New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added support for dynamic groupby on all data interfaces #711

Merged
merged 4 commits into from Jul 14, 2016

Conversation

Projects
None yet
2 participants
@philippjfr
Member

philippjfr commented Jun 4, 2016

A dynamic version of groupby proved exceptionally useful for large datasets we can now handle via the iris interface. However it can trivially be implemented in a general way using select, which is what I've done here.

However there's also some cases where the behavior is not well defined. When you apply a dynamic groupby to columnar dataset, it can be sparse, which means that some portions of the cartesian grid the DynamicMap defines can be empty. A simple example would be something like this:

import holoviews as hv
dataset = hv.Dataset((['UK', 'UK', 'USA'], [1995, 1996, 1995],  [0.1, 0.2, 0.3]),
                     kdims=['Country', 'Year', 'Index'], vdims=['Value'])
dmap = dataset.groupby(['Country', 'Year'], dynamic=True)
:DynamicMap   [Country,Year]
assert len(dmap['USA', 1996]) == 0

Here the value entry for USA and 1996 did not exist, so it returned an empty Element. Alternatively it could raise a KeyError. However the semantics of a DynamicMap mean that anything inside the space defined by theDimensions should be addressable, I think returning an empty Element might be more appropriate. However when you access a value that was not defined in the original Dataset it should definitely raise a KeyError:

dmap['Canada', 1955]

So I'll have to make sure that DynamicMap.__getitem__ (and select) ensure that when in bounded mode it checks the requested key is in the defined values, not just the bounds.

@philippjfr

This comment has been minimized.

Member

philippjfr commented Jun 8, 2016

Requires review and discussion about the behavior described above.

@jlstevens

This comment has been minimized.

Member

jlstevens commented Jun 8, 2016

By an empty element, do you mean an element without any data?

I remember discussing empty elements with you ages ago. If that is indeed what you mean, then that is the right behaviour. As long as all the visualization code is happy to process elements without data in them.

@philippjfr

This comment has been minimized.

Member

philippjfr commented Jun 8, 2016

By an empty element, do you mean an element without any data?

Yes, basically Elements containing a length zero array or equivalent. We'd likely have to double check that all plots will handle them correctly though.

@jlstevens

This comment has been minimized.

Member

jlstevens commented Jun 8, 2016

Maybe for a separate issue, but I would like to say we always support empty elements. To do this, it would be good to automatically test that empty elements always work. That said, I'm not sure that what an 'empty element' is, is always defined. I suppose it is any valid datastructure (shape, type etc) with no data in it? Though, how could you have an empty MxN numpy array for instance?

For instance, we could have arranged it so data=None could be supported everywhere. Which could be useful but would be an orthogonal feature to empty data.

I like the idea of empty elements and have wanted to support them for ages. I'm just not entirely sure that their semantics (i.e how they should be declared) is entirely defined and unambiguous.

@philippjfr

This comment has been minimized.

Member

philippjfr commented Jun 8, 2016

Maybe for a separate issue, but I would like to say we always support empty elements. To do this, it would be good to automatically test that empty elements always work. That said, I'm not sure that what an 'empty element' is, is always defined. I suppose it is any valid datastructure (shape, type etc) with no data in it? Though, how could you have an empty MxN numpy array for instance?

You can define an array of shape (0, 0), so I don't think it's an issue. In the sparse data formats the shape is (0, D), which also works fine. Just need to make sure the plots don't choke on it.

@jlstevens

This comment has been minimized.

Member

jlstevens commented Jun 8, 2016

I suppose the simplest solution might be to define the semantics as 'an empty element is any element with zero length data'. Very clear, even if you can always make data of the right shape etc that is empty. The assumption though is that len makes sense for all the data structures we support.

Otherwise, you can just declare it appropriately as you suggest.

@philippjfr

This comment has been minimized.

Member

philippjfr commented Jun 8, 2016

I suppose the simplest solution might be to define the semantics as 'an empty element is any element with zero length data'. Very clear, even if you can always make data of the right shape etc that is empty. The assumption though is that len makes sense for all the data structures we support.

Length on Elements using the data interfaces is always defined as the total number of samples so for a grid based interface that's the product of the shape and in the column based format that's the number of rows. Checking for an empty Element should therefore be easy, the only problem is that some artists in matplotlib and bokeh might not be allow being initialized with an empty array.

@jlstevens

This comment has been minimized.

Member

jlstevens commented Jul 14, 2016

Looks good and dynamic is False by default so nothing should break unless the new feature is used. Merging.

@jlstevens jlstevens merged commit 124019e into master Jul 14, 2016

4 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details
coverage/coveralls Coverage increased (+0.2%) to 69.583%
Details
s3-reference-data-cache Test data is cached.
Details

@philippjfr philippjfr removed the in progress label Jul 14, 2016

@philippjfr philippjfr deleted the dataset_dynamic_groupby branch Sep 2, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment