Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data API #284

Merged
merged 230 commits into from
Nov 9, 2015
Merged

Data API #284

merged 230 commits into from
Nov 9, 2015

Conversation

philippjfr
Copy link
Member

While not yet ready to be merged this PR will implement the proposals laid out in #269. I'll reproduce the main points of discussion here and outline a to-do list.

The core of the proposal is to add a core Columns Element class, which subsumes the functionality of the Chart, DFrame and NdElement classes. This will allow a wide range of Element classes to interchangeably use one of three data formats:

  • An array based format of NxD numpy array, where N is the number of samples and D the number of dimensions/columns.
  • A dataframe based format with N samples and D columns, which should work with pandas, dask and blaze dataframes.
  • A pure-Python NdElement based data format storing the data as keys and values of an OrderedDict.

This would mean the Curve, Points, VectorField, Bars, Table, ErrorBars, Spread, Scatter, Scatter3D, Trisurface, Distribution, TimeSeries and Regression Elements would immediately support all three data formats.

The data format specific implementations of the API are implemented on utility classes, this will include:

  • dimension_values: Returns an array, list or pandas series of one column
  • range: Returns the minimum and maximum values along a dimension excluding NaNs.
  • select: Same as __getitem__ but works for one or multiple dimensions.
  • groupby: Groups by the values along one or multiple dimensions returning an NdMapping type indexed by the grouped dimensions with Elements of the same type as values. (not implemented for Charts)
  • dframe: Converts data to a dataframe
  • collapse_data: Applies a function across a list of Element.data attributes
  • reduce: Applies a reduce function across one or multiple axes
  • sample: Returns a Table with of the samples specified as lists
  • reindex: Reorder specified dimensions dropping unspecified dimensions. (not implemented for Charts)
  • drop_dimension: Drops specified dimension(s)/column(s)
  • add_dimension: Adds named column at specified position with supplied value(s)
  • closest: Returns the closest samples given a list of coordinates.
  • shape: NxD shape of data

and conversion methods, which can take an optional flag to return a clone of the Element with converted data:

  • array: Returns data as an array.
  • dframe: Returns data as a DataFrame.
  • mapping: Returns data as NdElement.

Backwards incompatible changes:

  • Various operations no longer return ItemTables, the variable return type is confusing and makes the implementation very messy. All operations (select, reduce, sample) return one of two types of data either a modified Columns object or a single scalar.

Other features:

  • Currently Elements only support slicing of key dimensions, this is a common annoyance mentioned by several people. I propose we extend the slicing behavior of our types to the following with the following proposal. The __getitem__ method will now check whether the first or n-th index (where n is the number of key dimensions) matches an existing dimension by name, allowing the user to select particular columns. Passing a dimension as a first index will return the values along that dimension, this matches the behavior of DataFrames and will allow us to provide an easy interface for more complex Boolean slicing, e.g.: columns[np.logical_or(columns['x'] > 5, columns['y'] > 10)]. Passing a dimension as the n-th index preserves the current behavior when slicing tables, allowing you to select a particular value dimension. If the non-key indexes don't match any existing dimension they are used to slice along the vdims.
  • Collapse/reduce work via the same mechanism, they are basically a groupby operation over one or more key dimensions of an Element or list of Elements, then applying an aggregation function.

One of the major sticking points is how the constructor should behave for different data formats, currently this is the implemented behavior showing the mapping between an input type and the storage format that's used (some come with fallbacks if a certain condition isn't met):

  • ndarray -> ndarray
  • DataFrame -> DataFrame
  • NdElement -> NdElement
  • dict -> NdElement
  • list of scalar values -> ndarray
  • list of tuples for each row -> ndarray (if numeric) -> DataFrame (if available) -> NdElement
  • tuple of arrays/lists -> ndarray (if numeric) -> DataFrame (if available) -> NdElement
  • list of key value tuples (i.e. tuple of tuples) -> NdElement

The current to-do list is as follows:

  • Get nosetests passing
  • Test whether all affected Element types can be constructed with all data types.
  • Test whether all affected Element types display using both plotting backends.
  • Unify TableConversion interface with DFrame conversions
  • Implement data-type agnostic comparisons for Columns.
  • Refactor utility classes using classmethods rather than using instances (easier for now)
  • Write comprehensive set of unit tests providing 100% coverage of Columns module (Okay maybe not 100%, let's settle for 99%)
  • Achieve feature-parity between backends:
    • Constructors
    • getitem/select
    • dimension_values
    • range
    • shape
    • sample
    • reindex
    • add_dimension
    • closest
    • dframe
    • collapse
    • reduce
    • array
    • aggregate
    • sort
    • Add index dimension to NdElement storage format to allow duplicate entries (numpy and pandas support this by behavior already).
    • mapping
    • concatenate
  • Get notebook tests passing:
    • Composing Data
    • Containers
    • Continuous Coordinates
    • Elements (8 sorting related failures)
    • Exploring Data (2 NdMapping.info output changes)
    • Exporting
    • Introduction
    • Options
    • Pandas Conversion (16 sorting related failures)
    • Sampling Data (18 sorting related failures)
    • Showcase
  • Decide on good naming for the new module and classes
  • Document main Columns class and utility classes.

Improvements to plotting:

  • Support for datetime axes
  • Support for linked brushing when same DataFrame/NdElement is used across Layout.

That covers it for now. I won't have any time to work on this in a concentrated way so I'll be slowly ticking off the to-do list as I find time. In the mean-time we can think about the naming.

Documentation and notebooks:

philippjfr and others added 30 commits August 31, 2015 18:52
@jlstevens
Copy link
Contributor

Just a quick note about __setstate__: we should check if interface exists and only then decide on a suitable interface based on the data type (for old pickles). I can fix this after the merge...

@jlstevens
Copy link
Contributor

One other thing to do after the merge: we should migrate HeatMap to be a Columns object. In addition, we should make sure that ItemTable can accept a row of a Table now the return type from indexing has changed slightly. I suppose the rows of the ItemTable are simply the value dimensions of a Table with a single row?

@jlstevens
Copy link
Contributor

Ok, all tests are green! Time to merge...

jlstevens added a commit that referenced this pull request Nov 9, 2015
Data API for flexible data manipulation and access
@jlstevens jlstevens merged commit e91b018 into master Nov 9, 2015
@jbednar
Copy link
Member

jbednar commented Nov 9, 2015

Yay! Good work, guys!

@jbednar
Copy link
Member

jbednar commented Nov 9, 2015

Is there an issue open elsewhere that lists blockers for the next release, now that this has been merged?

@philippjfr
Copy link
Member Author

There's this milestone https://github.com/ioam/holoviews/milestones/v1.4.0.

Out of those, the following should be a quick fix (<1 hour each):

These are issues where we simply have to come to a decision:

These are items that might require a bit more work:

Then there's getting the build process streamlined and improving the docs:

@jbednar
Copy link
Member

jbednar commented Nov 10, 2015

Out of those, the following should be a quick fix (<1 hour each):

Those all seem worth fixing, though the Arrows one might not be urgent.

These are issues where we simply have to come to a decision:

Well, you know my opinion on QuadMesh, so you two can be the deciding
votes. I think we could postpone 221 again, as I still don't know the
answer, myself. I'm not sure what needs to be decided about 258; it sounds
like you know you want reload_ext, so is the only issue whether to print
the message? Again, you know my opinion on that, so just decide as you
like.

These are items that might require a bit more work:

The last two seem a bit dangerous to add just before a release; seems like
we could either postpone them until after the release, or else put them in
pretty much as-is but marked experimental. The DynamicMap stuff does
seem very important to include in this release, to me.

Then there's getting the build process streamlined and improving the docs:

These last three seem safe to postpone until after a release; important
but not required for the actual code release.

Apart from DynamicMap, the rest sounds like it adds up to a solid day's
work, in which case I think you ought to do it now. How much for
DynamicMap?

@jlstevens
Copy link
Contributor

DynamicMap needs to be done as another PR. The basics are already working but we need to be careful that the design allows for the sort of interaction we really want (callbacks feeding information such as mouse position to the DynamicMap via something we have called Streams).

I am happy to work on that aspect which shouldn't take too long. The part I would have more difficulty with is the JavaScript side - we need a flexible caching system that supports 1. no caching 2. LRU 3. caching every nth frame. This is important for reproducibility and export as the caching determines what you get when exporting a DynamicMap to a regular HoloMap. For this, Philipp would need to suggest how long this would take to implement (I think this could be easy, LRU which accepts 0 or infinite frames together with an integer modulo i.e two parameters could do the job reasonably well).

@jbednar
Copy link
Member

jbednar commented Nov 10, 2015

Sounds like DynamicMap should be released, since it's very useful and urgently needed, but marked experimental in the release notes since we're still polishing it.

For caching, seems like you do need some support, but that can be part of what's marked experimental, and we can tune it later. To me it seems like we do need at least two parameters:

  1. Something that limits the total amount cached (either in terms of memory, which seems tricky to calculate, or in terms of total number of frames, which seems easier), defaulting to some large number but potentially settable to infinity if people wish. This would be a hard limit, regardless of setting 2.
  2. Something that allows people to specify that only every nth frame is exported, regardless of how many are visible interactively. The skipped frames would never be exported, regardless of setting 1.

This still leaves it undefined what happens when the limit in 1 is reached -- should it start deleting LRU, or start decimating frames? We should probably just pick one behavior and leave it unconfigurable (at least for now), because if we started to add parameters to control it people would get confused between those option-1 parameters and those for option 2 (which is independent of option 1's limits).

@vascotenner vascotenner mentioned this pull request Nov 23, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants