Data API #284

philippjfr · 2015-10-18T14:13:58Z

While not yet ready to be merged this PR will implement the proposals laid out in #269. I'll reproduce the main points of discussion here and outline a to-do list.

The core of the proposal is to add a core Columns Element class, which subsumes the functionality of the Chart, DFrame and NdElement classes. This will allow a wide range of Element classes to interchangeably use one of three data formats:

An array based format of NxD numpy array, where N is the number of samples and D the number of dimensions/columns.
A dataframe based format with N samples and D columns, which should work with pandas, dask and blaze dataframes.
A pure-Python NdElement based data format storing the data as keys and values of an OrderedDict.

This would mean the Curve, Points, VectorField, Bars, Table, ErrorBars, Spread, Scatter, Scatter3D, Trisurface, Distribution, TimeSeries and Regression Elements would immediately support all three data formats.

The data format specific implementations of the API are implemented on utility classes, this will include:

dimension_values: Returns an array, list or pandas series of one column
range: Returns the minimum and maximum values along a dimension excluding NaNs.
select: Same as __getitem__ but works for one or multiple dimensions.
groupby: Groups by the values along one or multiple dimensions returning an NdMapping type indexed by the grouped dimensions with Elements of the same type as values. (not implemented for Charts)
dframe: Converts data to a dataframe
collapse_data: Applies a function across a list of Element.data attributes
reduce: Applies a reduce function across one or multiple axes
sample: Returns a Table with of the samples specified as lists
reindex: Reorder specified dimensions dropping unspecified dimensions. (not implemented for Charts)
drop_dimension: Drops specified dimension(s)/column(s)
add_dimension: Adds named column at specified position with supplied value(s)
closest: Returns the closest samples given a list of coordinates.
shape: NxD shape of data

and conversion methods, which can take an optional flag to return a clone of the Element with converted data:

array: Returns data as an array.
dframe: Returns data as a DataFrame.
mapping: Returns data as NdElement.

Backwards incompatible changes:

Various operations no longer return ItemTables, the variable return type is confusing and makes the implementation very messy. All operations (select, reduce, sample) return one of two types of data either a modified Columns object or a single scalar.

Other features:

Currently Elements only support slicing of key dimensions, this is a common annoyance mentioned by several people. I propose we extend the slicing behavior of our types to the following with the following proposal. The __getitem__ method will now check whether the first or n-th index (where n is the number of key dimensions) matches an existing dimension by name, allowing the user to select particular columns. Passing a dimension as a first index will return the values along that dimension, this matches the behavior of DataFrames and will allow us to provide an easy interface for more complex Boolean slicing, e.g.: columns[np.logical_or(columns['x'] > 5, columns['y'] > 10)]. Passing a dimension as the n-th index preserves the current behavior when slicing tables, allowing you to select a particular value dimension. If the non-key indexes don't match any existing dimension they are used to slice along the vdims.
Collapse/reduce work via the same mechanism, they are basically a groupby operation over one or more key dimensions of an Element or list of Elements, then applying an aggregation function.

One of the major sticking points is how the constructor should behave for different data formats, currently this is the implemented behavior showing the mapping between an input type and the storage format that's used (some come with fallbacks if a certain condition isn't met):

ndarray -> ndarray
DataFrame -> DataFrame
NdElement -> NdElement
dict -> NdElement
list of scalar values -> ndarray
list of tuples for each row -> ndarray (if numeric) -> DataFrame (if available) -> NdElement
tuple of arrays/lists -> ndarray (if numeric) -> DataFrame (if available) -> NdElement
list of key value tuples (i.e. tuple of tuples) -> NdElement

The current to-do list is as follows:

Improvements to plotting:

Support for datetime axes
Support for linked brushing when same DataFrame/NdElement is used across Layout.

That covers it for now. I won't have any time to work on this in a concentrated way so I'll be slowly ticking off the to-do list as I find time. In the mean-time we can think about the naming.

Documentation and notebooks:

Profiling notebook

jlstevens · 2015-11-09T14:24:48Z

Just a quick note about __setstate__: we should check if interface exists and only then decide on a suitable interface based on the data type (for old pickles). I can fix this after the merge...

jlstevens · 2015-11-09T16:22:56Z

One other thing to do after the merge: we should migrate HeatMap to be a Columns object. In addition, we should make sure that ItemTable can accept a row of a Table now the return type from indexing has changed slightly. I suppose the rows of the ItemTable are simply the value dimensions of a Table with a single row?

jlstevens · 2015-11-09T20:11:15Z

Ok, all tests are green! Time to merge...

Data API for flexible data manipulation and access

jbednar · 2015-11-09T21:41:32Z

Yay! Good work, guys!

jbednar · 2015-11-09T21:43:15Z

Is there an issue open elsewhere that lists blockers for the next release, now that this has been merged?

philippjfr · 2015-11-10T00:24:33Z

There's this milestone https://github.com/ioam/holoviews/milestones/v1.4.0.

Out of those, the following should be a quick fix (<1 hour each):

Arrow elements should support arrows in any direction #34 Arrow elements should support arrows in any direction
Sum across an axis #197 ~~Sum across an axis~~
Widget type switches when switching between backends #259 Widget type switches when switching between backends
Allow easy definition of sanitized substitutions #264 Allow easy definition of sanitized substitutions
Table to errorbars #265 Table to errorbars (I have a new suggestion here)
Style pickling is currently broken #263 Style pickling is currently broken
Export archive #280 Export archive

These are issues where we simply have to come to a decision:

QuadMesh transposed? #217 QuadMesh transposed?
Styles going missing when slicing NdOverlays #221 Styles going missing when slicing NdOverlays
Switch to %reload_ext in the tutorials #258 Switch to %reload_ext in the tutorials docs

These are items that might require a bit more work:

DynamicMap #278 DynamicMap (could consider only offering basic cache control at first)
Bokeh Callbacks #275 Bokeh Callbacks
API for static HTML export #282 API for static HTML export API feature

Then there's getting the build process streamlined and improving the docs:

Main repository has grown far too large #289 Main repository has grown far too large
Document renderers and display function #266 Document renderers and display function
Website building process needs improvement #180 Website building process needs improvement

jbednar · 2015-11-10T02:37:19Z

Out of those, the following should be a quick fix (<1 hour each):

Arrow elements should support arrows in any direction #34 Arrow elements should support arrows in any direction #34 Arrow elements should support arrows in any direction

Sum across an axis #197 Sum across an axis #197 Sum across an axis

Widget type switches when switching between backends #259 Widget type switches when switching between backends #259 Widget type switches when switching between backends

Allow easy definition of sanitized substitutions #264 Allow easy definition of sanitized substitutions #264 Allow easy definition of sanitized substitutions

Table to errorbars #265 Table to errorbars #265 Table to errorbars (I have a new suggestion here)

Style pickling is currently broken #263 Style pickling is currently broken #263 Style pickling is currently broken

Export archive #280 Export archive #280 Export archive

Those all seem worth fixing, though the Arrows one might not be urgent.

These are issues where we simply have to come to a decision:

QuadMesh transposed? #217 QuadMesh transposed? #217 QuadMesh transposed?

Styles going missing when slicing NdOverlays #221 Styles going missing when slicing NdOverlays #221 Styles going missing when slicing NdOverlays

Switch to %reload_ext in the tutorials #258 Switch to %reload_ext in the tutorials #258 Switch to %reload_ext in the tutorials docs

Well, you know my opinion on QuadMesh, so you two can be the deciding
votes. I think we could postpone 221 again, as I still don't know the
answer, myself. I'm not sure what needs to be decided about 258; it sounds
like you know you want reload_ext, so is the only issue whether to print
the message? Again, you know my opinion on that, so just decide as you
like.

These are items that might require a bit more work:

DynamicMap #278 DynamicMap #278 DynamicMap (could consider only offering basic cache control at first)

Bokeh Callbacks #275 Bokeh Callbacks #275 Bokeh Callbacks

API for static HTML export #282 API for static HTML export #282 API for static HTML export API feature

The last two seem a bit dangerous to add just before a release; seems like
we could either postpone them until after the release, or else put them in
pretty much as-is but marked experimental. The DynamicMap stuff does
seem very important to include in this release, to me.

Then there's getting the build process streamlined and improving the docs:

Main repository has grown far too large #289 Main repository has grown far too large #289 Main repository has grown far too large

Document renderers and display function #266 Document renderers and display function #266 Document renderers and display function

Website building process needs improvement #180 Website building process needs improvement #180 Website building process needs improvement

These last three seem safe to postpone until after a release; important
but not required for the actual code release.

Apart from DynamicMap, the rest sounds like it adds up to a solid day's
work, in which case I think you ought to do it now. How much for
DynamicMap?

jlstevens · 2015-11-10T12:18:47Z

DynamicMap needs to be done as another PR. The basics are already working but we need to be careful that the design allows for the sort of interaction we really want (callbacks feeding information such as mouse position to the DynamicMap via something we have called Streams).

I am happy to work on that aspect which shouldn't take too long. The part I would have more difficulty with is the JavaScript side - we need a flexible caching system that supports 1. no caching 2. LRU 3. caching every nth frame. This is important for reproducibility and export as the caching determines what you get when exporting a DynamicMap to a regular HoloMap. For this, Philipp would need to suggest how long this would take to implement (I think this could be easy, LRU which accepts 0 or infinite frames together with an integer modulo i.e two parameters could do the job reasonably well).

jbednar · 2015-11-10T15:07:01Z

Sounds like DynamicMap should be released, since it's very useful and urgently needed, but marked experimental in the release notes since we're still polishing it.

For caching, seems like you do need some support, but that can be part of what's marked experimental, and we can tune it later. To me it seems like we do need at least two parameters:

Something that limits the total amount cached (either in terms of memory, which seems tricky to calculate, or in terms of total number of frames, which seems easier), defaulting to some large number but potentially settable to infinity if people wish. This would be a hard limit, regardless of setting 2.
Something that allows people to specify that only every nth frame is exported, regardless of how many are visible interactively. The skipped frames would never be exported, regardless of setting 1.

This still leaves it undefined what happens when the limit in 1 is reached -- should it start deleting LRU, or start decimating frames? We should probably just pick one behavior and leave it unconfigurable (at least for now), because if we started to add parameters to control it people would get confused between those option-1 parameters and those for option 2 (which is independent of option 1's limits).

philippjfr and others added 30 commits August 31, 2015 18:52

Initial dataframe integration on Charts

dc81dbc

Initial support for plotting Charts with dataframes

7ed1174

Added support for dask dataframes on Charts

281d9d7

Fixed bug in Chart.range

68d8e12

Merge branch 'master' into dataframe

df599b4

Merge branch 'master' into dataframe

a9693b8

Fix for Points dataframe len support

e8694dc

Merge branch 'master' into dataframe

26b6ab5

Added support for Blaze data sources in Charts

6a5ed27

Merge branch 'master' into dataframe

9707d65

Fixed bug in conditional logic handling numpy arrays and dataframes

45e7b31

Simplified the dimension_values method on Raster

8db885e

Added the toarray utility to support dask Arrays

d9ab340

Initial support for dask Array objects in Raster

4130d7e

Fixed accidental removal of line from collapse_data method

795b52e

Fixed isinstance check in dimension_values method of Chart

5e1a577

Added index_value option to toarray utility function

7c6ec45

Value indexing now works corrects using dask arrays

0270bd0

Merge branch 'master' into dataframe

94f7c5b

Merge branch 'master' into dataframe

e72fd61

Added is_dataframe utiltity

c9d2c4e

Moved dataframe support from Chart to Element

88d0818

Made Tabular.pprint_cell use general API

dfa6764

Added initial support for dataframes on NdElement types

2878e8d

Made DFrame inherit from NdElement

a2e6b37

Fixes to NdElement dataframe handling

3e20bf2

Improved bokeh Chart plots data source mapping

072a2e8

bokeh ElementPlot and TabularPlot now cache initial frame

51c5477

Plots in tabs now correctly sized

9ff59a8

Added traverse method to DimensionedPlot

49a59bc

philippjfr added 2 commits November 9, 2015 13:35

Reimplemented extents setting on Chart types

d515cf0

Minor fix for Chart slicing extent setting

7880598

philippjfr force-pushed the dataframe branch from 7b434da to 7880598 Compare November 9, 2015 13:41

philippjfr added 2 commits November 9, 2015 13:43

Fixed issue with scalar indexing in Charts

d65da67

Fixed missing interface for compatibility with old pickles

b989f20

philippjfr force-pushed the dataframe branch from 91c8dea to b989f20 Compare November 9, 2015 14:08

philippjfr added 3 commits November 9, 2015 14:26

Minor fixes to Columns constructor

9bfc2dd

Fixed Python3 issues in Columns interface

601f026

Fixed HeatMap dimension_values bug

3265a8e

philippjfr and others added 9 commits November 9, 2015 16:26

Fixed Python3 issue in Chart.__getitem__

6a526be

Made Columns Python3 iterator handling more robust

2ebf686

Minor Python3 fix for Raster.closest

b9ddd32

Minor fix for Sampling_Data notebook

ca76526

Updated reference_data submodule reference

0bb304f

Minor fix to HeatMap.dense_keys

30efd80

Minor formatting fixes

245b6e2

Updated reference_data submodule reference

21a210e

Updated reference_data submodule reference

3fc7036

jlstevens added a commit that referenced this pull request Nov 9, 2015

Merge pull request #284 from ioam/dataframe

e91b018

Data API for flexible data manipulation and access

jlstevens merged commit e91b018 into master Nov 9, 2015

vascotenner mentioned this pull request Nov 23, 2015

Export archive #280

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data API #284

Data API #284

philippjfr commented Oct 18, 2015

jlstevens commented Nov 9, 2015

jlstevens commented Nov 9, 2015

jlstevens commented Nov 9, 2015

jbednar commented Nov 9, 2015

jbednar commented Nov 9, 2015

philippjfr commented Nov 10, 2015

jbednar commented Nov 10, 2015

jlstevens commented Nov 10, 2015

jbednar commented Nov 10, 2015

Data API #284

Data API #284

Conversation

philippjfr commented Oct 18, 2015

jlstevens commented Nov 9, 2015

jlstevens commented Nov 9, 2015

jlstevens commented Nov 9, 2015

jbednar commented Nov 9, 2015

jbednar commented Nov 9, 2015

philippjfr commented Nov 10, 2015

jbednar commented Nov 10, 2015

jlstevens commented Nov 10, 2015

jbednar commented Nov 10, 2015