Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-dimensional dim transforms on data sets #4080

Merged
merged 50 commits into from
Mar 9, 2020
Merged

Multi-dimensional dim transforms on data sets #4080

merged 50 commits into from
Mar 9, 2020

Conversation

poplarShift
Copy link
Collaborator

@poplarShift poplarShift commented Oct 31, 2019

Addresses #3932 and #237

Supersedes #3636.

Related to #3790.

In this PR, we have the possibility to apply arbitrary dim transforms with multiple output values to Datasets, taking into account correct insertion of dimensions. The name of the new method used here is .transform, which provides two ways of specifying transforms. You either supply tuples of dimensions and dim_transforms as args or as kwargs and the method will apply these transforms and either replace the existing dimensions or add the newly added dimensions as new value dimensions.

The upside of all of this is that we get complex statistical aggregation for free - see below for an example with hex bins that compute trends within each bin.

List of changes

  • Add Dataset.transform
  • Implement Interface.assign
  • Dataset.aggregate now handles dim transforms
  • tests

Setup

import xarray as xr
from holoviews import Dataset
import numpy as np
import pandas as pd
import holoviews as hv
from holoviews import dim, opts

Multi-dimensional dim transforms and aggregations

df = pd.DataFrame(dict(
    x=np.array(range(7))%2,
    y=np.array(range(7))%3,
    u=np.array(range(7))%4,
    v=np.array(range(7))%5
))
ds = Dataset(df, ['x', 'y'], ['u', 'v'])
nds = ds.groupby(['x'])

# scalar output
tf1 = dim('u', lambda u, v: np.sum(u) + np.sum(v), dim('v'))
# tuple of arrays
tf2 = dim('u', lambda u, v: (u, np.mean(v)), dim('v'))
print(ds.data.head())
x y u v
0 0 0 0
1 1 1 1
0 2 2 2
1 0 3 3
0 1 0 4
print(ds.transform(w=tf1).data)

# same as:
# ds.transform(('w', tf1)).data
x y u v w
0 0 0 0 20
1 1 1 1 20
0 2 2 2 20
1 0 3 3 20
0 1 0 4 20
1 2 1 0 20
0 0 2 1 20
print(ds.transform((('a', 'b'), tf2), drop=True).data)
a b
0 1.5714285714285714
1 1.5714285714285714
2 1.5714285714285714
3 1.5714285714285714
print(ds.aggregate('y', w=tf1).data)
y w
0 9
1 6
2 5

Example: Complex hex binning operations

hv.extension('bokeh')
xds = xr.tutorial.open_dataset('air_temperature').sel(time=slice(None, '2013-1-5'))
df = xds.to_dataframe().reset_index()

def regression(x1, x2):
    x1 = pd.to_numeric(x1)/1e9/86400.
    p = np.polyfit(x1, x2, 1)
    return p[0], p[1]

tf = dim('time', regression, dim('air'))
ds = hv.Dataset(df, ['lon', 'lat'], ['time', 'air'])

e = hv.HexTiles(ds)
e.opts(gridsize=10, aggregator=(('trend', 'offset'), tf),
       color=dim('trend'),
       scale=dim('offset').norm(),
       colorbar=True, width=600)

image

@philippjfr
Copy link
Member

Sorry I never reviewed this. I'd very much like to get this into the release, so I'm going to take it over.

@poplarShift
Copy link
Collaborator Author

Sorry for not replying earlier. I currently don't have a lot of spare time but I'm still interested in getting this in.

@philippjfr
Copy link
Member

@poplarShift I have a potentially much simpler implementation here, but one thing I haven't figured out is the drop_duplicate_data keyword argument? Why is that needed here?

@poplarShift
Copy link
Collaborator Author

I see you already went ahead and got rid of it. I like the solution with .assign!

@philippjfr
Copy link
Member

I see you already went ahead and got rid of it. I like the solution with .assign!

I did, but is there a good reason why I shouldn't have?

Copy link
Member

@jbednar jbednar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is some really amazing and useful functionality. Among many other things, it makes the new link_selections even more powerful, by making it simple to express arbitrarily complex data transformation pipelines that can then be linked by dimension automatically. This is a major step up in power for HoloViews!

holoviews/core/accessors.py Show resolved Hide resolved
holoviews/core/data/__init__.py Outdated Show resolved Hide resolved
holoviews/core/data/__init__.py Show resolved Hide resolved
holoviews/core/data/__init__.py Outdated Show resolved Hide resolved
holoviews/core/data/dictionary.py Show resolved Hide resolved
holoviews/core/util.py Outdated Show resolved Hide resolved
holoviews/util/transform.py Show resolved Hide resolved
@philippjfr philippjfr merged commit eaa8e4c into master Mar 9, 2020
@poplarShift
Copy link
Collaborator Author

@philippjfr Awesome!

Sorry for being a bit slow with replying these days. I don't remember the exact reasons for the drop_duplicate_data kwarg, but I think they were specific to my implementation.

Also thanks for seeing this through, I'm super excited about using this straight from the source instead of my monkey patched code snippets! As @jbednar said this is indeed a major step up for workflow design.

@philippjfr philippjfr deleted the transforms branch April 25, 2022 14:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants