Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support ds.where and ds.summary and add selector #5805

Merged
merged 25 commits into from Jul 24, 2023
Merged

Conversation

Hoxbro
Copy link
Member

@Hoxbro Hoxbro commented Jul 13, 2023

Example code:

import datashader as ds
import holoviews as hv
import numpy as np
import pandas as pd
from holoviews.operation.datashader import rasterize

hv.extension("bokeh")

num = 10000
np.random.seed(1)

dists = {
    cat: pd.DataFrame(
        {
            "x": np.random.normal(x, s, num),
            "y": np.random.normal(y, s, num),
            "s": s,
            "val": val,
            "cat": cat,
        }
    )
    for x, y, s, val, cat in [
        (2, 2, 0.03, 0, "d1"),
        (2, -2, 0.10, 1, "d2"),
        (-2, -2, 0.50, 2, "d3"),
        (-2, 2, 1.00, 3, "d4"),
        (0, 0, 3.00, 4, "d5"),
    ]
}

df = pd.concat(dists, ignore_index=True)
agg = ds.where(ds.min("s"))

plot = rasterize(hv.Points(df), aggregator=agg).opts(
    tools=["hover"], colorbar=True, width=500
)

With agg = ds.where(ds.min("val")):

val2.mp4

With agg = ds.where(ds.min("s")):

s.mp4

@Hoxbro Hoxbro marked this pull request as draft July 13, 2023 13:54
@codecov-commenter
Copy link

codecov-commenter commented Jul 13, 2023

Codecov Report

Merging #5805 (cab0cc9) into main (d48950a) will increase coverage by 0.01%.
The diff coverage is 97.22%.

@@            Coverage Diff             @@
##             main    #5805      +/-   ##
==========================================
+ Coverage   88.19%   88.20%   +0.01%     
==========================================
  Files         307      307              
  Lines       63305    63399      +94     
==========================================
+ Hits        55829    55920      +91     
- Misses       7476     7479       +3     
Flag Coverage Δ
ui-tests 22.33% <18.51%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
holoviews/operation/datashader.py 83.80% <95.00%> (+0.42%) ⬆️
holoviews/tests/operation/test_datashader.py 97.59% <100.00%> (+0.13%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@Hoxbro
Copy link
Member Author

Hoxbro commented Jul 17, 2023

With selector:

screenrecord-2023-07-17_15.47.31.mp4

@jlstevens
Copy link
Contributor

Looking good!

@Hoxbro Hoxbro changed the title Support where inspection Support ds.where and ds.summary and add selector Jul 19, 2023
@Hoxbro Hoxbro force-pushed the inspect_where branch 2 times, most recently from 15faf72 to 16d72a1 Compare July 19, 2023 09:52
@Hoxbro Hoxbro marked this pull request as ready for review July 19, 2023 09:57
@Hoxbro Hoxbro added this to the 1.17.0 milestone Jul 19, 2023
Copy link
Member

@philippjfr philippjfr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good! It will need a whole new documentation section though (which we'll have @jbednar write) and really the datashader user guide will need to be split up. I did ask one clarification question about the ways in which aggregator=where(...) and selector=... interact because I can't quite wrap my head around it.

@jlstevens
Copy link
Contributor

jlstevens commented Jul 20, 2023

Very happy with this PR already, just used it successfully in a talk at EuroPython! :-)

I'll be taking a closer look soon but my initial question is whether you have any idea how first_n and last_n might work as selectors? I suppose the DataSet could have m columns x n selector layers? While this would quickly blow up for wide data with lots of columns (or large n) I assume this is the natural extension?

Alternatively, first and last should probably behave like first_n and last_n where n=1...

params["vdims"] = [params["vdims"]]
sum_agg = ds.summary(**{str(params["vdims"][0]): agg_fn, "index": ds.where(sel_fn)})
agg = self._apply_datashader(dfdata, cvs_fn, sum_agg, agg_kwargs, x, y)
_ignore = [*params["vdims"], "index"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While it is a good default, the name "index" is a magic value. I could be wrong but couldn't this clash with a column name?

@jbednar
Copy link
Member

jbednar commented Jul 20, 2023

While this would quickly blow up for wide data with lots of columns (or large n) I assume this is the natural extension?

With the index array approach, first_3 should always only be 3x larger than the aggregate array, while with the approach of returning all columns the size and time taken would scale with the number and size of columns. Seems unsafe as a default!

An intermediate approach could be to return a fixed number of scalar columns only (up to e.g. 3) by default, but that seems quite arbitrary.

@Hoxbro Hoxbro mentioned this pull request Jul 21, 2023
3 tasks
@philippjfr
Copy link
Member

I'd say we handle the first_n and last_n cases later.

@philippjfr
Copy link
Member

I'm going to merge. I'll look into the first_n and last_n thing separately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants