Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure categorical column order is the same across dask partitions #1239

Merged
merged 1 commit into from Jun 23, 2023
Merged

Ensure categorical column order is the same across dask partitions #1239

merged 1 commit into from Jun 23, 2023

Conversation

ianthomas23
Copy link
Member

Fixes #1202.

This fixes datashader's handling of categorical columns in dask partitions. It was broken by a change in dask's reading of categorical columns between releases 2022.7.0 and 2022.7.1 (dask/dask#9264) but really is due to our slightly inconsistent handling of categorical columns which we got away with historically but not after that change.

Relevant section of dask docs is https://docs.dask.org/en/stable/dataframe-categoricals.html. When categorical columns are read from parquet files there is no enforcement of the columns being the same across dask partitions for performance reasons, and we need to do this ourselves. They provide a couple of solutions, both of which incur a performance loss as they involve traversing the whole dataframe.

Cutting a long story short, the fix is actually a two-liner in datashader. We already correct the categorical columns in dshape_from_dask() by calling categorize passing a list of the categorical columns to correct. The corrected dask.DataFrame is used in datashader to work out the full list of categories for the categorical dimension of the returned xarray.DataArray. But the originally supplied (and therefore uncorrected) dask.DataFrame was still used in each individual partition's mapping of category to integer index. So the fix is to use the categorically-corrected DataFrame for the remainder of the datashader calculations. As we were already performinng the categorize call, we incur no performance loss by doing this.

I have added a new explicit test.

Using the USA 2010 Census data and the following test code:

import dask.dataframe as dd
import datashader as ds

cvs = ds.Canvas(plot_width=450, plot_height=265, x_range=[-14E6, -7.4E6], y_range=[2.7E6, 6.4E6])

for engine in ('fastparquet', 'pyarrow'):
    df  = dd.read_parquet('~/data/census2010.parq', engine=engine)
    agg = cvs.points(df, 'easting', 'northing', ds.count_cat('race'))
    color_key = {'w':'aqua', 'b':'lime', 'a':'red', 'h':'fuchsia', 'o':'yellow' }
    img = ds.tf.shade(agg, how='eq_hist', color_key=color_key)
    ds.utils.export_image(img, f'issue1202_{engine}')

before the fix we see for fastparquet and pyarrow readers respectively
before_fastparquet
before_pyarrow

and after the fix
after_fastparquet
after_pyarrow

@ianthomas23 ianthomas23 added this to the v0.15.1 milestone Jun 23, 2023
@codecov
Copy link

codecov bot commented Jun 23, 2023

Codecov Report

Merging #1239 (fc467e6) into main (6dce648) will not change coverage.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##             main    #1239   +/-   ##
=======================================
  Coverage   83.52%   83.52%           
=======================================
  Files          35       35           
  Lines        8778     8778           
=======================================
  Hits         7332     7332           
  Misses       1446     1446           
Impacted Files Coverage Δ
datashader/utils.py 82.48% <ø> (ø)
datashader/core.py 88.28% <100.00%> (ø)

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Copy link
Member

@jbednar jbednar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds too good to be true! Thanks.

@jbednar jbednar merged commit 5a89820 into holoviz:main Jun 23, 2023
16 checks passed
@ianthomas23 ianthomas23 deleted the 1202_dask_categories branch June 27, 2023 08:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Categorical colorizing broken for census parquet file
2 participants