Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tell datashader to use a specific color for NaNs in categorical data #1019

Open
Noskario opened this issue Aug 23, 2021 · 1 comment
Open
Milestone

Comments

@Noskario
Copy link

I have very large dataset that I cannot plot directly using holoviews. I want to make a scatterplot with categorial data. Unfortunately my data is very sparse and many points have NA as category. I would like to make these points gray. Is there any way to make datashader know what I want to do?

I show you the way I do it now (as more or less proposed in https://holoviews.org/user_guide/Large_Data.html ). I provide you an example:

import numpy as np
import pandas as pd
import holoviews as hv
hv.extension('bokeh')
import datashader as ds
from datashader.colors import Sets1to3
from holoviews.operation.datashader import datashade,dynspread



raw_data = [('Alice', 60, 'London', 5) ,
           ('Bob', 14, 'Delhi' , 7) ,
           ('Charlie', 66, np.NaN, 11) ,
           ('Dave', np.NaN,'Delhi' , 15) ,
           ('Eveline', 33, 'Delhi' , 4) ,
           ('Fred', 32, 'New York', np.NaN ),
           ('George', 95, 'Paris', 11)
            ]
# Create a DataFrame object
df = pd.DataFrame(raw_data, columns=['Name', 'Age', 'City', 'Experience'])
df['City']=pd.Categorical(df['City'])



x='Age'
y='Experience'
color='City'
cats=df[color].cat.categories





# Make dummy-points (currently the only way to make a legend: https://holoviews.org/user_guide/Large_Data.html)
for cat in cats:
    #Just to make clear how many points of a given category we have
    print(cat,((df[color]==cat)&(df[x].notnull())&(df[y].notnull())).sum())
color_key=[(name,color) for name, color in zip(cats,Sets1to3)]
color_points = hv.NdOverlay({n: hv.Points([0,0], label=str(n)).opts(color=c,size=0) for n,c in color_key})


# Create the plot with datashader
points=hv.Points(df, [x, y],label="%s vs %s" % (x, y),)
datashaded=datashade(points,aggregator=ds.by(color)).opts(width=800, height=480)

(dynspread(datashaded)*color_points).opts(legend_position='right')

It produces the following picture:
bug_report_image

Although there is just one person from Paris you see that the NA-person (Charlie) is also printed in purple, the color for Paris. Is there a way to make the dot gray? I have tried many plots and it seems like the NAs always take the color of the last item in the legend.

It would be nice to provide the possiblilty to give a parameter like NA_color='gray' to the datashade-method. Also an option for not plotting NA-category-points at all would be nice with the same kind of interface. But that is less important.

@jbednar jbednar changed the title tell datashader to use a specific color for NA's when plotting categorial data Tell datashader to use a specific color for NaNs in categorical data Aug 23, 2021
@jbednar jbednar transferred this issue from holoviz/holoviews Aug 23, 2021
@jbednar
Copy link
Member

jbednar commented Aug 23, 2021

I've transferred this issue to the Datashader repo since the code changes involved would be at the Datashader level. For current versions of Datashader, my advice would be to use Pandas to modify the data before plotting, either to replace the NaNs with 'Unknown' or 'Other', or to delete rows where the category is NaN. That way the data will either be clearly labeled or not included, as desired.

That said, there are some reasonable feature requests here for Datashader's shade() function:

  1. Accept a separate nan_color value to use for NaN categorical values (presumably gray by default), as Bokeh provides already. Note that you will then need to handle NaN specially when you construct the legend as above, making sure that the legend shows what color is used for NaNs and has an appropriate label ('Unknown", "Missing", etc.)
  2. Add a flag skip_nans or skip_missing_categories, defaulting to False, which silently drops points where the category information is not known.

Compared to the rest of Datashader, the shade() function is fairly self contained and is not terribly tricky for a new contributor to figure out. I'd be happy to review a PR adding either or both of these features.

@maximlt maximlt added this to the wishlist milestone Nov 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants