-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improvements to histplot (1D) for discrete data #3567
Comments
Thanks for the suggestions. I am not sure I agree with them though...
I guess I am not sure what you mean by "discretization step size" here, but
It sounds like you are saying you have already binned your data (i.e., you've done half the histogram computation yourself), but that seems like a special case. The heuristic that you propose will fail for, e.g. sparse data that users want to have sampled on every integer. Note that you can cast the values to string (you probably want to sort first) if you want to treat the numbers more as names of categories than as numeric value.
If there is no statistically principled factor to choose, then I would say ... yes, probably? E.g. with your Poisson example, increase lambda to a larger value, and the default rule works well while your proposed adjustment way over-smooths. I'm not totally sure it makes sense to show a KDE plot over a discrete discrete distribution. I kind of like how the default happens to look "weird" in a way that makes one stop and reconsider whether the smooth curve is an accurate representation of the data. Maybe there's a good reason for it, and seaborn won't stop you for adding it, but I don't see anything "wrong" with the default. |
I'm going to close this issue for now, but thanks for the contribution! |
Hi @mwaskom, thank you for your thoughts and sorry for not responding earlier! Quick responses:
(Totally fine to leave this closed, I just wanted to add a bit more context.) |
I wrote a longer response but it got eaten when I clicked on the link in your third bullet. Shorter answer:
|
Hi,
two suggestions for minor usability improvements concerning the handling of discrete data in histplot (with
discrete=True
).Detect the correct bin size automatically
Currently, the bin size is just set to 1 automatically. However, data might be discrete with a different discretization step size. Of course I can set that manually, but it would be very convenient if it "just worked". Wouldn't that be as simple as something like
binwidth = np.diff(np.sort(df.x.unique())).min()
? (Surely a very inefficient implementation, but you get the idea.)Adapt kde bandwidth method when
discrete=True
For discrete data, we can get the below ugly KDE behavior. (This is simply
sns.histplot(df, x="x", discrete=True, kde=True)
withdf = pd.DataFrame({"x": np.random.poisson(lam=1, size=(10000,))})
).I am aware that kde bandwidth selection is a thorny topic, and that there are additional problems going on here because of the hard 0 boundary, but I would believe that an even slightly better default behavior should be possible? For instance, would simply setting
bandwidth= [some constant between 0.5 and 1] * binwidth
be an awful default for thediscrete=True
case?For the example above, this is what I get with
![image](https://private-user-images.githubusercontent.com/1774207/284373569-dada2644-5455-4fd8-874a-c20eb1320495.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjEzMjk1NDYsIm5iZiI6MTcyMTMyOTI0NiwicGF0aCI6Ii8xNzc0MjA3LzI4NDM3MzU2OS1kYWRhMjY0NC01NDU1LTRmZDgtODc0YS1jMjBlYjEzMjA0OTUucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDcxOCUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA3MThUMTkwMDQ2WiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9YTU2ZTQ4NmRkNTk5YmUxMTE1MjE1MjIwN2U1YWFiM2QyZWRhMDc4ZWU4N2FhMmI3MTU0MDg4MGFmNTJkMDcxMCZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.E3Cg1o3MJUoDFIU6S7M8tWKw031FSjitSNmxXtj0Vqg)
kde_kws={'bw_method': 0.6}
, which is quite a bit closer to what would seem like a reasonable default to me.The text was updated successfully, but these errors were encountered: