Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements to histplot (1D) for discrete data #3567

Closed
e-pet opened this issue Nov 20, 2023 · 4 comments
Closed

Improvements to histplot (1D) for discrete data #3567

e-pet opened this issue Nov 20, 2023 · 4 comments

Comments

@e-pet
Copy link

e-pet commented Nov 20, 2023

Hi,

two suggestions for minor usability improvements concerning the handling of discrete data in histplot (with discrete=True).

Detect the correct bin size automatically

Currently, the bin size is just set to 1 automatically. However, data might be discrete with a different discretization step size. Of course I can set that manually, but it would be very convenient if it "just worked". Wouldn't that be as simple as something like binwidth = np.diff(np.sort(df.x.unique())).min()? (Surely a very inefficient implementation, but you get the idea.)

Adapt kde bandwidth method when discrete=True

For discrete data, we can get the below ugly KDE behavior. (This is simply sns.histplot(df, x="x", discrete=True, kde=True) with df = pd.DataFrame({"x": np.random.poisson(lam=1, size=(10000,))})).

image

I am aware that kde bandwidth selection is a thorny topic, and that there are additional problems going on here because of the hard 0 boundary, but I would believe that an even slightly better default behavior should be possible? For instance, would simply setting bandwidth= [some constant between 0.5 and 1] * binwidth be an awful default for the discrete=True case?

For the example above, this is what I get with kde_kws={'bw_method': 0.6}, which is quite a bit closer to what would seem like a reasonable default to me.
image

@mwaskom
Copy link
Owner

mwaskom commented Nov 20, 2023

Thanks for the suggestions. I am not sure I agree with them though...

However, data might be discrete with a different discretization step size

I guess I am not sure what you mean by "discretization step size" here, but discrete in this context is meant to evoke the idea of a discrete probability distribution over integers:

discrete: bool
If True, default to binwidth=1 and draw the bars so that they are centered
on their corresponding data points. This avoids “gaps” that may otherwise
appear when using discrete (integer) data.

It sounds like you are saying you have already binned your data (i.e., you've done half the histogram computation yourself), but that seems like a special case. The heuristic that you propose will fail for, e.g. sparse data that users want to have sampled on every integer. Note that you can cast the values to string (you probably want to sort first) if you want to treat the numbers more as names of categories than as numeric value.

For instance, would simply setting bandwidth= [some constant between 0.5 and 1] * binwidth be an awful default for the discrete=True case?

If there is no statistically principled factor to choose, then I would say ... yes, probably? E.g. with your Poisson example, increase lambda to a larger value, and the default rule works well while your proposed adjustment way over-smooths.

I'm not totally sure it makes sense to show a KDE plot over a discrete discrete distribution. I kind of like how the default happens to look "weird" in a way that makes one stop and reconsider whether the smooth curve is an accurate representation of the data. Maybe there's a good reason for it, and seaborn won't stop you for adding it, but I don't see anything "wrong" with the default.

@mwaskom
Copy link
Owner

mwaskom commented Dec 5, 2023

I'm going to close this issue for now, but thanks for the contribution!

@mwaskom mwaskom closed this as not planned Won't fix, can't repro, duplicate, stale Dec 5, 2023
@e-pet
Copy link
Author

e-pet commented Dec 10, 2023

Hi @mwaskom, thank you for your thoughts and sorry for not responding earlier! Quick responses:

  • discrete / quantized data can occur for all kinds of reasons in practical applications, not just as a result of a manual preprocessing step I believe? In my case, I was working with a risk model that used discrete risk categories of a certain width. So clearly discrete (and a proper discrete probability distribution) but not over integers.
  • Yes, you are certainly right that my naive proposal will fail in various cases. FWIW, sparse integer-valued samples seem like much more of a special case to me personally than dense non-integer discrete values. ;-) But that's probably just a function of the applications we've worked on / domains we're coming from.
  • Re: why do I want this / should this work out of the box? I am essentially using this as a histogram smoothing method. I am aware that this is not the cleanest solution / there are various more principled approaches available (though not in Seaborn, AFAIK), but I'm also not sure it's entirely terrible? It was just the simplest thing available. (See here for a discussion + various literature pointers regarding KDE for discrete variables.)

(Totally fine to leave this closed, I just wanted to add a bit more context.)

@mwaskom
Copy link
Owner

mwaskom commented Dec 10, 2023

I wrote a longer response but it got eaten when I clicked on the link in your third bullet. Shorter answer:

  • There is already support for arbitrary categorical count/frequency plots in countplot, as well as in histplot when the data have a string type (even if those strings represent numbers) which may work for your use case.
  • Arguments about what is a "special case" aside, your proposed change to how binwidths are chosen for discrete=True would produce different results than are currently obtained with sparse integer data, and the new behavior would not be unambiguously better or more correct, so that feels like a non-starter.
  • If plotting a Gaussian KDE over discrete data works for you, feel free, but I don't like the idea of choosing an arbitrary bandwidth fudge factor to hide what's going on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants