Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

density method #18

Merged
merged 10 commits into from
Apr 14, 2023
Merged

density method #18

merged 10 commits into from
Apr 14, 2023

Conversation

grantmcdermott
Copy link
Owner

Fixes #17

Examples:

library(plot2)

# no groups
plot2(density(mtcars$mpg))

# with groups
plot2(density(iris$Sepal.Width), by = iris$Species)

Created on 2023-04-11 with reprex v2.0.2

@grantmcdermott
Copy link
Owner Author

grantmcdermott commented Apr 12, 2023

@zeileis All tests are passing so feel free to skip over this quickly if you don't have time to review. The main thing you may (or not) want to comment on are lines 583-605. In short, to get the individual group density plots I extract the model call and then run update on each data subset. This seemed the most sensible way forward from my perspective (and appears to be working well enough). But let me know if you see potential problems, or have other suggestions.

@zeileis
Copy link
Collaborator

zeileis commented Apr 12, 2023

Disclaimer: I'm not particularly fond of plotting the density f(x | by) against x. In most situations I'm more interested in the probability f(by | x). So this is what the "conditional density plot" cdplot() (which has mostly been written by me) plots against x. And the trick is to compute f(by | x) = f(x | by) * f(by)/f(x). So I often feel that plotting f(x | by) against x is the wrong way around.

Concerns: Having said that, I see two potential problems. The first is easier to address than the second, I think.

  • Bandwidth selection: Wouldn't it be the more natural default to use the same bandwidth for all subgroups? Otherwise the scaling of the densities might be harder to compare. In any case, this is what we decided to use for cdplot(). Not sure what other "ridge lines" implementations are doing. (Note: If you decide to use the same bandwidth you could call density() directly rather than via update().
  • Environment handling: Both the update(object, ...) method and your eval(str2lang(object$data), ...) will only work if the data is in the parent frame (or visible from there). (Note: Your evaluation should probably also use envir = parent.frame() for consistency.). This may not be true and unfortunately density() does not store the enclosing environment (as it would be done in a formula). That's why densities computed from auxiliary functions will not be compatible with plot2(). See the example below. Thus, the tradeoff is between the simple convenience of plot2(density(x), by) vs. a more elaborate formula interface like plot2(x ~ by, data = ..., type = "density") or something along those lines.

Example:

## convenience function returning a density
log_density <- function(x) density(log(x))

## example data
y <- 1:10
by <- rep(1:2, each = 5)

## set up density and try to plot
d <- log_density(y)
plot2(d, by = by) ## Error in log(x) : non-numeric argument to mathematical function

## or even worse: this would be completely misleading if you had eval(..., envir = parent.frame())
x <- by
plot2(d, by = x)

@grantmcdermott
Copy link
Owner Author

These are all great comments @zeileis. Some quick responses:

  1. cdplot: It would be super if we could support cdplot. This strikes me as an obvious candidate for faceting once the internals of Idea: Handle plots with factors by introducing mfrow/facets #12 are resolved.
  2. bandwidth: Fully agree with you about this. I'll fix it shortly. (I think we still need update instead of density to capture the other arguments like "kernel" and ..., though right? That, or some manual implementation of what update does internally.)
  3. environment: Yeah, this is tricky because of the limitations of density itself (as you mention). Given your third example, I'm reluctant to go all in on the envir = parent.frame() passing. I might just add a tryCatch error message in there.

P.S. I'm not sure about the plo2t(x ~ by, type = "density") syntax because it breaks the | by API the we have elsewhere. But I might be persuaded by a one-side formula version, i.e. plot2(~ x | by, type = "density"). Will think on this some more.

@zeileis
Copy link
Collaborator

zeileis commented Apr 14, 2023

Thanks for the follow-up!

  1. cdplot: OK, will work on that hopefully today or tomorrow. The week was rather busy.
  2. Bandwidth/kernel: You are right. I thought that the kernel was provided explicitly in the "density" object but it isn't. So update() is the best we can do. Of course, the kernel argument might also come from somewhere where update() cannot find it without the right environment.
  3. Environment: I think that envir = parent.frame() is the only consistent choice because from my reading this is what update() does. Otherwise you might search for the x argument in a different environment than for the other variables (bandwidth, kernel, etc.). I can try to cook up an example if you need more convincing ;-)
  4. Formula: My gut feeling was that a one-sided formula is more confusing but you are probably right that it is more consistent. We still might support a type argument there to choose between histogram and density.
  5. Area shading: I think it might be a nice touch to optionally add some semi-transparent area shading below the density or histogram line. The area might also be opaque if we order the by groups (e.g., by median) from highest to lowest, which would yield a kind of ridge lines plot.

@grantmcdermott
Copy link
Owner Author

grantmcdermott commented Apr 14, 2023

I'm convinced ;-)

I believe that my latest changes address the main shortcomings, so feel free to "Squash and merge" if you agree.

Quickly on 4 and 5, which I'll leave for a future set of PRs.

  1. Supporting extra type arguments is a cool a idea. We should look at supporting type = "hist" too.
  2. Yeah, this would be great. I need to figure out better support for polygons in general, though.

EDIT: forgot to include an example of the updated grouped density plot (i.e., with the joint bandwidth calc).

library(plot2)
plot2(density(iris$Sepal.Width), by = iris$Species)

Created on 2023-04-13 with reprex v2.0.2

@zeileis zeileis merged commit d6de3a5 into main Apr 14, 2023
5 checks passed
@zeileis
Copy link
Collaborator

zeileis commented Apr 14, 2023

Looks good, thanks. I'll post a wishlist item for 4/5.

zeileis pushed a commit that referenced this pull request Apr 14, 2023
* Add density method

* Import methods and update Namepsace

* Update examples

* Add grouped density plot example to README

* Add tests

* NEWS

* Use joint bandwidth

* update tests

* Add envir
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

plot2(density(x)) doesn't work
2 participants