Skip to content

Loading…

Small cleanup of stat density #554

Closed
wants to merge 3 commits into from

2 participants

@jiho

To keep the code clear and the documentation in line with the code.

jiho added some commits
@jiho jiho Remove double spaces 229bfc0
@jiho jiho Remove unused code b5aa707
@jiho jiho Allow density to expand scale bounds
In the previous implementation, density was estimated only within the
range of the scale, which is determined by default by the range of the
data. So, to see the effect of not trimming the data, the user had to
explicitly expand the scale.

In the new implementation, with trim = FALSE, the density is indeed
estimated up to 3 bandwidths after the bounds of the data, and this
expands the scale. But trim = TRUE by default (as was said in the
documentation but not respected in the code) so the final aspect should
stay the same in most circumstances.
538d75a
@jiho

The most important change is explained in defail in the commit message of 538d75a

@hadley

I'm pretty sure you have to explicitly supply the range of the density, otherwise stacked density don't work because they're not estimated on the same points.

hum, probably indeed. But in that case, should we extend this range by cut bandwidth beyond the range of the full dataset, to allow the "normal" behaviour of density?
Otherwise, in the current implementation, there's no difference between trim=TRUE or FALSE (and this probably explains the discrepancy between the documentation and the code regarding the default value of trim)

Owner

Hmmm, I don't think that will work either because we're not setting the bandwidth and so it will vary by group. Maybe trim should just be deprecated?

Alternatively, we could switch to some other density method that would work with all the data simultaneously.

Or it could be computed with bw.nrd and then passed to density(); but that probably adds another layer of computation and complexity. On the other hand, I can't remember a situation where I needed trim indeed.

If trim is suppressed, the from and to arguments must be reinstated and there probably should be a message in the help indicating that, in that case, the area under the curve is not actually 1 (since this is what you would expect from a density estimation).

Owner

Maybe we should move to a model more like stat_bin - i.e. compute the bandwidth based on the complete data, and output a message saying what was used. Then trim could actually work because there would be a common bandwidth across all groups.

@hadley
Owner

See discussion of #595 - any fix is going to require backward incompatible changes.

@hadley
Owner

Unfortunately that change would break existing code so isn't suitable for inclusion in a very mature package like ggplot2.

@hadley hadley closed this
@jiho jiho deleted the jiho:fix/clean-stat-density branch
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on May 11, 2012
  1. @jiho

    Remove double spaces

    jiho committed
  2. @jiho

    Remove unused code

    jiho committed
  3. @jiho

    Allow density to expand scale bounds

    jiho committed
    In the previous implementation, density was estimated only within the
    range of the scale, which is determined by default by the range of the
    data. So, to see the effect of not trimming the data, the user had to
    explicitly expand the scale.
    
    In the new implementation, with trim = FALSE, the density is indeed
    estimated up to 3 bandwidths after the bounds of the data, and this
    expands the scale. But trim = TRUE by default (as was said in the
    documentation but not respected in the code) so the final aspect should
    stay the same in most circumstances.
Showing with 5 additions and 8 deletions.
  1. +5 −8 R/stat-density.r
View
13 R/stat-density.r
@@ -4,11 +4,11 @@
#' @param kernel kernel used for density estimation, see
#' \code{\link{density}} for details
#' @param trim if \code{TRUE}, the default, densities are trimmed to the
-#' actually range of the data. If \code{FALSE}, they are extended by the
+#' actually range of the data. If \code{FALSE}, they are extended by the
#' default 3 bandwidths (as specified by the \code{cut} parameter to
#' \code{\link{density}})
#' @param na.rm If \code{FALSE} (the default), removes missing values with
-#' a warning. If \code{TRUE} silently removes missing values.
+#' a warning. If \code{TRUE} silently removes missing values.
#' @inheritParams stat_identity
#' @return data.frame with additional columns:
#' \item{density}{density estimate}
@@ -85,7 +85,7 @@
#' qplot(length, data=movies, geom="density", weight=rating/sum(rating))
#' }
stat_density <- function (mapping = NULL, data = NULL, geom = "area", position = "stack",
-adjust = 1, kernel = "gaussian", trim = FALSE, na.rm = FALSE, ...) {
+adjust = 1, kernel = "gaussian", trim = TRUE, na.rm = FALSE, ...) {
StatDensity$new(mapping = mapping, data = data, geom = geom, position = position,
adjust = adjust, kernel = kernel, trim = trim, na.rm = na.rm, ...)
}
@@ -93,18 +93,15 @@ adjust = 1, kernel = "gaussian", trim = FALSE, na.rm = FALSE, ...) {
StatDensity <- proto(Stat, {
objname <- "density"
- calculate <- function(., data, scales, adjust=1, kernel="gaussian", trim=FALSE, na.rm = FALSE, ...) {
+ calculate <- function(., data, scales, adjust=1, kernel="gaussian", trim=TRUE, na.rm = FALSE, ...) {
data <- remove_missing(data, na.rm, "x", name = "stat_density",
finite = TRUE)
n <- nrow(data)
if (n < 3) return(data.frame())
if (is.null(data$weight)) data$weight <- rep(1, n) / n
-
- range <- scale_dimension(scales$x, c(0, 0))
- xgrid <- seq(range[1], range[2], length=200)
- dens <- density(data$x, adjust=adjust, kernel=kernel, weight=data$weight, from=range[1], to=range[2])
+ dens <- density(data$x, adjust=adjust, kernel=kernel, weight=data$weight)
densdf <- as.data.frame(dens[c("x","y")])
densdf$scaled <- densdf$y / max(densdf$y, na.rm = TRUE)
Something went wrong with that request. Please try again.