Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

Categorial variables not specified explicitly as factors should be ordered as they are in the original data #357

Closed
jiho opened this Issue · 5 comments

3 participants

Jean-Olivier Irisson Hadley Wickham kohske takahashi
Jean-Olivier Irisson

Example:

d <- data.frame(size=c("S", "M", "L", "XL"), value=c(3,6,7,9), stringsAsFactors=FALSE)
class(d$size)
ggplot(d) + geom_point(aes(x=size, y=value))

Here the order in the plot is L, M, S, XL: the alphabetical order. I guess ggplot converts d$size into a factor even though I forced it to be a character vector.

I think that when categorical data are coded as factors, of course ggplot should use the levels' order.

When data is of categorical nature but not explicitely coded as a factor, ggplot should respect the order in the original data because it often has a meaning. It is the expectation of novice users at least and, for those users also, it is often conceptually easier to reorder the original data than to change the order of levels in a factor.

In practice, when converting the data for the plot, ggplot should do

converted <- factor(mydata, levels=unique(mydata))

If you point be towards where the data is initially handled, I might even be able to fix this myself.

Hadley Wickham
Owner

This always confuses me - if your data is c(3, 1, 5) you don't expect the numbers to be ordered differently. I think it's better for people to learn how factors work in R than for ggplot2 to do something magical. I suspect if I did make the change, there would be just as many (if not more) people complaining that ggplot2 no longer ordered character variables alphabetically.

Jean-Olivier Irisson

Factors are useful but they serve a specific purpose, which I would summarize as: remember the number and possibly order of levels even on a subset of the data. This is mandatory for many statistical analyses and they are an important aspect of R as a statistical language.

But R has grown into more than just a "statistics package". It is used for all sorts of data-centric tasks, and ggplot in particular can be used to explore any data, not specifically with a statistical point of view (it is used by plenty of physicists in my lab who don't want to hear about stats).

My argument is that the current behaviour forces people to know and care about factors while one can efficiently use R without them. The existence of default.stringsAsFactors() to avoid having every string automatically converted into a factor comforts me in thinking that factors are not mandatory.

Another way to look at it is:

  • if you don't care about the order of the categories then either solution is fine
  • if you do care about the order the proposed behaviour allows you to

    • set it in your data (either sort your dataset before reading it in R, or use sort()/order() in R)
    • use factors, if you know about them, and set the order of their levels

    In contrast, the current behaviour forces you to know about factors (and if you don't, appears to be doing something "magical" and surprising with your data).

So I think the proposed behaviour is more versatile and has an easier learning slope than the current one, while maintaining all functionality. The only downside it that ggplot itself won't automatically order strings alphabetically, but you usually get this functionality by letting default.stringsAsFactors() with its default value of TRUE and reading your data/creating a data.frame for plotting in ggplot (i.e. feeding ggplot factors instead of strings).

To finish up, I would say that I, personally, know and use factors; but even so, I am often bit by strange errors that come from using them. But, mainly, my suggestion comes from trying to teach people new to R/ggplot or non-statisticians and factors are always a difficult step (leading to the inevitable: "why do I need to care about that when I just want to plot stuff" ;) ).

kohske takahashi
Collaborator

I don't know which is better but in my view I like the current behavior.
@jiho, what do you think the order of variable should be if each element are duplicated:

d <- data.frame(size=c("S", "M", "L", "M", "S"), value=c(3,6,7,9,0), stringsAsFactors=FALSE)
Jean-Olivier Irisson

The order of the first occurrence of each level, i.e. what

 unique(c("S", "M", "L", "M", "S"))

returns.

Hadley Wickham
Owner

While I can see your point, I really think that would be counter-intuitive behaviour. Some things are confusing because you need to learn a bit more about how R works. It's more effort now, but pays off in the future.

Hadley Wickham hadley closed this
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.