You can clone with
HTTPS or Subversion.
d <- data.frame(size=c("S", "M", "L", "XL"), value=c(3,6,7,9), stringsAsFactors=FALSE)
ggplot(d) + geom_point(aes(x=size, y=value))
Here the order in the plot is L, M, S, XL: the alphabetical order. I guess ggplot converts d$size into a factor even though I forced it to be a character vector.
I think that when categorical data are coded as factors, of course ggplot should use the levels' order.
When data is of categorical nature but not explicitely coded as a factor, ggplot should respect the order in the original data because it often has a meaning. It is the expectation of novice users at least and, for those users also, it is often conceptually easier to reorder the original data than to change the order of levels in a factor.
In practice, when converting the data for the plot, ggplot should do
converted <- factor(mydata, levels=unique(mydata))
If you point be towards where the data is initially handled, I might even be able to fix this myself.
This always confuses me - if your data is c(3, 1, 5) you don't expect the numbers to be ordered differently. I think it's better for people to learn how factors work in R than for ggplot2 to do something magical. I suspect if I did make the change, there would be just as many (if not more) people complaining that ggplot2 no longer ordered character variables alphabetically.
c(3, 1, 5)
Factors are useful but they serve a specific purpose, which I would summarize as: remember the number and possibly order of levels even on a subset of the data. This is mandatory for many statistical analyses and they are an important aspect of R as a statistical language.
But R has grown into more than just a "statistics package". It is used for all sorts of data-centric tasks, and ggplot in particular can be used to explore any data, not specifically with a statistical point of view (it is used by plenty of physicists in my lab who don't want to hear about stats).
My argument is that the current behaviour forces people to know and care about factors while one can efficiently use R without them. The existence of default.stringsAsFactors() to avoid having every string automatically converted into a factor comforts me in thinking that factors are not mandatory.
Another way to look at it is:
if you do care about the order the proposed behaviour allows you to
In contrast, the current behaviour forces you to know about factors (and if you don't, appears to be doing something "magical" and surprising with your data).
So I think the proposed behaviour is more versatile and has an easier learning slope than the current one, while maintaining all functionality. The only downside it that ggplot itself won't automatically order strings alphabetically, but you usually get this functionality by letting default.stringsAsFactors() with its default value of TRUE and reading your data/creating a data.frame for plotting in ggplot (i.e. feeding ggplot factors instead of strings).
To finish up, I would say that I, personally, know and use factors; but even so, I am often bit by strange errors that come from using them. But, mainly, my suggestion comes from trying to teach people new to R/ggplot or non-statisticians and factors are always a difficult step (leading to the inevitable: "why do I need to care about that when I just want to plot stuff" ;) ).
I don't know which is better but in my view I like the current behavior.
@jiho, what do you think the order of variable should be if each element are duplicated:
d <- data.frame(size=c("S", "M", "L", "M", "S"), value=c(3,6,7,9,0), stringsAsFactors=FALSE)
The order of the first occurrence of each level, i.e. what
unique(c("S", "M", "L", "M", "S"))
While I can see your point, I really think that would be counter-intuitive behaviour. Some things are confusing because you need to learn a bit more about how R works. It's more effort now, but pays off in the future.