Skip to content

Latest commit

 

History

History
247 lines (200 loc) · 8.84 KB

r-the-beautiful-parts.markdown

File metadata and controls

247 lines (200 loc) · 8.84 KB
kind feed title created_at
article
true
R: The Beautiful Parts
2012-03-30 05:00 GMT

One of the biggest criticisms levelled at R is the unfriendliness of the syntax. I think this is a valid point as R can be both a curly brace language and a lisp dialect. The example below illustrates examples of both types of this programming.

# Curly brace type programming
my.var      <- 7

my.function <- function(x){
  (x / 2) / x
}


# Lisp-type functional programming
sqrt(
  '*'(
    '+'(1, 2),
    3))

# count is a function identifier passed to this function
by(my.data, year, count)

This leads confusion as there is apparently no 'right' way to program in R. Writing from experience, you can write all of your scripts using declartive statements and loops within nested loops. Alternatively you can take advantage of the functional nature of R and never write a single loop again. The programming style is almost left as a choice to the user.

The problem with this situation is that few R users have experience with functional programming, and fewer still with a lisp. Therefore many users will program R as if in Python, Java, Perl and so forth, which the R syntax permits. A declarative programming style in R is however more verbose and more error prone. I say error-prove because you will likely have to multiple groups of nested loops using indexing variables, which are hard to debug and easy to confuse. Finally R non-syntax related error messages can be extreme terse and even plain unhelpful. This then compounds an already frustrating experience.

Now on the opposing side there is functional programming approach. This avoids the common problem of nested loops and indexing variables. This approach will inevitably also cut down the amout of code you'll need to write and therfore maintain and debug. Writing from my own experience, as difficult and frustrating writing declarative R is, writing functional R the opposite: simple and compact.

The title of this post obviously draws inspiration from Douglous Crockford's JavasScript: The Good Parts book. However I think the best bits of R aren't just 'good,' instead I think they can produce beautiful code. I assume that if you majority of people reading this wish to use R for manipulating, analysing, and visulising data. I'm not advocating writing a webserver in R, obviously there a more suited languages for that. I think for data analysis however, using a specific set of functions and libraries, is easier in R than many other languages.

Use only data frames and vectors

Use only vectors and data frames when ever possible. Data frames are functionally columns and rows of vectors. Sticking to just these two types of data structures will simplyfy and reduce the number of functions you need to remember. Statistical functions often return lists, such as lm, however you'll nearly always be interested in the data retrieved using the $/[[ operators or a helper function. For example:

linear.model <- lm(y ~ x,
                   data.frame(x = 1:6,
                              y = (1:6)*4))

mode(linear.model)
#=> "list"

# Both these fetch the fitted coefficients for the model
linear.model[['coefficients']]
coefficents(linear.model)

mode(coefficents(linear.model))
#=> numeric

Use data frames in the long format

Imagine a case where I take seedlings from several parent plants expose them to two treatments then measure their height and weight after a month. I could store the resulting data as follows and I have found this to be quite common for data meaured using a spreadsheet program:

sample  cntrl_wgt trt_1_wgt trt_2_wgt cntrl_hgt trt_1_hgt trt_2_hgt
1       10        5         15        8         6         10
2       6         3         9         4         2         6
3       ...

This is called 'wide' formatted data. Instead store your data in 'long' format as follows:

parent treatment characteristic value
1      control   weight         10
1      control   height         8
1      1         weight         6
1      1         height         2
1      2         weight         15
1      2         height         10
2      control   weight         6
2      control   height         4
...

The advantages of this format are simple to explain. Imagine requiring the weight measurements across all treaments:

# Fetching weight data from the wide format
data[,c('cntrl_wgt','trt_1_wgt','trt_2_wgt')] 

# Fetching weight data from the long format
subset(data,characterisitic == 'weight')$value

In the case of using the wide format I have to manually select the columns containing the weight categories. On the other had I can just select from the long format using the subset command. Now imagine I extend this case where I take the same set of measurements after 1 month and 2 months. The data now look like this.

sample cntrl_mnth_1_wgt contrl_mnth_2_wgt cntrl_mnth_1_hgt ...
1      10               13                8                ...
2      ...

Where the number of columns I now has essentially doubled. The long data format on the other hands looks like this:

parent treatment month characteristic value
1      control   1     weight         10
1      control   1     height         8
1      control   2     weight         13
1      control   2     height         10
...

All that's required is an extra column for each extra factor. If decided to to measure plants at 1,2 and 3 months the wide format would triple in the number of columns while long format would still only require a single extra column. Now again consider the case of subsetting the data:

# I now have to enumerate all the extra columns I require
data[,c('cntrl_mnth_1_wgt','trt_1_mnth_1_wgt','trt_2_mnth_1_wgt',
        'cntrl_mnth_2_wgt','trt_1_mnth_2_wgt','trt_2_mnth_2_wgt')] 

# Subsetting data from the long form is still exactly the same
subset(data,characterisitic == 'weight')$value

# I think it is arguably simpler and more readable to fetch more
# specific subsets from long formatted data also
subset(data,characterisitic == 'weight' &
                  treatment == 1        &
                      month == 2        &
                     parent == 4)

I think this serves as an example that if you follow the R way of doing things then your code is compact and simple. The alternative is to end up fighting against R to get the results you need.

Prefer with/within over attach

Accessing each each column vector from a data fame using the $ name prefix is cumbersome. The attach method can be used to get around this by directly adding the column vectors into the R search path. Using this method does not however add any changes made to these variable back into the parent data frame. For example:

my.data <- data.frame(x = 1:3,
                      y = (1:3)*2 + 2)
my.data
#=>    x y
#    1 1 4
#    2 2 6
#    3 3 8

# Calculate the x*y product
# Using the `$` verbose notation
my.data$z <- my.data$x * my.data$y
my.data
#=>    x y  z
#    1 1 4  4
#    2 2 6 12
#    3 3 8 24

# Calculate using attach
# The x and y variables can be directly accessed
attach(my.data)
z <- x * y

# The new variable `z` is however not part of my.data
my.data
#=>    x y
#    1 1 4
#    2 2 6
#    3 3 8

The with and within functions allow you to do many of the functions you might need to do with a data without verbose syntax whilst still maintaining the data as part of the data frame. For example the following highlights using within to perform the same operations:

# Using `within` returns the updated to the data frame
my.data <- within(my.data, z <- x * y)
my.data
#=>    x y  z
#    1 1 4  4
#    2 2 6 12
#    3 3 8 24

# Multiple, multi-line operations using curly brace expressions
my.data <- within(my.data,{
  z1 <- x * y
  z2 <- x + y
})
my.data
#=>    x y z1 z2
#    1 1 4  4  5
#    2 2 6 12  8
#    3 3 8 24 11

The with function is similar to the within function where instead the result of the expression is returned rather than updating the passed data frame. This useful for creating models or plots from data frames, where the last call is returned. This can be illustrated as follows:

# The result of the expression is returned rather
# than updating the data
with(my.data, z <- x * y)
#=> 4 12 24

# Create a linear model
with(my.data, {
  z <- x^2 + 1
  lm(y ~ z)
})
#=> The linear model is returned

apply

anonymous functions

conditionals if/switch

reshape

plyr

examples from survey