### Introduction

In this post, I'll give a quick tour of R's various `apply` operations. These are generally useful utilities to map functions onto some data structure and offer a much more elegant way of doing so than looping iteration. For starters, we'll just create a simple numerical matrix creatively called `M`.

In [22]:
M <- matrix(c(1:20), ncol=4)

In [23]:
M

0,1,2,3
1,6,11,16
2,7,12,17
3,8,13,18
4,9,14,19
5,10,15,20


### apply

And then let's suppose that we want to get the maximum value of each column in `M`. We can simply call `apply` on our matrix passing in three arguments: the target matrix, `M` in this case, the `margin` which gives the subscripts over which our function will be applied and, lastly, the function which is of course `max`.

In [29]:
apply(M, 2, max)

For matrices, `2` indicates columns and `1` indicates rows:

In [30]:
apply(M, 1, max)

and we can even indicate both columns and rows with `c(1,2)`.

In [31]:
apply(M, c(1,2), max)

0,1,2,3
1,6,11,16
2,7,12,17
3,8,13,18
4,9,14,19
5,10,15,20


which basically just re-capitulates our matrix since each cell is its own max. It's important to note that the `margin` argument to the `apply` function actually specifies the arguments that we *don't* want to operate on rather than the ones we do. Thus, `margin` = 1, means that we want to **preserve** the rows while `margin` = 2 means that we want to **preserve**.

Of course, we can also `apply` lambdas of our own, defining functions at the time we pass them to apply. It's an interesting fact here that you don't have to specifically define a return value for such functions. This is because R will automatically return the last value that was evaluated. Compare these scenarios:

In [57]:
apply(M, 1, function(x) { min(x)})

In [59]:
apply(M, 1, function(x) { 0; min(x)})

In [60]:
apply(M, 1, function(x) { 0 ; min(x); 0})

### lapply

Several variants of the `apply` function exist which we'll go into in this post. The first is `lapply` which behaves differently in that it returns a list of the same length as the input matrix which contains the result of applying the specified function to each value. In a sense, it's like calling `apply(M, c(1,2), ...)` above except that it returns a list instead of an equivalently-shaped matrix.

In [130]:
lapply(M, max)

In [80]:
lapply(M, function(x) {x + 2})

We don't have to specify a margin here because `lapply` converts our matrix into a list.

In [84]:
is.list(lapply(M, function(x) {x + 2}))

### sapply

`sapply` is a wapper for `lapply`. It's the "simple" apply because it attempts to cast the returned values into the most basic data structure possible. In this case, it gives us a vector as opposed to a list. 

In [87]:
is.vector(sapply(M, max))

We can turn off this simplification behavior with the `simplify` flag.

In [96]:
is.list(sapply(M, max, simplify=FALSE))

### vapply

`vapply` is very similar to `sapply` but is generally prefered because it offers a slight performance improvement and it leads to more resilient code because it requires us to specify the type of the return value which ultimately prevents some more insidious bugs from popping up later. For example, here we run `vapply` on our numerical matrix and tell it to expect `character` output and, just as we would want, we get an error when it does not.

In [100]:
vapply(M,  max, "")

ERROR: Error in vapply(M, max, ""): values must be type 'character',
 but FUN(X[[1]]) result is type 'integer'


and just for completeness' sake, we can pass a lambda wherein we return a string and we can see that the type checking now passes.

In [101]:
vapply(M, function(x) {'a'}, "")

### tapply

Now at this point we depart a bit from the theme and get into some more unusual members of the `apply` family. `tapply` allows us to apply a function over groups from a given dataframe.

We'll pull out the `iris` dataset for this one

In [104]:
head(iris)

Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
5.1,3.5,1.4,0.2,setosa
4.9,3.0,1.4,0.2,setosa
4.7,3.2,1.3,0.2,setosa
4.6,3.1,1.5,0.2,setosa
5.0,3.6,1.4,0.2,setosa
5.4,3.9,1.7,0.4,setosa


and what `tapply` lets us to do is specify our dataframe of interest (like usual) as well as an `INDEX` argument which refers to a factor by which to group our data before applying our passed function. Conceptually, this is a lot like SQL's groupby function.

In [107]:
tapply(iris$Sepal.Length, iris$Species, mean)

Note how we got the mean for each species group of the iris dataset respectively.

### mapply

Lastly, we'll cover `mapply` which stands for "multivariate apply." It's purpose is to be like an `sapply` that operates in parallel on multiple vectors. It's sort of like Python's `zip` function. For a quick illustration we take two vectors one of which contains numbers 1 through 7 and the other which contains numbers 7 through 1. Using `mapply` we can sum the corresponding values at each position (e.g. 1+6, 2+5, 3+4 etc...).

In [118]:
mapply(sum, c(1:6), c(6:1))

and an intersting behavior is that, if the vectors are uneven, the last value of the shortest vector is repeated for each value of the longest vector like so:

In [121]:
mapply(sum, c(1:6), c(5:1))

“longer argument not a multiple of length of shorter”

our first vector had numbers 1 through 6 but our second vector only had numbers 5 through 1. Therefore, for that last `mapply` iteration, it received a 5 from the first vector and, seeing no value for the second vector, substituted the last value it encountered (5 in this case, giving us 11).

We can pass in our own lambdas too but, of course, we have to declare an argument for each processed list.

In [127]:
mapply(function(x,y) {x*y}, c(1:6), c(6:1))

### Conclusion

So that's a handy overview of the various functions in R's `apply` family. They can be really helpful to speed up your code and make your style more legible and declarative. Next time you're doing an analysis, see if one of these wouldn't come in handy.