![](logo.png)

# <font color='Red'>Advanced Programming with R</font>

> ### Apply Functions
> ### Built-in R Features
> ### Math Functions
> ### Regular Expressions

# <font color='Red'>Apply Functions</font>

In this lecture we will learn about 3 different apply() functions. The basic idea of an apply() is to apply a function over some iterable object.

Let's start with lapply():

## lapply()

The built-in R function **lapply()** will apply a function over a list or vector:
    
    lapply(x, FUN, ...)

Where X is your list/vector and FUN is your function. For more info you can use help(lapply)

In [2]:
help(lapply)

Let's see how we can use this in its most practical use case by applying a custom function to a vector. First I want to show you a quick function (we will go over more utilities like this one later) which will allow us to pick a random sample from a vector:

In [3]:
sample(1:10, 1)

In [4]:
# Vector
v <- c(1,2,3,4,5)

# Custom Function
addrand <- function(x){
    
    #Get random number
    ran <- sample(1:10, 1)
    
    return(x+ran)
    
}

lapply(v, addrand)

## Anonymous Functions

So you noticed that in the last example we had to write out an entire function to apply to the vector, but in reality that function is just doing something pretty simple, adding a random number. Do we really want to formally define an entire function for this? We don't want to, especially if we only plan to use this function a single time.

To address this issue, we can create an anonymous function (because it won't have a defined name). Here's the syntax for an anonymous function in R:

    function(a){code here}

This is a similar idea to lambda expressions in Python. So for example we can rewrite the previous function as an anonymous function and use lapply() with it:

In [5]:
# Show vector
v

In [6]:
# Anonymous Function Setup

function(a){a+sample(1:10, 1)}

In [7]:
# Anonymous Function with lapply()

lapply(v, function(a){a+sample(1:10, 1)})

In [8]:
# Anonymous Function with lapply()

lapply(v, function(x){x+2})

Now what if our original function had multiple arguments? lapply() actually let's us deal with that by simply adding them in like this:

In [14]:
add_choice <- function(num, choice){
    
    return(num+choice)
    
}

add_choice(2,3)

In [16]:
# Using lapply() with multiple arguments
lapply(v, add_choice, choice=10)

You can do this with several arguments, you just keep adding them after choice=10

## sapply() vs. lapply()
Notice that lapply returned a list, we can use sapply, which simplifies the process by returning a vector or matrix. For example:

In [17]:
help(sapply)

In [18]:
# Use sapply instead of lapply for adding choices to v

sapply(v, add_choice, choice=10)

NOTE: See the difference in the output of sapply(). This changed from a list to a vector.

In [19]:
# Show difference between lapply and sapply output
lapp <- lapply(v,add_choice,choice=10)
sapp <- sapply(v,add_choice,choice=10)

class(lapp) # a list
class(sapp) # vector of numerics

## sapply() limitations
sapply() won't be able to automatically return a vector if your applied function doesn't return something for all elements in that vector. For example:

In [20]:
# Checks for even numbers
nums <- c(1,2,3,4,5)

even <- function(x) {
  return(x[(x %% 2 == 0)])
}

sapply(nums, even)

In [21]:
lapply(nums, even)

## Apply to Dataframes

The apply functions can be used to add or change existing columns of dataframes

In [23]:
## Apply to Dataframes

p_yield = read.csv('peanut_yield.txt', sep='\t', header=TRUE)

In [24]:
head(p_yield)

Year,Location,Name,Label,NC_Accession,Plot_Yield,Yield
<int>,<fct>,<fct>,<fct>,<fct>,<dbl>,<int>
2014,LEW,ATP,Advanced Testing Program - Yield,ACI WT09-0761,9.0,2733
2014,RMT,ATP,Advanced Testing Program - Yield,ACI WT09-0761,10.2,3085
2014,WHI,ATP,Advanced Testing Program - Yield,ACI WT09-0761,10.2,3085
2015,LEW,ATP,Advanced Testing Program - Yield,ACI WT11-0351,15.3,4632
2015,RMT,ATP,Advanced Testing Program - Yield,ACI WT11-0351,10.7,3226
2015,LEW,ATP,Advanced Testing Program - Yield,ACI WT12-0226,13.6,4104


In [26]:
p_yield['Test_Yield'] <- lapply(p_yield['Plot_Yield'], function(x){x*302.5})

In [27]:
head(p_yield)

Year,Location,Name,Label,NC_Accession,Plot_Yield,Yield,Test_Yield
<int>,<fct>,<fct>,<fct>,<fct>,<dbl>,<int>,"<dbl[,1]>"
2014,LEW,ATP,Advanced Testing Program - Yield,ACI WT09-0761,9.0,2733,2722.5
2014,RMT,ATP,Advanced Testing Program - Yield,ACI WT09-0761,10.2,3085,3085.5
2014,WHI,ATP,Advanced Testing Program - Yield,ACI WT09-0761,10.2,3085,3085.5
2015,LEW,ATP,Advanced Testing Program - Yield,ACI WT11-0351,15.3,4632,4628.25
2015,RMT,ATP,Advanced Testing Program - Yield,ACI WT11-0351,10.7,3226,3236.75
2015,LEW,ATP,Advanced Testing Program - Yield,ACI WT12-0226,13.6,4104,4114.0


In [29]:
p_yield['Test_Yield2'] <- sapply(p_yield['Plot_Yield'], function(x){x*302.5})

In [30]:
head(p_yield)

Year,Location,Name,Label,NC_Accession,Plot_Yield,Yield,Test_Yield,Test_Yield2
<int>,<fct>,<fct>,<fct>,<fct>,<dbl>,<int>,"<dbl[,1]>",<dbl>
2014,LEW,ATP,Advanced Testing Program - Yield,ACI WT09-0761,9.0,2733,2722.5,2722.5
2014,RMT,ATP,Advanced Testing Program - Yield,ACI WT09-0761,10.2,3085,3085.5,3085.5
2014,WHI,ATP,Advanced Testing Program - Yield,ACI WT09-0761,10.2,3085,3085.5,3085.5
2015,LEW,ATP,Advanced Testing Program - Yield,ACI WT11-0351,15.3,4632,4628.25,4628.25
2015,RMT,ATP,Advanced Testing Program - Yield,ACI WT11-0351,10.7,3226,3236.75,3236.75
2015,LEW,ATP,Advanced Testing Program - Yield,ACI WT12-0226,13.6,4104,4114.0,4114.0


In [50]:
# If else statments using sapply, lapply

map_condition <- function(x){
    if (x > 3000){
        return('Yes')
    }else{
        return('No')
    }
}

map_condition(3100)

In [54]:
# Use lapply to map the function map_condtion
head(lapply(p_yield[['Yield']], map_condition))

## ifelse() function

Instead of having to write a function using an IF, ELSE statement, there is a built-in R function called **ifelse()** that allows you to code a binary result based on a condition. Let's look at an example to create a new data column from an existing data column in a dataframe. 

In [42]:
# Use 'Yield' > 3000 to determine whether you should select a line or not
p_yield['Select'] <- ifelse(p_yield['Yield'] > 3000, 'Select', 'Drop')
head(p_yield)

Year,Location,Name,Label,NC_Accession,Plot_Yield,Yield,Test_Yield,Test_Yield2,Select
<int>,<fct>,<fct>,<fct>,<fct>,<dbl>,<int>,"<dbl[,1]>",<dbl>,"<chr[,1]>"
2014,LEW,ATP,Advanced Testing Program - Yield,ACI WT09-0761,9.0,2733,2722.5,2722.5,Drop
2014,RMT,ATP,Advanced Testing Program - Yield,ACI WT09-0761,10.2,3085,3085.5,3085.5,Select
2014,WHI,ATP,Advanced Testing Program - Yield,ACI WT09-0761,10.2,3085,3085.5,3085.5,Select
2015,LEW,ATP,Advanced Testing Program - Yield,ACI WT11-0351,15.3,4632,4628.25,4628.25,Select
2015,RMT,ATP,Advanced Testing Program - Yield,ACI WT11-0351,10.7,3226,3236.75,3236.75,Select
2015,LEW,ATP,Advanced Testing Program - Yield,ACI WT12-0226,13.6,4104,4114.0,4114.0,Select


In [57]:
# Multiple conditions
head(ifelse(p_yield['Yield'] > 3000, 'Select', ifelse(p_yield['Yield'] < 2000, 'Drop', 'Keep')), 10)

Yield
Keep
Select
Select
Select
Select
Select
Select
Drop
Keep
Keep


## Other apply() type functions
There are actually quite a few apply() type functions in R. We've gone over everything you need to know for now. But if your curious in finding out more about them, you can check out this documentation or this excellent StackOverflow answer, copied here below:

R has many *apply* functions which are ably described in the help files (e.g. ?apply). There are enough of them, though, that beginning useRs may have difficulty deciding which one is appropriate for their situation or even remembering them all. They may have a general sense that "I should be using an *apply function here", but it can be tough to keep them all straight at first.

Despite the fact (noted in other answers) that much of the functionality of the *apply* family is covered by the extremely popular plyr package, the base functions remain useful and worth knowing.

This answer is intended to act as a sort of signpost for new useRs to help direct them to the correct *apply* function for their particular problem. Note, this is not intended to simply regurgitate or replace the R documentation! The hope is that this answer helps you to decide which *apply* function suits your situation and then it is up to you to research it further. With one exception, performance differences will not be addressed.

* **Apply** - When you want to apply a function to the rows or columns of a matrix (and higher-dimensional analogues); not generally advisable for data frames as it will coerce to a matrix first.

>#### Two dimensional matrix
    M <- matrix(seq(1,16), 4, 4)

>#### apply min to rows
    apply(M, 1, min)
    [1] 1 2 3 4

>#### apply max to columns
    apply(M, 2, max)
    [1]  4  8 12 16

>#### Three-dimensional array
    M <- array(seq(32), dim = c(4,4,2))

>#### Apply sum across each M[*, , ] - i.e Sum across 2nd and 3rd dimension
    apply(M, 1, sum)

>#### Result is one-dimensional
    [1] 120 128 136 144

>#### Apply sum across each M[*, *, ] - i.e Sum across 3rd dimension
    apply(M, c(1,2), sum)

>#### Result is two-dimensional
         [,1] [,2] [,3] [,4]
    [1,]   18   26   34   42
    [2,]   20   28   36   44
    [3,]   22   30   38   46
    [4,]   24   32   40   48

If you want row/column means or sums for a 2D matrix, be sure to investigate the highly optimized, lightning-quick colMeans, rowMeans, colSums, rowSums.


* **Lapply** - When you want to apply a function to each element of a list in turn and get a list back.

This is the workhorse of many of the other *apply* functions. Peel back their code and you will often find lapply underneath.

x <- list(a = 1, b = 1:3, c = 10:100) 
    
    lapply(x, FUN = length)
    a
    [1] 1
    b
    [1] 3
    c
    [1] 91

    lapply(x, FUN = sum) 
    a 
    [1] 1
    b 
    [1] 6
    c 
    [1] 5005

* **Sapply** - When you want to apply a function to each element of a list in turn, but you want a vector back, rather than a list.

If you find yourself typing unlist(lapply(...)), stop and consider sapply.

x <- list(a = 1, b = 1:3, c = 10:100)

Compare with above; a named vector, not a list 
    
    sapply(x, FUN = length)  
    a  b  c   
    1  3 91

    sapply(x, FUN = sum)   
    a    b    c    
    1    6 5005 

In more advanced uses of sapply it will attempt to coerce the result to a multi-dimensional array, if appropriate. For example, if our function returns vectors of the same length, sapply will use them as columns of a matrix:

    sapply(1:5,function(x) rnorm(3,x))
   
If our function returns a 2 dimensional matrix, sapply will do essentially the same thing, treating each returned matrix as a single long vector:

    sapply(1:5,function(x) matrix(x,2,2))
   
Unless we specify simplify = "array", in which case it will use the individual matrices to build a multi-dimensional array:

    sapply(1:5,function(x) matrix(x,2,2), simplify = "array")
   
Each of these behaviors is of course contingent on our function returning vectors or matrices of the same length or dimension.

* **Vapply** - When you want to use sapply but perhaps need to squeeze some more speed out of your code.

For vapply, you basically give R an example of what sort of thing your function will return, which can save some time coercing returned values to fit in a single atomic vector.

x <- list(a = 1, b = 1:3, c = 10:100)

Note that since the advantage here is mainly speed, this
example is only for illustration. We're telling R that
everything returned by length() should be an integer of 
length 1.

    vapply(x, FUN = length, FUN.VALUE = 0L) 
    a  b  c  
    1  3 91

* **Mapply** - For when you have several data structures (e.g. vectors, lists) and you want to apply a function to the 1st elements of each, and then the 2nd elements of each, etc., coercing the result to a vector/array as in sapply.

This is multivariate in the sense that your function must accept multiple arguments.

Sums the 1st elements, the 2nd elements, etc. 

    mapply(sum, 1:5, 1:5, 1:5) 
    [1]  3  6  9 12 15

To do rep(1,4), rep(2,3), etc.
    
    mapply(rep, 1:4, 4:1)   
    [[1]]
    [1] 1 1 1 1

    [[2]]
    [1] 2 2 2

    [[3]]
    [1] 3 3

    [[4]]
    [1] 4

* **Map** - A wrapper to mapply with SIMPLIFY = FALSE, so it is guaranteed to return a list.

    Map(sum, 1:5, 1:5, 1:5)
    [[1]]
    [1] 3

    [[2]]
    [1] 6

    [[3]]
    [1] 9

    [[4]]
    [1] 12

    [[5]]
    [1] 15


* **Rapply** - For when you want to apply a function to each element of a nested list structure, recursively.

To give you some idea of how uncommon rapply is, I forgot about it when first posting this answer! Obviously, I'm sure many people use it, but YMMV. rapply is best illustrated with a user-defined function to apply:

Append ! to string, otherwise increment

    myFun <- function(x){
        if (is.character(x)){
        return(paste(x,"!",sep=""))
        }
        else{
        return(x + 1)
        }
    }

A nested list structure
    
    l <- list(a = list(a1 = "Boo", b1 = 2, c1 = "Eeek"), 
          b = 3, c = "Yikes", 
          d = list(a2 = 1, b2 = list(a3 = "Hey", b3 = 5)))


Result is named vector, coerced to character           
    
    rapply(l,myFun)

Result is a nested list like l, with values altered

    rapply(l, myFun, how = "replace")


* **Tapply** - For when you want to apply a function to subsets of a vector and the subsets are defined by some other vector, usually a factor.

The black sheep of the *apply family, of sorts. The help file's use of the phrase "ragged array" can be a bit confusing, but it is actually quite simple.

A vector:

    x <- 1:20
    A factor (of the same length!) defining groups:

    y <- factor(rep(letters[1:5], each = 4))
    Add up the values in x within each subgroup defined by y:

    tapply(x, y, sum)  
    a  b  c  d  e  
    10 26 42 58 74 

More complex examples can be handled where the subgroups are defined by the unique combinations of a list of several factors. tapply is similar in spirit to the split-apply-combine functions that are common in R (aggregate, by, ave, ddply, etc.) Hence its black sheep status.

# <font color='Red'>Math Functions</font>

We've discussed or used most of these built-in R math functions in lectures or exercises, but this is a good time to reiterate the importance of using these functions instead of writing your own.

    seq(): Create sequences
    sort(): Sort a vector
    rev(): Reverse elements in object
    str(): Show the structure of an object
    append(): Merge objects together (works on vectors and lists)
    strsplit(): Split the elements of a string using a delimiter

In [71]:
# Create a vector
vec <- c(4,6,1,2,-1,0,3)

In [72]:
# Creating sequences using start, end, and step
seq(0, 100, by= 3)

In [73]:
# Sort the elements of vec
sort(vec)

In [74]:
# Reverse the order of a sorted vec
rev(sort(vec))

In [76]:
# Check the structure of vec
str(vec)

 num [1:7] 4 6 1 2 -1 0 3


In [79]:
# Append elements on to a sorted vec
append(sort(vec), 9)

In [106]:
# Split string elements to form a vector
my_string <- 'Bailey-Wynne-Bailey II-Gregory-Emery'
#new_vec <- strsplit(my_string,"-")
new_vec <- unlist(strsplit(my_string,"-"))

In [173]:
# Pull out elements of new_vec
new_vec[1]
# To convert to a list
as.list(new_vec)[1]

# <font color='Red'>Math Functions</font>

We've discussed or used most of these built-in R math functions in lectures or exercises, but this is a good time to reiterate the importance of using these functions instead of writing your own.

    abs(): computes the absolute value.
    sum(): returns the sum of all the values present in the input.
    mean(): computes the arithmetic mean.
    round(): rounds values (additional arguments to nearest)

In [63]:
# Create a vector
vec <- c(-1,0,1,2,3,4,6)

In [64]:
# Using abs to find the abolute value of a vector
abs(-2)

In [65]:
# Sum of vec
sum(vec)

In [66]:
# Mean of vec
mean(vec)

In [67]:
# Round the mean of vec to two decimal places
round(mean(vec), 2)

# <font color='Red'>Regular Expressions</font>

Regular expressions is a general term which covers the idea of pattern searching, typically in a string (or a vector of strings).

For now, we'll learn two useful functions for regular expressions and pattern searching:

* grepl(), which returns a logical indicating if the pattern was found in a string or vector

* gregexr(), which returns the index of the pattern within a string

* grep(), which returns a vector of index locations of matching pattern instances of a vector

* sub(), which replaces the first instance of the pattern with the replacement string in a string

* gsub(), which replaces all instances of the pattern with the replacement string in a string

For both of these functions you'll pass in a pattern and then the object you want to search. 

Let's see some quick examples:

## Match Pattern In A String

In [167]:
base_pair = c('A','C','G','T')
sequence <- paste(sample(base_pair, 100, replace=TRUE), collapse="")

In [134]:
# Determine if a pattern is within the sequence - Use grepl(pattern, x)
grepl(toupper('gtg'), sequence)

In [137]:
grepl(toupper('cat'), sequence)

In [140]:
grepl(toupper('tgtcca'), sequence)

In [152]:
sequence

## Find Index In A String

In [153]:
# Use gregexpr to index within sequence
gregexpr(toupper('cag'), sequence)

## Index Vector Elements

In [164]:
base_pair = c('A','C','G','T')
sequence <- sample(base_pair, 100, replace=TRUE)

In [148]:
# Determine index of a patter within the sequence - Use grep(pattern, x)
grep('G', sequence)

## Sub vs Gsub

In [170]:
sequence <- c('ATG','TAG','TTC','CTT','GTG','GAT','GCC')

In [171]:
sub('T', 'U', sequence)

In [172]:
gsub('T', 'U', sequence)