## Fine Tuning: Efficient Base R

R is flexible because you can often solve a single problem in many different ways. Some ways can be several orders of magnitude faster than the others. This chapter teaches you how to write fast base R code.

### Timings - growing a vector
Growing a vector is one of the deadly sins in R; you should always avoid it. The growing() function defined below generates n random standard normal numbers,
but grows the size of the vector each time an element is added!

In [1]:
# The growing() function is defined below:
n <- 30000
# Slow code
growing <- function(n) {
    x <- NULL
    for(i in 1:n)
        x <- c(x, rnorm(1))
    x
}

# Using the system.time() function, find how long it takes to generate 
# n = 30000 random standard normal numbers using the growing() function. 
# Use the <- trick to store the result in a vector called res_grow.

system.time( res_grow <- growing(n))

   user  system elapsed 
   0.64    0.02    0.66 

### Timings - growing a vector
Growing a vector is one of the deadly sins in R; you should always avoid it. The growing() function defined below generates n random standard normal numbers, but grows the size of the vector each time an element is added!

In [2]:
n <- 30000
# Fast code
pre_allocate <- function(n) {
    x <- numeric(n) # Pre-allocate
    for(i in 1:n) 
        x[i] <- rnorm(1)
    x
}

# Using system.time(), find how long it takes to run pre_allocate(n). 
# Use the <- trick to store the result in the object res_allocate.

system.time(res_allocate <- pre_allocate(n))

# check that pre-allocating the vector is significantly faster 
# than growing the vector!

   user  system elapsed 
   0.03    0.00    0.03 

### Vectorized code: multiplication
The following piece of code is written like traditional C or Fortran code. 
Instead of using the vectorized version of multiplication, it uses a for loop.

x <- rnorm(10)

x2 <- numeric(length(x))

for(i in 1:10)
    
    x2[i] <- x[i] * x[i]

Your job is to make this code more "R-like" by vectorizing it. Rewrite that code using a vectorized solution. Hints: Your solution should be a single line of code. You should not use a for loop. The multiplication operator is vectorized!

In [3]:
# Store your answer as x2_imp
x2_imp <- x * x
# other way to do it
x2_imp_exp <- x**2

print(x2)
print(x2_imp)
print(x2_imp_exp)

 [1] 0.01905238 0.77723099 2.83592511 0.21349695 0.53524366 0.23198633
 [7] 0.31160668 1.33557329 2.07965055 0.01299281
 [1] 0.01905238 0.77723099 2.83592511 0.21349695 0.53524366 0.23198633
 [7] 0.31160668 1.33557329 2.07965055 0.01299281
 [1] 0.01905238 0.77723099 2.83592511 0.21349695 0.53524366 0.23198633
 [7] 0.31160668 1.33557329 2.07965055 0.01299281


### Vectorized code: calculating a log-sum
A common operation in statistics is to calculate the sum of log probabilities.
The following code calculates the log-sum (the sum of the logs).

In [4]:
# x is a vector of probabilities
n <- 100
total <- 0
x <- runif(n)
for(i in seq_along(x)) 
    total <- total + log(x[i])

# However this piece of code could be significantly improved 
# using vectorized code.

# Find the log-sum of x using the log() and sum() functions, simplify 
# the above loop.Store your answer in the object log_sum.
log_sum <- sum(log(x))

print(total)
print(log_sum)

[1] -84.55428
[1] -84.55428


### Data frames and matrices - column selection
All values in a matrix must have the same data type, which has efficiency  implications when selecting rows and columns. Suppose we have two objects, mat (a matrix) and df (a data frame). Using the microbenchmark() function, how long does it take to select the first  column from each of these object? In other words, which is faster 
mat[, 1] or df[1, ]?

In [21]:
# install.packages("microbenchmark")
# library(microbenchmark)

# download data from gapminder 
# install.packages("gapminder")
# library(gapminder)
df = gapminder
n = nrow(df)
m = ncol(df)
# create a matrix of random numbers
mat = matrix(rnorm(n),nrow=m)
# Which is faster, mat[, 1] or df[1,]? 
microbenchmark(mat[, 1], df[1, ])

# fyi - Accessing a row of a data frame is much slower than accessing that 
# of a matrix, more so than when accessing a column from each data type. 
# This is because the values of a column of a data frame must be the same 
# data type, whereas that of a row doesn't have to be. 

expr,time
"mat[, 1]",2500
"mat[, 1]",600
"df[1, ]",139000
"df[1, ]",62500
"mat[, 1]",1400
"mat[, 1]",300
"df[1, ]",54300
"df[1, ]",49900
"mat[, 1]",1400
"mat[, 1]",400
