# R - the language


R is a programming language and **free software** environment mostly used for statistical computing. https://www.r-project.org/about.html

R is high-level programming language, just like Ruby and Python, but more oriented to **tables manipulation** and **statistical analysis**.

Please have this **cheatsheet** at hand:
http://github.com/rstudio/cheatsheets/raw/master/base-r.pdf

If you wanna use R at home (or even during this lesson), you can use **Rstudio**:
https://rstudio.com/

with the **terminal**, you can instead simply type "R".

Later in the class, we will use the **ggplot** package to do plotting. ggplot is part of a bigger package, **tidyverse**, which provides many more useful packages https://www.tidyverse.org/


You can ge information about a function with the ? character:    

In [2]:
?print

print                   package:base                   R Documentation

_P_r_i_n_t _V_a_l_u_e_s

_D_e_s_c_r_i_p_t_i_o_n:

     ‘print’ prints its argument and returns it _invisibly_ (via
     ‘invisible(x)’).  It is a generic function which means that new
     printing methods can be easily added for new ‘class’es.

_U_s_a_g_e:

     print(x, ...)
     
     ## S3 method for class 'factor'
     print(x, quote = FALSE, max.levels = NULL,
           width = getOption("width"), ...)
     
     ## S3 method for class 'table'
     print(x, digits = getOption("digits"), quote = FALSE,
           na.print = "", zero.print = "0",
           right = is.numeric(x) || is.complex(x),
           justify = "none", ...)
     
     ## S3 method for class 'function'
     print(x, useSource = TRUE, ...)
     
_A_r_g_u_m_e_n_t_s:

       x: an object used to select a method.

     ...: further arguments passed to or from other methods.

   quote: logical, indicating wh

In [None]:
print("Ciaooo!")

If you have just a vague idea about a function of which you don''t remember the same, you can search it:

In [None]:
??select

## Basic operations

Unlike many programming languages, in R assignment to variable is done with the "<-" operator (but you can use also the "=" character):

In [None]:
x <- 3
x

# wierd enought, you can also do this
# 4 -> y
# y

Aritmetic operations are straightforward:

In [None]:
"An addition: 5 + 5 "
5 + 5 

"A subtraction, with a number expressed in scientific notation: 5 - 5e-2"
5 - 5e-2

"Exponentiation: 2^5"
2^5

"Modulo: 28%%6"
28%%6

## Flow Control

R includes all the typical programming elements like IF, FOR , WHILE.....



In [None]:
# in python you use tabs for indentation
#if (a < 1):
#    print(a)
#    x = 10

# in R you use brackets
height_in_cm <- 195

if (height_in_cm > 190) {
    print("Oooohh you are so tall")
    print("Great!")
} else {
    print("You are ok")
}


In [None]:
i <- 0
while (i < 5){
    print(paste0("i is still less than 5: ", i))
    i <- i + 1
}
print("Finally i is 5")

In [None]:
### FOR loops are quite special, because they iterate over vectors (see later) and use the keyword "in":

# x is a vector from 1 to 5
x <- 1:5

"x is a vector"
x

# the variable i will iterate over the vector x, you do not need to increase it manually
for (i in x) {
    print(paste0("i is still less than 5: ", i))
    print(i)
}

print("Finally i is 5")

### next and break 

When inside a loop (for or while), the special operator **next** force to skip the currect iteration and jump to the next one, while **break** directly exits the loop

In [None]:
for (i in 1:10) {
    if(i == 4) # don't write 4
        next
    if(i == 9) # we stop the loop at 9
        break
    print(i)
}
    

## Functions

Functions can be defined with the **function** keyword

In [None]:
# in Python functions are like this
# def func(base, exponent):
#   result = base ** exponent
#   return(result)

func <- function(base, exponent) {
    result <- base ^ exponent
    return(result)
}

# call the function
func(2, 3)


## Basic data types

R possesses a few data types: integer, numeric, logical, characters and NA.

In [None]:
# Numeric variable
x1 <- 42

# A logical variable
x2 <- FALSE

# A string variable
x3 <- "universe"

# NA means nota available, the data is messing
x4 <- NA

# Let's see what it's inside x2
x2

# you can always check a variable type with "class"
class(x2)

## Vectors

Vectors are just an array of elements put into a single variable. <br>
In R, *vectors are indexed starting from 1.* <br>
The good thing is that R treats them similarly to algebraic vector, so you can do *vectorized* operations, which is a way to avoid loops

In [None]:
# Create vectors using the "c" (concatenate) function
numbers <- c(1, 2, 10, 6, 49, 101, 155, 8)
characters <- c("a", "b", "c")
# the colon operator creates interger sequences (also check out the seq() function )
sequence <- 1:10

numbers
characters
sequence

### Indexing vectors

In [None]:
x <- 20:30
x

"The fourth element"
x[4] 

"All but the fourth"
x[-4] 




#### Indexing vectors with another vector

In [None]:
x <- 20:30

"Use a vector to index another vector"
x[c(1, 2, 8)] 

"Elements two to four"
x[2:4] 

"All Elements except one and five."
x[-c(1, 5)] 


### Vector names

In [None]:
x <- 1:10
"The vector without names"
x

# You can give names to the vector elements"
names(x) <- c("One", "Two", "Three", "Four", "Five", "Six", "Seven", "Eight", "Nine", "Ten")

"The vector with names"
x

"Only the names (which is a vector itself)"
names(x)


## NA 

NA simply indicates that the value is missing, and it is a type for itself. Be careful that doing operations of NAs may return a NA!

In [None]:
# a vector containing a missing value
vec <- c(1, 2, 3, NA, 5)

" if we take the mean of the vector, we get a NA"
mean(vec)

" which can be avoided by ignoring the NAs"
mean(vec, na.rm=T)

### Example: plotting

plot() is the basic R function for plotting. The resulting plot depends on the data you provide, but in general is a scatterplot.

In [None]:
x <- 1:10
y <- c(12, 25, 11, 3, 24, 30, 12, 34, 10, 6)

plot(x, y)
barplot(y)

### Vectorization

As noted before R automatically **vectorize** operations that involve vectors and non-vectors (another way to avoid loops)

In [None]:
x <- 1:10

"Elements positions which are equal to 10."
x == 10

"Elements which are equal to 10."
x[x == 10] 

"All elements less than 10"
x[x < 10] 

"Elements in the set 1, 2, 5."
x[x %in% c(1, 2, 5)]


"Elements not it the set 1, 2, 5."
x[!x %in% c(1, 2, 5)]


This seems a little counterintuitive (some other languages do not possess this property), but it is very useful. For example, you can modify a vector (and also other higher datatypes) in *one shot*:


In [None]:
x <- 1:10

"Print x"
x

"Print the elements lower than 5"
x[x < 5]

"Set the elements lower than 5 to zero"
x[x < 5] <- 0

x

"Turn the vector elements into percentages, and round them to 2 decimal points"
x / sum(x) * 100
round(x / sum(x) * 100, 2)

## Basic plotting

Vectorization can be extremely useful for plotting. In this example we are going to simulate a linear regression, and show how you can out several plot in the same page with the "par()" function.

In [None]:
# Simulate the height of 100 people
height <- rnorm(n = 100, mean = 175, sd = 10)
hist(height)


In [None]:
# Imagine that the weight = height * 0.4
weight <- 10 + height * 0.4
hist(weight)


In [None]:
# plot weight against height
plot(height, weight)

In [None]:
# Add some variability (noise)
weight <- weight + rnorm(100, 0, 1)
plot(height, weight)

In [None]:

# the function lm simply perform a linear regressin with the formula y ~ x
fit <- lm(weight ~ height) 
"Have a look at the linear regression"
fit


In [None]:
# let's prepare a 2 by 2 plot
par(mfrow = c(2,2))
# plot the histograms of height and weight
hist(height)
hist(weight)
# plot the relationship between height and weight
plot(x = height, y = weight)
# plot again the relationship, and add the regression line
plot(x = height, y = weight)
abline(fit)

## Matrix

Matrices are just 2-dimensional vector, nothing special really.....They are a collection of elements of the same data type (numeric, character, or logical) arranged into a fixed number of rows and columns. 

In [None]:
# There are several way to create a matrix, one is to turn a vector into a matrix
x <- 1:9
m <- matrix(x, nrow = 3, ncol = 3)

"Print the matrix"
m

# by default elements are entered into the matrix by column, but you can do it by row
m <- matrix(x, nrow = 3, ncol = 3, byrow = T)

"Print the matrix by row"
m



### Subsetting matrices

matrices can be easily subsetted by their indices. \\
*Warning: selecting only one row or column automatically returns a vector, which can produce later undesired results. use drop=F to avoid this

In [None]:
x <- 1:9
m <- matrix(x, nrow = 3, ncol = 3, byrow = T)
m

"Get the element in position 1, 2"
m[1, 2]

"Get the the second and third column"
m[, 2:3]


"Get the first row (be careful, the result is a vector, use drop=F to avoid it)"
m[1, ]
m[1,,drop=F]


### Matrix algebra operations

Matrices can be used to do any kind of algebraic operation.

check more at: https://www.statmethods.net/advstats/matrix.html

In [None]:
m

"Matrix multiplication"
m %*% m

"transpose"
t(m)

"Determinant"
det(m)

"Sum of the rows"
rowSums(m)

### Columns and rows names

In [None]:
rownames(m) <- c("one", "two", "three")
colnames(m) <- c("uno", "dos", "tres")
m

### Manipulating matrices

You can add rows and columns with rbind() and cbind()



In [None]:
"Add a row"
rbind(m, c(10, 10, 10))

"Add a column"
cbind(m, c(10, 10, 10))

"join two matrices by rows"
rbind(m, m)

## Lists

List are like vector but they can **contain elements of different types**. <br>
A list can be used to store for example a person's features (height, weight, name, birth date....), which are of different types.<br>
Of course lists cannot be used for algebraic operations like vectors

In [None]:
pepe <- list(Name = "Pepe", Height = 180, Birth=as.Date("05/05/1955"), Pets=c("gato1", "gato2", "perro1" )) 

pepe

# List can be indexed by the dollar symbol $
"Pepes pets, by dollar"
pepe$Pets

# Or if the element does not have a name, by index using double [[]]
"Pepes pets, by index"
pepe[[4]]

## The apply functions

A popular alternative to loops in R are the apply functions:


In [None]:
fruits <- list("mangoes", "bananas", "peaches","oranges", "apples")

"How long is each word? we can use a loop"
l <- c()
for(x in fruits)
    l <- c(l, nchar(x))
l
    
"sapply applies the nchar function on each element and returns a list"
l <- sapply(fruits, nchar)
l


### apply's anonymous function

if the function to apply is more complex, you can specify it directly

In [None]:
fruits <- list("mangoes", "bananas", "peaches","oranges", "apples")

"check which words are of 7 characters"
sapply(fruits, function(x) if(nchar(x) == 7) return (T) else return (F) )
    
# The previous function is "anonymous", because it is only visible inside the apply function.
# But you can specify an external function, especially if is very long
    

*warning: if sapply cannot simplify the result into a vector, it will return a list just l

### the apply family

- apply iterates over rows or columns of matrices
- lapply iterates over list and vector, apply a specific functions for each element and returns a list of results
- sapply is similar to lapply but tries to compress the result into a vector. If used wrongly can return wierd results
- vapply only iterates over vectors and requires to specify the type of return value. It can be safer and faster than sapply
- rapply recursively iterate over nested lists: list(..., list(list(), list()...)...)
- mapply apply a function for each element of same-length vectors: mapply(sum, 1:3, 1:3, 1:3) -> 3 6 9
- tapply apply a function based on grouping define by a factor (see next section)

For more informations: https://ademos.people.uic.edu/Chapter4.html

## Factors

Factor a categorical variables. They can be very useful when we want to group together elements of vectors, lists data frames....




In [None]:
# a normal vector
fruits <- c("apple", "apple", "orange", "orange", "apple")

# create a vector out of it
fruits_factor <- factor(fruits)

"the vector"
fruits
"the factor"
fruits_factor



notice the difference, "fruits" is just a vector of strings, "fruits_factor" is a factor, a vector of categories. "Levels" shows you the categories contained in the factor.
The second factor has been created from a vector of numbers, as you can see, R created a factor of categories as with the fruit vector. *It does not matter what type is used to create a factor, equal element will correspond to a single category*


Why do we need factors? 
Factors are needed to tell R which elements belongs to the same group. For example, let's see the **tapply** function:


In [None]:
# let's say we have 5 fruits....

# the vector of fruit types
fruits <- c("apple", "apple", "pineapple", "orange", "pineapple")

# the size of each fruit
sizes <- c(2, 3, 15, 4, 18)

# How can we get the average of each TYPE of fruit?
# tapply get the mean for each group!
tapply(X = sizes, INDEX = factor(fruits), FUN = mean)



## Example: The Monty Hall problem

<img src="montyhallproblem.png">

The monty hall problem is loosely based on the American television game show from the '60s. The contestant is presented with 3 doors: one contains a valuable car, and the other two hid goats. The contestant has to choose a door and try to win the prize. However:

- after choosing a door, the door is not inmediately revealed
- the host will instead open one of the two remeaining doors, **always** revealing a goat (the host knows what is behind each door)
- the contestant is therefore left with two remaining doors, the chosen one and the alternative
- the contestant is offered the choice to keep the original door or to switch to the alternative

What do you think is the best option for the contestant?  to stay or to switch? Or is does not matter? *(hint: think of a similar game but with MANY doors, one cars a many goats)*

Let's try to solve the problem by simulation:

In [None]:
# iterations
N <- 100

# Solution with for loop
doors <- c("A", "B", "C")
car <- vector("integer", N)  # the doors hiding the cars
choice <- vector("integer", N) # the doors chosen by the participant
monty <- vector("integer", N) # the doors revealed by monty
closed <- vector("integer", N) # the doors that is not either the choice or monty's
win_stay <- vector("logical", N)  # the times that we win by staying
win_switch <- vector("logical", N) # the times that we win by switching
for(i in 1:N) {
  car[i] <- sample(x = doors, size = 1)
  choice[i] <- sample(x = doors, size = 1)
  monty[i] <- sample(x = doors[!doors %in% c(choice[i], car[i])], size = 1)
  closed[i] <- doors[!doors %in% c(choice[i], monty[i])]
  win_stay[i] <- car[i] == choice[i]
  win_switch[i] <- car[i] == closed[i]
}


table(win_stay)
table(win_switch)

res <- cbind(cumsum(win_stay), cumsum(win_switch))
colnames(res) <- c("Stay", "Switch")
head(res)
matplot(y = res, col=c("red", "blue"), pch=16, ylab="Number of wins", xlab="Iteration", cex=2, cex.axis=2,  cex.lab=2)
legend(x = 10, y = 60, c("Stay", "Switch"),  col=c("red", "blue"), pch = 16, cex=2)


In [None]:
# Solution with vectorizations functions
doors <- c("A", "B", "C")
car <- sample(x = doors, size = N, replace = T)
choice <- sample(x = doors, size = N, replace = T)
monty <- sapply(1:N, function(i) sample(x = doors[!doors %in% c(choice[i], car[i])], size = 1))
closed <- sapply(1:N, function(i) doors[!doors %in% c(choice[i], monty[i])])
win_stay <- car == choice
win_switch <- car == closed