# Session 1

By **Paul Rognon & Miquel Torrens i Dinarès**

*Barcelona School of Economics* – 
*Data Science Center*

March 16th, 2023

## Getting started

### What is R?

R refers to both:
- A programming language.
- A program (it runs on a computer). The program's purpose is to interpret the language and convert it into instructions for a computer.
(Like Python)

Current release of R is version [4.2.2](https://cran.r-project.org/).

R does have a graphical interface: the R-GUI console. In R-GUI you can submit R language commands one by one.

Most of the time you will want to submit a series of commands (your code) and thus you will write a *script* with extension `.R`. For example you could write the following script and save it with extension `.R`:
```
message <- "Hello World!"
print(message)
```
Then to execute your script, open the GUI and run `source('path to your .R file')`.

You can write a script with any basic text editor. However it is more convenient to use an IDE (integrated development environment). The most popular IDE for R is [RStudio](https://posit.co/products/open-source/rstudio/). Shortly, RStudio is an interface with four quadrants: one for editing a script, one displaying the commands run and their printed output (the console), one showing the objects in the environment, one that displays the working directory, plots, packages and help. [Here](https://rladiessydney.org/courses/ryouwithme/01-basicbasics-1/) is a tour of RStudio. Note that RStudio now supports Python programming too.

### R in Jupyter Notebooks and Google Colab

You can run R code in Jupyter notebooks. You can do it through the Jupyter Notebook application, see this [guide](https://docs.anaconda.com/navigator/tutorials/r-lang/).

Google Colab can also run Jupyter notebooks with R code. To create a R notebook in Colab, click this [link](https://colab.research.google.com/notebook#create=true&language=r). If you go to `Runtime`>`Change runtime type`, you will see the Runtime type is now set to 'R'.

## 1. The R Environment


### 1.1 Introduction
When you open an `R` session, a **console** is displayed, which is where all operations in `R` are run. This is built on an `R` environment that we will call the **workspace**. This is the playing field that will contain everything needed to run your code.

In an `R` workspace there are objects that can store data or functions but also operators. For example: we can create a vector `x` and use the function `mean`.

In [31]:
x <- 0:10  # This defines a numeric vector "x" with integers from 0 to 10
mean(x)  # This computes the mean of the vector

Here we displayed an object storing values, an operator, and a function. To create an object, we used an operator of assignment `<-` (right-to-left), and we called a function on this object by placing it inside brackets, the object is then taken as an *argument* of the function. We also showed how to comment your code: use `#` to let `R` know it should not run the rest of the line.

Functions can have more than one argument. Arguments let you to control on what and how the function should be run.

In [32]:
pi  # This is a special object built into R
round(pi)  # This functions rounds a number to any decimal point (by default, 0)
round(pi, digits = 1)  # The argument "digits" controls the number of decimals
round(pi, digits = 4)

Some functions do not require arguments. For example, you can explore the objects in your workspace using

In [33]:
ls()

You can know which arguments are admitted by each function, alongside with a detailed function description, by using the `help()` function, or by typing in an `?` in front of the name of the function:

In [34]:
?mean

`R` has a wide array of general functions and operators embedded natively that allow you to do pretty much anything basic. Those functions not readily available that you might require for specific purposes will need to be either (a) installed and loaded from external packages, or (b) coded by you.

Objects storing data can be manually coded from scratch or generated by importing data, either numeric, in text, or any other format (maps, JSON, SQL, etc.).

Finally, don't forget to exit your `R` session once your job is done! Do that using the quit function

In [35]:
q()

Use the command `rm(list = ls())` instead if you don't want to leave the session but want to remove every object in it. The `rm()` function will remove what's inside its parenthesis from the workspace, but use it with caution.

### 1.2 General rules to programming (with `R`)

1. **Reproducibility** is everything: so you and others can keep track of your analysis, it is important to conserve your code. Develop and save your code in an `R` script (file with extension `.R`) or a notebook (Jupyter Notebook `.ipynb`, R markdown `.rmd`). Commands submitted to the console are **NOT** saved.
2. Make your code as **robust** as possible: be conservative and write safe code, taking all possibilities into account. Being lazy today will cause problems tomorrow.
2. **Comment your code thoroughly** and clarify everything, someone else might read it and needs to know what's happening (even yourself in the future)
3. Follow a **style guide** and do it **consistently**: spacing, indenting, format..., just like you do writing natural language. Hadley Wickham, a leading developper of R solutions, published his own [style guide](http://adv-r.had.co.nz/Style.html). It's an example that you can tweak to make it your own: just stick to the basics, stay consistent and write as if it were to be read. Remember: **clean code = happy programmer**.

---
## 2. Input/Output


### 2.1 Importing functions (libraries)

*Libraries* (a.k.a. *packages*) are bundles of functions generally related to one particular topic. For example, the `xtable` package includes a set of functions to convert a table into $\LaTeX$ readable format.

In `R`, packages need to be installed prior to its use and then loaded into the workspace.

In [36]:
# Install the required package (use single or double quotation marks)
install.packages('xtable')

# Load it into your workspace
library(xtable)  # Option 1
require(xtable)  # Option 2

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



In effects, `library()` and `require()` do the same: import every function from the package into your workspace -- the only practical difference is that if the loaded package is not available, `library` will crash, while `require` just returns a warning.

In [37]:
# The following command will give an error
require(mombf)
library(mombf)

Loading required package: mombf

“there is no package called ‘mombf’”


ERROR: ignored

There are two ways to call a function from an installed package:

In [38]:
data1 <- matrix(1:9, ncol = 3, nrow = 3)
data2 <- xtable(data1)  
data2 <- xtable::xtable(data1) # This is useful when several functions have the same name in your environment
print(data1)
print(data2)

     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9
% latex table generated in R 4.2.2 by xtable 1.8-4 package
% Wed Mar 15 20:33:51 2023
\begin{table}[ht]
\centering
\begin{tabular}{rrrr}
  \hline
 & 1 & 2 & 3 \\ 
  \hline
1 &   1 &   4 &   7 \\ 
  2 &   2 &   5 &   8 \\ 
  3 &   3 &   6 &   9 \\ 
   \hline
\end{tabular}
\end{table}


There are many packages available on repositories. There is some large probability that there is an existing  package doing what you want to code. For simplicity and ease, you might just rely on existing packages. Well developped packages can also run significantly faster than your own code. This is an important consideration when dealing with high-dimensional data.

There are nevertheless downsides on relying on packages. Since packages are generally coded by other users, they might
* change over time (affecting the time-consistency of the code)
* be more prone to errors (in particular libraries with few users)
* cause cross-platform trouble and run into replication problems.

*Comprehensive R Archive Network* (CRAN) is a repository of packages that must comply with some quality standards. However, availability on CRAN is no definitive guarantee that a package is rock solid.

### 2.2 Importing and exporting external data

Here we detail the process of importing and exporting data into an `R` session. Some popular native functions:

```
read.table()  # General function
read.csv()  # Import data from a CSV
read.csv2()  # CSV when "," is decimal point and a ";" is field separator
read.txt()  # Import from TXT file
read.fwf()  # Read fixed width format files
readLines()  # Import lines of text
download.file()  # To download a data file from a URL
```

You need to supply the file path to these functions to load the data itself.

In [39]:
# On your personal computer
#file_path <- "Path to the file that you want create e.g C:/Users/Paul/desktop/tips.csv"

# On colab
file_path <- "/content/tips.csv"
download.file("https://raw.githubusercontent.com/barcelonagse-datascience/academic_files/master/data/tips.csv",destfile=file_path)

In [40]:
tips_df <- read.csv(file_path) # tips.csv is comma separated, read.csv parses the csv file into a data frame object
tips_df[1:5,]

Unnamed: 0_level_0,total_bill,tip,sex,smoker,day,time,size
Unnamed: 0_level_1,<dbl>,<dbl>,<chr>,<chr>,<chr>,<chr>,<int>
1,16.99,1.01,Female,No,Sun,Dinner,2
2,10.34,1.66,Male,No,Sun,Dinner,3
3,21.01,3.5,Male,No,Sun,Dinner,3
4,23.68,3.31,Male,No,Sun,Dinner,2
5,24.59,3.61,Female,No,Sun,Dinner,4


The library of functions `foreign` will help you to read a number of different other data formats. Most commercial softwares have their own reading methods. For example, to import from Excel you need the function `read_excel()` in the `readxl` package. The `haven` package contains functions to read output files from other popular softwares such as Stata, SPSS or SAS. 


If you use RStudio, some of these imports are automatised. The *Import dataset* button in the Environment quadrant is an extremely helpful and intuitive way to import data.

Similarly to the reading functions, there are equivalent writing functions to save your output data tables into non-`R` generic formats: `write.csv()`,  `write.table()`, or `writeLines()`.

### 2.3 Importing and exporting R files

Sometimes you will want to save an R object which is not a data table to an external file. There are two natural formats to produce `R` outputs: `.RData` or `.rda` (more than one object can be stored), and `.rds` (only one object).

In [41]:
# Write an object on an external file (.RData, rda, rds)
save(data1, file = 'data1.RData')  # .RData or .rda
saveRDS(data2, file = 'data2.rds')  # .rds
save.image(file = 'workspace.RData')  # This saves the entire workspace

Similarly, you might want to read a file produced in `R`. an `.RData`, again two ways to load them:

In [42]:
rm(data1, data2)  # Remove objects in the workspace to prove this works
load('data1.RData')  # This loads into workspace with the name it was saved with
data2 <- readRDS('data2.rds')  # This needs an assigment
data1 <- get(load('data1.RData'))  # This is even safer, to control the names of what you're loading
print(data1)
print(data2)

     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9
% latex table generated in R 4.2.2 by xtable 1.8-4 package
% Wed Mar 15 20:34:00 2023
\begin{table}[ht]
\centering
\begin{tabular}{rrrr}
  \hline
 & 1 & 2 & 3 \\ 
  \hline
1 &   1 &   4 &   7 \\ 
  2 &   2 &   5 &   8 \\ 
  3 &   3 &   6 &   9 \\ 
   \hline
\end{tabular}
\end{table}


Other relevant functions for file management (outside of your workspace):

```
dir.create()
file.create()
file.rename()
file.copy()
file.remove()
file.exists()
file.info()
```



---
## 3. Objects and object classes

The **class** of an object determines (a) its structure and (b) what/how functions can be applied to it.

### 3.1 Common object classes

We will see some of the most common object classes in R.

1. Vectors are single-valued objects or a collection of values of the same type. Basic (‘atomic’) vector types are:

  a. `integer`

  b. `double`
  
  c. `character` (equivalent to string in other languages)

  d. `logical` (`TRUE` or `FALSE`)

  e. `complex`

Integers use no decimal points while double do. Integers and double types are gathered under the umbrella term `numeric`.

Note that `factors` are special type of vectors that can only take a specific set of values (known as `levels`), as in a categorical variable.


In [43]:
# Types of vectors
# [c() creates a vector: it is a function to concatenate elements]
x1 <- c(1.4, 2.3, 3.2, 4.1)  # numeric
x2 <- c(1L, 2L, 3L, 4L)  # integer
x3 <- c('A', 'B', 'C', 'A')  # character
x4 <- factor(c('A', 'B', 'C', 'A'))  # declare it as factor with levels "A", "B" and "C"
x5 <- c(TRUE, FALSE, TRUE, FALSE)  # logical

# The "class" function tells you what class an object is
class(x1)  # To obtain a boolean, try: is.numeric(x1)
class(x2)  # Try: is.integer(x2)
class(x3)  # Try: is.character(x3)
class(x4)  # Try: is.factor(x4)
class(x5)  # Try: is.logical(x5)

Some classes can be *coerced* into other classes: for example a numeric vector can be coerced into a character vector, and vice versa if the elements of the vector are digits. But numeric vectors cannot be coerced into logical. Use the functions
- `as.numeric()`,
- `as.integer()`,
- `as.character()`,
- `as.factor()`,
- `as.logical()`,

to convert objects into other classes.

If some class is not coercible into another, you may be able to use a coercible middle class to do the trick. For example, if `x` is a factor with values in 1, 2, 3, you may coerce it into numeric using

In [44]:
x <- as.factor(c(4:6, 6:4))  # x is a factor
cat('* Class of "x":', class(x), '\n')
print(x)  # This will print the levels

# Factors are not coercible to numeric
y <- as.numeric(x)
print(y)  # WRONG result

# Use character as a translator
z <- as.numeric(as.character(x))  # factor -> character -> numeric
cat('* Class of "z":', class(z), '\n')
print(z)

* Class of "x": factor 
[1] 4 5 6 6 5 4
Levels: 4 5 6
[1] 1 2 3 3 2 1
* Class of "z": numeric 
[1] 4 5 6 6 5 4


2. `list`

    `Lists` are generalisation of a vector. They allow you to have different objects (maybe of different class) in one single objects, using *slots*, having one object in each slot.

In [45]:
# Create a list
a <- list('a' = x, 'b' = z) # with two "slots", with names "a" and "b"
print(a)

$a
[1] 4 5 6 6 5 4
Levels: 4 5 6

$b
[1] 4 5 6 6 5 4



3. `data.frame`

    Rectangular data structure similar to a matrix, with rows and columns, where each column can be of a different type.

In [46]:
# Create a data.frame
b <- data.frame('col1' = x, 'col2' = y, 'col3' = z)
print(b)

  col1 col2 col3
1    4    1    4
2    5    2    5
3    6    3    6
4    6    3    6
5    5    2    5
6    4    1    4


4. `matrix` and `array`

    Matrices are similar to a `data.frame`; they are less flexible (all its elements have to be of the same class) but computations with matrices are more efficient (fast). `arrays` are a generalisation of a matrix (they can potentially have more dimensions).


In [47]:
# Create a matrix of integers
m <- matrix(1:12, ncol = 3)
print(m)

# Create a matrix of characters
m <- matrix(rep('1',12), ncol = 3)
print(m)

     [,1] [,2] [,3]
[1,]    1    5    9
[2,]    2    6   10
[3,]    3    7   11
[4,]    4    8   12
     [,1] [,2] [,3]
[1,] "1"  "1"  "1" 
[2,] "1"  "1"  "1" 
[3,] "1"  "1"  "1" 
[4,] "1"  "1"  "1" 


These classes are coercible as well, using `as.list()`, `as.data.frame()`, and `as.matrix()`, but **BE CAREFUL** in any type of class coercion, and always inspect the coerced object to make sure the result is in the correct desired format.

When you will program at a more advanced level, you will encounter more classes that stem from this basic set, and more complex. You will also learn to create your own classes.

### 3.2 Functions applicable to a class

Functions are another type of R objects, we will learn more about them later. It is important to emphasize here that the same function can have different behaviours depending on the class of objects it is applied to. For example, the function `mean()` can be applied to an object of class `numeric`, but applying it to an object of class `character` or `factor` would generate an error in your workspace:

In [48]:
x1 <- c(1, 7, 4, 8, 6, 0)
x2 <- as.factor(c('a', 'c', 'd', 'b', 'b', 'd'))
print(mean(x1))  # OK
print(mean(x2))  # Can't do it

[1] 4.333333


“argument is not numeric or logical: returning NA”


[1] NA


**Note:** characters require quotation marks, otherwise it would be as if you were creating a vector with objects `a`, `b`, `c` and `d`, which do not exist. Recall that in `R` single (`'`) and double (`"`) quotation marks are equivalent.

Other functions are applicable to both classes, but mind you they might produce different types of output:

In [49]:
cat('* Summary of object "x1"\n')
summary(x1)
cat('* Summary of object "x2"\n')
summary(x2)

* Summary of object "x1"


   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   1.750   5.000   4.333   6.750   8.000 

* Summary of object "x2"


You can find out if a function is usable on your object class with the function `methods()`, which will display all classes that a given (native) function can handle:

In [50]:
methods(summary)

 [1] summary.aov                         summary.aovlist*                   
 [3] summary.aspell*                     summary.check_packages_in_dir*     
 [5] summary.connection                  summary.data.frame                 
 [7] summary.Date                        summary.default                    
 [9] summary.ecdf*                       summary.factor                     
[11] summary.glm                         summary.infl*                      
[13] summary.lm                          summary.loess*                     
[15] summary.manova                      summary.matrix                     
[17] summary.mlm*                        summary.nls*                       
[19] summary.packageStatus*              summary.POSIXct                    
[21] summary.POSIXlt                     summary.ppr*                       
[23] summary.prcomp*                     summary.princomp*                  
[25] summary.proc_time                   summary.rlang_error*               

### 3.3 Quick object inspection

Sometimes we have an object of a given class that we want to inspect, as in learning more about it beyond just printing it. The function `str` returns the structure of an object.

In [51]:
set.seed(666)  # This functions "controls" the random
x <- rnorm(10000, mean = 0, sd = 1)  # 10K random values from a standard normal
y <- list(A = 1:100, B = letters[1:22], C = rnorm(10), D = c(TRUE, TRUE, FALSE))
z <- data.frame(col1 = 1:100, col2 = 101:200, col3 = 201:300)
w <- as.matrix(z)

cat('* Structure of object "x"\n')
str(x)  # This functions provides the structure of the object
cat('* Structure of object "y"\n')
str(y)
cat('* Structure of object "z"\n')
str(z)

* Structure of object "x"
 num [1:10000] 0.753 2.014 -0.355 2.028 -2.217 ...
* Structure of object "y"
List of 4
 $ A: int [1:100] 1 2 3 4 5 6 7 8 9 10 ...
 $ B: chr [1:22] "a" "b" "c" "d" ...
 $ C: num [1:10] 0.413 -0.8354 0.8998 -0.0885 -0.6267 ...
 $ D: logi [1:3] TRUE TRUE FALSE
* Structure of object "z"
'data.frame':	100 obs. of  3 variables:
 $ col1: int  1 2 3 4 5 6 7 8 9 10 ...
 $ col2: int  101 102 103 104 105 106 107 108 109 110 ...
 $ col3: int  201 202 203 204 205 206 207 208 209 210 ...


The function `summary` that we mentionned above applies to most classes.

In [52]:
print(summary(x))
print(summary(y))
print(summary(z))

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-3.73887 -0.65872  0.02748  0.01521  0.68787  3.65146 
  Length Class  Mode     
A 100    -none- numeric  
B  22    -none- character
C  10    -none- numeric  
D   3    -none- logical  
      col1             col2            col3      
 Min.   :  1.00   Min.   :101.0   Min.   :201.0  
 1st Qu.: 25.75   1st Qu.:125.8   1st Qu.:225.8  
 Median : 50.50   Median :150.5   Median :250.5  
 Mean   : 50.50   Mean   :150.5   Mean   :250.5  
 3rd Qu.: 75.25   3rd Qu.:175.2   3rd Qu.:275.2  
 Max.   :100.00   Max.   :200.0   Max.   :300.0  


### 3.4 Object attributes
Object attributes also gets you information on objects. They are specific to classes.

Matrices and data.frames have dimensions and names

In [53]:
dim(z)  # Dimension
nrow(z)  # No. of rows
ncol(z)  # No. of columns
colnames(z)  # Names of the columns
head(rownames(z))  # Names of the rows
rownames(z) <- paste('row', 1:nrow(z), sep = '')  # All these can be changed
head(rownames(z))

Lists and vectors can also have names but no dimension (they have length).

In [54]:
names(y)
length(y)

In [55]:
names(x)  # This vector has no names, but they can be assigned
length(x)
names(x) <- paste('val', 1:length(x), sep = '')
head(names(x))

NULL

In [56]:
# For non-continuous vectors, this is a really useful function
v <- c('A', 'B', 'A', 'C', 'C', 'A', 'A', 'B', 'C')
table(v)  # Absolute frequencies (VERY USEFUL function)

v
A B C 
4 2 3 

### 3.5 Indexing
The ways to access elements of multi-valued objects depend on their class.

Vectors: use single square brackets (access either by position or by name)

In [60]:
x[1:2]
x['val1']  # same
head(x[-1])  # negative indices used to drop elements: all elements except the first
head(x[-(1:5)])  # return every element except those in 1 to 5
x[11:20][7]  # you can "double" subsets: return element 7 from the set 11 to 20

Lists: double square brackets (position or name)

In [61]:
y[[1]]  # First element of the list is a vector
y[[1]] <- y[[1]][1:10]  # Lets shorten it: keep first ten values (reassignment)
y[[1]]  # Repeat: now shorter vector
y[['A']]  # same result
y$A  # same, but NOT recommended: "partial matching" can cause errors

Why not use the "$" operator? Let's see what can happen

In [62]:
names(y) <- c('Abcd', 'Defg', 'Hijk', 'Lmno')  # Let's set new names
y[['A']]  # NA (well-done!): a slot named "A" does not exist
y$A  # Returns an element, WRONG: I call for 'A' and it gives me something with a different name

NULL

data.frames and matrices: single square brackets with two entries separated by a comma, the first entry is for ROWS , the second for COLUMNS. Omitting an entry retrieves a row or a column.

In [64]:
z[1, 1]  # element 1 in column 1
head(z[, 1])  # column 1
head(z[1, ])  # row 1
head(z[, 'col1'])
head(z['row1', ])

Unnamed: 0_level_0,col1,col2,col3
Unnamed: 0_level_1,<int>,<int>,<int>
row1,1,101,201


Unnamed: 0_level_0,col1,col2,col3
Unnamed: 0_level_1,<int>,<int>,<int>
row1,1,101,201


You can also use "$" in data.frames to retrieve columns, but not for matrices

In [69]:
head(z$col1)  # In data.frame: NOT recommended because of partial matching again
try(head(w$col1))  # In matrix: ERROR (we introduce "try()" later on)

colnames(z) <- c('Abcd', 'Defg', 'Hijk')  # Same exercise: partial matching
#head(z[, 'A'])  # ERROR: good!
head(z$A)  # Again: it gives you something you're not asking for

Error in w$col1 : $ operator is invalid for atomic vectors


Generally, avoid the use of the `$` operator: it's shorter code but more prone to errors (code of lower quality and robustness), especially in lists and `data.frame`'s. Instead, use either position indexing or (if named) full quotation.

---
## 4. Operators

We provide a list of the native `R` operators with an example:

### 4.1 Basic operators

*   **Arithmetic** operators

In [70]:
# Numeric and integer vectors
3 + 1  # addition
3 - 1  # subtraction
5 * 2  # multiplication
5 / 2  # division
5 %% 2  # remainder
5 %/% 2  # quotient
2 ** 3  # exponentiation, also: 2^3
exp(1)  # e^1 = 2.718282... (NOT an operator)

# Matrix operators
A <- matrix(1:9, ncol = 3)
B <- matrix(10:18, ncol = 3)
B_transposed <- t(B)
A * B  # elementwise multiplication (!)
A %*% B  # matrix multiplication
A %o% B  # outer product (matrix multiplication A%*%t(B)

0,1,2
10,52,112
22,70,136
36,90,162


0,1,2
138,174,210
171,216,261
204,258,312


*   **Relational** operators (returning a boolean)

In [None]:
7 > 2  # bigger than
7 < 2  # smaller than
7 >= 2  # bigger or equal than
7 <= 2  # smaller or equal than
7 == 2  # equal to
7 != 2  # different than

*   **Logical** operators

In [None]:
(7 > 2) & (3 > 4)  # AND
(7 > 2) | (3 > 4)  # OR
! (7 > 2)  # NOT
xor(7 > 2, 3 > 4)  # AND/OR

If the elements to which we apply a logical operator are vectors (of the same length), the operator will return a logical vector comparing the two input vectors elementwise: you can use double `AND` or `OR` to inspect only the first element: `x && y`, or `x || y`.

*   **Miscellaneous** operators

In [None]:
2:8  # Integer sequence of 2 to 8
7 %in% 2:8  # logical values answering "Is 7 an element of the vector 2:8?""

Some other basic operators are already covered in other sections, like the help operator (`?`), component/slot extraction (`$` and `@`), indexing (`[` and `[[`), formula (`~`) and library access (`::`).

*   **Assignment** operators: we cover them in the next subsection.

Like in every language, operators in `R` have *precedence*, i.e. in the single expression some operators are applied before others. Think of math: in an equation we multiply before we add, thus in the expression `x <- 1 + 3 * 4` gives 13 instead of 16, i.e. `*` has precedence over `+`. There is a precedence order for all operators in `R`: explore the precedence list using the command `?Syntax`.

### 4.2 Assignment

In `R`, objects can be created, re-written and removed without restrictions. Once an object is declared, the expression we are assigning to it is evaluated. The standard method of assigment is the use of the operator `<-`, even though there are other options. We illustrate the most common ones:

In [72]:
# Vector
rm(x)  # Remove the object from the workspace
x <- c(1, 7, 4, 8, 6, 0)

# Here the expression "mean(x)" will be evaluated 4 times
x1 <- mean(x)  # Good :-)
x2 = mean(x)  # NOT good: "=" is *not* an assigment operator
x3 <<- mean(x)  # Used only in functions to declare an object in a superior environment
assign('x4', mean(x))  # Good: useful when the object has an "unknown" name
mean(x) -> x5  # it works, but standard is to assign right to left (same with ->>)

# Test if the four objects were assigned correctly
identical(x1, x2, x3, x4, x5)  # It returns TRUE if the objects are equal

For more advanced assignment methods, check the functions `get()`, `with()`, and `within()`.

Multiple assignments require multiple operators:

In [74]:
obj1 <- obj2 <- 1:5
obj1
obj2
#obj1, obj2 <- 1:5  # NOT POSSIBLE

**(!) NOTE** – Although technically you can assign an expression using the `=` symbol, `=` is not formally an assignment operator in `R`, and its use is designed for functions. That's why it is **not** recommended despite widespread misuse. In short, refrain from assignment use.

### 4.3 Reserved words

There are certain reserved words in `R` (keywords) that cannot be assigned or over-written since they are essential built-in elements. We list the most important, words for:

*   Booleans: `TRUE`, `FALSE`.
*   Iterables: `if`, `else` (`if`-statements), `for`, `in`, `next` (`for`-loops), `while`, `break`, `repeat` (`while`-loops).
*   Functions: `function`, `...`.
*   Particular values: `NA`, `NULL`, `NaN`, `Inf`.

We review most of them later on.

Still, even if the rest functions in `R` can be over-written (stuff like `mean <- 3` is factually perpetrable), refrain from doing it.

---
## 5. Functions

Functions are often used to encapsulate a sequence of expressions that need to be executed numerous times, with some explicitly specified set of parameters or *arguments* that give it flexibility. Functions are stored in the `R` environment like any other object, in this case with class `function`.

### 5.1 Coding functions

Coding functions in `R` is fairly simple. You only need three basic things: the `function` keyword, the list of arguments (if any) it requires, and the `return` keyword to inform the output (if any). 

In [80]:
# Define function to compute the mean of numeric vector
average <- function(x) {  # input "arguments" inside parenthesis,
  n <- length(x)          # commands of the function inside "{" and "}"
  avg <- sum(x) / n
  return(avg)  # "return" assigns the outcome to "average"
}

# Use it
avg1 <- mean(1:10)
avg2 <- average(1:10)  # Here you call your first function
identical(avg1, avg2)  # They should be the same

In [81]:
# Function to compute the squared of the mean
mean.sq <- function(x, ...) {  # "..." indicates: pass any other arguments to function below
  out <- mean(x, ...)  # "..." reads other passed arguments that apply to "mean", e.g trim or na.rm
  return(out)  # "return" assigns the outcome to "average"
}

# Use it
avgsq <- mean.sq(1:10, na.rm = TRUE)  # "na.rm" is an argument of inside function "mean"

In [82]:
# Function to compute the mode (the most frequent value)
the.mode <- function(x, print.it = TRUE) {  # Some arguments can have a default value
  freqs <- sort(table(x), decreasing = TRUE)  # Nested function
  value <- names(freqs)[1]
  if (print.it == TRUE) {
    cat('The mode is:', value, '\n')
  }
  return(value)
}

# Compute this functions
(mode1 <- the.mode(c(2, 2, 3, 4, 4, 4, 1)))  # parenthesis around a call makes R print the output
(mode2 <- the.mode(c(2, 2, 3, 4, 4, 4, 1), print.it = FALSE))  # change default
(mode3 <- the.mode(c('A', 'D', 'B', 'B', 'C')))

The mode is: 4 


The mode is: B 


Note the table function that returns the a frequency table.

In [78]:
table(obj1)

obj1
1 2 3 4 5 
1 1 1 1 1 

In [83]:
# You can create a function calling to other functions of yours
mean.and.mode <- function(x, y) {  # Multiple arguments
  obj1 <- average(x)
  obj2 <- the.mode(y)
  return(list(mean = obj1, mode = obj2))
  #return(c(obj1, obj2))  # More simple alternative: return a vector
}

# Use it
mean.and.mode(1:10, c(2, 2, 3, 4, 4, 4, 1))  # Assigning is not mandatory

The mode is: 4 


In [84]:
# Arguments and outputs are not even mandatory for a function
hello.world <- function() {  # Keep the parenthesis though!
  cat('Hello world! My coding teacher is awesome.\n')
}

# Not a very helpful function
hello.world()

Hello world! My coding teacher is awesome.


By default, if no `return()` function is used, a function will return the last expression evaluated (although this is not recommended).

**Important: what happens in a function stays in the function.** Keep in mind that functions have their own *environment*: everything inside `{` and `}` occurs on a separate environment *inferior* to the global one. This means that once you're **inside** the function, what exists in the "superior" environment (the workspace you're in) is known, but the objects created therein live in that separate space and are not poured onto the superior global environment, unless you explicitely tell them to (as one does using `return`, or *double* assignment `<<-`).

### 5.2 The `apply` family of functions

Sometimes you will encounter situations in which you want to apply a function repeatedly to elements of an object without using expensive loops. The set of `apply` functions are efficient and elegant ways to do that. There are more `apply`-type of functions, but here we present the four more common ones

*   `apply`: used to iterate over the dimensions of a `data.frame` or `matrix` or larger `array`-type objects, the simplest case being to apply a function on each row/column
*   `lapply`: uses `apply` on `list`-type objects, and also outputs a `list`
*   `tapply`: summarises a vector by groups defined by another vector, i.e. say you want the group-mean of variable `x`, where in one column you have the value of `x` and the group they belong to in another
*   `sapply`: uses `apply` on sets of vector-type objects, returning a vector

Check out other functions like `vapply` (similar to `sapply`), `mapply` (for multiple parallel data structures), `mcapply` (multi-core), `mclapply` (multi-core `lapply` from the `parallel` package).


In [None]:
df1 <- data.frame(x = 1:10, y = 11:20, z = -10:-1)
df2 <- data.frame(x = 1:10, group = c(rep('A', 5), rep('B', 5)))
df3 <- data.frame(x = c('a', 'b', 'c', 'a'), y = c('e', 'e', 'f', 'f'))
l1 <- list(A = 1:10, B = 11:20, C = -10:-1)

# On the columns of df: find the max of each
apply(df1, 2, max)  # "2" is the dimension: in this case is columns (1 = rows)
                    # order of arguments: object, dimension, function to apply
# Check out also "sweep()": same syntax, used to "sweep out" a summary statistic

# Same operation but now the vectors are elements of a list
lapply(l1, max)  # output object is a list (here you need no dimension)

# Compute the mean of "x" as a function of "group"
tapply(df2[, 'x'], df2[, 'group'], mean)

# Similar to "apply" but directly on a set of vectors
sapply(df1, class)

# Check how many unique elements are in each column
apply(df3, 2, function(x) { length(unique(x)) })  # This is an "anonymous" function! It is never defined

The `apply` family of functions are formally *functionals*: functions that take another function as an argument. They are very important and frequently used in `R`, and can become quite complex quickly. I recommend the following link to Hadley Wickham's book if you wish to go deeper on the set of `apply` functions: http://adv-r.had.co.nz/Functionals.html.

### 5.4 Useful miscellaneous functions

Here I introduce a few generic functions that are very helpful but are employed in many contexts and are not explicitely introduced anywhere else in this course.

In [95]:
# Replication of elements
x <- rep(1:3, 2)  # replicate 1:3 two times
print(x)

unique(x)  # Returns unique elements in x
duplicated(x)  # Returns logical stating whether elements have been found before
rev(x)  # Return elements in x in reverse

[1] 1 2 3 1 2 3


In [86]:
x <- seq(0, 1, 0.1)  # Return a sequence from 0 to 1, with distance between elements of 0.1
y <- seq(-1e3, 1e3, 1e2)  # From -10^3 to 10^3 every 10^2
format(y, scientific = TRUE)  # This is a very flexible function with various uses
format(y, scientific = FALSE)

In [99]:
x <- c(8,10,4,-5,-2)
diff(x) #computes the first difference

In [93]:
# Subdivision of vectors
x <- 1:10
y <- 3:7
z <- as.factor(c(rep('A', 3), rep('B', 2)))  # Declare character as factor

intersect(x, y)  # Returns intersection set of elements in x and y
union(x, y)  # Returns union set of elements in x or y
setdiff(x, y)  # Returns elements in x not in y

split(x, z)  # group elements in x as a function of a factor z
cut(x, breaks = 2*(0:5))  # divides elements in x into "breaks" number of groups, used to "bucket" vectors

In [131]:
# Basic checks of numeric elements
x <- 10.3
is.na(x)  # logical on wehter x is NA
is.finite(x)  # logical on wehter x is not +Inf or -Inf
ceiling(x)  # Returns closest integer to x from above
floor(x)  # Returns closest integer to x from below

## 6. Dataset manipulation

Often times you want to perform access/slice/filter operations, i.e. clean, subset, merge and, in a nutshell, manipulate datasets. Here we present basic operations that allow us to perform these tasks.

### 6.1 Filtering, ordering and expanding


Selecting columns and rows with boolean filters

In [114]:
set.seed(666)  # This function "controls random"
df <- data.frame(x = 1:10, y = rep(c('A', 'B'), 5), z = rnorm(10))

# Select rows (KEY: position of the comma)
df1 <- df[df[, 'x'] > 6, ]  # Select observations where variable x > 6
df2 <- df[df[, 'x'] > 6 & df[, 'y'] == 'A', ]  # Double condition

df1
df2

Unnamed: 0_level_0,x,y,z
Unnamed: 0_level_1,<int>,<chr>,<dbl>
7,7,A,-1.30618526
8,8,B,-0.80251957
9,9,A,-1.79224083
10,10,B,-0.04203245


Unnamed: 0_level_0,x,y,z
Unnamed: 0_level_1,<int>,<chr>,<dbl>
7,7,A,-1.306185
9,9,A,-1.792241


In [119]:
# Select columns (KEY: position of the comma)
df3 <- df[, colnames(df) != 'x']  # Select every column except "x": this is more robust than "df[, -1]" 
df3bis <- df[,colnames(df) %in% c('y', 'z')]  # Identical with %in% operator

df3
df3bis

y,z
<chr>,<dbl>
A,0.75331105
B,2.01435467
A,-0.35513446
B,2.02816784
A,-2.21687445
B,0.75839618
A,-1.30618526
B,-0.80251957
A,-1.79224083
B,-0.04203245


y,z
<chr>,<dbl>
A,0.75331105
B,2.01435467
A,-0.35513446
B,2.02816784
A,-2.21687445
B,0.75839618
A,-1.30618526
B,-0.80251957
A,-1.79224083
B,-0.04203245


Ordering the rows by the values of a column and reordering columns  

In [124]:
df4 <- df[order(df[, 'z'], decreasing = TRUE), ]  # Order rows by value of column "z" (highest-to-lowest)
df5 <- df[, c('z', 'y', 'x')]  # Changes order of columns

Adding columns or rows

In [127]:
df6 <- cbind.data.frame(df5, w = 11:20) # cbind for matrix
df6[, 'v'] <- 21:30  # add a new column from a new object
df7 <- rbind.data.frame(df5[1:5, ], df5[6:10, ]) # rbind for matrix
df7[11, ] <- c(11, 'B', 0)  # add a new observation from a new object
df6
df7

z,y,x,w,v
<dbl>,<chr>,<int>,<int>,<int>
0.75331105,A,1,11,21
2.01435467,B,2,12,22
-0.35513446,A,3,13,23
2.02816784,B,4,14,24
-2.21687445,A,5,15,25
0.75839618,B,6,16,26
-1.30618526,A,7,17,27
-0.80251957,B,8,18,28
-1.79224083,A,9,19,29
-0.04203245,B,10,20,30


Unnamed: 0_level_0,z,y,x
Unnamed: 0_level_1,<chr>,<chr>,<chr>
1,0.753311046217783,A,1
2,2.01435466569865,B,2
3,-0.355134460371891,A,3
4,2.02816784264222,B,4
5,-2.21687445114244,A,5
6,0.758396178001042,B,6
7,-1.3061852590117,A,7
8,-0.802519568703793,B,8
9,-1.79224083446114,A,9
10,-0.0420324540227439,B,10


Renaming variables or rows, changing the type of a column, transforming a column

In [129]:
colnames(df5) <- c('Z', 'Y', 'X')
rownames(df5) <- paste('obs', 1:nrow(df5), sep = '')
df5[, 'X'] <- as.integer(df5[, 'X'])  # it's ok to re-write: it is reversible
df5[, 'Z'] <- df5[, 'Z'] * 3  # modify an existing column
colnames(df5)
rownames(df5)

Some remarks on these operations:
*   Note the difference between `sort()` and `order()`: the former returns the object but sorted, the latter returns the ranks, not the values themselves.
*   One helpful function we have not looked at is `subset()`, which does filtering as we did above. It works well for simple commands, but the framework above is more flexible and can deal with complex operations. Feel free to use it though! Try: `subset(airquality, Temp > 80, select = c(Ozone, Temp))`.
*   Notice that we never modified the original object when manipulating the datasets, we always created a new object. You are allowed over-write existing objects, but datasets operations are **irreversible**,and you may lose information.

### 6.2. Merge and join

Beyond the straightforward expansion or selection of rows and columns explained before, one may want to *merge* two datasets, e.g. we have different data on the same set of individuals, or different years, etc. This usually involves joining two full datasets, instead of just adding rows or columns based on some coditions.

In [130]:
set.seed(666)
df1 <- data.frame(id1 = sample(1:10, replace = FALSE), col1 = rnorm(10), col2 = rchisq(10, df = 1))
df2 <- data.frame(id2 = sample(1:10, replace = FALSE), col3 = rnorm(10), col4 = rchisq(10, df = 1))
head(df1, 5)
head(df2, 5)

# Joint datasets using "match"
df3 <- cbind.data.frame(df1, df2[match(df1[, 'id1'], df2[, 'id2']), -1])  # Exclude "id" column to avoid redundancy
head(df3, 5)

# Joint datasets using "merge"
colnames(df1)[1] <- 'id'
colnames(df2)[1] <- 'id'
df3 <- merge(df1, df2, by = 'id')  # This is fine but be careful if ID lists and column names are not equal
head(df3, 5)  # The original order is altered

Unnamed: 0_level_0,id1,col1,col2
Unnamed: 0_level_1,<int>,<dbl>,<dbl>
1,5,1.23528305,3.0634462
2,9,-0.08365711,0.8760082
3,4,0.25683143,2.836735
4,1,-1.07362365,0.3991052
5,2,-0.62286788,1.560704


Unnamed: 0_level_0,id2,col3,col4
Unnamed: 0_level_1,<int>,<dbl>,<dbl>
1,1,-0.9564482,0.1428781
2,8,1.1390113,1.5188775
3,10,-1.4151442,0.1644284
4,6,0.2009336,0.4742982
5,7,-1.2343732,0.1326316


Unnamed: 0_level_0,id1,col1,col2,col3,col4
Unnamed: 0_level_1,<int>,<dbl>,<dbl>,<dbl>,<dbl>
8,5,1.23528305,3.0634462,-0.3334186,0.3311042
6,9,-0.08365711,0.8760082,-0.1211411,0.9001866
7,4,0.25683143,2.836735,0.6722846,2.3237402
1,1,-1.07362365,0.3991052,-0.9564482,0.1428781
10,2,-0.62286788,1.560704,2.3682631,0.3960097


Unnamed: 0_level_0,id,col1,col2,col3,col4
Unnamed: 0_level_1,<int>,<dbl>,<dbl>,<dbl>,<dbl>
1,1,-1.0736237,0.3991052,-0.9564482,0.1428781
2,2,-0.6228679,1.560704,2.3682631,0.3960097
3,3,0.2849911,1.3178794,0.2514136,0.6331147
4,4,0.2568314,2.836735,0.6722846,2.3237402
5,5,1.2352831,3.0634462,-0.3334186,0.3311042


The function `match()` basically looks for the elements of the first vector present in the second vector, and returns the corresponding indices of the latter in the order of the former. It is another hall-of-famer function in `R` and is super useful in a number of contexts. Use it!

Using `merge()` is also a good option, especially if you're familiar with database management, as it is well equipped to easily do *inner* (default), *outer* and *cross* joints.