# SESSION B1: R (i)

By **Miquel Torrens i Dinarès**

*Barcelona School of Economics*

*Data Science Center*

January 4, 2022

## 1. The R Environment

Complete user manual: https://cran.r-project.org/doc/manuals/r-release/R-lang.html

### 1.1 Introduction

When you open an `R` session, a **console** is displayed, which is where all operations in `R` are run. This is built on an `R` environment that we will call **workspace**. This is the playing field that will contain everything needed to run your code.

In a nutshell, in an `R` workspace there are *things* (objects) and *actions* (operators and functions). Objects are located inside the workspace, and we apply functions to these objects to obtain some desired output. For example: we can create a vector `x` (object) and compute its `mean` (function)


In [None]:
x <- 0:10  # This defines a numeric vector with integers from 0 to 10
mean(x)  # This computes the mean of the vector

Here we displayed an object, an operator, and a function. To create an object, we used an operator of assignment `<-` (right-to-left), and we called a function on this object by placing it inside brackets. We also showed how to comment your code: use `#` to let `R` know it should not run the rest of the line.

Some functions do not require an object to be called. For example, you can explore the objects in your workspace using


In [None]:
ls()

Most functions also admit **arguments**, which are specifications to control how a function should be run.

In [None]:
pi  # This is a special object built into R
round(pi)  # This functions rounds a number to any decimal point (by default, 0)
round(pi, digits = 1)  # The argument "digits" controls the number of decimals
round(pi, digits = 4)

You can know which arguments are admitted by each function, alongside with a detailed function description, by using the `help()` function, or by typing in an `?` in front of the name of the function:

In [None]:
?mean

`R` has a wide array of general functions and operators embedded natively that allow you to do pretty much anything basic. Those functions not readily available that you might require for specific purposes will need to be either (a) installed and loaded from external packages, or (b) coded by YOU. Like many things, coding functions can be both fun and painful.

Similarly, you will not create all objects from scratch in a workspace, sometimes you will be importing data, either numeric, in text, or any other format (maps, JSON, SQL, etc.). Your workspace is just the place where you will work on them to generate some output of interest.

Finally, don't forget to exit your R session once your job is done! Do that using the quit function

In [None]:
q()

Use the command `rm(list = ls())` if instead you don't want to leave the session but want to remove every object in it.

### 1.2 General rules to programming (with `R`)

1. **Reproducibility** is everything: write your code in an `R` script (file with extension `.R`), NOT in the console
2. **Comment your code thoroughly** and clarify everything, someone else might read it and needs to know what's happening (even yourself in the future)
3. Follow a **style guide** and do it **consistently**: spacing, indenting, format..., just like you do writing natural language. [The one from Google](https://google.github.io/styleguide/Rguide.html) is decent, it's an example that you can tweak to make it your own, just follow the basics and stay consistent. Remember: **clean code = happy programmer**.

## 2. Input/Output


### 2.1 Importing functions (libraries)

Libraries are bundles of functions generally related to one particular topic. For example, the `xtable` package includes a set of functions to convert a table into LaTeX readable format.

In `R`, packages need to be installed prior to its use, which then can be loaded into the workspace.

In [None]:
# Install the required package (use single or double quotation marks)
install.packages('xtable')

# Load it into your workspace
library(xtable)  # Option 1
require(xtable)  # Option 2

Both commands do the exact same task: dump every function in the package into your workspace -- the only practical difference is that if the loaded package is not available, `library` will crash, while `require` just returns a warning.

In [None]:
# The following command will give an error
require(mombf)
library(mombf)

There are two ways to call a function from an installed package:

In [None]:
data1 <- matrix(1:9, ncol = 3, nrow = 3)
data2 <- xtable(data1)  # NOT recommended
data2 <- xtable::xtable(data1)  # RECOMMENDED (double colon)
print(data1)
print(data2)

     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9
% latex table generated in R 4.1.1 by xtable 1.8-4 package
% Fri Oct 29 14:42:54 2021
\begin{table}[ht]
\centering
\begin{tabular}{rrrr}
  \hline
 & 1 & 2 & 3 \\ 
  \hline
1 &   1 &   4 &   7 \\ 
  2 &   2 &   5 &   8 \\ 
  3 &   3 &   6 &   9 \\ 
   \hline
\end{tabular}
\end{table}


### 2.2 Importing and exporting external data

Here we detail the process of importing and exporting data into an `R` session. Some popular native functions

In [None]:
# NOT RUN
read.csv()  # Import data from a CSV
read.txt()  # Import from TXT file
read.fwf()  # Read fixed width format files
read.table()  # General function
readLines()  # Import lines of text
download.file()  # To download a data file from a URL
# END NOT RUN

You need to supply the file path to these functions to load the data itself.

The library of functions `foreign` will help you to read a number of different other data formats. Most commercial softwares have their own reading methods. For example, to import from Excel you need the function `read_excel()` in the `readxl` package. The `haven` package contains functions to read output files from other popular dinosaur softwares such as Stata, SPSS or SAS. If you use RStudio, some of these imports are automatised in the tab *Import dataset*.


Similarly to the reading functions, there are equivalent writing functions to save your output into non-`R` generic formats: `write.csv()`,  `write.table()`, or `writeLines()`.

### 2.3 Importing and exporting R files

Sometimes you will want to save an object created in `R` to an external file. There are two natural formats to produce `R` outputs: `.RData` or `.rda` (more than one object can be stored), and `.rds` (only one object).

In [None]:
# Write an object on an external file (.RData, rda, rds)
save(data1, file = 'data1.RData')  # .RData or .rda
saveRDS(data2, file = 'data2.rds')  # .rds
save.image(file = 'workspace.RData')  # This saves the entire workspace

Similarly, you might want to read a file produced in `R`. an `.RData`, again two ways to load them:

In [None]:
rm(data1, data2)  # Remove objects in the workspace to prove this works
load('data1.RData')  # This loads into workspace with the name it was saved with
data2 <- readRDS('data2.rds')  # This needs an assigment
print(data1)
print(data2)

You can actually run an `R` script itself by supplying the command `source()` with the `.R` file path.

Other relevant functions:

In [None]:
print
cat
message, warning
format
sink
flie.copy, file.create, file.remove, file.rename, dir.create
file.exists
file.info
download.file

## 3. Objects and object classes

The **class** of an object determines (a) its structure and (b) what/how functions can be applied to it. 

### 3.1 Common object classes

Here we summarise some of the most common object classes:

1. Vectors: essentially three types (but not only)

  a. `numeric` (float or double) and `integer`

  The difference is integers use no decimal points and so are less demanding in terms of memory.

  b. `character` and `factor`
  
  Factors are characters that can only take a specific set of values (known as `levels`), as in a categorical variable.

  c. `logical`
  
  Booleans, they can only be `TRUE` or `FALSE`.

2. `list`

  They allow you to have different objects (maybe of different class) in one single objects, using *slots*, having one object in each slot.

3. `data.frame`

  Rectangular data structure similar to a matrix, with rows and columns, where each column can be of a different type.
  
4. `matrix` and `array`

  Matrices are similar to data.frames but less flexible: all its elements have to be of the same classe. On the other hand, computations with matrices are more efficient. arrays are just matrices with potentially more dimensions.


In [None]:
# Types of vectors
# [c() creates a vector: it is a function to concatenate elements]
x1 <- c(1.4, 2.3, 3.2, 4.1)  # numeric
x2 <- c(1L, 2L, 3L, 4L)  # integer
x3 <- c('A', 'B', 'C', 'A')  # character
x4 <- factor(c('A', 'B', 'C', 'A'))  # declare it as factor with levels "A", "B" and "C"
x5 <- c(TRUE, FALSE, TRUE, FALSE)  # logical

# The "class" function tells you what class an object is
class(x1)  # To obtain a boolean, try: is.numeric(x1)
class(x2)  # Try: is.integer(x2)
class(x3)  # Try: is.character(x3)
class(x4)  # Try: is.factor(x4)
class(x5)  # Try: is.logical(x5)

Some classes can be *coerced* into other classes: for example a numeric vector can be coerced into a character vector, and vice versa if the elements of the vector are digits. But numeric vectors cannot be coerced into logical. Use the functions
- `as.numeric()`,
- `as.integer()`,
- `as.character()`,
- `as.factor()`,
- `as.logical()`,

to convert objects into other classes.

If some class is not coercible into another, you may be able to use a coercible middle class to do the trick. For example, if `x` is a factor with values in 1, 2, 3, you may coerce it into numeric using

In [None]:
x <- as.factor(c(4:6, 6:4))  # x is a factor
cat('* Class of "x":', class(x), '\n')
print(x)  # This will print the levels

# Factors are not coercible to numeric
y <- as.numeric(x)
print(y)  # WRONG result

# Use character as a translator
z <- as.numeric(as.character(x))  # factor -> character -> numeric
cat('* Class of "z":', class(z), '\n')
print(z)

As for the rest of object classes, here's how to create them:

In [None]:
# Create a list
a <- list('a' = x, 'b' = z) # with two "slots", with names "a" and "b"
print(a)

# Create a data.frame
b <- data.frame('col1' = x, 'col2' = y, 'col3' = z)
print(b)

# Create a matrix
m <- matrix(1:12, ncol = 3)
print(m)

These classes are coercible as well, using `as.list()`, `as.data.frame()`, and `as.matrix()`, but **BE CAREFUL** in any type of class coercion, and always inspect the coerced object to make sure the result is in the correct desired format.

In more advanced levels you will encounter more classes that stem from this basic set, and more complex, such as [`S4` objects](http://adv-r.had.co.nz/S4.html). You will also learn to create your own classes.

### 3.2 Functions applicable to a class

Functions are applied to specific classes. For example, the function `mean()` can be applied to an object of class `numeric`, but applying it to an object of class `character` or `factor` would generate an error in your workspace:

In [None]:
x1 <- c(1, 7, 4, 8, 6, 0)
x2 <- as.factor(c('a', 'c', 'd', 'b', 'b', 'd'))
print(mean(x1))  # OK
print(mean(x2))  # Can't do it

[1] 4.333333


“argument is not numeric or logical: returning NA”


[1] NA


(Note: characters require quotation marks, otherwise it would be as if you were creating a vector with objects `a`, `b`, `c` and `d`, which do not exist.)

Other functions are applicable to both classes, but mind you they might produce different types of output:

In [None]:
cat('* Summary of object "x1"\n')
summary(x1)
cat('* Summary of object "x2"\n')
summary(x2)

* Summary of object "x1"


   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   1.750   5.000   4.333   6.750   8.000 

* Summary of object "x2"


You can find out if a function is usable on your object class with the function `methods()`, which will display all classes that a given (native) function can handle:

In [None]:
methods(summary)

### 3.2 Quick object inspection

Sometimes we have an object of a given class that you want to inspect.

In [None]:
set.seed(666)  # This functions "controls" the random
x <- rnorm(10000, mean = 0, sd = 1)  # 10K random values from a standard normal
y <- list(A = 1:100, B = letters[1:22], C = rnorm(10), D = c(TRUE, TRUE, FALSE))
z <- data.frame(col1 = 1:100, col2 = 101:200, col3 = 201:300)
w <- as.matrix(z)

cat('* Structure of object "x"\n')
str(x)  # This functions provides the structure of the object
cat('* Structure of object "y"\n')
str(y)
cat('* Structure of object "z"\n')
str(z)

The function `summary` above applies to most classes, and some other quick-look functions are also useful.

In [None]:
x <- round(x, 4)  # For display purposes

head(x)  # Provide first values
head(x, n = 10)
tail(x)  # Provide last values
x[1]
x[7287]  # Provide value no. 7,287
x[10001]  # Returns NA: element 10K+1 doesn't exist

Note that indexing in `R` is as in natural language: it starts at `1` (Python is at `0`).


Different classes have different attributes that can be inspected:


In [None]:
# Matrices and data.frames have dimensions and they have names
dim(z)  # Dimension
nrow(z)  # No. of rows
ncol(z)  # No. of columns
colnames(z)  # Names of the columns
head(rownames(z))  # Names of the rows
rownames(z) <- paste('row', 1:nrow(z), sep = '')  # All these can be changed
head(rownames(z))

# Lists can also have names but no dimension (they have length)
names(y)
length(y)

# So do vectors
names(x)  # This vector has no names, but they can be assigned
length(x)
names(x) <- paste('val', 1:length(x), sep = '')
head(names(x))

# For non-continuous vectors, this is a really useful function
v <- c('A', 'B', 'A', 'C', 'C', 'A', 'A', 'B', 'C')
table(v)  # Absolute frequencies

and different operators to access its elements

In [None]:
# Vectors: single parenthesis (access either by index or by name)
x[1]
x['val1']  # same
x[-1]  # negative indices used to drop elements: all elements except the first
x[-(1:5)]  # return every element except those in 1 to 5
x[11:20][7]  # you can "double" subsets: return element 7 from the set 11 to 20

# Lists: double parenthesis (index or name)
y[[1]]  # First element of the list is a vector
y[[1]] <- y[[1]][1:10]  # Lets shorten it: keep first ten values (reassignment)
y[[1]]  # Repeat: now shorter vector
y[['A']]  # same result
y$A  # same, but NOT recommended: "partial matching" can cause errors

# Why not use the "$" operator? Let's see what can happen
names(y) <- c('Abcd', 'Defg', 'Hijk', 'Lmno')  # Let's set new names
y[['A']]  # NA (well-done!): a slot named "A" does not exist
y$A  # Returns an element, WRONG: I call for 'A' and it gives me something with a different name

# data.frames and matrices: ROWS before *comma*, COLUMNS after
head(z[, 1])  # column 1
head(z[1, ])  # row 1
head(z[1, 1])  # element 1 in column 1
head(z[, 'col1'])
head(z['row1', ])

# Operator "$" works also for data.frames, but not for matrices
head(z$col1)  # In data.frame: NOT recommended
try(head(w$col1))  # In matrix: ERROR

colnames(z) <- c('Abcd', 'Defg', 'Hijk')  # Same exercise: partial matching
head(z[, 'A'])  # ERROR: good!
head(z$A)  # Again: it gives you something you're not asking for

Generally, avoid the use of the `$` operator: it's shorter code but more prone to errors (code of lower quality and robustness), especially in lists and `data.frame`'s. Instead, use either indexing or (if named) full quotation.

## 4. Operators

Operators are functions in the form of symbols to perform specific mathematical or logical computations. We provide a list of the native `R` operators with an example:

### 4.1 Basic operators

*   **Arithmetic** operators

In [None]:
# Numeric and integer vectors
3 + 1  # addition
3 - 1  # subtraction
5 * 2  # multiplication
5 / 2  # division
5 %% 2  # remainder
5 %/% 2  # quotient
2 ** 3  # exponentiation, also: 2^3

# Matrix operators
A <- matrix(1:9, ncol = 3)
B <- matrix(10:18, ncol = 3)
A * B  # elementwise multiplication (!)
A %*% B  # matrix multiplication
A %o% B  # outer product

*   **Relational** operators (returning a boolean)

In [None]:
7 > 2  # bigger than
7 < 2  # smaller than
7 >= 2  # bigger or equal than
7 <= 2  # smaller or equal than
7 == 2  # equal to
7 != 2  # different to

*   **Logical** operators (test if a statement is true)

In [None]:
(7 > 2) & (3 > 4)  # AND
(7 > 2) | (3 > 4)  # OR
! (7 > 2)  # NOT
xor(7 > 2, 3 > 4)  # AND/OR (not an operator)

If the elements to which we apply a logical operator are vectors (of the same length), the operator will return a logical vector comparing the two input vectors elementwise: you can use double `AND` or `OR` to inspect only the first element: `x && y`, or `x || y`.

*   **Miscellaneous** operators

In [None]:
2:8  # Integer sequence of 2 to 8
7 %in% 2:8  # Is an element inside of a vector

Some other basic operators are already covered in other sections, like the help operator (`?`), component/slot extraction (`$` and `@`), indexing (`[` and `[[`), formula (`~`) and library access (`::` or `:::`).

*   **Assignment** operators: we cover them in the next subsection.

Like in every language, operators in `R` have *precedence*, i.e. in the same expression some operators are applied before others. Think of math: in an equation we multiply before we add, thus in the expression `x <- 3 * 4 + 1` results into 13 instead of 15. Explore the precedence list in `R` using the command `?Syntax`.

### 4.2 Assignment

In `R`, objects can be created, re-written and removed without restrictions. Once an object is declared, the expression we are assigning to it is evaluated. The standard method of assigment is the use of the operator `<-`, even though there are other options. We illustrate the most common ones:

In [None]:
# Vector
rm(x)  # Remove the object from the workspace
x <- c(1, 7, 4, 8, 6, 0)

# Here the expression "mean(x)" will be evaluated 4 times
x1 <- mean(x)  # Good :-)
x2 = mean(x)  # NOT good: "=" is *not* an assigment operator
x3 <<- mean(x)  # Used only in functions to declare an object in a superior environment
assign('x4', mean(x))  # Good: useful when the object has an "unknown" name
mean(x) -> x5  # Good, but standard is to assign right to left (same with ->>)

# Test if the four objects were assigned correctly
identical(x1, x2, x3, x4, x5)  # It returns TRUE if the objects are equal

For more advanced assignment methods, I recommend you also check the functions `get()`, `with()`, and `within()`.

Multiple assignments require multiple operators:

In [2]:
obj1 <- obj2 <- 1:5
#obj1, obj2 <- 1:5  # NOT POSSIBLE

ERROR: ignored

**(!) NOTE** – Although technically you can assign an expression using the `=` symbol, `=` is not formally an assignment operator in `R`, and its use is designed for functions. That's why it is **not** recommended despite widespread misuse. In short, refrain from assignment use.

### 4.3 Reserved words

There are certain reserved words in `R` (keywords) that cannot be assigned or over-written since they are essential built-in elements. We list the most important, words for:

*   Booleans: `TRUE`, `FALSE`.
*   Iterables: `if`, `else` (`if`-statements), `for`, `in`, `next` (`for`-loops), `while`, `break`, `repeat` (`while`-loops).
*   Functions: `function`, `...`.
*   Particular values: `NA`, `NULL`, `NaN`, `Inf`.

We review most of them later on.

Still, even if the rest functions in `R` can be over-written (stuff like `mean <- 3` is factually perpetrable), refrain from doing it.

## 5. Functions

### 5.1 Coding functions

### 5.2 The `apply` family of functions

### 5.3 Useful miscellaneous functions

## 6. Dataset manipulation: subsetting and merging

## 7. Control flows

## 8. Stats and math

## 9. Character manipulation

* grep, grepl, agrep, substr, gsub
* Regular expressions


## 10. Basic data visualization

## 11. Popular packages

* data.table/dplyr/ggplot/shiny

## 12. Relevant topics

* Dates, web scrapping, visualization?