In [1]:
options(jupyter.rich_display = FALSE)

# Week 8: Fundamentals of R Programming I

## POP77001 Computer Programming for Social Scientists

### Tom Paskhalis

##### 1 November 2021

##### Module website: [bit.ly/POP77001](https://bit.ly/POP77001)

## Overview

- R objects and operators
- Data structures and types
- Indexing and subsetting
- Attributes

## R background

<table>
    <tr>
        <td><img width="200" height="100" src='../imgs/ihaka_gentleman.jpg'></td>
        <td><img width="200" height="100" src='../imgs/r_logo.png'></td>
    </tr>
</table>

Source: [University of Auckland](https://www.stat.auckland.ac.nz/2008/ihaka-pickering/), [R Project](https://www.r-project.org/)

- S (for **s**tatistics) is a programming language for statistical analysis developed in 1976 in AT&T Bell Labs
- Original S language and its extention S-PLUS were closed source
- In 1991 **R**oss Ihaka and **R**obert Gentleman began developing R, an open-source alternative to S

## R release names (v. 4.1.1 -- "Kick Things")

<div style="text-align: center;">
    <img width="700" height="700" src="../imgs/r_peanuts.jpeg">
</div>

Source: [Twitter](https://twitter.com/WomenInStat/status/1449462277581721601)  
Extra: [More on historical R release names](https://livefreeordichotomize.com/2018/04/23/r-release-names/)

## R basics

- R is an *interpreted* language (like Python and Stata)
- It is geared towards statistical analysis
- R is often used for interactive data analysis (one command at a time)
- But it also permits to execute entire scripts in *batch* mode

In [2]:
print("Hello World!")

[1] "Hello World!"


## Operators

Key *operators* (*infix* functions) in R are:

- Arithmetic (`+`, `-`, `*`, `^`, `/`, `%/%`, `%%`, `%*%`) 
- Boolean (`&`, `&&`, `|`, `||`, `!`)
- Relational (`==`, `!=`, `>`, `>=`, `<`, `<=`)
- Assignment (`<-`, `<<-`, `=`)
- Membership (`%in%`)


## Basic mathematical operations in R

In [3]:
1 + 1

[1] 2

In [4]:
5 - 3

[1] 2

In [5]:
6 / 2

[1] 3

In [6]:
4 * 4

[1] 16

In [7]:
## Exponentiation, note that 2 ** 4 also works, but is not recommended
2 ^ 4

[1] 16

## Advanced mathematical operations in R

In [8]:
# Integer division, equivalent to Python's `//`
7 %/% 3

[1] 2

In [9]:
# Modulo operation (remainder of division), equivalent to Python's `%`
7 %% 3

[1] 1

## Basic logical operations in R

In [10]:
3 != 1 # Not equal

[1] TRUE

In [11]:
3 > 3 # Greater than

[1] FALSE

In [12]:
FALSE | TRUE # True if either first or second operand is True, False otherwise

[1] TRUE

In [13]:
F | T # R also treats F and T as Boolean, but it is not recommended due to poor legibility

[1] TRUE

In [14]:
3 > 3 | 3 >= 3 # Combining 3 Boolean expressions

[1] TRUE

## Assignment operations

- `<-` is the standard assignment operator in R
- While `=` is also supported it is not recommended
- As it hides the difference between `<-` and `<<-` (deep assignment)

In [15]:
x <- 3
x

[1] 3

In [16]:
x <- 3
f <- function() {
    x <<- 1 # Modifies the existing variable in parent namespace (or creates a new global variable)
}
f()
x

[1] 1

## Membership operations

Operator `%in%` returns `TRUE` if an object of the left side is in a sequence on the right.

In [17]:
"a" %in% "abc" # Note that R strings are not sequences

[1] FALSE

In [18]:
3 %in% c(1, 2, 3) # c(1, 2, 3) is a vector

[1] TRUE

In [19]:
!(3 %in% c(1, 2, 3))

[1] FALSE

## Data structures

Base R data structures can be classified along their *dimensionality* and *homogeneity*

5 main built-in data structures in R:

- Atomic vector (`vector`)
- Matrix (`matrix`)
- Array (`array`)
- List (`list`)
- Data frame (`data.frame`)

## Summary of data structures in R

| Structure    | Description                      | Dimensionality   | Data Type     |
|:-------------|:---------------------------------|:-----------------|:--------------|
| `vector`     | Atomic vector (scalar)           | 1d               | homogenous    |
| `matrix`     | Matrix                           | 2d               | homogenous    |
| `array`      | One-, two or n-dimensional array | 1d/2d/nd         | homogenous    |
| `list`       | List                             | 1d               | heterogeneous |
| `data.frame` | Rectangular data                 | 2d               | heterogeneous |


## Vectors

- *Vector* is the core building block of R
- R has no scalars (they are just vectors of length 1)
- Vectors can be created with `c()` function (short for **c**ombine)

In [20]:
v <- c(8, 10, 12)
v

[1]  8 10 12

In [21]:
v <- c(v, 14) # Vectors are always flattened (even when nested)
v

[1]  8 10 12 14

## Data types

4 common data types that are contained in R structures:

- Character (`character`)
- Integer (`integer`)
- Double/numeric (`double`/`numeric`)
- Logical/boolean (`logical`)

## Character vector

In [22]:
char_vec <- c("apple", "banana", "watermelon")

In [23]:
char_vec

[1] "apple"      "banana"     "watermelon"

In [24]:
# length() function gives the length of an R object (analogous to Python's len())
length(char_vec) 

[1] 3

## Integer vector

In [25]:
# Note the 'L' suffix to make sure you get an integer rather than double
int_vec <- c(300L, 200L, 4L)

In [26]:
int_vec

[1] 300 200   4

In [27]:
# typeof() function returns the type of an R object (analogous to Python's type())
typeof(int_vec)

[1] "integer"

## Numeric vector

In [28]:
# Note that even without decimal part R treats these numbers as real
num_vec <- c(150, 120, 3000)

In [29]:
num_vec

[1]  150  120 3000

In [30]:
typeof(num_vec)

[1] "double"

## Logical vector

In [31]:
log_vec <- c(FALSE, FALSE, TRUE)
log_vec

[1] FALSE FALSE  TRUE

In [32]:
# While more concise, using T/F instead of TRUE/FALSE can be confusing
log_vec2 <- c(F, F, T)
log_vec2

[1] FALSE FALSE  TRUE

In [33]:
typeof(log_vec)

[1] "logical"

## Type coercion in vectors

- All elements of a vector must be of the same type
- If you try to combine vectors of different types, their elements will be *coerced* to the most flexible type

In [34]:
# Note that logical vector get coerced to 0/1 for FALSE/TRUE
c(num_vec, log_vec)

[1]  150  120 3000    0    0    1

In [35]:
c(char_vec, int_vec)

[1] "apple"      "banana"     "watermelon" "300"        "200"       
[6] "4"         

In [36]:
# If no natural way of type conversion exists, NAs are introduced
as.numeric(char_vec)

“NAs introduced by coercion”


[1] NA NA NA

## NA and NULL values

- In Python we encountered `None` value
- R makes a distinction between:
    - `NA` - value exists, but is unknown (e.g. survey non-response)
    - `NULL` - object does not exist
- `NA`'s are defined for each data type (integer, character, numeric, etc.)

## NA and NULL example

In [37]:
na <- c(NA, NA, NA)
na

[1] NA NA NA

In [38]:
length(na)

[1] 3

In [39]:
null <- c(NULL, NULL, NULL)
null

NULL

In [40]:
length(null)

[1] 0

## Vector indexing and subsetting

- Indexing in R starts from **1** (as opposed to 0 in Python)
- To subset a vector, use `[]` to index the elements you would like to select:

```
vector[index]
```

In [41]:
num_vec[1]

[1] 150

In [42]:
num_vec[c(1,3)]

[1]  150 3000

## Summary of vector subsetting

| Value             | Example             | Description                                                   |
|:------------------|:--------------------|:--------------------------------------------------------------|
| Positive integers | `v[c(3, 1)]`        | Returns elements at specified positions                       |
| Negative integers | `v[-c(3, 1)]`       | Omits elements at specified positions                         |
| Logical vectors   | `v[c(FALSE, TRUE)]` | Returns elements where corresponding logical value is `TRUE`  |
| Character vector  | `v[c(“c”, “a”)]`    | Returns elements with matching names (only for named vectors) |
| Nothing           | `v[]`               | Returns the original vector                                   |
| 0 (Zero)          | `v[0]`              | Returns a zero-length vector                                  |


## Generating sequences for subsetting

- You can use `:` operator to generate vectors of indices for subsetting
- `seq()` function provides a generalization of `:` for generating arithemtic progressions


In [43]:
2:4

[1] 2 3 4

In [44]:
# It is similar to Python's object[start:stop:step] syntax
seq(from = 1, to = 4, by = 2)

[1] 1 3

## Vector subsetting examples

In [45]:
v

[1]  8 10 12 14

In [46]:
v[2:4]

[1] 10 12 14

In [47]:
# Argument names can be omitted for matching by position
v[seq(1,4,2)]

[1]  8 12

In [48]:
# All but the last element
v[-length(v)]

[1]  8 10 12

In [49]:
# Reverse order
v[seq(length(v),1,-1)]

[1] 14 12 10  8

## Vector recycling

For operations that require vectors to be of the same length R recycles (reuses) the shorter one

In [50]:
c(0, 1) + c(1, 2, 3, 4)

[1] 1 3 3 5

In [51]:
5 * c(1, 2, 3, 4)

[1]  5 10 15 20

In [52]:
c(1, 2, 3, 4)[c(TRUE, FALSE)]

[1] 1 3

## Lists

- As opposed to vectors, *lists* can contain elements of any type
- List can also have nested lists within it
- Lists are constructed using `list()` function in R

In [53]:
# We can combine different data types in a list and, optionally, name elements (e.g. B below)
l <- list(2:4, "a", B = c(TRUE, FALSE, FALSE), list('x', 1L))
l

[[1]]
[1] 2 3 4

[[2]]
[1] "a"

$B
[1]  TRUE FALSE FALSE

[[4]]
[[4]][[1]]
[1] "x"

[[4]][[2]]
[1] 1



## R object structure

- `str()` - one of the most useful functions in R
- It shows the **str**ucture of an arbitrary R object

In [54]:
str(l)

List of 4
 $  : int [1:3] 2 3 4
 $  : chr "a"
 $ B: logi [1:3] TRUE FALSE FALSE
 $  :List of 2
  ..$ : chr "x"
  ..$ : int 1


## List subsetting

- As with vectors you can use `[]` to subset lists
- This will return a list of length one
- Components of the list can be individually extracted using `[[` and `$` operators

```
list[index]
list[[index]]
list$name
```

## List subsetting examples

In [55]:
l[3]

$B
[1]  TRUE FALSE FALSE


In [56]:
str(l[3])

List of 1
 $ B: logi [1:3] TRUE FALSE FALSE


In [57]:
l[[3]]

[1]  TRUE FALSE FALSE

In [58]:
# Only works with named elements
l$B

[1]  TRUE FALSE FALSE

## Attributes

- All R objects can have attributes that contain metadata about them
- Attributes can be thought of as named lists
- Names, dimensions and class are common examples of attributes
- They (and some other) have special functions for getting and setting them
- More generally, attributes can be accessed and modified individually with `attr()` function


## Examples of attributes

In [59]:
v

[1]  8 10 12 14

In [60]:
attr(v, "example_attribute") <- "This is a vector"

In [61]:
attr(v, "example_attribute")

[1] "This is a vector"

In [62]:
# To set names for vector elements we can use names() function
names(v) <- c("a", "b", "c", "d")
v

 a  b  c  d 
 8 10 12 14 
attr(,"example_attribute")
[1] "This is a vector"

In [63]:
# Names of vector elements can be used for subsetting
v["b"]

 b 
10 

## Factors

- Factors form the basis of categorical data analysis in R
- Values of nominal (categorical) variables represent categories rather than numeric data
- Examples are abundant in social sciences (gender, party, region, etc.)
- Internally, in R factor variables are represented by integer vectors
- With 2 additional attributes:
    - `class()` attribute which is set to `factor`
    - `levels()` attribute which defines allowed values

## Factors example

In [64]:
provinces <- c("Leinster", "Connacht", "Munster", "Ulster")
provinces

[1] "Leinster" "Connacht" "Munster"  "Ulster"  

In [65]:
typeof(provinces)

[1] "character"

In [66]:
# We use factor() function to convert character vector into factor
provinces <- factor(provinces)
provinces

[1] Leinster Connacht Munster  Ulster  
Levels: Connacht Leinster Munster Ulster

In [67]:
class(provinces)

[1] "factor"

In [68]:
# Note that the data type of this vector is integer (and not character)
typeof(provinces)

[1] "integer"

## Factors example continued

In [69]:
# Note that R automatically sorted the categories alphabetically
levels(provinces)

[1] "Connacht" "Leinster" "Munster"  "Ulster"  

In [70]:
# You can change the reference category using relevel() function
provinces <- relevel(provinces, ref = "Leinster")
levels(provinces)

[1] "Leinster" "Connacht" "Munster"  "Ulster"  

In [71]:
# Or define an arbitrary ordering of levels using levels argiment in factor() function
provinces <- factor(provinces, levels = c("Ulster", "Munster", "Connacht", "Leinster"))
levels(provinces)

[1] "Ulster"   "Munster"  "Connacht" "Leinster"

In [72]:
# Under the hood factors continue to be integer vectors
as.integer(provinces)

[1] 4 3 2 1

## Arrays and matrices

- Arrays are vectors with an added class and dimensionality attribute
- These attributes can be accessed using `class()` and `dim()` functions
- Arrays can have an arbitrary number of dimensions
- Matrices are special cases of arrays that have just two dimensions
- Arrays and matrices can be created using `array()` and `matrix()` functions
- Or by adding dimension attribute with `dim()` function

## Array example

In [73]:
# : operator can be used generate vectors of sequential numbers
a <- 1:12
a

 [1]  1  2  3  4  5  6  7  8  9 10 11 12

In [74]:
class(a)

[1] "integer"

In [75]:
dim(a) <- c(3, 2, 2)
a

, , 1

     [,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6

, , 2

     [,1] [,2]
[1,]    7   10
[2,]    8   11
[3,]    9   12


In [76]:
class(a)

[1] "array"

## Matrix example

In [77]:
m <- 1:12

In [78]:
dim(m) <- c(3, 4)
m

     [,1] [,2] [,3] [,4]
[1,] 1    4    7    10  
[2,] 2    5    8    11  
[3,] 3    6    9    12  

In [79]:
# Alternatively, we could use matrix() function
m <- matrix(1:12, nrow = 3, ncol = 4)
m

     [,1] [,2] [,3] [,4]
[1,] 1    4    7    10  
[2,] 2    5    8    11  
[3,] 3    6    9    12  

In [80]:
# Note that length() function displays the length of underlying vector
length(m)

[1] 12

## Array and matrix subsetting

- Subsetting higher-dimensional (> 1) structures is a generalisation of vector subsetting
- But, since they are built upon vectors there is a nuance (albeit uncommon)
- They are usually subset in 2 ways:
    - with multiple vectors, where each vector is a sequence of elements in that dimension
    - with 1 vector, in which case subsetting happens from the underlying vector
    
```
array[vector_1, vector_2, ..., vector_n]
array[vector]
```

## Array subsetting example

In [81]:
a

, , 1

     [,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6

, , 2

     [,1] [,2]
[1,]    7   10
[2,]    8   11
[3,]    9   12


In [82]:
# Most common way
a[1,2,2]

[1] 10

In [83]:
# Here elements are subset from underlying vector (with repetition)
a[c(1,2,2)]

[1] 1 2 2

## Matrix subsetting example

In [84]:
m

     [,1] [,2] [,3] [,4]
[1,] 1    4    7    10  
[2,] 2    5    8    11  
[3,] 3    6    9    12  

In [85]:
# Subset all rows, first two columns
m[1:nrow(m),1:2]

     [,1] [,2]
[1,] 1    4   
[2,] 2    5   
[3,] 3    6   

In [86]:
# Note that vector recycling also applies here
m[c(TRUE, FALSE), -3]

     [,1] [,2] [,3]
[1,] 1    4    10  
[2,] 3    6    12  

## Data frames

- Data frame is the workhorse of data analysis in R
- Despite their matrix-like appearance, data frames are lists of equal-sized vectors
- Data frames can be created with `data.frame()` function with names vectors as input

Extra: Recall, how pandas data frames in Python are dictionaries of equal-length lists/arrays

In [87]:
df <- data.frame(
    x = 1:4,
    y = c("a", "b", "c", "d"),
    z = c(TRUE, FALSE, FALSE, TRUE)
)
df

  x y z    
1 1 a  TRUE
2 2 b FALSE
3 3 c FALSE
4 4 d  TRUE

## Data frames example

In [88]:
# str() function applied to data frame is useful in determining variable types
str(df)

'data.frame':	4 obs. of  3 variables:
 $ x: int  1 2 3 4
 $ y: chr  "a" "b" "c" "d"
 $ z: logi  TRUE FALSE FALSE TRUE


In [89]:
# dim() function behaves similar to matrix, showing N rows and N columns, respectively
dim(df)

[1] 4 3

In [90]:
# In constrast to matrix length() of data frame displays the length of underlying list
length(df)

[1] 3

## Data frame subsetting

- In subsetting data frames the techniques of subsetting matrices and lists are combined
- If you subset with a single vector, it behaves as a list
- If you subset with two vectors, it behaves as a matrix

## Data frame subsetting example

In [91]:
# Like a list
df[c("x", "z")]

  x z    
1 1  TRUE
2 2 FALSE
3 3 FALSE
4 4  TRUE

In [92]:
# Like a matrix
df[,c("x", "z")]

  x z    
1 1  TRUE
2 2 FALSE
3 3 FALSE
4 4  TRUE

In [93]:
df[df$y == "b",]

  x y z    
2 2 b FALSE

## R packages

- R's flexibility comes from its rich package ecosystem
- [Comprehensive R Archive Network (CRAN)](https://cran.r-project.org/) is the official repository of R packages
- At the moment it contains > 18K external packages
- Use `install.packages(<package_name>)` function to install packages that were released on CRAN
- Check `devtools` package if you need to install a package from other sources (e.g. GitHub, Bitbucket, etc.)
- Type `library(<package_name>)` to load installed packages

## Help!

R has an inbuilt help facility which provides more information about any function:

In [94]:
?length

In [95]:
help(dim)

- The quality of documentation varies a lot across packages.
- Stackoverflow is a good resource for many standard tasks.
- For custom packages it is often helpful to check the issues page on the GitHub.
- E.g. for `ggplot2`: [https://github.com/tidyverse/ggplot2/issues](https://github.com/tidyverse/ggplot2/issues)
- Or, indeed, any search engine [#LMDDGTFY](https://lmddgtfy.net/)

## Next

- Tutorial: R objects, attributes and subsetting
- Next week: Control flow and functions in R
