In [1]:
options(jupyter.rich_display = FALSE)

# Week 9 Tutorial: Fundamentals of R Programming II

## POP77001 Computer Programming for Social Scientists

##### Module website: [bit.ly/POP77001](https://bit.ly/POP77001)

## Code formatting

- Use consistent style and indentation (RStudio indents by 2 whitespaces, Jupyter Notebook by 4)
- Even though it does not affect how programs are executed in R

In [2]:
# Good style
is_positive <- function(num) {
  if (num > 0) {
    res <- TRUE
  } else {
    res <- FALSE
  }
  return(res)
}

In [3]:
# Bad style
is_positive <- function(num) {
if (num > 0) {
res <- TRUE
}
else {
res <- FALSE
}
return(res)
}

## Exercise 1: Iteration

- Below you see a matrix of random 30 observations of 5 variables
- Inspect visually the matrix
- Which variable(s) do you think has(ve) the highest standard deviation?
- First, try subsetting individual rows and columns from this matrix
- Check the dimensionality of the matrix using `dim()`, `nrow()` and `ncol()` functions
- Write a loop that goes over each variable and calculates its standard deviation
- You can use `sd()` function to calculate the standard deviation
- Save these calculated standard deviations in a vector
- Find the variable with the maximum standard deviation using `max()` or `which.max()` functions
- Is it the one you thought it would be?

In [4]:
# When dealing with random number generation it's always a good idea to make your code replicable
# by setting the seed with set.seed(function)
set.seed(2021)
# Here we create a matrix of 30 observations of 5 variables
# where each variable is a random draw from a normal distribution with mean 0
# and standard deviation drawn from a uniform distribution between 0 and 10
mat <- mapply(
  function(x) cbind(rnorm(n = 30, mean = 0, sd = x)),
  runif(n = 5, min = 0, max = 10)
)

In [5]:
mat

      [,1]       [,2]        [,3]         [,4]       [,5]       
 [1,]  2.3839361  -9.3570121  -3.38211062 -1.6188191   2.4345828
 [2,] -2.8108812   3.6398922   9.12221934  2.1498588   2.1145327
 [3,]  9.5657865 -13.0553064  -5.65808900 -1.5074431  -0.4870916
 [4,]  4.4413551   3.4095297  -5.55809112 -4.7089982 -11.5910021
 [5,]  0.7666776  12.0176689   0.86723020  0.8672294 -11.6148730
 [6,] -3.0214758  -0.8865633 -11.54534021  6.1809994  -1.3194973
 [7,]  5.0308587  -4.3594766  10.55399527  0.5751735   1.6799590
 [8,]  0.5180495  20.3242580  -1.64929984 -0.2931002  -1.1849911
 [9,]  7.6669813  -1.8116540  16.75477035 -4.4174753  -6.1632955
[10,] -2.7861283   3.2815657  -4.84693805  5.0580929   3.1865921
[11,]  6.1104984  -6.1228633  -1.06513883 -2.3779707  -3.9278496
[12,]  5.3133132   7.2338976   8.62467646 -0.9139698  -1.5229585
[13,]  4.0453120  -2.0328074  -0.09658005 -4.4241866  -2.7919515
[14,]  7.1206917   3.6832976 -12.75733596 -2.2678010  -7.4990240
[15,]  5.8374139   0.4654

In [6]:
# dim() gives us the dimensions of a matrix (or data.frame) in a nrow X ncol format
# nrow() and ncol() are, essentially, shorthands for dim()[1L] and dim()[2L], respectively
dim(mat)

[1] 30  5

In [7]:
# First, let's initialize a varible that would hold the calculated standard deviations
sds <- vector("double", length = ncol(mat))
for (i in 1:ncol(mat)) {
  sds[i] <- sd(mat[,i])
}

In [8]:
sds

[1] 4.398089 7.341164 7.050135 3.805983 5.241136

In [9]:
# The second variable has the highest standard deviation
# As you can see from the raw data it is also the variable
# with the highest absolute value (above 20)
max(sds)

[1] 7.341164

In [10]:
# We can also use which.max() function to find the index of the element in a vector
# that contains the maximum value in a given sequence
sds[which.max(sds)]

[1] 7.341164

## Exercise 2: Functions

- As R is a functional language, many of iteration routines can be avoided.
- For example, instead of creating a loop for calculating standard deviations above,
- We are more likely to run a function `apply(<object_name>, 2, <function_name>)` to calculate the desired summary statistic for each of the variables (more on the `apply`-family of function in the next lecture)
- Apply this function to the matrix from the exercise above
- Now, change 2 in the function call to 1
- What do you see? What do the current numbers show? Does this summary make sense and why?
---
- Let's turn to a more complicated case
- Below you can see another matrix object, but this time it's interspersed with letters
- What is the type of this matrix?
- Write a function that can take this matrix as an input and return a list, where each element is a column of the input matrix
- Internally, you can re-use the loop from the previous exercise
- In addition to that while building iteratively your list try checking whether a column is coercible into numeric

In [11]:
apply(mat, 2, sd)

[1] 4.398089 7.341164 7.050135 3.805983 5.241136

In [12]:
# Changing the margin from 2 to 1 returns row-wise standard deviations
# Given that in the data generation process standard deviations varied
# by column (variable), rather than row, this isn't the most meaningful
# of summary statistics
apply(mat, 1, sd)

 [1] 4.873991 4.273205 8.240744 6.699413 8.362641 6.338739 5.534204 9.417941
 [9] 9.635175 4.322363 4.640642 4.687767 3.250237 8.076607 3.395958 5.972755
[17] 6.235732 3.606126 4.734652 5.145260 5.121549 4.550594 2.435730 4.887203
[25] 6.622245 2.077985 5.900717 7.432548 3.156476 5.029326

In [13]:
set.seed(2021)
mat2 <- cbind(
  letters[sample.int(26, 30, replace = TRUE)],
  mapply(
    function(x) cbind(rnorm(n = 30, mean = 0, sd = x)),
    runif(n = 3, min = 0, max = 10)
  ),
  letters[sample.int(26, 30, replace = TRUE)]
)

In [14]:
mat2

      [,1] [,2]               [,3]               [,4]               [,5]
 [1,] g    -2.26248101185558  -4.67722160983318  -1.50979300274865  w   
 [2,] f    -0.445765619916805 -21.0901358530494  -2.87809056214973  v   
 [3,] n    -2.6434201992139   0.40893792423111   -9.69321155043919  n   
 [4,] z    2.90026773641013   -3.4480824645454   2.83291610610758   v   
 [5,] g    -3.90091658817758  -8.9771249160351   -0.64825005423638  l   
 [6,] l    0.252977008172167  0.97011182624825   -5.54706110264981  v   
 [7,] t    -3.4940171234176   3.99472863747609   2.38858085002964   s   
 [8,] f    -0.849870553544038 -1.5938329162815   -7.10484622112137  m   
 [9,] f    -0.224941456038607 -14.4829217683358  -4.43291989578963  l   
[10,] f    2.64232547402139   -14.0758627365065  2.16178933687776   m   
[11,] n    -4.71446626344588  0.149991103137851  3.62617045505629   y   
[12,] e    -3.4760147063997   -1.73297205751811  -1.31366103638704  v   
[13,] o    2.44733178666701   3.66418632154723   -0

In [15]:
apply(mat2, 2, sd)

“NAs introduced by coercion”
“NAs introduced by coercion”


[1]       NA 2.659355 8.409539 3.956232       NA

In [16]:
# As matrices in R can only contain elements of the same type
# This matrix was coerced into character type
typeof(mat2)

[1] "character"

In [17]:
as_list <- function(mat) {
  ncols <- ncol(mat)
  # Rather counterintuitively, but function vector() also allows to initialize an empty list
  lst <- vector("list", length = ncols)
  for (i in 1:ncols) {
    # Here we are checking whether a column is coercible into numeric
    # by looking for any NAs in the converted vector
    # (this assumes no missing data in the numeric columns)
    # suppressWarnings() function omits the warning message that
    # “NAs introduced by coercion”
    if (!(NA %in% suppressWarnings(as.numeric(mat[,i])))) {
      lst[[i]] <- as.numeric(mat[,i])
    } else {
      lst[[i]] <- mat[,i]
    }
  }
  return(lst)
}

In [18]:
lst2 <- as_list(mat2)

In [19]:
lst2

[[1]]
 [1] "g" "f" "n" "z" "g" "l" "t" "f" "f" "f" "n" "e" "o" "g" "i" "w" "l" "s" "r"
[20] "c" "n" "h" "z" "d" "e" "v" "p" "b" "s" "d"

[[2]]
 [1] -2.26248101 -0.44576562 -2.64342020  2.90026774 -3.90091659  0.25297701
 [7] -3.49401712 -0.84987055 -0.22494146  2.64232547 -4.71446626 -3.47601471
[13]  2.44733179 -3.41233180 -1.45127296 -3.80137429 -3.08707973 -3.49219631
[19] -0.20902770  1.21169799  0.27940912  4.22566562 -0.82850551  5.08938865
[25] -0.08252849 -1.90168846  3.54220275 -1.74181243  0.74991426  1.66116699

[[3]]
 [1]  -4.6772216 -21.0901359   0.4089379  -3.4480825  -8.9771249   0.9701118
 [7]   3.9947286  -1.5938329 -14.4829218 -14.0758627   0.1499911  -1.7329721
[13]   3.6641863  -7.0744949   2.1635246  -9.1958076   5.2829439  15.1150230
[19]  -2.3556142  -9.8714153  -3.2556208  -0.4019135 -13.0657402  13.9320412
[25]  -9.7172363  -2.2152006  -9.3409793 -13.0188889   9.1807729   3.3744389

[[4]]
 [1]  -1.50979300  -2.87809056  -9.69321155   2.83291611  -0.64825005
 [6

## Week 9: Assignment 4

- Practice subsetting, conditional statements and functions in R
- Due at 11:00 on Monday, 15th November