# Lab 8

This lab will do some review of regular expressions, then focus on functions, conditions, and vectors.

## Table of Contents
* [Review/Explore](#Review/Explore)
* [Exercises](#Exercises)

In [None]:
library(tidyverse)
library(stringr)

## Review/Explore

### Previous HW

As there were many questions about last week's HW assignment and multiple inefficient ways I saw some people solving them, I thought it would be worthwile to go over a few of the questions together.

**Problem 2**

The word-boundary character class \b matches the beginning and end of a word. Use this character class to write a regular expression re2 such that str_count(s, re2) counts the number of words in the string s. Here a word is defined a consecutive string of letters, numbers or underscores.

In [None]:
# To solve this, take a quick look at what the word-boundary regex does, try to think in terms of what we want
tst = "\\b"
str_view("this is a word",tst)

In [None]:
#Based on this, what we want, is a pattern that is something like:
# word-boundary + any number of letters/characters + word-boundary
#Hence, the solution
sln = "\\b\\w+\\b"

In [None]:
str_view_all("this is a word",sln)

**Problem 3**

Write a regular expression which matches any word containing exactly two vowels, regardless of case. Store this regular expression in a variable named re3.

In [None]:
# Since we need exactly two of something, we know we need to use the {} brackets
# Also, note that within each word we want to a vowel surrounded by 0 or more consonants
# Here's how we can find one vowel
vow = "[aoeui]"
# Here's how we can find 0 or more consonants (not vowels)
ocon = "[^aoeui]*"
# Now we can combine this together with curly brackets surrounded by word-boundaries
sln = str_c("\\b(", ocon, vow, ocon, "){2}\\b")
sln

In [None]:
# Let's take it for a spin
str_detect(c('aba', 'aa','fkbkbfgo','aaa','thirteen'),sln)

**Problem 4**

Write a regular expression which matches proper nouns. A proper noun is defined as one or more capitalized words, optionally separated by the word(s) "and", "of", "the", and/or "by". Store your expression in a variable called re4. (If a capitalized word occurs at the beginning of a sentence, you may assume it is part of a proper noun.)

In [None]:
# Let's break this problem down, first, we know we need to find words that begin with a capital letter, so lets do that
#Ok so we want one or capital letters to start, followed by 0 or more lowercase letters
tst = "[A-Z]+[\\w]*"
str_view_all("the Jabba Monster Is great",tst)

In [None]:
# Great, so now let's try to add the second condition, we want words that follow this pattern and follow 0 or more
# of the key words given in the problem
tst = "[A-Z]+[\\w]* ((and|of|the|or|by))*"
str_view_all("the Jabba the Monster Is great",tst)

In [None]:
#Alright, but we want to connect these separate matches together somehow, so let's find 0 or more of this same pattern
# Also, let's add a space after the key words (and, of, etc.)
tst = "([A-Z]+[\\w]* ((and|of|the|or|by) )*)*"
str_view_all(c("test Test Test of Test", "Test the Test    of Test", "The Test of the or by Test of Test"),tst)

In [None]:
# Great, so last step is now making sure we only continue with this pattern when we have found another word after
# That also has a capital letter
sln = "([A-Z]+[\\w']* ((and|of|the|or|by) )*)*[A-Z]+[\\w']*"
str_view_all(c("test Test Test of Test", "Test the Test of Test", "The Test of the or by Test of Test"),sln)

### Conditions

There are two main conditional arguments you will encounter in R, IF statements and IFELSE statements. They are essentially the same thing, but IF statements can incorporate more complex logic.

In [None]:
#If I wanted to identify even numbers between 10 and 30 and even numbers between 50 and 60 
tst_number = 52

if (tst_number%%2==0) {
    if (tst_number>=10 & tst_number<=30) {
        print('got one here sir!')
    } else if (tst_number>=50 & tst_number<=60){
        print('got a big one here sir!')
    }
    else {
        print('sorry buddy, try again')
    }
} else {
    print('sorry buddy, try again')
}

In [None]:
#Using an ifelse statement, you can condense multiple lines of code into a single function (very useful sometimes!)

#If I wanted to make a new column that flags a column as even or odd
#First, setup data
nums = c(1:10)
dta = data.frame(nums)
#Now, create new column
dta$evnflg = ifelse(dta$nums %% 2 == 0, 'even', 'odd')

### Functions

For this section we will explore functions by making a game together! The game will be based on a randomly created board of black and white tiles. Multiple players are added to the board on the far right end. The goal of the game is to get to the far left end of the board first. Each player can take one step at a time, and cannot step on black squares.

Let the races begin!

In [None]:
# Make a function to create a board of a specific size
make_board = function(height,width) {
    size = height*width
    brd = matrix(sample(c(1,0,0),size,replace = TRUE), nrow=height)
    return(brd)
}

In [None]:
#Make a function to place the players on the board
set_players = function(players,board) {
    play = 1:players+1
    max_board = ncol(board)
    for (i in seq(players)){
        board[max_board,i] = play[i]
    }
    return(board)
}

In [None]:
# Make a function to print the board we create
print_board = function(brd,players) {
    colors = c('white','black',colors()[10:(10+players-1)])
    image(brd,col=colors,axes = FALSE)
}

In [None]:
# Make a function to take a step in the game
take_step = function(brd, playernum,step) {
    rw = which(brd==playernum,arr.ind=TRUE)[1]
    cl = which(brd==playernum,arr.ind=TRUE)[2]
    if (step == 'left') {
        brd[rw,cl] = 0
        brd[row,cl-1] = playernum
        invisible(return(brd))
    } else if (step == 'right') {
        brd[rw,cl] = 0
        brd[rw,cl+1] = playernum
        invisible(return(brd))
    } else if (step == 'up') {
        brd[rw,cl] = 0
        brd[rw-1,cl] = playernum
        invisible(return(brd))
    } else if (step == 'down') {
        brd[rw,cl] = 0
        brd[rw+1,cl] = playernum
        invisible(return(brd))
    } else {
        print("Invalid step type! Only up, down, left, or right steps are allowed.")
    }
}

In [None]:
# Alright, now we are ready to play the game

#Let's first just setup a 10 x 10 board to play on with 2 players
set.seed(12)
brd = make_board(10,10)
plyrs = 2
# Now we can add two players to the board
brd = set_players(plyrs,brd)
# Let's see what the board looks like now
print_board(brd,plyrs)

In [None]:
# Let's move player 2 (3 in our matrix) up
brd = take_step(brd,3,'up')
print_board(brd,plyrs)

In [None]:
#Alright now lets move player 1 to the right
brd = take_step(brd,2,'right')
print_board(brd,plyrs)

## Exercises

### Section 19

In [None]:
# Why is TRUE not a parameter to rescale01()? What would happen if x contained a single missing value, 
# and na.rm was FALSE?

# Solution:
#First, note that by a a single missing value, this means that the vector x has at least one element equal to NA.
#If there were any NA values, and na.rm = FALSE, then the function would return NA. I can confirm this by testing 
#a function that allows for na.rm as an argument

rescale01_alt <- function(x, finite = TRUE) {
  rng <- range(x, na.rm = finite, finite = finite)
  (x - rng[1]) / (rng[2] - rng[1])
}
rescale01_alt(c(NA, 1:5), finite = FALSE)
#> [1] NA NA NA NA NA NA
rescale01_alt(c(NA, 1:5), finite = TRUE)
#> [1]   NA 0.00 0.25 0.50 0.75 1.00

In [None]:
# Write both_na(), a function that takes two vectors of the same length and returns the number of
# positions that have an NA in both vectors.

# Solution:
both_na <- function(x, y) {
  sum(is.na(x) & is.na(y))
}
both_na(c(NA, NA,  1, 2),
        c(NA,  1, NA, 2))
#> [1] 1
both_na(c(NA, NA,  1, 2, NA, NA, 1), 
        c(NA,  1, NA, 2, NA, NA, 1))
#> [1] 3

In [None]:
# What do the following functions do? Why are they useful even though they are so short?
is_directory <- function(x) file.info(x)$isdir
is_readable <- function(x) file.access(x, 4) == 0

# Solution:
#The function is_directory checks whether the path in x is a directory. The function is_readable checks whether 
#the path in x is readable, meaning that the file exists and the user has permission to open it. These functions 
#are useful even though they are short because their names make it much clearer what the code is doing.

In [None]:
# Read the source code for each of the following three functions, puzzle out what they do, and then
# brainstorm better names.

f1 <- function(string, prefix) {
substr(string, 1, nchar(prefix)) == prefix }

f2 <- function(x) {
if (length(x) <= 1) return(NULL)
x[-length(x)]
}
    
f3 <- function(x, y) {
rep(y, length.out = length(x))
}
    

In [None]:
# Compare and contrast rnorm() and MASS::mvrnorm(). How could you make them more consistent?

# Solution:
#rnorm samples from the univariate normal distribution, while MASS::mvrnorm samples from the 
#multivariate normal distribution. The main arguments in rnorm are n, mean, sd. The main arguments 
#is MASS::mvrnorm are n, mu, Sigma. To be consistent they should have the same names. However, this is 
#difficult. In general, it is better to be consistent with more widely used functions, e.g. rmvnorm should 
#follow the conventions of rnorm. However, while mean is correct in the multivariate case, sd does not make 
#sense in the multivariate case. Both functions an internally consistent though; it would be bad to have mu 
#and sd or mean and Sigma.

In [None]:
# What’s the difference between if and ifelse()? Carefully read the help and construct three examples
# that illustrate the key differences.

# Solution:
# The keyword if tests a single condition, while ifelse tests each element.

In [None]:
# Implement a fizzbuzz function. It takes a single number as input. If the number is divisible by three,
# it returns “fizz”. If it’s divisible by five it returns “buzz”. If it’s divisible by three and five, it returns
# “fizzbuzz”. Otherwise, it returns the number. Make sure you first write working code before you create the function.

#Solution:
fizzbuzz <- function(x) {
  stopifnot(length(x) == 1)
  stopifnot(is.numeric(x))
  # this could be made more efficient by minimizing the
  # number of tests
  if (!(x %% 3) & !(x %% 5)) {
    print("fizzbuzz")
  } else if (!(x %% 3)) {
    print("fizz")
  } else if (!(x %% 5)) {
    print("buzz")
  }
}
fizzbuzz(6)
#> [1] "fizz"
fizzbuzz(10)
#> [1] "buzz"
fizzbuzz(15)
#> [1] "fizzbuzz"
fizzbuzz(2)

In [None]:
# What does this switch() call do? What happens if x is “e”?
# switch(x, a = , b = "ab", c = , d = "cd")

# Solution:
# It will return the "ab" for a or b, "cd" for c or d, an NULL for e. It returns the first non-missing value 
# for the first name it matches.
x = "e"
switch(x, a = , b = "ab", c = , d = "cd")

In [None]:
# What does commas(letters, collapse = "-") do? Why?
commas <- function(...) {
  stringr::str_c(..., collapse = ", ")
}

# Solution:
#This errors out because the '...' operation simply passes the additional collapse argument into the function. 
# this means that R is trying to execute: commas(letters, collapse = "-",collapse="-"), which cases an error
commas(letters,collapse="-")

In [None]:
# It’d be nice if you could supply multiple characters to the pad argument, e.g. rule("Title", pad =
# "-+"). Why doesn’t this currently work? How could you fix it?

# Solution:
rule <- function(..., pad = "-") {
  title <- paste0(...)
  width <- getOption("width") - nchar(title) - 5
  cat(title, " ", stringr::str_dup(pad, width), "\n", sep = "")
}
rule("Important output")
#> Important output ------------------------------------------------------
rule("Important output", pad = "-+")
#> Important output -+-+-+-+-+-+-+-+-+-+

In [None]:
# The default value for the method argument to cor() is c("pearson", "kendall", "spearman").
# What does that mean? What value is used by default?

# Solution:
# It means that the method argument can take one of those three values. The first value, "pearson", is used by default.

### Section 20

In [None]:
# Describe the difference between is.finite(x) and !is.infinite(x).

# Solution:
#is.finite considers only a number to be finite, and considers missing (NA), not a number (NaN), 
#and positive and negative infinity to be not finite. However, since is.infinite only considers Inf 
#and -Inf to be infinite, !is.infinite considers 0 as well as missing and not-a-number to be not infinite.
x <- c(0, NA, NaN, Inf, -Inf)
is.finite(x)
#> [1]  TRUE FALSE FALSE FALSE FALSE
!is.infinite(x)
#> [1]  TRUE  TRUE  TRUE FALSE FALSE

In [None]:
# Read the source code for dplyr::near() (Hint: to see the source code, drop the ()). How does it work?

# Solution:
#Instead of checking for exact equality, it checks that two numbers are within a certain tolerance, tol. 
#By default the tolerance is set to the square root of .Machine$double.eps, which is the smallest floating 
#point number that the computer can represent.
dplyr::near
#> function (x, y, tol = .Machine$double.eps^0.5) 
#> {
#>     abs(x - y) < tol
#> }
#> <environment: namespace:dplyr>

In [None]:
# What functions from the readr package allow you to turn a string into logical, integer, and double vector?

# Solution:
# The functions parse_logical, parse_integer, and parse_number.
parse_logical(c("TRUE", "FALSE", "1", "0", "true", "t", "NA"))
#> [1]  TRUE FALSE  TRUE FALSE  TRUE  TRUE    NA
parse_integer(c("1235", "0134", "NA"))
#> [1] 1235  134   NA
parse_number(c("1.0", "3.5", "1,000", "NA"))
#> [1]    1.0    3.5 1000.0     NA

In [None]:
# What does mean(is.na(x)) tell you about a vector x? What about sum(!is.finite(x))?

# Solution:
# The expression mean(is.na(x)) calculates the proportion of missing values in a vector
x <- c(1:10, NA, NaN, Inf, -Inf)
mean(is.na(x))
#> [1] 0.143

# The expression mean(!is.finite(x)) calculates the proportion of values that are NA, NaN, or infinite.
mean(!is.finite(x))
#> [1] 0.286

In [None]:
# Compare and contrast setNames() with purrr::set_names().

# Solution:
# These are simple functions, so we can simply print out their source code:
setNames
#> function (object = nm, nm) 
#> {
#>     names(object) <- nm
#>     object
#> }
#> <bytecode: 0x7fc4bff5e808>
#> <environment: namespace:stats>

purrr::set_names
#> function (x, nm = x, ...) 
#> {
#>     set_names_impl(x, x, nm, ...)
#> }
#> <bytecode: 0x7fc4c4243c28>
#> <environment: namespace:rlang>

#From the code we can see that set_names adds a few sanity checks: x has to be a vector, 
#and the lengths of the object and the names have to be the same.

In [None]:
# Why is x[-which(x > 0)] not the same as x[x <= 0]?

# Solution:
#-which(x > 0) which calculates the indexes for any value that is TRUE and ignores NA. Thus is keeps NA and 
#NaN because the comparison is not TRUE. x <= 0 works slightly differently. If x <= 0 returns TRUE or FALSE 
#it works the same way. However, if the comparison generates a NA, then it will always keep that entry, but set 
#it to NA. This is why the last two values of x[x <= 0] are NA rather than c(NaN, NA).
x <- c(-5:5, Inf, -Inf, NaN, NA)
x[-which(x > 0)]
#> [1]   -5   -4   -3   -2   -1    0 -Inf  NaN   NA
-which(x > 0)
#> [1]  -7  -8  -9 -10 -11 -12
x[x <= 0]
#> [1]   -5   -4   -3   -2   -1    0 -Inf   NA   NA
x <= 0
#>  [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE
#> [12] FALSE  TRUE    NA    NA

In [None]:
# What happens if you subset a tibble as if you’re subsetting a list? What are the key differences between
# a list and a tibble?

# Solution:
#When you subset with positive integers that are larger than the length of the vector, NA values are returned 
#for those integers larger than the length of the vector.
(1:10)[11:12]
#> [1] NA NA

#When a vector is subset with a name that doesn’t exist, an error is generated.
c(a = 1, 2)[["b"]]
#> Error in c(a = 1, 2)[["b"]]: subscript out of bounds

In [None]:
# What does hms::hms(3600) return? How does it print? What primitive type is the augmented vector
# built on top of? What attributes does it use?

# Solution:
x <- hms::hms(3600)
class(x)
#> [1] "hms"      "difftime"
x
#> 01:00:00

typeof(x)
#> [1] "double"

attributes(x)
#> $units
#> [1] "secs"
#> 
#> $class
#> [1] "hms"      "difftime"

In [None]:
# Try and make a tibble that has columns with different lengths. What happens?

# Solution:
#If I try to create at tibble with a scalar and column of a different length there are no issues, 
#and the scalar is repeated to the length of the longer vector.

tibble(x = 1, y = 1:5)
#> # A tibble: 5 x 2
#>       x     y
#>   <dbl> <int>
#> 1    1.     1
#> 2    1.     2
#> 3    1.     3
#> 4    1.     4
#> 5    1.     5

#However, if I try to create a tibble with two vectors of different lengths (other than one), the tibble 
#function throws an error.

tibble(x = 1:3, y = 1:4)
#> Error: Column `x` must be length 1 or 4, not 3