In [1]:
library(tidyverse)
install.packages('nycflights13')
library(nycflights13)

remotes::install_github("bradleyboehmke/harrypotter")
install.packages("tidytext")
library(harrypotter)

“running command 'timedatectl' had status 1”
── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.4.1     [32m✔[39m [34mpurrr  [39m 1.0.1
[32m✔[39m [34mtibble [39m 3.2.0     [32m✔[39m [34mdplyr  [39m 1.1.0
[32m✔[39m [34mtidyr  [39m 1.3.0     [32m✔[39m [34mstringr[39m 1.4.1
[32m✔[39m [34mreadr  [39m 2.1.4     [32m✔[39m [34mforcats[39m 1.0.0

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Downloading GitHub repo bradleyboehmke/harrypotter@HEAD



[36m──[39m [36mR CMD build[39m [36m─────────────────────────────────────────────────────────────────[39m
* checking for file ‘/tmp/Rtmp8DqsGk/remotes865bf65b27/bradleyboehmke-harrypotter-51f7146/DESCRIPTION’ ... OK
* preparing ‘harrypotter’:
* checking DESCRIPTION meta-information ... OK
* checking for LF line-endings in source and make files and shell scripts
* checking for empty or unneeded directories
* building ‘harrypotter_0.1.0.tar.gz’



Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependencies ‘Rcpp’, ‘SnowballC’, ‘janeaustenr’, ‘tokenizers’




# Lecture 18: More on functions and iterations
<div style="border: 1px double black; padding: 10px; margin: 10px">

**After today's lecture you will understand:**
- Function scope
- [Functional programming](#Functional-programming) (FP): functions that operate on other functions.
</div>

These notes correspond to Chapter 27 of the book.

In [1]:
library(tidyverse)
# install.packages('nycflights13')
library(nycflights13)

# remotes::install_github("bradleyboehmke/harrypotter")
library(harrypotter)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.2     [32m✔[39m [34mreadr    [39m 2.1.4
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.0
[32m✔[39m [34mggplot2  [39m 3.4.3     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.2     [32m✔[39m [34mtidyr    [39m 1.3.0
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


## Scope

Scoping refers to how R looks up the value associated with an object referred to by name. There are two types of scoping – lexical and dynamic – but we will concern ourselves only with lexical scoping here. There are four keys to understanding scoping:

- environments
- name masking
- variables vs functions
- dynamic look up

An environment can be thought of as a context in which names are associated with objects. Each time a function is called, it generates a new environment for the computation.

Consider the following examples:

In [2]:
ls()

In [3]:
f1 = function() {
  f1_message = "I'm defined inside of f!"  # `message` is a function in base
  x = 10
  ls()
}
f1()
ls()

In [4]:
exists('f1') # f1 %in% ls() 

In [None]:
# what about f1_message?
exists('f1_message')

In [7]:
environment() # here we are in the global environment

<environment: R_GlobalEnv>

In [8]:
f2 = function(){
  environment() # here we are in the local environment -- each time we get a different local environment
    # created for the purpose of this function
}
f2()

<environment: 0x562fd792cba8>

In [9]:
rm(f1, f2)

In [10]:
exists('f1')

Name masking refers to where and in what order `R` looks for object names.
When we call `f1` above, `R` first looks in the current environment which happens to be the global environment. The call to `ls()` however, happens within the environment created by the function call and hence returns only the objects defined in the local environment.

When an environment is created, it gets nested within the current environment referred to as the “parent environment”. When an object is referenced we first look in the current environment and move recursively up through parent environments until we find a value bound to that name.

Name masking refers to the notion that objects of the same name can exist in different environments. Consider these examples:



In [24]:
#  Example 3 -- lexical scoping
y = 'Y - I came from outside of f!'
x = 'X - I came from outside of f!'
f3 = function(){
  x =  'I came from inside of f!'
  print(paste("x = ", x, "and y = ", y))
}
f3()

print(paste("outside-x =", x, "and outside-y =", y))

[1] "x =  I came from inside of f! and y =  Y - I came from outside of f!"
[1] "outside-x = X - I came from outside of f! and outside-y = Y - I came from outside of f!"


* x is redefined inside the function enviornment
* y is not, so R will search for y in the parent environment and keep moving up
* x that is associated with f3, is not going to change the x in the global environment, unless we explicitly write the code to do that

In [23]:
#  Example 4 -- assigning to outside scope

x = 'X - I came from outside of f!'
f4 = function(){
  x <<-  'global scope x value is now changed' ## super assignment operator is used here
  print(paste("x = ", x))
}
f4()
print(paste("outside-x = ", x))

[1] "x =  global scope x value is now changed"
[1] "outside-x =  global scope x value is now changed"


In [18]:
#  Example 5 -- masking
mean = function(x){ 
    sum(x)
}
mean(1:10)

In [19]:
base::mean(1:10)

In [20]:
rm(mean)

In [None]:
mean(1:10)

R also uses dynamic look-up, meaning values are searched for when a function is called, not when it is created. In the example above, y was defined in the global environment rather than within the function body. This means the value returned by f3 depends on the value of y in the global environment. You should generally avoid this, but there are occasions where it can be useful.


In [25]:
# Example 5 - dynamic lookup
y = "I have been reinvented!"
f3()

[1] "x =  I came from inside of f! and y =  I have been reinvented!"


### Anonymous functions
In the last lesson, we wrote a function, `z_score`, and then applied it to multiple columns using the `across()` verb. Often, we want to apply a relatively simple function that we are only going to use once. In these cases, we can define an "anonymous" function that only exists temporarily. 

In [57]:
mpg %>% summarise(across(where(is.numeric), \(x) median(x, na.rm=T))) %>% print

[90m# A tibble: 1 × 5[39m
  displ  year   cyl   cty   hwy
  [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m
[90m1[39m   3.3 [4m2[24m004.     6    17    24


## `list()`

We did use `list` earlier but now we will learn more about it

In [31]:
x <- list('a', 1, FALSE, pi, list(1:3))
x

As the printout suggests, you can think of a list as a "vector of vectors". For this reason, they are sometimes referred to as "recursive vectors".

The `str` command will print out the **str**ucture of a vector:

In [32]:
str(x) 

List of 5
 $ : chr "a"
 $ : num 1
 $ : logi FALSE
 $ : num 3.14
 $ :List of 1
  ..$ : int [1:3] 1 2 3


You can name each individual entry of a list:

In [32]:
x_named <- list(a = 1, b = 2, c = 3, 4)
names(x_named)

In [33]:
str(x_named)

List of 4
 $ a: num 1
 $ b: num 2
 $ c: num 3
 $  : num 4


In [34]:
x_named$a

### Sub-setting lists
Subsetting lists is a little more complex than subsetting atomic vectors. We will use the following example list:

In [35]:
str(example_list <- list(a = 1:3, b = "a string", c = pi, d = list(-1, -5)))

List of 4
 $ a: int [1:3] 1 2 3
 $ b: chr "a string"
 $ c: num 3.14
 $ d:List of 2
  ..$ : num -1
  ..$ : num -5


#### `[]`
The `[]` operator extracts a sub-list. That is, the return type will always be a list:

In [41]:
x <- list('a', 1, FALSE, pi, list(1:3))
x[1]

#### `[[]]`
The double-brackets will extract a single component from the list:

In [40]:
x[[1]]

### Data frames are lists
Many data types in R are actually lists plus some additional attributes. For example, tibbles and data frames are both lists:

In [43]:
typeof(mpg)

The `names()` of a tibble/data frame correspond to columns. This means we can use the list indexing methods shown above to access columns:

In [44]:
str(mpg)

tibble [234 × 11] (S3: tbl_df/tbl/data.frame)
 $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
 $ model       : chr [1:234] "a4" "a4" "a4" "a4" ...
 $ displ       : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
 $ year        : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
 $ cyl         : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
 $ trans       : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
 $ drv         : chr [1:234] "f" "f" "f" "f" ...
 $ cty         : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
 $ hwy         : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
 $ fl          : chr [1:234] "p" "p" "p" "p" ...
 $ class       : chr [1:234] "compact" "compact" "compact" "compact" ...


In [45]:
names(mpg)

## `map()` 
In the last lecture, we learned iterations. An alternative to writing a for loop is to use the `map(f, seq)` function. This takes a function `f` and "maps" it over each element of a sequence (list or vector) `seq`.

![map](https://d33wubrfki0l68.cloudfront.net/f0494d020aa517ae7b1011cea4c4a9f21702df8b/2577b/diagrams/functionals/map.png)

In [62]:
my_list <- c(1:5)

print(my_list)

double_me <- function(x){
    return(x*2)
}

# double the number using map() and double_me function
my_list_doubled <- map(my_list, double_me)

# Print the result
my_list_doubled

[1] 1 2 3 4 5


## Lists and functional programming
Lists are useful to use because they can represent a sequence of values. Let's see an example of combining all the Harry Potter books into one data frame:

First, we need to load the library and get the names of all the books:

In [46]:
hp <- ls('package:harrypotter')
hp

In [50]:
hp[1]

This has returned a character vector containing the seven titles in the database. To access any one of them, we can write:

In [51]:
getExportedValue('harrypotter', hp[2]) %>% str

 chr [1:37] "The two men appeared out of nowhere, a few yards apart in the narrow, moonlit lane. For a second they stood qui"| __truncated__ ...


In [52]:
for (title in hp) {
    print(title)
}

[1] "chamber_of_secrets"
[1] "deathly_hallows"
[1] "goblet_of_fire"
[1] "half_blood_prince"
[1] "order_of_the_phoenix"
[1] "philosophers_stone"
[1] "prisoner_of_azkaban"


In [66]:
tbl_from_title <- function(title) {
    text <- getExportedValue('harrypotter', title)
    tibble(title=title, text=text)
}

hp %>% map(tbl_from_title) %>% str

List of 7
 $ : tibble [19 × 2] (S3: tbl_df/tbl/data.frame)
  ..$ title: chr [1:19] "chamber_of_secrets" "chamber_of_secrets" "chamber_of_secrets" "chamber_of_secrets" ...
 $ : tibble [37 × 2] (S3: tbl_df/tbl/data.frame)
  ..$ title: chr [1:37] "deathly_hallows" "deathly_hallows" "deathly_hallows" "deathly_hallows" ...
  ..$ text : chr [1:37] "The two men appeared out of nowhere, a few yards apart in the narrow, moonlit lane. For a second they stood qui"| __truncated__ "Harry was bleeding. Clutching his right hand in his left and swearing under his breath, he shouldered open his "| __truncated__ "The sound of the front door slamming echoed up the stairs and a voice roared, \"Oh! You!\"Sixteen years of bein"| __truncated__ "Harry ran back upstairs to his bedroom, arriving at the window just in time to see the Dursleys' car swinging o"| __truncated__ ...
 $ : tibble [37 × 2] (S3: tbl_df/tbl/data.frame)
  ..$ title: chr [1:37] "goblet_of_fire" "goblet_of_fire" "goblet_of_fire" "goblet_of_f

In [54]:
for (item in x){
  print(item)
}

[1] "a"
[1] 1
[1] FALSE
[1] 3.141593
[[1]]
[1] 1 2 3



In [53]:
print(hp[1])
print(hp[2])
print(hp[3])

[1] "chamber_of_secrets"
[1] "deathly_hallows"
[1] "goblet_of_fire"


By tweaking this for loop, we could make it create a dataset of all the chapters in HP:

In [None]:
df <- tibble()
for (title in hp) {
    df <- bind_rows(df, tibble(title=title, text=getExportedValue('harrypotter', title)))
}
df %>% print

[90m# A tibble: 200 × 2[39m
   title              text                                                      
   [3m[90m<chr>[39m[23m              [3m[90m<chr>[39m[23m                                                     
[90m 1[39m chamber_of_secrets [90m"[39mTHE WORST BIRTHDAY　　Not for the first time, an argumen…
[90m 3[39m chamber_of_secrets [90m"[39mTHE BURROW　　Ron.l\" breathed Harry, creeping to the wi…
[90m 4[39m chamber_of_secrets [90m"[39mAT FL0VRR 11 $ HAND BLOTTS　　ife at the Burrow was as d…
[90m 5[39m chamber_of_secrets [90m"[39mTHE WHOMPING　　WILLOW　　he end of the summer vacation …
[90m 6[39m chamber_of_secrets [90m"[39mGILDEROY LOCKHART　　he next day, however, Harry barely …
[90m 7[39m chamber_of_secrets [90m"[39mHarry looked bemusedly at the photograph Colin was brand…
[90m 8[39m chamber_of_secrets [90m"[39m　　\"What are you talking about, Harry? Perhaps you're …
[90m 9[39m chamber_of_secrets [90m"[39mTHE WRTITING ON THE WALL　　What's g

Finally, we need to take this list of dataframes and combine it into one large dataframe:

In [70]:
hp %>% setNames(hp)

In [79]:
hp %>% 
    setNames(hp) %>% 
    map(\(title) tibble(text = getExportedValue('harrypotter', title))) %>%
    list_rbind(names_to = "title") %>% print

[90m# A tibble: 200 × 2[39m
   title              text                                                      
   [3m[90m<chr>[39m[23m              [3m[90m<chr>[39m[23m                                                     
[90m 1[39m chamber_of_secrets [90m"[39mTHE WORST BIRTHDAY　　Not for the first time, an argumen…
[90m 3[39m chamber_of_secrets [90m"[39mTHE BURROW　　Ron.l\" breathed Harry, creeping to the wi…
[90m 4[39m chamber_of_secrets [90m"[39mAT FL0VRR 11 $ HAND BLOTTS　　ife at the Burrow was as d…
[90m 5[39m chamber_of_secrets [90m"[39mTHE WHOMPING　　WILLOW　　he end of the summer vacation …
[90m 6[39m chamber_of_secrets [90m"[39mGILDEROY LOCKHART　　he next day, however, Harry barely …
[90m 7[39m chamber_of_secrets [90m"[39mHarry looked bemusedly at the photograph Colin was brand…
[90m 8[39m chamber_of_secrets [90m"[39m　　\"What are you talking about, Harry? Perhaps you're …
[90m 9[39m chamber_of_secrets [90m"[39mTHE WRTITING ON THE WALL　　What's g

In [None]:
?list_rbind

### Example: raw NCAA data
Let us analyse NCAA data. These data actually came from a much larger dataset spread across many files. The you can load the raw data here:

In [None]:
u <- "https://datasets.stats306.org/ncaa/ncaa_games_2002.csv.gz"  # contains data for 2002-2019
read_csv(u)

[1mRows: [22m[34m27708[39m [1mColumns: [22m[34m11[39m
[36m──[39m [1mColumn specification[22m [36m─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (5): opponent_name, game_date, location, neutral_site_location, game_length
[32mdbl[39m (6): score, opponent_score, attendence, opponent_id, year, school_id

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


opponent_name,game_date,score,opponent_score,location,neutral_site_location,game_length,attendence,opponent_id,year,school_id
<chr>,<chr>,<dbl>,<dbl>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>
Sul Ross St.,11/16/2001,93,59,Home,,,2041,1390,2002,26172
Texas St.,11/25/2001,89,99,Home,,,1493,670,2002,26172
Loyola Chicago,11/29/2001,66,86,Away,,,1128,371,2002,26172
Illinois,12/01/2001,56,80,Away,,,16500,301,2002,26172
Texas,12/05/2001,64,89,Away,,,6099,703,2002,26172
UTEP,12/08/2001,56,82,Away,,,6203,704,2002,26172
Lamar,12/15/2001,67,69,Home,,,1342,346,2002,26172
San Francisco,12/18/2001,80,75,Home,,,1360,629,2002,26172
Denver,12/21/2001,81,79,Home,,,1415,183,2002,26172
Wayland Baptist,12/28/2001,92,83,Home,,,2096,,2002,26172


Let's think about how we could combine these data into one big table for further analysis. First, we'll use a for loop and bind_rows:

In [84]:
# for loop way
link = 'https://datasets.stats306.org/ncaa/ncaa_games_{year}.csv.gz'

tbl = tibble()
for(year in 2002:2004){
  tbl = bind_rows(tbl, read_csv(str_replace(link, '\\{year\\}', as.character(year))))
}

[1mRows: [22m[34m27708[39m [1mColumns: [22m[34m11[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (5): opponent_name, game_date, location, neutral_site_location, game_length
[32mdbl[39m (6): score, opponent_score, attendence, opponent_id, year, school_id

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1mRows: [22m[34m27253[39m [1mColumns: [22m[34m11[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (5): opponent_name, game_date, location, neutral_site_location, game_length
[32mdbl[39m (6): score, opponent_score, attendence, opponent_id, year, school_id

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36

In [85]:
tbl %>% glimpse

Rows: 82,510
Columns: 11
$ opponent_name         [3m[90m<chr>[39m[23m "Sul Ross St.", "Texas St.", "Loyola Chicago", "…
$ game_date             [3m[90m<chr>[39m[23m "11/16/2001", "11/25/2001", "11/29/2001", "12/01…
$ score                 [3m[90m<dbl>[39m[23m 93, 89, 66, 56, 64, 56, 67, 80, 81, 92, 102, 69,…
$ opponent_score        [3m[90m<dbl>[39m[23m 59, 99, 86, 80, 89, 82, 69, 75, 79, 83, 98, 82, …
$ location              [3m[90m<chr>[39m[23m "Home", "Home", "Away", "Away", "Away", "Away", …
$ neutral_site_location [3m[90m<chr>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ game_length           [3m[90m<chr>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "2 OT", …
$ attendence            [3m[90m<dbl>[39m[23m 2041, 1493, 1128, 16500, 6099, 6203, 1342, 1360,…
$ opponent_id           [3m[90m<dbl>[39m[23m 1390, 670, 371, 301, 703, 704, 346, 629, 183, NA…
$ year                  [3m[90m<dbl>[39m[23m 2002, 2002, 2002, 2002, 2002, 2002,

Next, we will use map and list_rbind:

Now we will simply further using map_dfr

In [60]:
str_c('a', 1:3, 'c')

In [90]:
# map way

str_c('https://datasets.stats306.org/ncaa/ncaa_games_', 2002:2004, '.csv.gz') %>% 
  map(read_csv) %>% list_rbind %>% glimpse

[1mRows: [22m[34m27708[39m [1mColumns: [22m[34m11[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (5): opponent_name, game_date, location, neutral_site_location, game_length
[32mdbl[39m (6): score, opponent_score, attendence, opponent_id, year, school_id

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1mRows: [22m[34m27253[39m [1mColumns: [22m[34m11[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (5): opponent_name, game_date, location, neutral_site_location, game_length
[32mdbl[39m (6): score, opponent_score, attendence, opponent_id, year, school_id

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36

Rows: 82,510
Columns: 11
$ opponent_name         [3m[90m<chr>[39m[23m "Sul Ross St.", "Texas St.", "Loyola Chicago", "…
$ game_date             [3m[90m<chr>[39m[23m "11/16/2001", "11/25/2001", "11/29/2001", "12/01…
$ score                 [3m[90m<dbl>[39m[23m 93, 89, 66, 56, 64, 56, 67, 80, 81, 92, 102, 69,…
$ opponent_score        [3m[90m<dbl>[39m[23m 59, 99, 86, 80, 89, 82, 69, 75, 79, 83, 98, 82, …
$ location              [3m[90m<chr>[39m[23m "Home", "Home", "Away", "Away", "Away", "Away", …
$ neutral_site_location [3m[90m<chr>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ game_length           [3m[90m<chr>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "2 OT", …
$ attendence            [3m[90m<dbl>[39m[23m 2041, 1493, 1128, 16500, 6099, 6203, 1342, 1360,…
$ opponent_id           [3m[90m<dbl>[39m[23m 1390, 670, 371, 301, 703, 704, 346, 629, 183, NA…
$ year                  [3m[90m<dbl>[39m[23m 2002, 2002, 2002, 2002, 2002, 2002,

Do you find the `map()` way easier to use? Easier to read? More enjoyable to write? (Hopefully at least one of the three.)