In [1]:
library(tidyverse)
library(nycflights13)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.3.3     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.0.6     [32m✔[39m [34mdplyr  [39m 1.0.4
[32m✔[39m [34mtidyr  [39m 1.1.2     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.4.0     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



- What are the primary and foreign keys in the two tables below?
- Show that they are keys!

In [2]:
# US panda births
us_born_pandas = read_csv("data/us_born_pandas.csv")

# Current pandas in the United States
us_current_pandas = read_csv("data/us_current_pandas.csv")


[36m──[39m [1m[1mColumn specification[1m[22m [36m────────────────────────────────────────────────────────[39m
cols(
  name = [31mcol_character()[39m,
  birth_date = [31mcol_character()[39m,
  birth_location = [31mcol_character()[39m
)



[36m──[39m [1m[1mColumn specification[1m[22m [36m────────────────────────────────────────────────────────[39m
cols(
  name = [31mcol_character()[39m,
  location = [31mcol_character()[39m,
  sex = [31mcol_character()[39m
)




In [3]:
us_born_pandas
us_current_pandas

name,birth_date,birth_location
<chr>,<chr>,<chr>
Hua Mei,8/21/99,San Diego Zoo
Mei Sheng,8/19/03,San Diego Zoo
Su Lin,8/2/05,San Diego Zoo
Yun Zi,8/5/09,San Diego Zoo
Zhen Zhen,8/3/07,San Diego Zoo
Xiao Liwu,7/29/12,San Diego Zoo
Mei Lan,9/6/06,Atlanta Zoo
Xi Lan,8/30/08,Atlanta Zoo
Po,10/3/10,Atlanta Zoo
Mei Lun,7/15/13,Atlanta Zoo


name,location,sex
<chr>,<chr>,<chr>
Le Le,Memphis Zoo,male
Ya Ya,Memphis Zoo,female
Lun Lun,Atlanta Zoo,female
Mei Lan,Atlanta Zoo,male
Yang Yang,Atlanta Zoo,male
Tian Tian,Smithsonian National Zoo,male
Mei Xiang,Smithsonian National Zoo,female
Xiao Qi Ji,Smithsonian National Zoo,male


In [None]:
# a proof that name is a primary key
us_current_pandas %>%
    count(name)

us_born_pandas %>%
    count(name)

### Inner Join

Observations are matched whenever the keys are equal

![](https://d33wubrfki0l68.cloudfront.net/3abea0b730526c3f053a3838953c35a0ccbe8980/7f29b/diagrams/join-inner.png)

Before we inner join the two datasets, answer the following:
- How many observations in the resulting dataset?
- How many variables in the resulting dataset?
- Will there be missing values? If so, where?
- How would you describe the resulting dataset in words?

In [4]:
inner_join(us_current_pandas, us_born_pandas, by = "name")

name,location,sex,birth_date,birth_location
<chr>,<chr>,<chr>,<chr>,<chr>
Mei Lan,Atlanta Zoo,male,9/6/06,Atlanta Zoo
Xiao Qi Ji,Smithsonian National Zoo,male,8/21/20,Smithsonian National Zoo


In [5]:
# if we don't specify a key, inner_join will use all shared variable names
inner_join(us_current_pandas, us_born_pandas)

Joining, by = "name"



name,location,sex,birth_date,birth_location
<chr>,<chr>,<chr>,<chr>,<chr>
Mei Lan,Atlanta Zoo,male,9/6/06,Atlanta Zoo
Xiao Qi Ji,Smithsonian National Zoo,male,8/21/20,Smithsonian National Zoo


In [6]:
# we can also use the pipe
# (pipe always uses the previous result as the first argument of the next line)
us_current_pandas %>%
    inner_join(us_born_pandas, by = "name")

name,location,sex,birth_date,birth_location
<chr>,<chr>,<chr>,<chr>,<chr>
Mei Lan,Atlanta Zoo,male,9/6/06,Atlanta Zoo
Xiao Qi Ji,Smithsonian National Zoo,male,8/21/20,Smithsonian National Zoo


In [7]:
# what if the datasets share more than variable?
(us_born_pandas2 = read_csv("data/us_born_pandas2.csv"))


[36m──[39m [1m[1mColumn specification[1m[22m [36m────────────────────────────────────────────────────────[39m
cols(
  name = [31mcol_character()[39m,
  birth_date = [31mcol_character()[39m,
  birth_location = [31mcol_character()[39m,
  sex = [31mcol_character()[39m
)




name,birth_date,birth_location,sex
<chr>,<chr>,<chr>,<chr>
Hua Mei,8/21/99,San Diego Zoo,female
Mei Sheng,8/19/03,San Diego Zoo,female
Su Lin,8/2/05,San Diego Zoo,female
Yun Zi,8/5/09,San Diego Zoo,male
Zhen Zhen,8/3/07,San Diego Zoo,female
Xiao Liwu,7/29/12,San Diego Zoo,male
Mei Lan,9/6/06,Atlanta Zoo,male
Xi Lan,8/30/08,Atlanta Zoo,male
Po,10/3/10,Atlanta Zoo,female
Mei Lun,7/15/13,Atlanta Zoo,female


In [8]:
inner_join(us_current_pandas, us_born_pandas2, by = "name")

name,location,sex.x,birth_date,birth_location,sex.y
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
Mei Lan,Atlanta Zoo,male,9/6/06,Atlanta Zoo,male
Xiao Qi Ji,Smithsonian National Zoo,male,8/21/20,Smithsonian National Zoo,male


- Since we are just joining along one variable (```name```), we treat the variables ```us_born_pandas2$sex``` and ```us_current_pandas$sex``` as two distinct variables when joining
- If a variable (or variable group) is a primary key, then so is any variable group containing that variable (or variable group)
 * Think about why this is true!
- Join the two datasets along all shared variables using two different, distinct blocks of code.

In [9]:
inner_join(us_current_pandas, us_born_pandas2, by = c("name", "sex"))

name,location,sex,birth_date,birth_location
<chr>,<chr>,<chr>,<chr>,<chr>
Mei Lan,Atlanta Zoo,male,9/6/06,Atlanta Zoo
Xiao Qi Ji,Smithsonian National Zoo,male,8/21/20,Smithsonian National Zoo


In [10]:
inner_join(us_current_pandas, us_born_pandas2)

Joining, by = c("name", "sex")



name,location,sex,birth_date,birth_location
<chr>,<chr>,<chr>,<chr>,<chr>
Mei Lan,Atlanta Zoo,male,9/6/06,Atlanta Zoo
Xiao Qi Ji,Smithsonian National Zoo,male,8/21/20,Smithsonian National Zoo


In [11]:
# let's make things a little messier...
(us_current_pandas2 <- rename(us_current_pandas, panda_name = name, sex_of_panda = sex))

panda_name,location,sex_of_panda
<chr>,<chr>,<chr>
Le Le,Memphis Zoo,male
Ya Ya,Memphis Zoo,female
Lun Lun,Atlanta Zoo,female
Mei Lan,Atlanta Zoo,male
Yang Yang,Atlanta Zoo,male
Tian Tian,Smithsonian National Zoo,male
Mei Xiang,Smithsonian National Zoo,female
Xiao Qi Ji,Smithsonian National Zoo,male


In [12]:
# if the names of the variables aren't exactly the same, we have to specify this
inner_join(us_born_pandas2, us_current_pandas2, by = c("name" = "panda_name", "sex" = "sex_of_panda"))

name,birth_date,birth_location,sex,location
<chr>,<chr>,<chr>,<chr>,<chr>
Mei Lan,9/6/06,Atlanta Zoo,male,Atlanta Zoo
Xiao Qi Ji,8/21/20,Smithsonian National Zoo,male,Smithsonian National Zoo


In [None]:
# why doesn't the following code work?!
inner_join(us_born_pandas2, us_current_pandas2, by = c("panda_name" = "name", "sex_of_panda" = "sex"))

Attempt
- Use the ```summarize()``` and ```filter()``` commands on the ```flights``` dataset to get a dataset called ```avg_delay_hour``` containing the mean departure delay each hour of the year for flights whose ```origin``` is JFK.
 * Hint: First filter by flights with whose ```origin``` is JFK. The variables in the resulting dataset should be mean_delay, year, month, day, and hour
 * Is there a primary key?
- Use the ```weather``` dataset as a starting point to get a dataset called ```JFK_weather``` representing the weather at every hour at JFK airport.
 * Is there a primary key?
- Think about how you might join ```avg_delay_hour``` and ```JFK_weather``` to a new dataset ```weather_and_delay```
 * What would be a good primary key for weather in this scenario?
 * How many rows are in ```avg_delay_hour```, ```JFK_weather```, and ```weather_and_delay```? What happened? 

In [13]:
avg_delay_hour <- flights %>%
    filter(origin == "JFK") %>%
    group_by(year, month, day, hour) %>%
    summarize(mean_delay = mean(dep_delay, na.rm = TRUE)) %>%
    print()

`summarise()` has grouped output by 'year', 'month', 'day'. You can override using the `.groups` argument.



[90m# A tibble: 6,935 x 5[39m
[90m# Groups:   year, month, day [365][39m
    year month   day  hour mean_delay
   [3m[90m<int>[39m[23m [3m[90m<int>[39m[23m [3m[90m<int>[39m[23m [3m[90m<dbl>[39m[23m      [3m[90m<dbl>[39m[23m
[90m 1[39m  [4m2[24m013     1     1     5      0.333
[90m 2[39m  [4m2[24m013     1     1     6     -[31m1[39m[31m.[39m[31m0[39m[31m6[39m 
[90m 3[39m  [4m2[24m013     1     1     7      4.12 
[90m 4[39m  [4m2[24m013     1     1     8      1.09 
[90m 5[39m  [4m2[24m013     1     1     9      3.67 
[90m 6[39m  [4m2[24m013     1     1    10     -[31m2[39m    
[90m 7[39m  [4m2[24m013     1     1    11      8.33 
[90m 8[39m  [4m2[24m013     1     1    12      6    
[90m 9[39m  [4m2[24m013     1     1    13     25.9  
[90m10[39m  [4m2[24m013     1     1    14      9.19 
[90m# … with 6,925 more rows[39m


In [15]:
JFK_weather <- weather %>%
    filter(origin == "JFK") %>%
    print(width = Inf)

[90m# A tibble: 8,706 x 15[39m
   origin  year month   day  hour  temp  dewp humid wind_dir wind_speed
   [3m[90m<chr>[39m[23m  [3m[90m<int>[39m[23m [3m[90m<int>[39m[23m [3m[90m<int>[39m[23m [3m[90m<int>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m    [3m[90m<dbl>[39m[23m      [3m[90m<dbl>[39m[23m
[90m 1[39m JFK     [4m2[24m013     1     1     1  39.0  26.1  59.4      260       12.7
[90m 2[39m JFK     [4m2[24m013     1     1     2  39.0  26.1  59.4      270       11.5
[90m 3[39m JFK     [4m2[24m013     1     1     3  39.9  27.0  59.5      260       15.0
[90m 4[39m JFK     [4m2[24m013     1     1     4  39.9  28.0  62.2      250       17.3
[90m 5[39m JFK     [4m2[24m013     1     1     5  39.0  27.0  61.6      260       15.0
[90m 6[39m JFK     [4m2[24m013     1     1     6  37.9  27.0  64.3      260       13.8
[90m 7[39m JFK     [4m2[24m013     1     1     7  39.0  28.0  64.4      260       1

In [18]:
inner_join(JFK_weather, avg_delay_hour, by = c("year", "month", "day", "hour"))

origin,year,month,day,hour,temp,dewp,humid,wind_dir,wind_speed,wind_gust,precip,pressure,visib,time_hour,mean_delay
<chr>,<int>,<int>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dttm>,<dbl>
JFK,2013,1,1,5,39.02,26.96,61.63,260,14.96014,,0,1012.1,10,2013-01-01 05:00:00,0.3333333
JFK,2013,1,1,6,37.94,26.96,64.29,260,13.80936,,0,1012.6,10,2013-01-01 06:00:00,-1.0625000
JFK,2013,1,1,7,39.02,28.04,64.43,260,13.80936,,0,1012.5,10,2013-01-01 07:00:00,4.1250000
JFK,2013,1,1,8,39.92,26.96,59.50,260,17.26170,,0,1012.6,10,2013-01-01 08:00:00,1.0869565
JFK,2013,1,1,9,39.92,26.96,59.50,260,16.11092,,0,1013.0,10,2013-01-01 09:00:00,3.6666667
JFK,2013,1,1,10,41.00,28.04,59.65,260,16.11092,,0,1012.8,10,2013-01-01 10:00:00,-2.0000000
JFK,2013,1,1,11,41.00,26.96,57.06,270,14.96014,,0,1011.7,10,2013-01-01 11:00:00,8.3333333
JFK,2013,1,1,13,37.94,26.60,64.70,340,14.96014,,0,,10,2013-01-01 13:00:00,25.9166667
JFK,2013,1,1,14,39.02,24.08,54.68,310,11.50780,,0,1011.2,10,2013-01-01 14:00:00,9.1875000
JFK,2013,1,1,15,39.02,23.00,52.26,290,12.65858,,0,1011.7,10,2013-01-01 15:00:00,6.2692308
