# Lecture 09: Relational Data
<div style="border: 1px double black; padding: 10px; margin: 10px">

**Goals for today's lecture:**
* Learn about [keys](#Keys)
* Different types of [relations](#Relations) between tables
* [Commands for joining related tables together](#Outer-joins)
</div>


We have already spent a lot of time analyzing the `flights` table. In fact, there are four other tables in `nycflights13` that contain related information about these flights:

In [2]:
install.packages('nycflights13')
library(tidyverse)
library(nycflights13)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.3     [32m✔[39m [34mreadr    [39m 2.1.4
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.0
[32m✔[39m [34mggplot2  [39m 3.4.3     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.2     [32m✔[39m [34mtidyr    [39m 1.3.0
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


In [None]:
print(airlines)

[90m# A tibble: 16 × 2[39m
   carrier name                       
   [3m[90m<chr>[39m[23m   [3m[90m<chr>[39m[23m                      
[90m 1[39m 9E      Endeavor Air Inc.          
[90m 2[39m AA      American Airlines Inc.     
[90m 3[39m AS      Alaska Airlines Inc.       
[90m 4[39m B6      JetBlue Airways            
[90m 5[39m DL      Delta Air Lines Inc.       
[90m 6[39m EV      ExpressJet Airlines Inc.   
[90m 7[39m F9      Frontier Airlines Inc.     
[90m 8[39m FL      AirTran Airways Corporation
[90m 9[39m HA      Hawaiian Airlines Inc.     
[90m10[39m MQ      Envoy Air                  
[90m11[39m OO      SkyWest Airlines Inc.      
[90m12[39m UA      United Air Lines Inc.      
[90m13[39m US      US Airways Inc.            
[90m14[39m VX      Virgin America             
[90m15[39m WN      Southwest Airlines Co.     
[90m16[39m YV      Mesa Airlines Inc.         


In [None]:
airports %>% filter(faa == "DTW")

  faa name                   lat      lon       alt tz dst tzone           
1 DTW Detroit Metro Wayne Co 42.21244 -83.35339 645 -5 A   America/New_York

In [None]:
print(planes)

# A tibble: 3,322 x 9
   tailnum  year type          manufacturer   model  engines seats speed engine 
   <chr>   <int> <chr>         <chr>          <chr>    <int> <int> <int> <chr>  
 1 N10156   2004 Fixed wing m… EMBRAER        EMB-1…       2    55    NA Turbo-…
 2 N102UW   1998 Fixed wing m… AIRBUS INDUST… A320-…       2   182    NA Turbo-…
 3 N103US   1999 Fixed wing m… AIRBUS INDUST… A320-…       2   182    NA Turbo-…
 4 N104UW   1999 Fixed wing m… AIRBUS INDUST… A320-…       2   182    NA Turbo-…
 5 N10575   2002 Fixed wing m… EMBRAER        EMB-1…       2    55    NA Turbo-…
 6 N105UW   1999 Fixed wing m… AIRBUS INDUST… A320-…       2   182    NA Turbo-…
 7 N107US   1999 Fixed wing m… AIRBUS INDUST… A320-…       2   182    NA Turbo-…
 8 N108UW   1999 Fixed wing m… AIRBUS INDUST… A320-…       2   182    NA Turbo-…
 9 N109UW   1999 Fixed wing m… AIRBUS INDUST… A320-…       2   182    NA Turbo-…
10 N110UW   1999 Fixed wing m… AIRBUS INDUST… A320-…       2   182    NA Turbo-…
# … wi

In [None]:
print(weather)

# A tibble: 26,115 x 15
   origin  year month   day  hour  temp  dewp humid wind_dir wind_speed
   <chr>  <dbl> <dbl> <int> <int> <dbl> <dbl> <dbl>    <dbl>      <dbl>
 1 EWR     2013     1     1     1  39.0  26.1  59.4      270      10.4 
 2 EWR     2013     1     1     2  39.0  27.0  61.6      250       8.06
 3 EWR     2013     1     1     3  39.0  28.0  64.4      240      11.5 
 4 EWR     2013     1     1     4  39.9  28.0  62.2      250      12.7 
 5 EWR     2013     1     1     5  39.0  28.0  64.4      260      12.7 
 6 EWR     2013     1     1     6  37.9  28.0  67.2      240      11.5 
 7 EWR     2013     1     1     7  39.0  28.0  64.4      240      15.0 
 8 EWR     2013     1     1     8  39.9  28.0  62.2      250      10.4 
 9 EWR     2013     1     1     9  39.9  28.0  62.2      260      15.0 
10 EWR     2013     1     1    10  41    28.0  59.6      260      13.8 
# … with 26,105 more rows, and 5 more variables: wind_gust <dbl>, precip <dbl>,
#   pressure <dbl>, visib <dbl>,

In [None]:
weather$origin %>% unique

Together these four tables form a *relational database*. The relationships can be graphed like so:
![table relationships](http://r4ds.had.co.nz/diagrams/relational-nycflights.png)

The particular relationships in this database are:
- `flights` connects to `planes` via `tailnum`.
- `flights` connects to `airlines` via `carrier`.
- `flights` connects to `airports` twice: via `origin` and `dest`.
- `flights` connects to `weather` via `origin` (the location), and `year`, `month`, `day` and `hour`.

## Keys
The "key" to understanding relational databases is... keys.



### Primary Key
A *primary key* is a variable (or set of variables) that uniquely identifies an observation in its own table: there is **at most** one row in the table that corresponds to any setting of the columns which comprise the key.

A primary key is
* a variable or
* set of variables that uniquely identify each observation (compound key).

When more than one variable is needed, the key is called a compound key.

In the `planes` table, each airplane is identified by its `tailnum`:

In [None]:
print(planes)

# A tibble: 3,322 x 9
   tailnum  year type          manufacturer   model  engines seats speed engine 
   <chr>   <int> <chr>         <chr>          <chr>    <int> <int> <int> <chr>  
 1 N10156   2004 Fixed wing m… EMBRAER        EMB-1…       2    55    NA Turbo-…
 2 N102UW   1998 Fixed wing m… AIRBUS INDUST… A320-…       2   182    NA Turbo-…
 3 N103US   1999 Fixed wing m… AIRBUS INDUST… A320-…       2   182    NA Turbo-…
 4 N104UW   1999 Fixed wing m… AIRBUS INDUST… A320-…       2   182    NA Turbo-…
 5 N10575   2002 Fixed wing m… EMBRAER        EMB-1…       2    55    NA Turbo-…
 6 N105UW   1999 Fixed wing m… AIRBUS INDUST… A320-…       2   182    NA Turbo-…
 7 N107US   1999 Fixed wing m… AIRBUS INDUST… A320-…       2   182    NA Turbo-…
 8 N108UW   1999 Fixed wing m… AIRBUS INDUST… A320-…       2   182    NA Turbo-…
 9 N109UW   1999 Fixed wing m… AIRBUS INDUST… A320-…       2   182    NA Turbo-…
10 N110UW   1999 Fixed wing m… AIRBUS INDUST… A320-…       2   182    NA Turbo-…
# … wi

The tail number of an airplane is assigned by a government agency and is unique: no two planes can have the same tail number. Thus, `tailnum` should be a primary key in this table.

To check that one or more variables constitutes a primary key, we can group by those variables and then check that the number of distinct values equals the number of rows in the data set:

In [31]:
# planes %>% print
planes %>% summarize(n = n(), nd = n_distinct(tailnum))

n,nd
<int>,<int>
3322,3322


In [33]:
planes %>% count(tailnum) %>% filter(n > 1)

tailnum,n
<chr>,<int>


Compare with `flights`, where tailnum does *not* uniquely identify each row. (There are many flights present for the same airplane.)

In [None]:
flights %>% count(tailnum) %>% filter(n > 1) %>% print

[90m# A tibble: 3,873 × 2[39m
   tailnum     n
   [3m[90m<chr>[39m[23m   [3m[90m<int>[39m[23m
[90m 1[39m D942DN      4
[90m 2[39m N0EGMQ    371
[90m 3[39m N10156    153
[90m 4[39m N102UW     48
[90m 5[39m N103US     46
[90m 6[39m N104UW     47
[90m 7[39m N10575    289
[90m 8[39m N105UW     45
[90m 9[39m N107US     41
[90m10[39m N108UW     60
[90m# ℹ 3,863 more rows[39m


What is the primary key for the `flights` table?

In [34]:
flights %>% count(year, month, day, dep_time, tailnum) %>% filter (n > 1)

year,month,day,dep_time,tailnum,n
<int>,<int>,<int>,<int>,<chr>,<int>
2013,1,2,,N10575,2
2013,1,2,,,2
2013,1,3,,,2
2013,1,4,,,2
2013,1,9,,,2
2013,1,10,,,2
2013,1,12,,,2
2013,1,13,,,7
2013,1,15,,,2
2013,1,16,,N12195,2


We might guess that `year`, `month`, `day`, and `tailnum` are sufficient to identify each row in `flights`, but this is not true:

In [35]:
flights %>% summarize(n = n(), nd = n_distinct(year, month, day, tailnum))

n,nd
<int>,<int>
336776,251727


In fact, even restricting to the exact *minute* that an airplane departed is not sufficient:

In [36]:
flights %>% summarize(n=n(), nd=n_distinct(tailnum, time_hour, minute))

n,nd
<int>,<int>
336776,336367


In [37]:
flights %>% summarize(n=n(), nd = n_distinct(tailnum, year, month, day, hour, minute))

n,nd
<int>,<int>
336776,336367


This says that there are certain airplanes that are marked as having departed more than once in the same year, month, day, hour and minute. We can inspect these rows as follows:

In [None]:
count(flights, tailnum, time_hour, minute) %>% filter(n > 1) %>% print

[90m# A tibble: 298 × 4[39m
   tailnum time_hour           minute     n
   [3m[90m<chr>[39m[23m   [3m[90m<dttm>[39m[23m               [3m[90m<dbl>[39m[23m [3m[90m<int>[39m[23m
[90m 1[39m N11119  2013-06-10 [90m16:00:00[39m     55     2
[90m 2[39m N11192  2013-08-26 [90m08:00:00[39m     30     2
[90m 3[39m N12563  2013-02-04 [90m16:00:00[39m     19     2
[90m 4[39m N12564  2013-01-13 [90m20:00:00[39m      0     2
[90m 5[39m N12900  2013-07-10 [90m21:00:00[39m     29     2
[90m 6[39m N13969  2013-01-28 [90m07:00:00[39m     59     2
[90m 7[39m N14148  2013-03-12 [90m06:00:00[39m     30     2
[90m 8[39m N14558  2013-04-19 [90m13:00:00[39m     29     2
[90m 9[39m N14916  2013-02-11 [90m13:00:00[39m     15     2
[90m10[39m N14974  2013-07-26 [90m06:00:00[39m     30     2
[90m# ℹ 288 more rows[39m


In [None]:
count(flights, tailnum, year, month, day, hour, minute) %>% filter(n > 1) %>% print

[90m# A tibble: 298 × 7[39m
   tailnum  year month   day  hour minute     n
   [3m[90m<chr>[39m[23m   [3m[90m<int>[39m[23m [3m[90m<int>[39m[23m [3m[90m<int>[39m[23m [3m[90m<dbl>[39m[23m  [3m[90m<dbl>[39m[23m [3m[90m<int>[39m[23m
[90m 1[39m N11119   [4m2[24m013     6    10    16     55     2
[90m 2[39m N11192   [4m2[24m013     8    26     8     30     2
[90m 3[39m N12563   [4m2[24m013     2     4    16     19     2
[90m 4[39m N12564   [4m2[24m013     1    13    20      0     2
[90m 5[39m N12900   [4m2[24m013     7    10    21     29     2
[90m 6[39m N13969   [4m2[24m013     1    28     7     59     2
[90m 7[39m N14148   [4m2[24m013     3    12     6     30     2
[90m 8[39m N14558   [4m2[24m013     4    19    13     29     2
[90m 9[39m N14916   [4m2[24m013     2    11    13     15     2
[90m10[39m N14974   [4m2[24m013     7    26     6     30     2
[90m# ℹ 288 more rows[39m


These likely indicate data entry errors.

## 🤔 Quiz

What column(s) constitute a primary key in the `mpg` table?

<ol style="list-style-type: upper-alpha;">
    <li><code>manufacturer</code> and <code>model</code></li>
    <li><code>manufacturer</code>, <code>year</code>, and <code>model</code></li>
    <li><code>manufacturer</code>, <code>year</code>, <code>displ</code>, and <code>model</code></li>
    <li><code>manufacturer</code>, <code>year</code>, <code>displ</code>, <code>trans</code>, and <code>model</code></li>
    <li>None of these</li>
</ol>



In [42]:
# primary key in mpg
mpg %>% count(manufacturer, model, year, displ, trans) %>% filter( n > 1)

manufacturer,model,year,displ,trans,n
<chr>,<chr>,<int>,<dbl>,<chr>,<int>
chevrolet,c1500 suburban 2wd,2008,5.3,auto(l4),3
chevrolet,k1500 tahoe 4wd,2008,5.3,auto(l4),2
dodge,caravan 2wd,1999,3.3,auto(l4),2
dodge,caravan 2wd,1999,3.8,auto(l4),2
dodge,caravan 2wd,2008,3.3,auto(l4),3
dodge,dakota pickup 4wd,2008,4.7,auto(l5),3
dodge,durango 4wd,2008,4.7,auto(l5),3
dodge,ram 1500 pickup 4wd,2008,4.7,auto(l5),3
dodge,ram 1500 pickup 4wd,2008,4.7,manual(m6),3
ford,explorer 4wd,1999,4.0,auto(l5),2


## Relations
A foreign key is a variable (or set of variables) that corresponds to a primary key in another table. For example:
* `flights$tailnum` is a foreign key that corresponds to the primary key `planes$tailnum`.
* `flights$carrier` is a foreign key that corresponds to the primary key `airlines$carrier`.
* `flights$origin` is a foreign key that corresponds to the primary key `airports$faa`.
* `flights$dest` is a foreign key that corresponds to the primary key `airports$faa`.
  
A **primary key** and the corresponding **foreign key** in another table form a *relation*. Relations come in several forms:
- *One-to-many*. (Most common). For example, each flight has one plane, but each plane has many flights.
- *Many-to-many*: For example, each airline flies to many airports; each airport hosts many airlines.
- *One-to-one*. Each row in one table corresponds uniquely to a row in a second table. This is relatively uncommon because you could just as easily combine the two tables into one.

In [3]:
x = tribble(
  ~key, ~val_x,
     1, "x1",
     2, "x2",
     3, "x3"
)
y = tribble(
  ~key, ~val_y,
     1, "y1",
     2, "y2",
     4, "y3"
)

In [47]:
x
y

k,val_x
<dbl>,<chr>
1,x1
2,x2
3,x3


key,val_y
<dbl>,<chr>
1,y1
2,y2
4,y3


## Joins
Joins are the way we combine or "merge" two data tables based on keys.
To understand how joins work we'll study these two simple tables:

![simple tables](http://r4ds.had.co.nz/diagrams/join-setup.png)

### Inner joins
Inner joins match a pair of observations whenever their keys are equal:
![match example](https://r4ds.hadley.nz/diagrams/join/inner.png)

In [4]:
# older way of joining
x %>% inner_join(y, by = "key")

key,val_x,val_y
<dbl>,<chr>,<chr>
1,x1,y1
2,x2,y2


In [5]:
# using join_by is preferred
x %>% inner_join(y, join_by('key'))

key,val_x,val_y
<dbl>,<chr>,<chr>
1,x1,y1
2,x2,y2


In [6]:
x %>% inner_join(y, join_by('key' == 'key'))

key,val_x,val_y
<dbl>,<chr>,<chr>
1,x1,y1
2,x2,y2


Note that there is no row for `key=3` or `key=4`: with an inner join, unmatched rows are not included in the result. For this reason, we do not as often use inner joins for data analysis since you can easily lose observations.

#### Differences between `by` and `join_by`
* by = "x" corresponds to join_by(x).
* by = c("a" = "x") corresponds to join_by(a == x).


### Outer joins
An outer join keeps observations that appear in at least one of the tables. There are three types of outer joins:
- A left join keeps all observations in x.
- A right join keeps all observations in y.
- A full join keeps all observations in both x and y.

![match example](https://r4ds.hadley.nz/diagrams/join/left.png)

![match example](https://r4ds.hadley.nz/diagrams/join/right.png)

![match example](https://r4ds.hadley.nz/diagrams/join/full.png)

In [57]:
x %>% full_join(y, join_by(k == key))

k,val_x,val_y
<dbl>,<chr>,<chr>
1,x1,y1
2,x2,y2
3,x3,
4,,y3


In [53]:
x
y

k,val_x
<dbl>,<chr>
1,x1
2,x2
3,x3


key,val_y
<dbl>,<chr>
1,y1
2,y2
4,y3


Left joins are the most common. Use them to look up data in another table, but preserve your original observations when there are records on the left table for which the other table does not have a match.

### Example
The `flights` table has a `carrier` column which is a two-letter code for the airline. The `airlines` table maps these code to recognizable airline names.

How many flights are there per carrier name (full name of the carrier) in the dataset?

In [62]:
# flights per carrier

flights %>%
  count(carrier) %>%
    left_join(airlines, join_by(carrier)) %>%
      select(name, n)

name,n
<chr>,<int>
Endeavor Air Inc.,18460
American Airlines Inc.,32729
Alaska Airlines Inc.,714
JetBlue Airways,54635
Delta Air Lines Inc.,48110
ExpressJet Airlines Inc.,54173
Frontier Airlines Inc.,685
AirTran Airways Corporation,3260
Hawaiian Airlines Inc.,342
Envoy Air,26397


## 🤔 Quiz

How many flights departing from LGA were operated by JetBlue Airways?

<ol style="list-style-type: upper-alpha;">
    <li>46087</li>
    <li>26397</li>
    <li>2276</li>
    <li>6002</li>
    <li>43939</li>
</ol>

In [66]:
# jetblue from LGA
flights %>% filter(origin == 'LGA') %>%
  left_join(airlines) %>%
    filter (name == 'JetBlue Airways') %>%
      nrow

[1m[22mJoining with `by = join_by(carrier)`


### Duplicate foreign keys
A primary key is unique but not a foreign key.

##### One table has duplicate keys
This is useful when you want to add in additional information as there is typically a one-to-many relationship.
![duplicate keys](http://r4ds.had.co.nz/diagrams/join-one-to-many.png)

In fact, we already saw an example of this above when we joined `planes` to `flights`: each plane belongs to potentially many flights.

### Exercise
What is the most common model of airplane used by each carrier?

In [72]:
# most common model

flights %>%
  inner_join(planes, join_by(tailnum)) %>%
  group_by(carrier, model) %>%
    summarise(n = n()) %>% top_n(1)

[1m[22m`summarise()` has grouped output by 'carrier'. You can override using the
`.groups` argument.
[1m[22mSelecting by n


carrier,model,n
<chr>,<chr>,<int>
9E,CL-600-2D24,10580
AA,767-223,4257
AS,737-890,346
B6,A320-232,34063
DL,MD-88,10191
EV,EMB-145LR,28027
F9,A320-214,617
FL,717-200,2774
HA,A330-243,342
MQ,G1159B,486


##### When Both tables have duplicate keys
This represents a many-to-many join and is usually an error since the key does not uniquely identify observations in either table. Joining duplicated keys results in the Cartesian product of all the possible matches:
![cartesian](http://r4ds.had.co.nz/diagrams/join-many-to-many.png)

Be careful when doing many-to-many merges. It's possible to generate huge tables by accident and crash R.

## 🤔 Quiz
To understand what the weather conditions were when each flight departed, I will join the weather table to the first 100 rows of the flights table using the command

```
flights %>% slice(1:100) %>% left_join(weather, by=c("origin", "hour"))
```

How many rows does the resulting table have?

<ol style="list-style-type: upper-alpha;">
    <li>100</li>
    <li>36344</li>
    <li>336776</li>
    <li>Zero</li>
    <li><code>NA</code></li>
</ol>

In [73]:
flights %>% slice(1:100) %>% left_join(weather, by=c("origin", "hour")) %>% nrow

“[1m[22mDetected an unexpected many-to-many relationship between `x` and `y`.
[36mℹ[39m Row 1 of `x` matches multiple rows in `y`.
[36mℹ[39m Row 8708 of `y` matches multiple rows in `x`.
[36mℹ[39m If a many-to-many relationship is expected, set `relationship =


### Defining the key columns
When we do a join using `left_join()`, R take as the key whatever column names the two tables have in common:

In [None]:
left_join(flights, planes) %>% print

Joining, by = c("year", "tailnum")


# A tibble: 336,776 x 26
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# … with 336,766 more rows,

This is called a *natural join*. If the key column(s) are named differently in the two tables, we must specify the mapping between the two using the `by=` or `join_by` parameter.


Consider joining `airports` to `flights`:
```
> left_join(flights, airports)
Error: `by` required, because the data sources have no common variables
Traceback:

1. left_join(flights, airports)
2. left_join.tbl_df(flights, airports)
3. common_by(by, x, y)
4. common_by.NULL(by, x, y)
5. bad_args("by", "required, because the data sources have no common variables")
6. glubort(fmt_args(args), ..., .envir = .envir)
7. .abort(text)
```

This has produced an error, because airports and flights do not have any columns in common. Indeed, the three digit FAA code is called `faa` in `airports`, but appears as either `origin` or `dest` in `flights`. To fix the error, we must specify which of `origin` or `dest` should be matched:

In [None]:
# join_by

## 🤔 Quiz

How many flights were bound to the Hawaii timezone in this dataset?

<ol style="list-style-type: upper-alpha;">
    <li>560</li>
    <li>707</li>
    <li>710</li>
    <li>659</li>
    <li>500</li>
</ol>


In [None]:
# your query

## Filtering joins

Filtering joins allow us to filter rows on one table based on their presence or absence in another table. We've already seen some examples of this on the problem sets:

In [None]:
dest_top6 <- count(flights, dest) %>% top_n(6)
filter(flights, dest %in% dest_top6$dest) %>% nrow

[1m[22mSelecting by n


### Semi-join
`semi_join(x, y)` keeps all the observations in `x` that are also in `y`.
![semi join](http://r4ds.had.co.nz/diagrams/join-semi.png)

In [None]:
# # find the total count of flights that have plane information
