# Lecture 09: Relational Data
<div style="border: 1px double black; padding: 10px; margin: 10px">

**Goals for today's lecture:**
* Learn about [keys](#Keys)
* Different types of [relations](#Relations) between tables
* [Commands for joining related tables together](#Outer-joins)
</div>


We have already spent a lot of time analyzing the `flights` table. In fact, there are four other tables in `nycflights13` that contain related information about these flights:

In [37]:
# install.packages('nycflights13')
library(tidyverse)
library(nycflights13)

In [None]:
print(airlines)

[90m# A tibble: 16 × 2[39m
   carrier name                       
   [3m[90m<chr>[39m[23m   [3m[90m<chr>[39m[23m                      
[90m 1[39m 9E      Endeavor Air Inc.          
[90m 2[39m AA      American Airlines Inc.     
[90m 3[39m AS      Alaska Airlines Inc.       
[90m 4[39m B6      JetBlue Airways            
[90m 5[39m DL      Delta Air Lines Inc.       
[90m 6[39m EV      ExpressJet Airlines Inc.   
[90m 7[39m F9      Frontier Airlines Inc.     
[90m 8[39m FL      AirTran Airways Corporation
[90m 9[39m HA      Hawaiian Airlines Inc.     
[90m10[39m MQ      Envoy Air                  
[90m11[39m OO      SkyWest Airlines Inc.      
[90m12[39m UA      United Air Lines Inc.      
[90m13[39m US      US Airways Inc.            
[90m14[39m VX      Virgin America             
[90m15[39m WN      Southwest Airlines Co.     
[90m16[39m YV      Mesa Airlines Inc.         


In [38]:
airports %>% filter(faa == "DTW")

faa,name,lat,lon,alt,tz,dst,tzone
<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>
DTW,Detroit Metro Wayne Co,42.21244,-83.35339,645,-5,A,America/New_York


In [39]:
print(planes)

[90m# A tibble: 3,322 × 9[39m
   tailnum  year type                   manuf…¹ model engines seats speed engine
   [3m[90m<chr>[39m[23m   [3m[90m<int>[39m[23m [3m[90m<chr>[39m[23m                  [3m[90m<chr>[39m[23m   [3m[90m<chr>[39m[23m   [3m[90m<int>[39m[23m [3m[90m<int>[39m[23m [3m[90m<int>[39m[23m [3m[90m<chr>[39m[23m 
[90m 1[39m N10156   [4m2[24m004 Fixed wing multi engi… EMBRAER EMB-…       2    55    [31mNA[39m Turbo…
[90m 2[39m N102UW   [4m1[24m998 Fixed wing multi engi… AIRBUS… A320…       2   182    [31mNA[39m Turbo…
[90m 3[39m N103US   [4m1[24m999 Fixed wing multi engi… AIRBUS… A320…       2   182    [31mNA[39m Turbo…
[90m 4[39m N104UW   [4m1[24m999 Fixed wing multi engi… AIRBUS… A320…       2   182    [31mNA[39m Turbo…
[90m 5[39m N10575   [4m2[24m002 Fixed wing multi engi… EMBRAER EMB-…       2    55    [31mNA[39m Turbo…
[90m 6[39m N105UW   [4m1[24m999 Fixed wing multi engi… AIRBUS… A320…       2   18

In [40]:
print(weather)

[90m# A tibble: 26,115 × 15[39m
   origin  year month   day  hour  temp  dewp humid wind_dir wind_speed wind_g…¹
   [3m[90m<chr>[39m[23m  [3m[90m<int>[39m[23m [3m[90m<int>[39m[23m [3m[90m<int>[39m[23m [3m[90m<int>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m    [3m[90m<dbl>[39m[23m      [3m[90m<dbl>[39m[23m    [3m[90m<dbl>[39m[23m
[90m 1[39m EWR     [4m2[24m013     1     1     1  39.0  26.1  59.4      270      10.4        [31mNA[39m
[90m 2[39m EWR     [4m2[24m013     1     1     2  39.0  27.0  61.6      250       8.06       [31mNA[39m
[90m 3[39m EWR     [4m2[24m013     1     1     3  39.0  28.0  64.4      240      11.5        [31mNA[39m
[90m 4[39m EWR     [4m2[24m013     1     1     4  39.9  28.0  62.2      250      12.7        [31mNA[39m
[90m 5[39m EWR     [4m2[24m013     1     1     5  39.0  28.0  64.4      260      12.7        [31mNA[39m
[90m 6[39m EWR     [4m2[24m013     1     

Together these four tables form a *relational database*. The relationships can be graphed like so:
![table relationships](http://r4ds.had.co.nz/diagrams/relational-nycflights.png)

The particular relationships in this database are:
- `flights` connects to `planes` via `tailnum`.
- `flights` connects to `airlines` via `carrier`.
- `flights` connects to `airports` twice: via `origin` and `dest`.
- `flights` connects to `weather` via `origin` (the location), and `year`, `month`, `day` and `hour`.

## Keys
The "key" to understanding relational databases is... keys. 



### Primary Key
A *primary key* is a variable (or set of variables) that uniquely identifies an observation in its own table: there is **at most** one row in the table that corresponds to any setting of the columns which comprise the key.

In the `planes` table, each airplane is identified by its `tailnum`:

In [None]:
print(planes)

# A tibble: 3,322 x 9
   tailnum  year type          manufacturer   model  engines seats speed engine 
   <chr>   <int> <chr>         <chr>          <chr>    <int> <int> <int> <chr>  
 1 N10156   2004 Fixed wing m… EMBRAER        EMB-1…       2    55    NA Turbo-…
 2 N102UW   1998 Fixed wing m… AIRBUS INDUST… A320-…       2   182    NA Turbo-…
 3 N103US   1999 Fixed wing m… AIRBUS INDUST… A320-…       2   182    NA Turbo-…
 4 N104UW   1999 Fixed wing m… AIRBUS INDUST… A320-…       2   182    NA Turbo-…
 5 N10575   2002 Fixed wing m… EMBRAER        EMB-1…       2    55    NA Turbo-…
 6 N105UW   1999 Fixed wing m… AIRBUS INDUST… A320-…       2   182    NA Turbo-…
 7 N107US   1999 Fixed wing m… AIRBUS INDUST… A320-…       2   182    NA Turbo-…
 8 N108UW   1999 Fixed wing m… AIRBUS INDUST… A320-…       2   182    NA Turbo-…
 9 N109UW   1999 Fixed wing m… AIRBUS INDUST… A320-…       2   182    NA Turbo-…
10 N110UW   1999 Fixed wing m… AIRBUS INDUST… A320-…       2   182    NA Turbo-…
# … wi

The tail number of an airplane is assigned by a government agency and is unique: no two planes can have the same tail number. Thus, `tailnum` should be a primary key in this table. 

To check that one or more variables constitutes a primary key, we can group by those variables and then check that the number of distinct values equals the number of rows in the data set:

In [41]:
# planes %>% print
planes %>% summarize(n = n(), nd = n_distinct(tailnum))

n,nd
<int>,<int>
3322,3322


Compare with `flights`, where tailnum does *not* uniquely identify each row. (There are many flights present for the same airplane.)

In [43]:
count(flights, tailnum) %>% print

[90m# A tibble: 4,044 × 2[39m
   tailnum     n
   [3m[90m<chr>[39m[23m   [3m[90m<int>[39m[23m
[90m 1[39m D942DN      4
[90m 2[39m N0EGMQ    371
[90m 3[39m N10156    153
[90m 4[39m N102UW     48
[90m 5[39m N103US     46
[90m 6[39m N104UW     47
[90m 7[39m N10575    289
[90m 8[39m N105UW     45
[90m 9[39m N107US     41
[90m10[39m N108UW     60
[90m# … with 4,034 more rows[39m


What is the primary key for the `flights` table?

In [47]:
flights %>% 
  filter(!is.na(tailnum)) %>% 
    count(year, month, day, dep_time, tailnum)  %>%
      filter(n>1, !is.na(dep_time)) %>% print

[90m# A tibble: 0 × 6[39m
[90m# … with 6 variables: year <int>, month <int>, day <int>, dep_time <int>,[39m
[90m#   tailnum <chr>, n <int>[39m


We might guess that `year`, `month`, `day`, and `tailnum` are sufficient to identify each row in `flights`, but this is not true:

In [48]:
flights %>% summarize(n=n(), nd=n_distinct(year, month, day, tailnum))

n,nd
<int>,<int>
336776,251727


In fact, even restricting to the exact *minute* that an airplane departed is not sufficient:

In [49]:
flights %>% summarize(n=n(), nd=n_distinct(tailnum, time_hour, minute))

n,nd
<int>,<int>
336776,336367


This says that there are certain airplanes that are marked as having departed more than once in the same year, month, day, hour and minute. We can inspect these rows as follows:

In [50]:
group_by(flights, tailnum, time_hour, minute) %>% 
  count %>% 
    filter(n > 1) %>% 
      arrange(tailnum, time_hour) %>% print

[90m# A tibble: 298 × 4[39m
[90m# Groups:   tailnum, time_hour, minute [298][39m
   tailnum time_hour           minute     n
   [3m[90m<chr>[39m[23m   [3m[90m<dttm>[39m[23m               [3m[90m<dbl>[39m[23m [3m[90m<int>[39m[23m
[90m 1[39m N11119  2013-06-10 [90m16:00:00[39m     55     2
[90m 2[39m N11192  2013-08-26 [90m08:00:00[39m     30     2
[90m 3[39m N12563  2013-02-04 [90m16:00:00[39m     19     2
[90m 4[39m N12564  2013-01-13 [90m20:00:00[39m      0     2
[90m 5[39m N12900  2013-07-10 [90m21:00:00[39m     29     2
[90m 6[39m N13969  2013-01-28 [90m07:00:00[39m     59     2
[90m 7[39m N14148  2013-03-12 [90m06:00:00[39m     30     2
[90m 8[39m N14558  2013-04-19 [90m13:00:00[39m     29     2
[90m 9[39m N14916  2013-02-11 [90m13:00:00[39m     15     2
[90m10[39m N14974  2013-07-26 [90m06:00:00[39m     30     2
[90m# … with 288 more rows[39m


These likely indicate data entry errors.

## 🤔 Quiz

What column(s) constitute a primary key in the `mpg` table?

<ol style="list-style-type: upper-alpha;">
    <li><code>manufacturer</code> and <code>model</code></li>
    <li><code>manufacturer</code>, <code>year</code>, and <code>model</code></li>
    <li><code>manufacturer</code>, <code>year</code>, <code>displ</code>, and <code>model</code></li>
    <li><code>manufacturer</code>, <code>year</code>, <code>displ</code>, <code>trans</code>, and <code>model</code></li>
    <li>None of these</li>
</ol>



In [57]:
# primary key in mpg 
count(mpg, across(everything())) %>% 
  filter(n > 1) %>% print

[90m# A tibble: 9 × 12[39m
  manufacturer model displ  year   cyl trans drv     cty   hwy fl    class     n
  [3m[90m<chr>[39m[23m        [3m[90m<chr>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<int>[39m[23m [3m[90m<int>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<int>[39m[23m [3m[90m<int>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<int>[39m[23m
[90m1[39m chevrolet    c150…   5.3  [4m2[24m008     8 auto… r        14    20 r     suv       2
[90m2[39m dodge        cara…   3.3  [4m1[24m999     6 auto… f        16    22 r     mini…     2
[90m3[39m dodge        cara…   3.3  [4m2[24m008     6 auto… f        17    24 r     mini…     2
[90m4[39m dodge        dako…   4.7  [4m2[24m008     8 auto… 4        14    19 r     pick…     2
[90m5[39m dodge        dura…   4.7  [4m2[24m008     8 auto… 4        13    17 r     suv       2
[90m6[39m dodge        ram …   4.7  [4m2[24m008     8 auto… 4        13    17

## Relations
A primary key and the corresponding foreign key in another table form a *relation*. Relations come in several forms:
- *One-to-many*. (Most common). For example, each flight has one plane, but each plane has many flights. 
- *Many-to-many*: For example, each airline flies to many airports; each airport hosts many airlines.
- *One-to-one*. Each row in one table corresponds uniquely to a row in a second table. This is relatively uncommon because you could just as easily combine the two tables into one.

In [58]:
x = tribble(
  ~key, ~val_x,
     1, "x1",
     2, "x2",
     3, "x3"
)
y = tribble(
  ~key, ~val_y,
     1, "y1",
     2, "y2",
     4, "y3"
)

In [59]:
x
y

key,val_x
<dbl>,<chr>
1,x1
2,x2
3,x3


key,val_y
<dbl>,<chr>
1,y1
2,y2
4,y3


## Joins
Joins are the way we combine or "merge" two data tables based on keys.
To understand how joins work we'll study these two simple tables:
![simple tables](http://r4ds.had.co.nz/diagrams/join-setup.png)

### Inner joins
Inner joins match a pair of observations whenever their keys are equal:
![match example](https://r4ds.hadley.nz/diagrams/join/inner.png)

In [62]:
x %>% inner_join(y, by = "key")

key,val_x,val_y
<dbl>,<chr>,<chr>
1,x1,y1
2,x2,y2


Note that there is no row for `key=3` or `key=4`: with an inner join, unmatched rows are not included in the result. For this reason, we do not as often use inner joins for data analysis since you can easily lose observations.

### Outer joins
An outer join keeps observations that appear in at least one of the tables. There are three types of outer joins:
- A left join keeps all observations in x.
- A right join keeps all observations in y.
- A full join keeps all observations in x and y.

![match example](https://r4ds.hadley.nz/diagrams/join/left.png)

![match example](https://r4ds.hadley.nz/diagrams/join/right.png)

![match example](https://r4ds.hadley.nz/diagrams/join/full.png)

Left joins are the most common. Use them to look up data in another table, but preserve your original observations when there are records on the left table for which the other table does not have a match.

### Example
The `flights` table has a `carrier` column which is a two-letter code for the airline. The `airlines` table maps these code to recognizable airline names. 

How many flighs are there per carrier in the dataset?

In [64]:
# flights per carrier
flights %>% count(carrier) %>% left_join(airlines)

[1m[22mJoining, by = "carrier"


carrier,n,name
<chr>,<int>,<chr>
9E,18460,Endeavor Air Inc.
AA,32729,American Airlines Inc.
AS,714,Alaska Airlines Inc.
B6,54635,JetBlue Airways
DL,48110,Delta Air Lines Inc.
EV,54173,ExpressJet Airlines Inc.
F9,685,Frontier Airlines Inc.
FL,3260,AirTran Airways Corporation
HA,342,Hawaiian Airlines Inc.
MQ,26397,Envoy Air


## 🤔 Quiz

How many flights departing from EWR were operated by Envoy Air?

<ol style="list-style-type: upper-alpha;">
    <li>46087</li>
    <li>26397</li>
    <li>2276</li>
    <li>18460</li>
    <li>43939</li>
</ol>

In [67]:
# envoy flights from ewr
flights %>%
  left_join(airlines) %>% 
  filter(origin == 'EWR', name == 'Envoy Air') %>% count

[1m[22mJoining, by = "carrier"


n
<int>
2276


### Duplicate keys
Although we have defined keys in terms of uniqueness, it's not required that a key be unique. There are several possibilities:

##### One table has duplicate keys
This is useful when you want to add in additional information as there is typically a one-to-many relationship.
![duplicate keys](http://r4ds.had.co.nz/diagrams/join-one-to-many.png)

In fact, we already saw an example of this above when we joined `planes` to `flights`: each plane belongs to potentially many flights.

### Exercise
What is the most common model of airplane used by each carrier?

In [74]:
# most common model
flights %>% 
  left_join(planes, by='tailnum') %>% 
    group_by(carrier, model) %>% 
      summarise(n = n()) %>% top_n(1)

[1m[22m`summarise()` has grouped output by 'carrier'. You can override using the
`.groups` argument.
[1m[22mSelecting by n


carrier,model,n
<chr>,<chr>,<int>
9E,CL-600-2D24,10580
AA,,22558
AS,737-890,346
B6,A320-232,34063
DL,MD-88,10191
EV,EMB-145LR,28027
F9,A320-214,617
FL,717-200,2774
HA,A330-243,342
MQ,,25397


##### When Both tables have duplicate keys
This represents a many-to-many join and is usually an error since the key does not uniquely identify observations in either table. Joining duplicated keys results in the Cartesian product of all the possible matches:
![cartesian](http://r4ds.had.co.nz/diagrams/join-many-to-many.png)

Be careful when doing many-to-many merges. It's possible to generate huge tables by accident and crash R.

## 🤔 Quiz
To understand what the weather conditions were when each flight departed, I will join the weather table to the first 100 rows of the flights table using the command

```
flights %>% slice(1:100) %>% left_join(weather, by=c("origin", "hour"))
```

How many rows does the resulting table have?

<ol style="list-style-type: upper-alpha;">
    <li>100</li>
    <li>36344</li>
    <li>336776</li>
    <li>Zero</li>
    <li><code>NA</code></li>
</ol>

In [79]:
flights %>% 
  slice(1:100) %>% 
    left_join(weather) %>% count

[1m[22mJoining, by = c("year", "month", "day", "origin", "hour", "time_hour")


n
<int>
100


In [None]:
# flights %>% slice(1:100) %>% left_join(weather, by=c("origin", "hour")) 

### Defining the key columns
When we do a join using `left_join()`, R take as the key whatever column names the two tables have in common:

In [None]:
left_join(flights, planes) %>% print
   # select(year, month, day, origin, type, tailnum, manufacturer) %>% 
   # print

Joining, by = c("year", "tailnum")


# A tibble: 336,776 x 26
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# … with 336,766 more rows,

This is called a *natural join*. If the key column(s) are named differently in the two tables, we must specify the mapping between the two using the `by=` parameter.


Consider joining `airports` to `flights`:
```
> left_join(flights, airports)
Error: `by` required, because the data sources have no common variables
Traceback:

1. left_join(flights, airports)
2. left_join.tbl_df(flights, airports)
3. common_by(by, x, y)
4. common_by.NULL(by, x, y)
5. bad_args("by", "required, because the data sources have no common variables")
6. glubort(fmt_args(args), ..., .envir = .envir)
7. .abort(text)
```

In [86]:
left_join(flights, airports, by=c("origin" = "faa")) %>% print

[90m# A tibble: 336,776 × 26[39m
    year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
   [3m[90m<int>[39m[23m [3m[90m<int>[39m[23m [3m[90m<int>[39m[23m    [3m[90m<int>[39m[23m      [3m[90m<int>[39m[23m   [3m[90m<dbl>[39m[23m   [3m[90m<int>[39m[23m   [3m[90m<int>[39m[23m   [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m  
[90m 1[39m  [4m2[24m013     1     1      517        515       2     830     819      11 UA     
[90m 2[39m  [4m2[24m013     1     1      533        529       4     850     830      20 UA     
[90m 3[39m  [4m2[24m013     1     1      542        540       2     923     850      33 AA     
[90m 4[39m  [4m2[24m013     1     1      544        545      -[31m1[39m    [4m1[24m004    [4m1[24m022     -[31m18[39m B6     
[90m 5[39m  [4m2[24m013     1     1      554        600      -[31m6[39m     812     837     -[31m25[39m DL     
[90m 6[39m  [4m2[24m013     1     1      554       

In [84]:
airports %>% print

[90m# A tibble: 1,458 × 8[39m
   faa   name                             lat    lon   alt    tz dst   tzone    
   [3m[90m<chr>[39m[23m [3m[90m<chr>[39m[23m                          [3m[90m<dbl>[39m[23m  [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<chr>[39m[23m    
[90m 1[39m 04G   Lansdowne Airport               41.1  -[31m80[39m[31m.[39m[31m6[39m  [4m1[24m044    -[31m5[39m A     America/…
[90m 2[39m 06A   Moton Field Municipal Airport   32.5  -[31m85[39m[31m.[39m[31m7[39m   264    -[31m6[39m A     America/…
[90m 3[39m 06C   Schaumburg Regional             42.0  -[31m88[39m[31m.[39m[31m1[39m   801    -[31m6[39m A     America/…
[90m 4[39m 06N   Randall Airport                 41.4  -[31m74[39m[31m.[39m[31m4[39m   523    -[31m5[39m A     America/…
[90m 5[39m 09J   Jekyll Island Airport           31.1  -[31m81[39m[31m.[39m[31m4[39m    11    -[31m5[39m A     

This has produced an error, because airports and flights do not have any columns in common. Indeed, the three digit FAA code is called `faa` in `airports`, but appears as either `origin` or `dest` in `flights`. To fix the error, we must specify which of `origin` or `dest` should be matched:

In [None]:
# join_by

## 🤔 Quiz

How many flights in February were bound for tropical destinations?
<img src='https://camo.githubusercontent.com/5a1a728ea646d55a3e73924f15bb2ba116d06b4fe1ac1aaa0628bb891a43c496/68747470733a2f2f75706c6f61642e77696b696d656469612e6f72672f77696b6970656469612f636f6d6d6f6e732f7468756d622f622f62302f576f726c645f6d61705f696e6469636174696e675f74726f706963735f616e645f73756274726f706963732e706e672f36343070782d576f726c645f6d61705f696e6469636174696e675f74726f706963735f616e645f73756274726f706963732e706e67' />
https://kids.britannica.com/kids/article/latitude-and-longitude/353366


<ol style="list-style-type: upper-alpha;">
    <li>56</li>
    <li>18</li>
    <li>71</li>
    <li>39</li>
    <li>50</li>
</ol>

(Note: there is a quantitative definition of what it means for a location on Earth to be tropical. It does not have to do with sitting on the beach.)

In [90]:
# tropical flights # 23.5
flights %>% left_join(airports, by = c("dest" = "faa")) %>%
  filter(month == 2, lat <= 23.5) %>% count

n
<int>
56


## Filtering joins

Filtering joins allow us to filter rows on one table based on their presence or absence in another table. We've already seen some examples of this on the problem sets:

In [None]:
dest_top6 <- count(flights, dest) %>% top_n(6)
filter(flights, dest %in% dest_top6$dest) %>% nrow

[1m[22mSelecting by n


### Semi-join
`semi_join(x, y)` keeps all the observations in `x` that are also in `y`.
![semi join](http://r4ds.had.co.nz/diagrams/join-semi.png)

In [94]:
# find all the flights with destinations in the top 6
flights %>% 
  right_join(count(flights, dest) %>% 
    top_n(6)) %>% count

[1m[22mSelecting by n
[1m[22mJoining, by = "dest"


n
<int>
94326
