-
Notifications
You must be signed in to change notification settings - Fork 4
/
05-joins.Rmd
323 lines (219 loc) · 10.6 KB
/
05-joins.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
# Data Relations {#joins}
<div class="right meme"><img src="images/memes/joins.png"
alt="3-panel meme. 1: dog looks at a baby in a chair, text reads 'An SQL query goes into a bar, walks up to two tables and asks...'; 2: baby looks at dog, text reads 'Can I join you?'; 3: dog and baby look at camera, no text" /></div>
## Learning Objectives {#ilo-joins}
1. Be able to use the 4 mutating join verbs: [(video)](https://youtu.be/WV0yg6f3DNM){class="video"}
+ [`left_join()`](#left_join)
+ [`right_join()`](#right_join)
+ [`inner_join()`](#inner_join)
+ [`full_join()`](#full_join)
2. Be able to use the 2 filtering join verbs: [(video)](https://youtu.be/ijoCEKifefQ){class="video"}
+ [`semi_join()`](#semi_join)
+ [`anti_join()`](#anti_join)
3. Be able to use the 2 binding join verbs: [(video)](https://youtu.be/8RWdNhbVZ4I){class="video"}
+ [`bind_rows()`](#bind_rows)
+ [`bind_cols()`](#bind_cols)
4. Be able to use the 3 set operations: [(video)](https://youtu.be/c3V33ElWUYI){class="video"}
+ [`intersect()`](#intersect)
+ [`union()`](#union)
+ [`setdiff()`](#setdiff)
## Setup {#setup-joins}
1. Open your `reprores-class-notes` project
1. Create a new R Markdown file called `05-joins.Rmd`
1. Update the YAML header
1. Replace the setup chunk with the one below:
```{r setup-joins, results = 'hide', warning = FALSE, message = FALSE, verbatim = "r setup, include = FALSE"}
knitr::opts_chunk$set(echo = TRUE)
# packages needed for this chapter
library(tidyverse) # loads dplyr for join functions
```
Download the [Data transformation cheatsheet](https://raw.githubusercontent.com/rstudio/cheatsheets/main/data-transformation.pdf).
## Create Demo Data
First, we'll create two small data tables.
`subject` has id, gender and age for subjects 1-5. Age and gender are missing for subject 3.
```{r subject}
subject <- tibble(
id = 1:5,
gender = c("m", "m", NA, "nb", "f"),
age = c(19, 22, NA, 19, 18)
)
```
`r knitr::kable(subject)`
`exp` has subject id and the score from an experiment. Some subjects are missing, some completed twice, and some are not in the subject table.
```{r exp}
exp <- tibble(
id = c(2, 3, 4, 4, 5, 5, 6, 6, 7),
score = c(10, 18, 21, 23, 9, 11, 11, 12, 3)
)
```
`r knitr::kable(exp)`
## Mutating Joins
`r glossary("Mutating joins")` act like the `mutate()` function in that they add new columns to one table based on values in another table.
All the mutating joins have this basic syntax:
`****_join(x, y, by = NULL, suffix = c(".x", ".y")`
* `x` = the first (left) table
* `y` = the second (right) table
* `by` = what columns to match on. If you leave this blank, it will match on all columns with the same names in the two tables.
* `suffix` = if columns have the same name in the two tables, but you aren't joining by them, they get a suffix to make them unambiguous. This defaults to ".x" and ".y", but you can change it to something more meaningful.
::: {.info data-latex=""}
You can leave out the `by` argument if you're matching on all of the columns with the same name, but it's good practice to always specify it so your code is robust to changes in the loaded data.
:::
### left_join() {#left_join}
```{r img-left-join, echo=FALSE, fig.width=4, fig.cap="Left Join", class="join"}
knitr::include_graphics("images/joins/left_join.png")
```
A `left_join` keeps all the data from the first (left) table and joins anything that matches from the second (right) table. If the right table has more than one match for a row in the right table, there will be more than one row in the joined table (see ids 4 and 5).
```{r left_join}
left_join(subject, exp, by = "id")
```
```{r img-left-join-rev, echo=FALSE, fig.width=4, fig.cap="Left Join (reversed)", class="join"}
knitr::include_graphics("images/joins/left_join_rev.png")
```
The order of tables is swapped here, so the result is all rows from the `exp` table joined to any matching rows from the `subject` table.
```{r left_join-2}
left_join(exp, subject, by = "id")
```
### right_join() {#right_join}
```{r img-right-join, echo=FALSE, fig.width=4, fig.cap="Right Join", class="join"}
knitr::include_graphics("images/joins/right_join.png")
```
A `right_join` keeps all the data from the second (right) table and joins anything that matches from the first (left) table.
```{r right_join}
right_join(subject, exp, by = "id")
```
::: {.info data-latex=""}
This table has the same information as `left_join(exp, subject, by = "id")`, but the columns are in a different order (left table, then right table).
:::
### inner_join() {#inner_join}
```{r img-inner-join, echo=FALSE, fig.width=4, fig.cap="Inner Join", class="join"}
knitr::include_graphics("images/joins/inner_join.png")
```
An `inner_join` returns all the rows that have a match in the other table.
```{r inner_join}
inner_join(subject, exp, by = "id")
```
### full_join() {#full_join}
```{r img-full-join, echo=FALSE, fig.width=4, fig.cap="Full Join", class="join"}
knitr::include_graphics("images/joins/full_join.png")
```
A `full_join` lets you join up rows in two tables while keeping all of the information from both tables. If a row doesn't have a match in the other table, the other table's column values are set to `NA`.
```{r full_join}
full_join(subject, exp, by = "id")
```
## Filtering Joins
`r glossary("Filtering joins")` act like the `filter()` function in that they remove rows from the data in one table based on the values in another table. The result of a filtering join will only contain rows from the left table and have the same number or fewer rows than the left table.
### semi_join() {#semi_join}
```{r img-semi-join, echo=FALSE, fig.width=4, fig.cap="Semi Join", class="join"}
knitr::include_graphics("images/joins/semi_join.png")
```
A `semi_join` returns all rows from the left table where there are matching values in the right table, keeping just columns from the left table.
```{r semi_join}
semi_join(subject, exp, by = "id")
```
::: {.info data-latex=""}
Unlike an inner join, a semi join will never duplicate the rows in the left table if there is more than one matching row in the right table.
:::
```{r img-semi-join-rev, echo=FALSE, fig.width=4, fig.cap="Semi Join (Reversed)", class="join"}
knitr::include_graphics("images/joins/semi_join_rev.png")
```
Order matters in a semi join.
```{r semi_join-2}
semi_join(exp, subject, by = "id")
```
### anti_join() {#anti_join}
```{r img-anti-join, echo=FALSE, fig.width=4, fig.cap="Anti Join", class="join"}
knitr::include_graphics("images/joins/anti_join.png")
```
An `anti_join` return all rows from the left table where there are *not* matching values in the right table, keeping just columns from the left table.
```{r anti_join}
anti_join(subject, exp, by = "id")
```
```{r img-anti-join-rev, echo=FALSE, fig.width=4, fig.cap="Anti Join (Reversed)", class="join"}
knitr::include_graphics("images/joins/anti_join_rev.png")
```
Order matters in an anti join.
```{r anti_join-2}
anti_join(exp, subject, by = "id")
```
## Binding Joins
`r glossary("Binding joins")` bind one table to another by adding their rows or columns together.
### bind_rows() {#bind_rows}
You can combine the rows of two tables with `bind_rows`.
Here we'll add subject data for subjects 6-9 and bind that to the original subject table.
```{r bind_rows}
new_subjects <- tibble(
id = 6:9,
gender = c("nb", "m", "f", "f"),
age = c(19, 16, 20, 19)
)
bind_rows(subject, new_subjects)
```
The columns just have to have the same names, they don't have to be in the same order. Any columns that differ between the two tables will just have `NA` values for entries from the other table.
If a row is duplicated between the two tables (like id 5 below), the row will also be duplicated in the resulting table. If your tables have the exact same columns, you can use `union()` (see below) to avoid duplicates.
```{r bind-rows-union}
new_subjects <- tibble(
id = 5:9,
age = c(18, 19, 16, 20, 19),
gender = c("f", "nb", "m", "f", "f"),
new = c(1,2,3,4,5)
)
bind_rows(subject, new_subjects)
```
### bind_cols() {#bind_cols}
You can merge two tables with the same number of rows using `bind_cols`. This is only useful if the two tables have their rows in the exact same order. The only advantage over a left join is when the tables don't have any IDs to join by and you have to rely solely on their order.
```{r bind_cols}
new_info <- tibble(
colour = c("red", "orange", "yellow", "green", "blue")
)
bind_cols(subject, new_info)
```
## Set Operations
`r glossary("Set operations")` compare two tables and return rows that match (intersect), are in either table (union), or are in one table but not the other (setdiff).
### intersect() {#intersect}
`intersect()` returns all rows in two tables that match exactly. The columns don't have to be in the same order.
```{r intersect}
new_subjects <- tibble(
id = seq(4, 9),
age = c(19, 18, 19, 16, 20, 19),
gender = c("f", "f", "m", "m", "f", "f")
)
intersect(subject, new_subjects)
```
::: {.warning data-latex=""}
If you've forgotten to load dplyr or the tidyverse, `r glossary("base R")` also has an `intersect()` function. The error message can be confusing and looks something like this:
```{r base-intersect, error = TRUE}
base::intersect(subject, new_subjects)
```
:::
### union() {#union}
`union()` returns all the rows from both tables, removing duplicate rows.
```{r union}
union(subject, new_subjects)
```
::: {.warning data-latex=""}
If you've forgotten to load dplyr or the tidyverse, `r glossary("base R")` also has a `union()` function. You usually won't get an error message, but the output won't be what you expect.
```{r base-union}
base::union(subject, new_subjects)
```
:::
### setdiff() {#setdiff}
`setdiff` returns rows that are in the first table, but not in the second table.
```{r setdiff}
setdiff(subject, new_subjects)
```
Order matters for `setdiff`.
```{r setdiff-order}
setdiff(new_subjects, subject)
```
::: {.warning data-latex=""}
If you've forgotten to load dplyr or the tidyverse, `r glossary("base R")` also has a `setdiff()` function. You usually won't get an error message, but the output might not be what you expect because the base R `setdiff()` expects columns to be in the same order, so id 5 here registers as different between the two tables.
```{r base-setdiff}
base::setdiff(subject, new_subjects)
```
:::
## Glossary {#glossary-joins}
`r glossary_table()`
## Further Resources {#resources-joins}
* [Chapter 13: Relational Data](http://r4ds.had.co.nz/relational-data.html) in *R for Data Science*
* [Cheatsheet for dplyr join functions](http://stat545.com/bit001_dplyr-cheatsheet.html)
* [Lecture slides on dplyr two-table verbs](slides/05_joins_slides.pdf)