-
Notifications
You must be signed in to change notification settings - Fork 3
/
lesson_joins.Rmd
217 lines (169 loc) · 7.05 KB
/
lesson_joins.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
---
title: Multi-table joins
minutes: ?20
---
```{r, include=FALSE}
# Silently load in the data so the rest of the lesson works
library(tidyverse)
library(gapminder)
```
> ## Learning objectives
>
> - Focus on the third tidy data principle
> - Each variable forms a column.
> - Each observation forms a row.
> - Each type of observational unit forms a table.
> - Be able to use `dplyr`'s join functions to merge tables
>
### Joins
The third tidy data maxim states that each observation type gets its own table. The idea of multiple tables within a dataset will be familiar to anyone who has worked with a relational database but may seem foreign to those who have not.
The idea is this: Suppose we conduct a behavioral experiment that puts individuals in groups, and we measure both individual- and group-level variables. We should have a table for the individual-level variables and a separate table for the group-level variables. Then, should we need to merge them, we can do so using the `join` functions of `dplyr`.
The join functions are nicely illustrated in RStudio's [Data wrangling cheatsheet](https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf). Each function takes two data.frames and, optionally, the name(s) of columns on which to match. If no column names are provided, the functions match on all shared column names.
The different join functions control what happens to rows that exist in one table but not the other.
- `left_join` keeps all the entries that are present in the left (first) table and excludes any that are only in the right table.
- `right_join` keeps all the entries that are present in the right table and excludes any that are only in the left table.
- `inner_join` keeps only the entries that are present in both tables. `inner_join` is the only function that guarantees you won't generate any missing entries.
- `full_join` keeps all of the entries in both tables, regardless of whether or not they appear in the other table.
![dplyr joins, via RStudio](img/dplyr-joins.png)
# Practice with joins
## Fabricate some example data
```{r}
set.seed(12345)
x <- data.frame(key= LETTERS[c(1:3, 5)],
value1 = sample(1:10, 4),
stringsAsFactors = FALSE)
y <- data.frame(key= LETTERS[c(1:4)],
value2 = sample(1:10, 4),
stringsAsFactors = FALSE)
x
y
```
## "Joining" joins
```{r}
# What's in both x and y?
inner_join(x, y, by = "key")
```
```{r}
# What's in X and bring with it the stuff that matches in Y
left_join(x, y, by = "key")
```
```{r}
# What's in Y and bring with it the stuff that matches in Y
right_join(x, y, by = "key")
```
```{r}
# Give me everything!
full_join(x, y, by = "key")
```
## Filtering "joins"
```{r}
# Give me the stuff in X that is also in Y
semi_join(x, y, by = "key")
```
```{r}
# Give me the stuff in X that is not in Y
anti_join(x, y, by = "key")
```
```{r}
# Want everything that doesn't match?
full_join(anti_join(x,y, by = "key"), anti_join(y,x, by = "key"), by= "key")
```
```{r}
# keys with different names?
x <- data.frame(keyX = LETTERS[c(1:3, 5)],
value1 = sample(1:10, 4),
stringsAsFactors = FALSE)
y <- data.frame(keyY = LETTERS[c(1:4)],
value2 = sample(1:10, 4),
stringsAsFactors = FALSE)
x
y
```
```{r, error = TRUE}
full_join(x, y) #should error out
full_join(x, y, by=c("keyX" = "keyY"))
```
# Set Operations
```{r setup_dataframes}
df1 <- data_frame(x = LETTERS[1:2],
y = c(1L, 1L))
df2 <- data_frame(x = LETTERS[1:2],
y = 1:2)
df1
df2
```
```{r}
# Which rows are common in both datasets?
dplyr::intersect(df1, df2)
```
```{r}
#Want all unique rows between both datasets?
dplyr::union(df1, df2)
```
```{r}
#What's unique to df1?
dplyr::setdiff(df1, df2)
```
```{r}
#What's unique to df2?
dplyr::setdiff(df2, df1)
```
# Practice with joins using gapminder
We will practice on our continents data.frame from module 2 and the gapminder data.frame. Note how these are tidy data: We have observations at the level of continent and at the level of country, so they go in different tables. The continent column in the gapminder data.frame allows us to link them now. If continents data.frame isn't in your Environment, load it and recall what it consists of:
```{r}
load('data/continents.RDA')
continents
```
We can join the two data.frames using any of the `dplyr` functions. We will pass the results to `str` to avoid printing more than we can read, and to get more high-level information on the resulting data.frames.
```{r}
left_join(gapminder, continents)
right_join(gapminder, continents)
```
These operations produce slightly different results, either 1704 or 1705 observations. Can you figure out why? Antarctica contains no countries so doesn't appear in the gapminder data.frame. When we use `left_join` it gets filtered from the results, but when we use `right_join` it appears, with missing values for all of the country-level variables:
```{r}
right_join(gapminder, continents) %>%
filter(continent == "Antarctica")
```
There's another problem in this data.frame -- it has two population measures, one by continent and one by country and it's not clear which is which! Let's rename a couple of columns.
```{r}
right_join(gapminder, continents) %>%
rename(country_pop = pop, continent_pop = population)
```
> #### Challenge -- Putting the pieces together
>
> A colleague suggests that the more land area an individual has, the greater their gdp will be and that this relationship will be observable at any scale of observation. You chuckle and mutter "Not at the continental scale," but your colleague insists. Test your colleague's hypothesis by:
>
> - Calculating the total GDP of each continent,
> - Hint: Use `dplyr`'s `group_by` and `summarize`
> - Joining the resulting data.frame to the `continents` data.frame,
> - Calculating the per-capita GDP for each continent, and
> - Plotting per-capita gdp versus population density.
>
## Challenge solutions
> #### Solution to Challenge -- Putting the pieces together
>
>
> ```{r, Putting the pieces together - solution}
> library(ggplot2)
>
> gapminder %>%
> mutate(GDP = gdpPercap * pop) %>% # Calculate country-level GDP
> group_by(continent) %>% # Group by continent
> summarize(cont_gdp = sum(GDP)) %>% # Calculate continent-level GDP
> # Join the continent-GDP data.frame to the continents data.frame
> left_join(continents) %>%
> # Calculate continent-level per-capita GDP
> mutate(per_cap = cont_gdp / population) %>%
> # Plot gdp versus land area
> ggplot(aes(x = area_km2, y = per_cap)) +
> # Draw points
> geom_point() +
> # And label them
> geom_text(aes(label = continent), nudge_y = 5e3)
> ```
>
<br>
This lesson is adapted from the Software Carpentry: R for Reproducible
Scientific Analysis [Multi-Table Joins materials](http://data-lessons.github.io/gapminder-R/12-joins.html)
and Brandon Hurr's dplyr II: Joins and Set Ops [presentation](https://github.com/bhive01/2016-02-D-RUG-dplyr-II-talk)
to the Davis R UsersGroup on Februrary 2, 2016.