-
Notifications
You must be signed in to change notification settings - Fork 9
/
example2_join.Rmd
102 lines (84 loc) · 2.94 KB
/
example2_join.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
---
title: "Example 2: Join tables"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Example 2: Join tables}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
This post has referred to a vignette from `dplyr`, you can find it in <https://dplyr.tidyverse.org/articles/two-table.html>. We'll try to display how to join data tables in this vignette. First, load the packages we need and get some data.
```{r setup}
library(tidyfst)
library(nycflights13)
flights2 <- flights %>%
select_dt(year,month,day, hour, origin, dest, tailnum, carrier)
```
Do a left join with a simple:
```{r}
flights2 %>%
left_join_dt(airlines)
```
## Controlling how the tables are matched
Join works the same as `dplyr`:
```{r}
flights2 %>% left_join_dt(weather)
flights2 %>% left_join_dt(planes, by = "tailnum")
flights2 %>% left_join_dt(airports, c("dest" = "faa"))
flights2 %>% left_join_dt(airports, c("origin" = "faa"))
```
## Types of join
```{r}
df1 <- data.table(x = c(1, 2), y = 2:1)
df2 <- data.table(x = c(1, 3), a = 10, b = "a")
df1 %>% inner_join_dt(df2)
df1 %>% left_join_dt(df2)
df1 %>% right_join_dt(df2)
df1 %>% full_join_dt(df2)
```
If all you have is a data.frame or tibble, you have no need to change the format. Feed the data directly:
```{r}
df1 <- data.frame(x = c(1, 1, 2), y = 1:3)
df2 <- data.frame(x = c(1, 1, 2), z = c("a", "b", "a"))
df1 %>% left_join_dt(df2)
```
The "_dt" suffix should remind you that this is backed up by `data.table` and will always return a data.table in the end.
## Filtering joins
Filtering joins have also been supported in `tidyfst`.
```{r}
flights %>%
anti_join_dt(planes, by = "tailnum") %>%
count_dt(tailnum, sort = TRUE)
```
Other examples (`semi_join_dt()` and `anti_join_dt()` never duplicate; they only ever remove observations.):
```{r}
df1 <- data.frame(x = c(1, 1, 3, 4), y = 1:4)
df2 <- data.frame(x = c(1, 1, 2), z = c("a", "b", "a"))
# Four rows to start with:
df1 %>% nrow()
# And we get four rows after the join
df1 %>% inner_join_dt(df2, by = "x") %>% nrow()
# But only two rows actually match
df1 %>% semi_join_dt(df2, by = "x") %>% nrow()
```
## Set operations
For set operations, wrap `data.table`'s function directly, but the functions will automatically turn any data.frame into data.table. Examples are listed as below:
```{r}
x = iris[c(2,3,3,4),]
x2 = iris[2:4,]
y = iris[c(3:5),]
intersect_dt(x, y) # intersect
intersect_dt(x, y, all=TRUE) # intersect all
setdiff_dt(x, y) # except
setdiff_dt(x, y, all=TRUE) # except all
union_dt(x, y) # union
union_dt(x, y, all=TRUE) # union all
setequal_dt(x, x2, all=FALSE) # setequal
setequal_dt(x, x2)
```
For more details, just find the help from `data.table` using `?setops`.