# Tidyverse: Data Wrangling

## 1. Introduction

### 1.1. Tidyverse
This topic introduces the [`tidyverse`] library, which is a collection of R packages desgined for Data Science. `library(tidyverse)` will load the following core packages:
- `ggplot2` for data visualization,
- `dplyr` for data manipulation,
- `tidyr` for data tidying,
- `readr` for data import,
- `purrr` for functional programming,
- `tibble` for tibbles, a modern re-imagining of data frames,
- `stringr` for strings,
- `forcats` for factors.

Tidyverse also includes many other useful packages such as `lubridate` and `magrittr`, however each of them need to be loaded separately.

[`tidyverse`]: https://www.tidyverse.org/packages/

In [1]:
library(tidyverse)

"package 'tidyverse' was built under R version 3.6.3"
-- [1mAttaching packages[22m --------------------------------------- tidyverse 1.3.0 --

[32mv[39m [34mggplot2[39m 3.3.2     [32mv[39m [34mpurrr  [39m 0.3.4
[32mv[39m [34mtibble [39m 3.0.1     [32mv[39m [34mdplyr  [39m 1.0.0
[32mv[39m [34mtidyr  [39m 1.1.0     [32mv[39m [34mstringr[39m 1.4.0
[32mv[39m [34mreadr  [39m 1.3.1     [32mv[39m [34mforcats[39m 0.5.0

"package 'ggplot2' was built under R version 3.6.3"
"package 'tibble' was built under R version 3.6.3"
"package 'tidyr' was built under R version 3.6.3"
"package 'readr' was built under R version 3.6.3"
"package 'purrr' was built under R version 3.6.3"
"package 'dplyr' was built under R version 3.6.3"
"package 'forcats' was built under R version 3.6.3"
-- [1mConflicts[22m ------------------------------------------ tidyverse_conflicts() --
[31mx[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31mx[39m [34mdplyr[3

### 1.2. The pipe
The pipe `%>%` from the `magrittr` library provides a brand new way writing functions. It uses an object as the first argument of a function, allowing you to read code from left to right insted of from inside to outside. Packages in Tidyverse automatically load the pipe for you, so it's not required to use `magrittr` explicitly.



In [None]:
library(magrittr)

In [3]:
floor(sqrt(15))

In [4]:
15 %>% sqrt %>% floor

## 2. Data exploring

### 2.1. Analysis of observations

In [5]:
library(dplyr)

In [6]:
mtcars <- read.csv('../data/mtcars.csv')

#### Top and bottom rows

In [7]:
mtcars %>% head

Unnamed: 0_level_0,model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Unnamed: 0_level_1,<fct>,<dbl>,<int>,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<int>,<int>,<int>,<int>
1,Mazda RX4,21.0,6,160,110,3.9,2.62,16.46,0,1,4,4
2,Mazda RX4 Wag,21.0,6,160,110,3.9,2.875,17.02,0,1,4,4
3,Datsun 710,22.8,4,108,93,3.85,2.32,18.61,1,1,4,1
4,Hornet 4 Drive,21.4,6,258,110,3.08,3.215,19.44,1,0,3,1
5,Hornet Sportabout,18.7,8,360,175,3.15,3.44,17.02,0,0,3,2
6,Valiant,18.1,6,225,105,2.76,3.46,20.22,1,0,3,1


In [8]:
mtcars %>% tail(3)

Unnamed: 0_level_0,model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Unnamed: 0_level_1,<fct>,<dbl>,<int>,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<int>,<int>,<int>,<int>
30,Ferrari Dino,19.7,6,145,175,3.62,2.77,15.5,0,1,5,6
31,Maserati Bora,15.0,8,301,335,3.54,3.57,14.6,0,1,5,8
32,Volvo 142E,21.4,4,121,109,4.11,2.78,18.6,1,1,4,2


#### Slicing specific rows

In [9]:
mtcars %>% slice(1:3, 6)

model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
<fct>,<dbl>,<int>,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<int>,<int>,<int>,<int>
Mazda RX4,21.0,6,160,110,3.9,2.62,16.46,0,1,4,4
Mazda RX4 Wag,21.0,6,160,110,3.9,2.875,17.02,0,1,4,4
Datsun 710,22.8,4,108,93,3.85,2.32,18.61,1,1,4,1
Valiant,18.1,6,225,105,2.76,3.46,20.22,1,0,3,1


### 1.2. Analysis of attributes

In [10]:
library(dplyr)

In [11]:
mtcars <- read.csv('../data/mtcars.csv')

In [12]:
mtcars %>% names

In [13]:
mtcars %>% dim

#### Selecting columns
You can either select columns in a dataframe using their names or their indices.

In [14]:
mtcars$model

In [15]:
# selecting specific columns
mtcars %>% select(model:cyl, gear) %>% head(3)

Unnamed: 0_level_0,model,mpg,cyl,gear
Unnamed: 0_level_1,<fct>,<dbl>,<int>,<int>
1,Mazda RX4,21.0,6,4
2,Mazda RX4 Wag,21.0,6,4
3,Datsun 710,22.8,4,4


In [16]:
# selecting specific columns
mtcars %>% select(1:5, gear) %>% head

Unnamed: 0_level_0,model,mpg,cyl,disp,hp,gear
Unnamed: 0_level_1,<fct>,<dbl>,<int>,<dbl>,<int>,<int>
1,Mazda RX4,21.0,6,160,110,4
2,Mazda RX4 Wag,21.0,6,160,110,4
3,Datsun 710,22.8,4,108,93,4
4,Hornet 4 Drive,21.4,6,258,110,3
5,Hornet Sportabout,18.7,8,360,175,3
6,Valiant,18.1,6,225,105,3


#### Unique values

In [17]:
mtcars %>% select(cyl) %>% unique %>% arrange(cyl)

cyl
<int>
4
6
8


In [18]:
mtcars %>% select(cyl, am) %>% unique %>% arrange(am, cyl)

cyl,am
<int>,<int>
4,0
6,0
8,0
4,1
6,1
8,1


### 1.3. Basic statistics

In [19]:
library(dplyr)

In [20]:
mtcars <- read.csv('../data/mtcars.csv')

In [21]:
mtcars %>% select(2:5) %>% summary

      mpg             cyl             disp             hp       
 Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
 1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
 Median :19.20   Median :6.000   Median :196.3   Median :123.0  
 Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
 3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
 Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  

In [22]:
mtcars$mpg %>% mean

In [23]:
mtcars$mpg %>% quantile

The `describe()` function from the `Hmisc` library provides even more details about each column.

In [None]:
Hmisc::describe(mtcars$mpg)

In [None]:
Hmisc::describe(mtcars$model)

### 1.4. Sorting and filtering

In [28]:
library(dplyr)

In [29]:
mtcars <- read.csv('../data/mtcars.csv')

#### Sorting

In [30]:
mtcars %>% arrange(cyl, mpg) %>% head

Unnamed: 0_level_0,model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Unnamed: 0_level_1,<fct>,<dbl>,<int>,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<int>,<int>,<int>,<int>
1,Volvo 142E,21.4,4,121.0,109,4.11,2.78,18.6,1,1,4,2
2,Toyota Corona,21.5,4,120.1,97,3.7,2.465,20.01,1,0,3,1
3,Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
4,Merc 230,22.8,4,140.8,95,3.92,3.15,22.9,1,0,4,2
5,Merc 240D,24.4,4,146.7,62,3.69,3.19,20.0,1,0,4,2
6,Porsche 914-2,26.0,4,120.3,91,4.43,2.14,16.7,0,1,5,2


In [31]:
mtcars %>% arrange(cyl, desc(mpg)) %>% head

Unnamed: 0_level_0,model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Unnamed: 0_level_1,<fct>,<dbl>,<int>,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<int>,<int>,<int>,<int>
1,Toyota Corolla,33.9,4,71.1,65,4.22,1.835,19.9,1,1,4,1
2,Fiat 128,32.4,4,78.7,66,4.08,2.2,19.47,1,1,4,1
3,Honda Civic,30.4,4,75.7,52,4.93,1.615,18.52,1,1,4,2
4,Lotus Europa,30.4,4,95.1,113,3.77,1.513,16.9,1,1,5,2
5,Fiat X1-9,27.3,4,79.0,66,4.08,1.935,18.9,1,1,4,1
6,Porsche 914-2,26.0,4,120.3,91,4.43,2.14,16.7,0,1,5,2


In [32]:
mtcars %>% arrange(desc(cyl, mpg)) %>% head

Unnamed: 0_level_0,model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Unnamed: 0_level_1,<fct>,<dbl>,<int>,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<int>,<int>,<int>,<int>
1,Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2
2,Duster 360,14.3,8,360.0,245,3.21,3.57,15.84,0,0,3,4
3,Merc 450SE,16.4,8,275.8,180,3.07,4.07,17.4,0,0,3,3
4,Merc 450SL,17.3,8,275.8,180,3.07,3.73,17.6,0,0,3,3
5,Merc 450SLC,15.2,8,275.8,180,3.07,3.78,18.0,0,0,3,3
6,Cadillac Fleetwood,10.4,8,472.0,205,2.93,5.25,17.98,0,0,3,4


#### Filtering data

In [33]:
mtcars %>% filter(mpg>=30)

model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
<fct>,<dbl>,<int>,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<int>,<int>,<int>,<int>
Fiat 128,32.4,4,78.7,66,4.08,2.2,19.47,1,1,4,1
Honda Civic,30.4,4,75.7,52,4.93,1.615,18.52,1,1,4,2
Toyota Corolla,33.9,4,71.1,65,4.22,1.835,19.9,1,1,4,1
Lotus Europa,30.4,4,95.1,113,3.77,1.513,16.9,1,1,5,2


In [34]:
mtcars %>% filter(mpg>=30 & carb==2)

model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
<fct>,<dbl>,<int>,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<int>,<int>,<int>,<int>
Honda Civic,30.4,4,75.7,52,4.93,1.615,18.52,1,1,4,2
Lotus Europa,30.4,4,95.1,113,3.77,1.513,16.9,1,1,5,2


In [35]:
mtcars %>% filter(mpg>=30 | disp<=90)

model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
<fct>,<dbl>,<int>,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<int>,<int>,<int>,<int>
Fiat 128,32.4,4,78.7,66,4.08,2.2,19.47,1,1,4,1
Honda Civic,30.4,4,75.7,52,4.93,1.615,18.52,1,1,4,2
Toyota Corolla,33.9,4,71.1,65,4.22,1.835,19.9,1,1,4,1
Fiat X1-9,27.3,4,79.0,66,4.08,1.935,18.9,1,1,4,1
Lotus Europa,30.4,4,95.1,113,3.77,1.513,16.9,1,1,5,2


In [36]:
mtcars %>% 
    filter(disp>300) %>% 
    filter(cyl==6 | cyl==8)

model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
<fct>,<dbl>,<int>,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<int>,<int>,<int>,<int>
Hornet Sportabout,18.7,8,360,175,3.15,3.44,17.02,0,0,3,2
Duster 360,14.3,8,360,245,3.21,3.57,15.84,0,0,3,4
Cadillac Fleetwood,10.4,8,472,205,2.93,5.25,17.98,0,0,3,4
Lincoln Continental,10.4,8,460,215,3.0,5.424,17.82,0,0,3,4
Chrysler Imperial,14.7,8,440,230,3.23,5.345,17.42,0,0,3,4
Dodge Challenger,15.5,8,318,150,2.76,3.52,16.87,0,0,3,2
AMC Javelin,15.2,8,304,150,3.15,3.435,17.3,0,0,3,2
Camaro Z28,13.3,8,350,245,3.73,3.84,15.41,0,0,3,4
Pontiac Firebird,19.2,8,400,175,3.08,3.845,17.05,0,0,3,2
Ford Pantera L,15.8,8,351,264,4.22,3.17,14.5,0,1,5,4


### 1.5. Adding columns

In [37]:
library(dplyr)

In [38]:
mtcars <- read.csv('../data/mtcars.csv')

In [39]:
mtcars %>% mutate(new=0) %>% head

Unnamed: 0_level_0,model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb,new
Unnamed: 0_level_1,<fct>,<dbl>,<int>,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<int>,<int>,<int>,<int>,<dbl>
1,Mazda RX4,21.0,6,160,110,3.9,2.62,16.46,0,1,4,4,0
2,Mazda RX4 Wag,21.0,6,160,110,3.9,2.875,17.02,0,1,4,4,0
3,Datsun 710,22.8,4,108,93,3.85,2.32,18.61,1,1,4,1,0
4,Hornet 4 Drive,21.4,6,258,110,3.08,3.215,19.44,1,0,3,1,0
5,Hornet Sportabout,18.7,8,360,175,3.15,3.44,17.02,0,0,3,2,0
6,Valiant,18.1,6,225,105,2.76,3.46,20.22,1,0,3,1,0


In [40]:
mtcars %>% mutate(new1=gear+carb, new2='nothing') %>% head

Unnamed: 0_level_0,model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb,new1,new2
Unnamed: 0_level_1,<fct>,<dbl>,<int>,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<int>,<int>,<int>,<int>,<int>,<chr>
1,Mazda RX4,21.0,6,160,110,3.9,2.62,16.46,0,1,4,4,8,nothing
2,Mazda RX4 Wag,21.0,6,160,110,3.9,2.875,17.02,0,1,4,4,8,nothing
3,Datsun 710,22.8,4,108,93,3.85,2.32,18.61,1,1,4,1,5,nothing
4,Hornet 4 Drive,21.4,6,258,110,3.08,3.215,19.44,1,0,3,1,4,nothing
5,Hornet Sportabout,18.7,8,360,175,3.15,3.44,17.02,0,0,3,2,5,nothing
6,Valiant,18.1,6,225,105,2.76,3.46,20.22,1,0,3,1,4,nothing


In [41]:
# mutate with case/when
df <- data.frame(number=1:10)
df %>%
    mutate(
        group = case_when(
            number<=5 ~ 'group 1',
            number>5 & number<8 ~ 'group 2',
            TRUE ~ 'group 3'))

number,group
<int>,<chr>
1,group 1
2,group 1
3,group 1
4,group 1
5,group 1
6,group 2
7,group 2
8,group 3
9,group 3
10,group 3


## 2. Data tidying

### 2.1. Data aggregation

In [42]:
library(dplyr)

#### Basic grouping

In [43]:
fish <- read.csv('../data/us_fishery_foreign_trade.csv')
fish %>% head

Unnamed: 0_level_0,year,month,product,country,value,feature,unit
Unnamed: 0_level_1,<int>,<int>,<fct>,<fct>,<int>,<fct>,<fct>
1,2010,1,SABLEFISH FRESH,UNITED ARAB EMIRATES,2297,EXP Quantity,kg
2,2010,1,SABLEFISH FRESH,JAPAN,16025,EXP Quantity,kg
3,2010,1,SABLEFISH FRESH,JAPAN,63437,EXP Quantity,kg
4,2010,1,MONKFISH FRESH,CANADA,579,EXP Quantity,kg
5,2010,1,MONKFISH FRESH,CANADA,7975,EXP Quantity,kg
6,2010,1,MONKFISH FRESH,NETHERLANDS,389,EXP Quantity,kg


In [44]:
fish %>% group_by(feature) %>% summarise(mean_value=mean(value))

`summarise()` ungrouping output (override with `.groups` argument)



feature,mean_value
<fct>,<dbl>
EXP Quantity,38093.38
EXP Value,273375.8
IMP Quantity,28478.83
IMP Value,215764.03


In [45]:
# number of rows each group
fish %>%
    group_by(feature, unit) %>%
    summarise(count=n(), sum_value=sum(value))

`summarise()` regrouping output by 'feature' (override with `.groups` argument)



feature,unit,count,sum_value
<fct>,<fct>,<int>,<int>
EXP Quantity,kg,4007,152640167
EXP Value,USD,4007,1095416826
IMP Quantity,kg,4730,134704872
IMP Value,USD,4730,1020563847


#### Advanced grouping

In [46]:
mtcars <- read.csv('../data/mtcars.csv')
mtcars %>% head

Unnamed: 0_level_0,model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Unnamed: 0_level_1,<fct>,<dbl>,<int>,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<int>,<int>,<int>,<int>
1,Mazda RX4,21.0,6,160,110,3.9,2.62,16.46,0,1,4,4
2,Mazda RX4 Wag,21.0,6,160,110,3.9,2.875,17.02,0,1,4,4
3,Datsun 710,22.8,4,108,93,3.85,2.32,18.61,1,1,4,1
4,Hornet 4 Drive,21.4,6,258,110,3.08,3.215,19.44,1,0,3,1
5,Hornet Sportabout,18.7,8,360,175,3.15,3.44,17.02,0,0,3,2
6,Valiant,18.1,6,225,105,2.76,3.46,20.22,1,0,3,1


In [47]:
# get the ranks for each group
mtcars %>%
    select(vs, mpg) %>%
    group_by(vs) %>%
    mutate(rank_mpg=dense_rank(mpg)) %>%
    arrange(vs, mpg)

vs,mpg,rank_mpg
<int>,<dbl>,<int>
0,10.4,1
0,10.4,1
0,13.3,2
0,14.3,3
0,14.7,4
0,15.0,5
0,15.2,6
0,15.2,6
0,15.5,7
0,15.8,8


In [48]:
# apply functions to multiple variables
mtcars %>%
    group_by(vs) %>%
    summarise_at(
        vars(mpg, hp),
        funs(mean, max, min))

"`funs()` is deprecated as of dplyr 0.8.0.
Please use a list of either functions or lambdas: 

  # Simple named list: 
  list(mean = mean, median = median)

  # Auto named with `tibble::lst()`: 
  tibble::lst(mean, median)

  # Using lambdas
  list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))


vs,mpg_mean,hp_mean,mpg_max,hp_max,mpg_min,hp_min
<int>,<dbl>,<dbl>,<dbl>,<int>,<dbl>,<int>
0,16.61667,189.72222,26.0,335,10.4,91
1,24.55714,91.35714,33.9,123,17.8,52


In [49]:
# apply fucntions to variables that meets a given condition
mtcars %>%
    select(vs, mpg, hp, cyl) %>%
    mutate(cyl = as.factor(cyl)) %>% 
    group_by(vs) %>%
    summarise_if(
        is.numeric,
        funs(mean, median))

vs,mpg_mean,hp_mean,mpg_median,hp_median
<int>,<dbl>,<dbl>,<dbl>,<dbl>
0,16.61667,189.72222,15.65,180
1,24.55714,91.35714,22.8,96


### 2.2. Unpivoting

In [50]:
library(tidyr)

In [51]:
wide = data.frame(
    Color= c('Red', 'Green', 'Blue'),
    'Q1.2020'=c(1000, 1500, 2000),
    'Q2.2020'=c(1200, 1500, 2200),
    'Q3.2020'=c(1500, 1575, 2000),
    'Q4.2020'=c(1700, 1800, 2800)
)
wide

Color,Q1.2020,Q2.2020,Q3.2020,Q4.2020
<fct>,<dbl>,<dbl>,<dbl>,<dbl>
Red,1000,1200,1500,1700
Green,1500,1500,1575,1800
Blue,2000,2200,2000,2800


In [52]:
wide %>% pivot_longer(
    cols=c(Q1.2020, Q2.2020, Q3.2020, Q4.2020),
    names_to='Quarter',
    values_to='Sales')

Color,Quarter,Sales
<fct>,<chr>,<dbl>
Red,Q1.2020,1000
Red,Q2.2020,1200
Red,Q3.2020,1500
Red,Q4.2020,1700
Green,Q1.2020,1500
Green,Q2.2020,1500
Green,Q3.2020,1575
Green,Q4.2020,1800
Blue,Q1.2020,2000
Blue,Q2.2020,2200


### 2.3. Pivot table

In [53]:
library(tidyr)

In [54]:
long = data.frame(
    Market=c('Asian', 'Asian', 'Asian', 'Asian', 'Europe', 'Europe', 'Europe', 'Europe'),
    Color=c('Red', 'Red', 'Blue', 'Blue', 'Red', 'Red', 'Blue', 'Blue'),
    Size=c('Large', 'Small', 'Large', 'Small','Large', 'Small', 'Large', 'Small'),
    Price=c(17, 11, 19, 13, 18, 12, 20, 14),
    Sales=c(68000, 44000, 57000, 52000, 81000, 72000, 90000, 77000)
)
long

Market,Color,Size,Price,Sales
<fct>,<fct>,<fct>,<dbl>,<dbl>
Asian,Red,Large,17,68000
Asian,Red,Small,11,44000
Asian,Blue,Large,19,57000
Asian,Blue,Small,13,52000
Europe,Red,Large,18,81000
Europe,Red,Small,12,72000
Europe,Blue,Large,20,90000
Europe,Blue,Small,14,77000


In [55]:
long %>% pivot_wider(
    id_cols=c('Market', 'Color'),
    names_from='Size',
    values_from='Sales'
)

Market,Color,Large,Small
<fct>,<fct>,<dbl>,<dbl>
Asian,Red,68000,44000
Asian,Blue,57000,52000
Europe,Red,81000,72000
Europe,Blue,90000,77000


In [56]:
long %>% pivot_wider(
    id_cols='Market',
    names_from='Size',
    values_from='Price',
    values_fn=mean
)

Market,Large,Small
<fct>,<dbl>,<dbl>
Asian,18,12
Europe,19,13


In [57]:
long %>% pivot_wider(
    id_cols='Market',
    names_from='Size',
    values_from=c('Sales', 'Price'),
    values_fn=list('Sales'=sum, 'Price'=mean)
)

Market,Sales_Large,Sales_Small,Price_Large,Price_Small
<fct>,<dbl>,<dbl>,<dbl>,<dbl>
Asian,125000,96000,18,12
Europe,171000,149000,19,13


### 2.4. Appending data

In [58]:
library(dplyr)
library(readxl)

"package 'readxl' was built under R version 3.6.3"


In [59]:
# remove all warnings
options(warn=-1)

In [60]:
path = '../data/world_population.xlsx'
sheets = excel_sheets(path)
sheets

In [61]:
world1 = read_excel(path, sheet=sheets[1])
world2 = read_excel(path, sheet=sheets[2])
world3 = read_excel(path, sheet=sheets[3])
world4 = read_excel(path, sheet=sheets[4])
world5 = read_excel(path, sheet=sheets[5])
world6 = read_excel(path, sheet=sheets[6])

In [62]:
world <- bind_rows(world1, world2, world3, world4, world5, world6)
world %>% dim

### 2.5. Joining dataframes

In [63]:
library(dplyr)

In [64]:
math <- data.frame(
    student_id=seq(1, 5),
    maths=c(10, 8, 7, 8.5, 9))

physics <- data.frame(
    student_id=seq(2, 10, 2),
    physics=c(7, 3, 6, 8, 9.5)) %>% mutate(id=student_id)

In [65]:
math

student_id,maths
<int>,<dbl>
1,10.0
2,8.0
3,7.0
4,8.5
5,9.0


In [66]:
physics

student_id,physics,id
<dbl>,<dbl>,<dbl>
2,7.0,2
4,3.0,4
6,6.0,6
8,8.0,8
10,9.5,10


In [67]:
math %>% left_join(physics, by='student_id' )

student_id,maths,physics,id
<dbl>,<dbl>,<dbl>,<dbl>
1,10.0,,
2,8.0,7.0,2.0
3,7.0,,
4,8.5,3.0,4.0
5,9.0,,


In [68]:
math %>% inner_join(physics, by='student_id' )

student_id,maths,physics,id
<dbl>,<dbl>,<dbl>,<dbl>
2,8.0,7,2
4,8.5,3,4


In [69]:
math %>% left_join(physics, by=c('student_id'='id'))

student_id,maths,student_id.y,physics
<dbl>,<dbl>,<dbl>,<dbl>
1,10.0,,
2,8.0,2.0,7.0
3,7.0,,
4,8.5,4.0,3.0
5,9.0,,


## Resources
- *r4ds.had.co.nz - [R for Data Science](https://r4ds.had.co.nz/)*