# Tidyverse (part 2)

### Pipes

Pipes are supplied by package magrittR, and are (normally) loaded with library(tidyverse). The main innovation is the "pipe" operator, that allows to "chain" commands without having to nest brackets or use intermediate variables. 

In [1]:
library(tidyverse)
library(readxl)

gcdkit.dir<-"C:\\Users\\moje4671\\R\\win-library\\3.6\\GCDkit\\"
sazxlFile <- paste(gcdkit.dir,"Test_data\\sazava.xls",sep="")

sazava_tbl<- read_xls(sazxlFile)

"package 'tidyverse' was built under R version 3.6.3"
-- [1mAttaching packages[22m --------------------------------------- tidyverse 1.3.0 --

[32mv[39m [34mggplot2[39m 3.3.2     [32mv[39m [34mpurrr  [39m 0.3.4
[32mv[39m [34mtibble [39m 3.0.4     [32mv[39m [34mdplyr  [39m 1.0.2
[32mv[39m [34mtidyr  [39m 1.0.2     [32mv[39m [34mstringr[39m 1.4.0
[32mv[39m [34mreadr  [39m 1.4.0     [32mv[39m [34mforcats[39m 0.5.0

"package 'ggplot2' was built under R version 3.6.3"
"package 'tibble' was built under R version 3.6.3"
"package 'tidyr' was built under R version 3.6.3"
"package 'readr' was built under R version 3.6.3"
"package 'purrr' was built under R version 3.6.3"
"package 'dplyr' was built under R version 3.6.3"
"package 'forcats' was built under R version 3.6.3"
-- [1mConflicts[22m ------------------------------------------ tidyverse_conflicts() --
[31mx[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31mx[39m [34mdplyr[3

A common situation is that you have to perform several operations on your data. Consider, for instance, the following case:

In [4]:
sazava_tbl[sazava_tbl[,"SiO2"]<55,"Al2O3"]

Al2O3
<dbl>
17.57
18.23
13.34
14.17
21.64


This can also be written using tibble operations :

In [6]:
select(filter(sazava_tbl,SiO2<55),Al2O3)

Al2O3
<dbl>
17.57
18.23
13.34
14.17
21.64


Note, incidentally, that tibbles use "data masking", meaning that most of the times you can indifferently quote or not the variable names ... in some cases:

In [9]:
select(filter(sazava_tbl,SiO2<55),Al2O3)
select(filter(sazava_tbl,SiO2<55),"Al2O3")

Al2O3
<dbl>
17.57
18.23
13.34
14.17
21.64


Al2O3
<dbl>
17.57
18.23
13.34
14.17
21.64


This feature is great when working inline, but may become annoying when programming, and does not always behave entirely predictably. See, for instance,
https://stackoverflow.com/questions/65671975/tibbles-and-data-defined-column-names/65672042#65672042

As we all know, piling up operators may lead to clumsy and unreadable code :

In [10]:
sazava_tbl[sazava_tbl[,"SiO2"]<55,"Al2O3"] / sazava_tbl[sazava_tbl[,"SiO2"]<55,"CaO"]*2

Al2O3
<dbl>
3.542339
4.264327
1.822404
2.477273
3.147636


So the usual cure is to play with intermediate variables:

In [12]:
idx <- sazava_tbl[,"SiO2"]<55
al <- sazava_tbl[idx,"Al2O3"]
ca <- sazava_tbl[idx,"CaO"]

al/ca*2

Al2O3
<dbl>
3.542339
4.264327
1.822404
2.477273
3.147636


This is still a bit unwieldy, and ends up polluting the workspace with lots of intermediate variables ... which has been known to cause trouble. This is were the pipe comes in handy.

The pipe is simply a function that connects its left-hand side and right-hand side. The output of the lhs function becomes the (first) input of the rhs function - so pipes work with any function that takes a sensible first argument. Like that:


In [32]:
1:10 %>% rnorm %>% mean

which correponds to

In [33]:
mean(rnorm(1:10))

Therefore, the above command can be recast as follows:

In [15]:
sazava_tbl %>% filter(SiO2<55) %>% select(Al2O3)

Al2O3
<dbl>
17.57
18.23
13.34
14.17
21.64


introducing the `mutate` command, that calculates a new variable :

In [16]:
sazava_tbl %>% filter(SiO2<55) %>% select(Al2O3,CaO) %>% mutate(AlCa = Al2O3/CaO*2)

Al2O3,CaO,AlCa
<dbl>,<dbl>,<dbl>
17.57,9.92,3.542339
18.23,8.55,4.264327
13.34,14.64,1.822404
14.17,11.44,2.477273
21.64,13.75,3.147636


This is exactly identical to the following:

In [21]:
intermediate1 <- filter(sazava_tbl,SiO2<55)
intermediate2 <- select(intermediate1,Al2O3,CaO)
intermediate3 <- mutate(intermediate2,AlCa = Al2O3/CaO*2)
intermediate3

Al2O3,CaO,AlCa
<dbl>,<dbl>,<dbl>
17.57,9.92,3.542339
18.23,8.55,4.264327
13.34,14.64,1.822404
14.17,11.44,2.477273
21.64,13.75,3.147636


Assigment can be done using the slighlty uncommon variant of the assignment operator, `->`:

In [22]:
sazava_tbl %>% filter(SiO2<55) %>% select(Al2O3,CaO) %>% mutate(AlCa = Al2O3/CaO*2) -> result

Or in the more common (but perhaps less readable) form

In [28]:
result <- sazava_tbl %>% filter(SiO2<55) %>% select(Al2O3,CaO) %>% mutate(AlCa = Al2O3/CaO*2)

for simple replacement, one may use bidirectional pipe of magrittR (which is **not** loaded directly by `library(tidyverse)`, you need to load magrittr manually to access the more evolved pipes - of which there are several types, not covered here )

In [29]:
library(magrittr)
result %<>% select(AlCa)
result

AlCa
<dbl>
3.542339
4.264327
1.822404
2.477273
3.147636


Pipes are also newline-friendly, so you can write very legible code :

In [30]:
sazava_tbl %>% 
  filter(SiO2<55) %>% 
  select(Al2O3,CaO) %>% 
  mutate(AlCa = Al2O3/CaO*2) 

Al2O3,CaO,AlCa
<dbl>,<dbl>,<dbl>
17.57,9.92,3.542339
18.23,8.55,4.264327
13.34,14.64,1.822404
14.17,11.44,2.477273
21.64,13.75,3.147636


Finally, in pipe chains, `.` can be used as a shorthand to refer to the "current" variable that gets passed through the pipe. So a neat way to assign the result of a pipe is

In [None]:
sazava_tbl %>% 
  filter(SiO2<55) %>% 
  select(Al2O3,CaO) %>% 
  mutate(AlCa = Al2O3/CaO*2) %>%
  {.} -> result

... which does nothing else than the previous versions, but in a very clean way (you see what gets into the pipe, and what comes out)