# The Tidyverse for GCDkit users

### A gentle introduction

The tidyverse (https://www.tidyverse.org) is "*an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures*". This tutorial aims at giving some basics for R/GCDkit users.

The tidyverse packages excel at two things: (i) data management, (ii) plotting. These are two tasks commonly performed with geochemical data, and it is no surprise that the tidy functions perform well in that respect.

## Installation

The tidyverse is a collection of half a dozen or so of packages. Assuming they have been installed beforehand (install.packages), most tidyverse packages can be loaded in one go with 

In [3]:
library(tidyverse)

"package 'tidyverse' was built under R version 3.6.3"
-- [1mAttaching packages[22m --------------------------------------- tidyverse 1.3.0 --

[32mv[39m [34mggplot2[39m 3.3.2     [32mv[39m [34mpurrr  [39m 0.3.4
[32mv[39m [34mtibble [39m 3.0.4     [32mv[39m [34mdplyr  [39m 1.0.2
[32mv[39m [34mtidyr  [39m 1.0.2     [32mv[39m [34mstringr[39m 1.4.0
[32mv[39m [34mreadr  [39m 1.4.0     [32mv[39m [34mforcats[39m 0.5.0

"package 'ggplot2' was built under R version 3.6.3"
"package 'tibble' was built under R version 3.6.3"
"package 'tidyr' was built under R version 3.6.3"
"package 'readr' was built under R version 3.6.3"
"package 'purrr' was built under R version 3.6.3"
"package 'dplyr' was built under R version 3.6.3"
"package 'forcats' was built under R version 3.6.3"
-- [1mConflicts[22m ------------------------------------------ tidyverse_conflicts() --
[31mx[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31mx[39m [34mdplyr[3

Each component of the tidyverse has an online documentation on its website, and a very convenient "cheat sheet" (in pdf format). The main packages are

### Tidyverse core.
All of these are loaded by `library(tidyverse)`:
- **tibble**. Supplies a new type of data structure, a "tibble" (more on that below). https://tibble.tidyverse.org
- **readr**. High-level functions for reading files. https://readr.tidyverse.org 
- **tidyr**. Cleans and reshapes data. https://tidyr.tidyverse.org
- **magrittr**. Supplies pipe operators, very pleasant "syntaxic sugar" to chain operations. https://magrittr.tidyverse.org . Some of the advanced possibilities of magrittR require loading it independently (`library(magrittr)`)
- **dplyr**. Data transformation, offers an SQL-like syntax to operate on tibbles. https://dplyr.tidyverse.org
- **ggplot2**. High-level plotting functions. https://ggplot2.tidyverse.org
- See also **stringr** (string manipulation), **forcats** (factors), and **purr** (map/reduce, i.e. a ?better version of the "apply" family), and also **lubridate** to work with dates, etc. https://www.tidyverse.org/packages/
There are rare cases where individual packages should be loaded by hand (see github etc.), mostly as a result of bugs or "unexpected features". This should probably not be an issue.

### More tidyverse
Not loaded with `library(tidyverse)`:
- **readxl**. Read excel format. Very efficient and much easier to use than the regular ODBC library. https://readxl.tidyverse.org

### Friends of tidyverse
There are not officially part of the tidyverse, they are maintained by different authors and so on but they play well with it.
- **clipr**. Read and write to/from the clipboard, much better than the regular R functions. https://github.com/mdlincoln/clipr
- **R6**. Supplies python-like classes to R. https://r6.r-lib.org



The tidyverse supplies a lot of "syntaxic sugar" (for instance column names can idniferently be quoted or not quoted, etc). This makes it very easy to use for inline processing (notebook or script), less so for programming. Also, tidyverse code ends to be more verbose (but arguably more readable) than equivalent R code.

## Some key components

## Tibbles

Tibbles (a pun on "table") are R data frames, with improved functionalities. As they say on the web site, tibbles "*are lazy and surly: they do less (i.e. they don’t change variable names or types, and don’t do partial matching) and complain more (e.g. when a variable does not exist). This forces you to confront problems earlier, typically leading to cleaner, more expressive code.*"

A tibble is created in the usual way :

In [7]:
ds <- tibble (a=c(1,2,3),
              b=c(NA,4,5),
              d=c("blue","green","red"))

The first obvious difference with a data frame is that a tibble has a nicer `print` method (*well, it is not visible in Jupyter because df's and tibbles are formatted in the same way, but try in Rgui...*) :

In [8]:
ds

a,b,d
<dbl>,<dbl>,<chr>
1,,blue
2,4.0,green
3,5.0,red


However, the more important differences are best evidenced when comparing with the behaviour of normal data.frames.

In [32]:
gcdkit.dir<-"C:\\Users\\moje4671\\R\\win-library\\3.6\\GCDkit\\"
sazFile <- paste(gcdkit.dir,"data\\sazava.txt",sep="")

In [38]:
sazava.df<-read.table(sazFile)
sazava.df

Unnamed: 0_level_0,Intrusion,Locality,Petrology,Outcrop,Symbol,Colour,SiO2,TiO2,Al2O3,FeO,...,Dy,Ho,Er,Tm,Yb,Lu,Y,Cs,Ta,Hf
Unnamed: 0_level_1,<fct>,<fct>,<fct>,<fct>,<int>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,...,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<dbl>,<dbl>,<dbl>
Sa-1,Sazava,Mrac,bi-amph quartz diorite,working quarry,10,4,59.98,0.63,16.42,5.46,...,,,,,,,25,,,
Sa-2,Sazava,Mrac,bi-amph quartz diorite,working quarry,10,4,55.17,0.71,17.0,5.26,...,,,,,,,30,,,
Sa-3,Sazava,Mrac,bi-amph quartz diorite,working quarry,10,4,55.09,0.75,17.59,5.81,...,,,,,,,30,,,
Sa-4,Sazava,Mrac,bi-amph quartz diorite,working quarry,10,4,50.72,0.83,17.57,7.65,...,5.8,1.03,2.8,0.43,2.88,0.43,38,5.7,0.5,2.5
Sa-7,Sazava,Teletin,bi-amph tonalite,disused quarry,10,1,57.73,0.95,18.82,5.43,...,2.7,0.56,1.64,0.24,1.52,0.25,24,6.6,0.6,3.6
SaD-1,basic,Teletin,bi-amph quartz diorite,disused quarry,8,1,52.9,1.35,18.23,7.24,...,,,,,,,36,2.3,1.1,1.8
Gbs-1,basic,Pecerady,px-amph gabbro,disused quarry,19,1,49.63,0.76,13.34,5.69,...,,,,,,,20,,,
Gbs-20,basic,Pecerady,px-amph gabbro,disused quarry,19,1,51.72,0.67,14.17,6.43,...,,,,,,,19,,,
Gbs-2,basic,Vavretice,amph-bi qtz gabbrodiorite,disused quarry,19,1,48.84,0.34,21.64,2.74,...,,,,,,,10,,,
Gbs-3,basic,Brtnice,amph-bi qtz gabbrodiorite,water supply gallery,19,1,55.8,0.8,16.98,6.22,...,,,,,,,42,,,


The normal behaviour of reading a file into a df is to convert string to factors (this can be overridden, of course).

Tibbles do not like rownames (in fact, they actively discourage using them). So it's not straightforwards to load files with one less item in the first line (as you would do in plain R). However,

In [47]:
sazxlFile <- paste(gcdkit.dir,"Test_data\\sazava.xls",sep="")
library(readxl)
sazava_tbl<- read_xls(sazxlFile)
sazava_tbl

New names:
* `` -> ...1



...1,Intrusion,Locality,Petrology,Outcrop,Symbol,Colour,SiO2,TiO2,Al2O3,...,Dy,Ho,Er,Tm,Yb,Lu,Y,Cs,Ta,Hf
<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,...,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
Sa-1,Sazava,Mrac,bi-amph quartz diorite,working quarry,10,4,59.98,0.63,16.42,...,,,,,,,25,,,
Sa-2,Sazava,Mrac,bi-amph quartz diorite,working quarry,10,4,55.17,0.71,17.0,...,,,,,,,30,,,
Sa-3,Sazava,Mrac,bi-amph quartz diorite,working quarry,10,4,55.09,0.75,17.59,...,,,,,,,30,,,
Sa-4,Sazava,Mrac,bi-amph quartz diorite,working quarry,10,4,50.72,0.83,17.57,...,5.8,1.03,2.8,0.43,2.88,0.43,38,5.7,0.5,2.5
Sa-7,Sazava,Teletín,bi-amph tonalite,disused quarry,10,1,57.73,0.95,18.82,...,2.7,0.56,1.64,0.24,1.52,0.25,24,6.6,0.6,3.6
SaD-1,basic,Teletín,bi-amph quartz diorite,disused quarry,8,1,52.9,1.35,18.23,...,,,,,,,36,2.3,1.1,1.8
Gbs-1,basic,Pecerady,px-amph gabbro,disused quarry,19,1,49.63,0.76,13.34,...,,,,,,,20,,,
Gbs-20,basic,Pecerady,px-amph gabbro,disused quarry,19,1,51.72,0.67,14.17,...,,,,,,,19,,,
Gbs-2,basic,Vavretice,amph-bi qtz gabbrodiorite,disused quarry,19,1,48.84,0.34,21.64,...,,,,,,,10,,,
Gbs-3,basic,Brtnice,amph-bi qtz gabbrodiorite,water supply gallery,19,1,55.8,0.8,16.98,...,,,,,,,42,,,


As you see, the textual columns are of type  `<chr>` and not `<fct>` - tibbles perform no automatic type conversion. 

As we mentionned, tibbles are "lazy and surly: they do less and they complain more".

**Tibbles are lazy:** one of the best features of tibbles is that they are type stable - operations on a tibble always return a tibble, unless you explicitly ask otherwise. So, for instance

In [48]:
sazava.df[,5]
# A vector

In [49]:
sazava_tbl[,6]
# A tibble

Symbol
<dbl>
10
10
10
10
10
8
19
19
19
19


In [50]:
sazava.df[1,7]
# A scalar

In [51]:
sazava_tbl[1,8]
# Still a tibble !

SiO2
<dbl>
59.98


In [52]:
sazava.df[1,NULL]
# A dataframe

Sa-1


In [54]:
sazava_tbl[1,NULL]
# Still a tibble (but jupyter does not print it, try in Rgui)

If you want to convert the contents of a tibble to a vector, you can do it explicitely by using the tibble function `as_vector` (NOT the same as plain R `as.vector` !)

In [83]:
as_vector(sazava_tbl[,6])

Or you can use the [[ or $ operator (as this performs implicit conversion, however, this is discouraged):

In [84]:
sazava_tbl$SiO2

In [86]:
sazava_tbl[["SiO2"]]

**Tibbles are surly:** they complain more and won't let you, for instance, use partial matching:

In [59]:
sazava.df$Si
# Automatically expanded to SiO2

In [61]:
sazava_tbl$Si

"Unknown or uninitialised column: `Si`."


NULL

Tibbles can also deal with poorly conformed variable names, as long as you protect them with back ticks, and they won't try to convert them to something "sensible":

In [62]:
tibble(a=c(1,2,3),`a silly and +-*? dangerous "title"`=c(4,5,6))

a,"a silly and +-*? dangerous ""title"""
<dbl>,<dbl>
1,4
2,5
3,6


In [79]:
tbl<-tibble(a=c(1,2,3),`a silly and +-*? dangerous "title"`=c(4,5,6))
tbl$`a silly and +-*? dangerous "title"`

In [63]:
data.frame(a=c(1,2,3),`a silly and +-*? dangerous "title"`=c(4,5,6))

a,a.silly.and......dangerous..title.
<dbl>,<dbl>
1,4
2,5
3,6


(whether you regard the above as a good, or bad feature is another matter...)

**Tibbles are still data.frames**, so you can perform normal data frame operations on them (subset, assign, etc.) :

In [65]:
tbl <- tibble (a=c(1,2,3),
              b=c(NA,4,5),
              d=c("blue","green","red"))
tbl[1,]


a,b,d
<dbl>,<dbl>,<chr>
1,,blue


In [67]:
tbl[1,1]<-2
tbl

a,b,d
<dbl>,<dbl>,<chr>
2,,blue
2,4.0,green
3,5.0,red


In [70]:
tbl[tbl[,"a"]>2,]

a,b,d
<dbl>,<dbl>,<chr>
3,5,red


However, **tibbles come with more evolved methods** to perform basic tasks:

In [71]:
tbl <- tibble (a=c(1,2,3),
              b=c(NA,4,5),
              d=c("blue","green","red"))
df <- data.frame (a=c(1,2,3),
              b=c(NA,4,5),
              d=c("blue","green","red"))

In [74]:
other_tbl <- tibble (b=c(5,5,7),
              a=c(0,4,3),
              e=c("white","green","purple"))
other.df <- data.frame (b=c(5,5,7),
              a=c(0,4,3),
              e=c("white","green","purple"))

In [76]:
rbind(df,other.df)
# oops

ERROR: Error in match.names(clabs, names(xi)): names do not match previous names


In [73]:
bind_rows(tbl,other_tbl)
# Works as planned

a,b,d,e
<dbl>,<dbl>,<chr>,<chr>
1,,blue,
2,4.0,green,
3,5.0,red,
0,5.0,,white
4,5.0,,green
3,7.0,,purple


There are even more explicit methods for subsetting :

In [87]:
tbl <- tibble (a=c(1,2,3),
              b=c(NA,4,5),
              d=c("blue","green","red"))
tbl[tbl[,"a"]>2,]

a,b,d
<dbl>,<dbl>,<chr>
3,5,red


This can be replaced by

In [88]:
filter(tbl,a>2)

a,b,d
<dbl>,<dbl>,<chr>
3,5,red


and likewise

In [89]:
tbl[,"a"]
select(tbl,"a")

a
<dbl>
1
2
3


a
<dbl>
1
2
3


which means that we can combine them into

In [92]:
filter(select(tbl,"a"),a<3)

a
<dbl>
1
2


As we will see latter, they work very well with the "pipe" operator.

So, in summary:
- Tibbles are modified data.frame.
- They are lazy and surly (they do less and they complain more), forcing you to write more explicit code and not rely on implicit conversions.
- They come with functions that mirror plain R functions, adapted for tibbles and, in general, doing more evolved operations. Mosty, they do have names with underscores (i.e. `as_vector`, `bind_rows`...)