# Introduction to Tidyverse - Part 2 

## Transforming Data: dplyr Basics

* Load
the tidyverse by running:

In [1]:
library(tidyverse, warn.conflicts = FALSE)

Registered S3 methods overwritten by 'ggplot2':
  method         from 
  [.quosures     rlang
  c.quosures     rlang
  print.quosures rlang
Registered S3 method overwritten by 'rvest':
  method            from
  read_xml.response xml2
-- Attaching packages --------------------------------------- tidyverse 1.2.1 --
v ggplot2 3.1.1       v purrr   0.3.2  
v tibble  2.1.1       v dplyr   0.8.0.1
v tidyr   0.8.3       v stringr 1.4.0  
v readr   1.3.1       v forcats 0.4.0  
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()


* "warn.conflicts = FALSE" will suppress the soft warnings.
* That one line of code loads the core tidyverse, the packages that you will
use in almost every data analysis.

### The penguins Data Frame

In addition to tidyverse, we will use the Palmer penguins dataset containing body measurements for
penguins on three islands in the Palmer Archipelago

### Loading the dataset

In [2]:
penguins = read.csv("./penguins.csv")

### Viewing the DataFrame

In [3]:
penguins

rowid,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
1,Adelie,Torgersen,39.1,18.7,181,3750,male,2007
2,Adelie,Torgersen,39.5,17.4,186,3800,female,2007
3,Adelie,Torgersen,40.3,18.0,195,3250,female,2007
4,Adelie,Torgersen,,,,,,2007
5,Adelie,Torgersen,36.7,19.3,193,3450,female,2007
6,Adelie,Torgersen,39.3,20.6,190,3650,male,2007
7,Adelie,Torgersen,38.9,17.8,181,3625,female,2007
8,Adelie,Torgersen,39.2,19.6,195,4675,male,2007
9,Adelie,Torgersen,34.1,18.1,193,3475,,2007
10,Adelie,Torgersen,42.0,20.2,190,4250,,2007


* For an alternative view, where you
can see all variables and the first few observations of each variable, use
glimpse()

Variables
* species: 
A penguin’s species (Adelie, Chinstrap, or Gentoo)
* flipper_length_mm: 
The length of a penguin’s flipper, in millimeters
* body_mass_g: 
The body mass of a penguin, in grams

In [4]:
glimpse(penguins)

Observations: 344
Variables: 9
$ rowid             <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15...
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, A...
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torge...
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34....
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18....
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, ...
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 347...
$ sex               <fct> male, female, female, NA, female, male, female, m...
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2...


* the variables names are followed by abbreviations that tell
you the type of each variable: <int> is short for integer, <dbl> is short
for double (aka real numbers), <chr> for character (aka strings), and
<dttm> for date-time. These are important because the operations you can
perform on a column depend so much on its “type.”
* dplyr’s verbs are organized into four groups based on what they operate on:
rows, columns, groups, and tables. 

## Rows
The most important verbs that operate on rows of a dataset are filter(),
which changes which rows are present without changing their order, and
arrange(), which changes the order of the rows without changing which
are present. Both functions affect only the rows, and the columns are left
unchanged. We’ll also discuss distinct(), which finds rows with
unique values, but unlike arrange() and filter(), it can also
optionally modify the columns.


#### filter()

In [5]:
penguins  %>%  filter(bill_depth_mm > 17 )

rowid,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
1,Adelie,Torgersen,39.1,18.7,181,3750,male,2007
2,Adelie,Torgersen,39.5,17.4,186,3800,female,2007
3,Adelie,Torgersen,40.3,18.0,195,3250,female,2007
5,Adelie,Torgersen,36.7,19.3,193,3450,female,2007
6,Adelie,Torgersen,39.3,20.6,190,3650,male,2007
7,Adelie,Torgersen,38.9,17.8,181,3625,female,2007
8,Adelie,Torgersen,39.2,19.6,195,4675,male,2007
9,Adelie,Torgersen,34.1,18.1,193,3475,,2007
10,Adelie,Torgersen,42.0,20.2,190,4250,,2007
11,Adelie,Torgersen,37.8,17.1,186,3300,,2007


* TIP: Use cntrl+ shift + M for inserting the pipe operator " %>% "

In [6]:
penguins  %>% filter(year == 2007)

rowid,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
1,Adelie,Torgersen,39.1,18.7,181,3750,male,2007
2,Adelie,Torgersen,39.5,17.4,186,3800,female,2007
3,Adelie,Torgersen,40.3,18.0,195,3250,female,2007
4,Adelie,Torgersen,,,,,,2007
5,Adelie,Torgersen,36.7,19.3,193,3450,female,2007
6,Adelie,Torgersen,39.3,20.6,190,3650,male,2007
7,Adelie,Torgersen,38.9,17.8,181,3625,female,2007
8,Adelie,Torgersen,39.2,19.6,195,4675,male,2007
9,Adelie,Torgersen,34.1,18.1,193,3475,,2007
10,Adelie,Torgersen,42.0,20.2,190,4250,,2007


In [7]:
penguins  %>% filter(island == "Dream")

rowid,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
31,Adelie,Dream,39.5,16.7,178,3250,female,2007
32,Adelie,Dream,37.2,18.1,178,3900,male,2007
33,Adelie,Dream,39.5,17.8,188,3300,female,2007
34,Adelie,Dream,40.9,18.9,184,3900,male,2007
35,Adelie,Dream,36.4,17.0,195,3325,female,2007
36,Adelie,Dream,39.2,21.1,196,4150,male,2007
37,Adelie,Dream,38.8,20.0,190,3950,male,2007
38,Adelie,Dream,42.2,18.5,180,3550,female,2007
39,Adelie,Dream,37.6,19.3,181,3300,female,2007
40,Adelie,Dream,39.8,19.1,184,4650,male,2007


* We may add chain multiple conditions 

In [8]:
penguins  %>% filter(bill_length_mm > 35 & island == "Torgersen")

rowid,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
1,Adelie,Torgersen,39.1,18.7,181,3750,male,2007
2,Adelie,Torgersen,39.5,17.4,186,3800,female,2007
3,Adelie,Torgersen,40.3,18.0,195,3250,female,2007
5,Adelie,Torgersen,36.7,19.3,193,3450,female,2007
6,Adelie,Torgersen,39.3,20.6,190,3650,male,2007
7,Adelie,Torgersen,38.9,17.8,181,3625,female,2007
8,Adelie,Torgersen,39.2,19.6,195,4675,male,2007
10,Adelie,Torgersen,42.0,20.2,190,4250,,2007
11,Adelie,Torgersen,37.8,17.1,186,3300,,2007
12,Adelie,Torgersen,37.8,17.3,180,3700,,2007


* TIP: Press Alt + - key to enter the Assignment operator "<-" 

* We may save the dataframe under different names after filtering operation

In [9]:
female_penguins  <- penguins  %>% filter(sex == "female")
male_penguins  <- penguins  %>% filter(sex != "female")

In [10]:
female_penguins

rowid,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
2,Adelie,Torgersen,39.5,17.4,186,3800,female,2007
3,Adelie,Torgersen,40.3,18.0,195,3250,female,2007
5,Adelie,Torgersen,36.7,19.3,193,3450,female,2007
7,Adelie,Torgersen,38.9,17.8,181,3625,female,2007
13,Adelie,Torgersen,41.1,17.6,182,3200,female,2007
16,Adelie,Torgersen,36.6,17.8,185,3700,female,2007
17,Adelie,Torgersen,38.7,19.0,195,3450,female,2007
19,Adelie,Torgersen,34.4,18.4,184,3325,female,2007
21,Adelie,Biscoe,37.8,18.3,174,3400,female,2007
23,Adelie,Biscoe,35.9,19.2,189,3800,female,2007


In [11]:
male_penguins

rowid,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
1,Adelie,Torgersen,39.1,18.7,181,3750,male,2007
6,Adelie,Torgersen,39.3,20.6,190,3650,male,2007
8,Adelie,Torgersen,39.2,19.6,195,4675,male,2007
14,Adelie,Torgersen,38.6,21.2,191,3800,male,2007
15,Adelie,Torgersen,34.6,21.1,198,4400,male,2007
18,Adelie,Torgersen,42.5,20.7,197,4500,male,2007
20,Adelie,Torgersen,46.0,21.5,194,4200,male,2007
22,Adelie,Biscoe,37.7,18.7,180,3600,male,2007
24,Adelie,Biscoe,38.2,18.1,185,3950,male,2007
25,Adelie,Biscoe,38.8,17.2,180,3800,male,2007


#### arrange()
* arrange() changes the order of the rows based on the value of the
columns

In [12]:
penguins  %>% arrange(bill_length_mm)

rowid,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
143,Adelie,Dream,32.1,15.5,188,3050,female,2009
99,Adelie,Dream,33.1,16.1,178,2900,female,2008
71,Adelie,Torgersen,33.5,19.0,190,3600,female,2008
93,Adelie,Dream,34.0,17.1,185,3400,female,2008
9,Adelie,Torgersen,34.1,18.1,193,3475,,2007
19,Adelie,Torgersen,34.4,18.4,184,3325,female,2007
55,Adelie,Biscoe,34.5,18.1,187,2900,female,2008
15,Adelie,Torgersen,34.6,21.1,198,4400,male,2007
81,Adelie,Torgersen,34.6,17.2,189,3200,female,2008
53,Adelie,Biscoe,35.0,17.9,190,3450,female,2008


* We can use desc() on a column inside of arrange() to reorder the
data frame based on that column in descending (big-to-small) order

In [13]:
penguins  %>% arrange(desc(bill_length_mm))

rowid,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
186,Gentoo,Biscoe,59.6,17.0,230,6050,male,2007
294,Chinstrap,Dream,58.0,17.8,181,3700,female,2007
254,Gentoo,Biscoe,55.9,17.0,228,5600,male,2009
340,Chinstrap,Dream,55.8,19.8,207,4000,male,2009
268,Gentoo,Biscoe,55.1,16.0,230,5850,male,2009
216,Gentoo,Biscoe,54.3,15.7,231,5650,male,2008
308,Chinstrap,Dream,54.2,20.8,201,4300,male,2008
316,Chinstrap,Dream,53.5,19.9,205,4500,male,2008
260,Gentoo,Biscoe,53.4,15.8,219,5500,male,2009
306,Chinstrap,Dream,52.8,20.0,205,4550,male,2008


#### distinct()


In [14]:
penguins  %>% distinct(species)

species
Adelie
Gentoo
Chinstrap


In [15]:
penguins  %>% distinct(island)

island
Torgersen
Biscoe
Dream


* If you want to find the number of occurrences instead, you’re better off
swapping distinct() for count(), and with the sort = TRUE

#### count()

In [16]:
penguins %>% count(species, sort = TRUE)

species,n
Adelie,152
Gentoo,124
Chinstrap,68


## Columns
There are four important verbs that affect the columns without changing the
rows: 
* mutate() creates new columns that are derived from the existing
columns 
* select() changes which columns are present, 
* rename()
changes the names of the columns, 
* relocate() changes the
positions of the columns.

#### mutate()

* Lets convert the body mass into kg for demonstration purpose 

In [17]:
penguins  %>% mutate(body_mass_KG = body_mass_g / 1000)

rowid,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year,body_mass_KG
1,Adelie,Torgersen,39.1,18.7,181,3750,male,2007,3.750
2,Adelie,Torgersen,39.5,17.4,186,3800,female,2007,3.800
3,Adelie,Torgersen,40.3,18.0,195,3250,female,2007,3.250
4,Adelie,Torgersen,,,,,,2007,
5,Adelie,Torgersen,36.7,19.3,193,3450,female,2007,3.450
6,Adelie,Torgersen,39.3,20.6,190,3650,male,2007,3.650
7,Adelie,Torgersen,38.9,17.8,181,3625,female,2007,3.625
8,Adelie,Torgersen,39.2,19.6,195,4675,male,2007,4.675
9,Adelie,Torgersen,34.1,18.1,193,3475,,2007,3.475
10,Adelie,Torgersen,42.0,20.2,190,4250,,2007,4.250


* ".before" and ".after" will help us to specify the position of the newly created columns 

### select()

In [18]:
penguins  %>% select(species, island, bill_length_mm)

species,island,bill_length_mm
Adelie,Torgersen,39.1
Adelie,Torgersen,39.5
Adelie,Torgersen,40.3
Adelie,Torgersen,
Adelie,Torgersen,36.7
Adelie,Torgersen,39.3
Adelie,Torgersen,38.9
Adelie,Torgersen,39.2
Adelie,Torgersen,34.1
Adelie,Torgersen,42.0


##### Selecting a range of columns

In [19]:
penguins  %>% select(island : flipper_length_mm)

island,bill_length_mm,bill_depth_mm,flipper_length_mm
Torgersen,39.1,18.7,181
Torgersen,39.5,17.4,186
Torgersen,40.3,18.0,195
Torgersen,,,
Torgersen,36.7,19.3,193
Torgersen,39.3,20.6,190
Torgersen,38.9,17.8,181
Torgersen,39.2,19.6,195
Torgersen,34.1,18.1,193
Torgersen,42.0,20.2,190


##### Selecting all columns except those from island  to flipper_length_mm:

In [20]:
penguins  %>% select(- (island :flipper_length_mm))

rowid,species,body_mass_g,sex,year
1,Adelie,3750,male,2007
2,Adelie,3800,female,2007
3,Adelie,3250,female,2007
4,Adelie,,,2007
5,Adelie,3450,female,2007
6,Adelie,3650,male,2007
7,Adelie,3625,female,2007
8,Adelie,4675,male,2007
9,Adelie,3475,,2007
10,Adelie,4250,,2007


##### You can rename variables as you select() them by using =. 

In [21]:
penguins  %>% select(YEAR = year)

YEAR
2007
2007
2007
2007
2007
2007
2007
2007
2007
2007


### rename()

In [22]:
penguins  %>% rename(YEAR = year)

rowid,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,YEAR
1,Adelie,Torgersen,39.1,18.7,181,3750,male,2007
2,Adelie,Torgersen,39.5,17.4,186,3800,female,2007
3,Adelie,Torgersen,40.3,18.0,195,3250,female,2007
4,Adelie,Torgersen,,,,,,2007
5,Adelie,Torgersen,36.7,19.3,193,3450,female,2007
6,Adelie,Torgersen,39.3,20.6,190,3650,male,2007
7,Adelie,Torgersen,38.9,17.8,181,3625,female,2007
8,Adelie,Torgersen,39.2,19.6,195,4675,male,2007
9,Adelie,Torgersen,34.1,18.1,193,3475,,2007
10,Adelie,Torgersen,42.0,20.2,190,4250,,2007


### Groups
dplyr gets even more powerful when you add in the ability to work with
groups. In this section, we’ll focus on the most important functions:
group_by(), summarize(), and the slice family of functions.

### group_by()

In [23]:
penguins  %>% group_by(sex)

"Factor `sex` contains implicit NA, consider using `forcats::fct_explicit_na`"

rowid,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
1,Adelie,Torgersen,39.1,18.7,181,3750,male,2007
2,Adelie,Torgersen,39.5,17.4,186,3800,female,2007
3,Adelie,Torgersen,40.3,18.0,195,3250,female,2007
4,Adelie,Torgersen,,,,,,2007
5,Adelie,Torgersen,36.7,19.3,193,3450,female,2007
6,Adelie,Torgersen,39.3,20.6,190,3650,male,2007
7,Adelie,Torgersen,38.9,17.8,181,3625,female,2007
8,Adelie,Torgersen,39.2,19.6,195,4675,male,2007
9,Adelie,Torgersen,34.1,18.1,193,3475,,2007
10,Adelie,Torgersen,42.0,20.2,190,4250,,2007


* group_by() doesn’t change the data,

### summarize()

* Lets calculate the average body mass of male and female sex

In [24]:
penguins  %>% group_by(sex) %>% summarise(average_mass = mean(body_mass_g))

"Factor `sex` contains implicit NA, consider using `forcats::fct_explicit_na`"

sex,average_mass
female,3862.273
male,4545.685
,


### The slice_ Functions

* df |> slice_head(n = 1)
Takes the first row from each group
* df |> slice_tail(n = 1)
Takes the last row in each group
* df |> slice_min(x, n = 1)
Takes the row with the smallest value of column x
* df |> slice_max(x, n = 1)
Takes the row with the largest value of column x
* df |> slice_sample(n = 1)
takes one random row.