# The tidyverse family
___

Code to explain the workings of the `tidyverse` packages.

> The tidyverse is a set of packages that work in harmony because they share common data representations and API design.

Useful links:
* [Github for tidyverse](https://github.com/tidyverse/tidyverse)

## Generate reference data

In [2]:
X <- data.frame("var1"=sample(1:5),"var2"=sample(6:10),"var3"=sample(11:15))
X <- X[sample(1:5),]
X$var2[c(1,3)] = NA
X

Unnamed: 0,var1,var2,var3
4,1,,11
1,5,10.0,12
2,3,,14
5,4,9.0,15
3,2,7.0,13


In [6]:
# Air pollution and weather variables in Chicago.
if (!file.exists("data/chicago.RDS")) {
    download.file("https://github.com/DataScienceSpecialization/courses/blob/master/03_GettingData/dplyr/chicago.rds?raw=true",
                 "data/chicago.RDS", extra='-L', method="curl", mode="wb")
}
chicago <- readRDS("data/chicago.RDS")

In [7]:
str(chicago)

'data.frame':	6940 obs. of  8 variables:
 $ city      : chr  "chic" "chic" "chic" "chic" ...
 $ tmpd      : num  31.5 33 33 29 32 40 34.5 29 26.5 32.5 ...
 $ dptp      : num  31.5 29.9 27.4 28.6 28.9 ...
 $ date      : Date, format: "1987-01-01" "1987-01-02" ...
 $ pm25tmean2: num  NA NA NA NA NA NA NA NA NA NA ...
 $ pm10tmean2: num  34 NA 34.2 47 NA ...
 $ o3tmean2  : num  4.25 3.3 3.33 4.38 4.75 ...
 $ no2tmean2 : num  20 23.2 23.8 30.4 30.3 ...


In [8]:
head(chicago)

city,tmpd,dptp,date,pm25tmean2,pm10tmean2,o3tmean2,no2tmean2
chic,31.5,31.5,1987-01-01,,34.0,4.25,19.9881
chic,33.0,29.875,1987-01-02,,,3.304348,23.19099
chic,33.0,27.375,1987-01-03,,34.16667,3.333333,23.81548
chic,29.0,28.625,1987-01-04,,47.0,4.375,30.43452
chic,32.0,28.875,1987-01-05,,,4.75,30.33333
chic,40.0,35.125,1987-01-06,,48.0,5.833333,25.77233


In [9]:
names(chicago)

In [5]:
?download.file

0,1
download.file {utils},R Documentation

0,1
url,A character string naming the URL of a resource to be downloaded.
destfile,A character string with the name where the downloaded file is saved. Tilde-expansion is performed.
method,"Method to be used for downloading files. Current download methods are ""internal"", ""wininet"" (Windows only) ""libcurl"", ""wget"" and ""curl"", and there is a value ""auto"": see ‘Details’ and ‘Note’. The method can also be set through the option ""download.file.method"": see options()."
quiet,"If TRUE, suppress status messages (if any), and the progress bar."
mode,"character. The mode with which to write the file. Useful values are ""w"", ""wb"" (binary), ""a"" (append) and ""ab"". Only used for the ""internal"" method."
cacheOK,logical. Is a server-side cached value acceptable?
extra,"character vector of additional command-line arguments for the ""wget"" and ""curl"" methods."


## plyr / dplyr
Some of the key verbs in the `dplyr` universe are:
* `select` - return a subset of the columns of a data frame
* `filter` - extract a subset of rows from a data frame based on logical conditions
* `arrange` - reorder rows of a data frame
* `rename` - rename variables in a data frame
* `mutate` - add new variables / columns or transform existing variables
* `summarise` / `summarize` - generature summary statistics of different variables in the data frame

In [10]:
library(dplyr)


Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union



## Pipeline operator
Can be used to chain operations and eliminate redundant temporary variables.

In [56]:
chicago %>% mutate(month = as.POSIXlt(date)$mon + 1) %>% group_by(month) %>% summarize(pm25 = mean(pm25tmean2, na.rm = TRUE))

month,pm25
1,17.76996
2,20.37513
3,17.40818
4,13.85879
5,14.0742
6,15.86461
7,16.57087
8,16.9338
9,15.91279
10,14.23557


## Usage of Arrange

In [4]:
arrange(X, var1)

var1,var2,var3
1,,11
2,7.0,13
3,,14
4,9.0,15
5,10.0,12


In [5]:
arrange(X, desc(var1))

var1,var2,var3
5,10.0,12
4,9.0,15
3,,14
2,7.0,13
1,,11


In [21]:
chicago <- arrange(chicago, date)
head(chicago)

city,tmpd,dptp,date,pm25tmean2,pm10tmean2,o3tmean2,no2tmean2
chic,31.5,31.5,1987-01-01,,34.0,4.25,19.9881
chic,33.0,29.875,1987-01-02,,,3.304348,23.19099
chic,33.0,27.375,1987-01-03,,34.16667,3.333333,23.81548
chic,29.0,28.625,1987-01-04,,47.0,4.375,30.43452
chic,32.0,28.875,1987-01-05,,,4.75,30.33333
chic,40.0,35.125,1987-01-06,,48.0,5.833333,25.77233


In [22]:
tail(chicago)

Unnamed: 0,city,tmpd,dptp,date,pm25tmean2,pm10tmean2,o3tmean2,no2tmean2
6935,chic,35,29.6,2005-12-26,8.4,8.5,14.041667,16.81944
6936,chic,40,33.6,2005-12-27,23.56,27.0,4.46875,23.5
6937,chic,37,34.5,2005-12-28,17.75,27.5,3.260417,19.28563
6938,chic,35,29.4,2005-12-29,7.45,23.5,6.794837,19.97222
6939,chic,36,31.0,2005-12-30,15.05714,19.2,3.03442,22.80556
6940,chic,35,30.1,2005-12-31,15.0,23.5,2.53125,13.25


In [24]:
chicago <- arrange(chicago, desc(date))
head(chicago)

city,tmpd,dptp,date,pm25tmean2,pm10tmean2,o3tmean2,no2tmean2
chic,35,30.1,2005-12-31,15.0,23.5,2.53125,13.25
chic,36,31.0,2005-12-30,15.05714,19.2,3.03442,22.80556
chic,35,29.4,2005-12-29,7.45,23.5,6.794837,19.97222
chic,37,34.5,2005-12-28,17.75,27.5,3.260417,19.28563
chic,40,33.6,2005-12-27,23.56,27.0,4.46875,23.5
chic,35,29.6,2005-12-26,8.4,8.5,14.041667,16.81944


## Usage of rename

In [28]:
chicago <- rename(chicago, pm = pm25tmean2, dewpoint = dptp)

ERROR: Error: `pm25tmean2`, `dptp` contains unknown variables


In [29]:
head(chicago)

city,tmpd,dewpoint,date,pm,pm10tmean2,o3tmean2,no2tmean2
chic,35,30.1,2005-12-31,15.0,23.5,2.53125,13.25
chic,36,31.0,2005-12-30,15.05714,19.2,3.03442,22.80556
chic,35,29.4,2005-12-29,7.45,23.5,6.794837,19.97222
chic,37,34.5,2005-12-28,17.75,27.5,3.260417,19.28563
chic,40,33.6,2005-12-27,23.56,27.0,4.46875,23.5
chic,35,29.6,2005-12-26,8.4,8.5,14.041667,16.81944


In [30]:
chicago <- rename(chicago, pm25tmean2 = pm, dptp = dewpoint)

## Usage of Mutate

In [34]:
chicago <- mutate(chicago, pm25detrend = pm25tmean2 - mean(pm25tmean2, na.rm = TRUE))
head(chicago)

city,tmpd,dptp,date,pm25tmean2,pm10tmean2,o3tmean2,no2tmean2,pm25detrend
chic,35,30.1,2005-12-31,15.0,23.5,2.53125,13.25,-1.230958
chic,36,31.0,2005-12-30,15.05714,19.2,3.03442,22.80556,-1.173815
chic,35,29.4,2005-12-29,7.45,23.5,6.794837,19.97222,-8.780958
chic,37,34.5,2005-12-28,17.75,27.5,3.260417,19.28563,1.519042
chic,40,33.6,2005-12-27,23.56,27.0,4.46875,23.5,7.329042
chic,35,29.6,2005-12-26,8.4,8.5,14.041667,16.81944,-7.830958


## Usage of Select

In [11]:
names(chicago)

Filter columns by range with name:

In [13]:
head(select(chicago, city:dptp))

city,tmpd,dptp
chic,31.5,31.5
chic,33.0,29.875
chic,33.0,27.375
chic,29.0,28.625
chic,32.0,28.875
chic,40.0,35.125


Filter columns by exception:

In [14]:
head(select(chicago, -date))

city,tmpd,dptp,pm25tmean2,pm10tmean2,o3tmean2,no2tmean2
chic,31.5,31.5,,34.0,4.25,19.9881
chic,33.0,29.875,,,3.304348,23.19099
chic,33.0,27.375,,34.16667,3.333333,23.81548
chic,29.0,28.625,,47.0,4.375,30.43452
chic,32.0,28.875,,,4.75,30.33333
chic,40.0,35.125,,48.0,5.833333,25.77233


## Usage of Filter

In [16]:
chic.f <- filter(chicago, pm25tmean2 > 30)
head(chic.f)

city,tmpd,dptp,date,pm25tmean2,pm10tmean2,o3tmean2,no2tmean2
chic,23,21.9,1998-01-17,38.1,32.46154,3.180556,25.3
chic,28,25.8,1998-01-23,33.95,38.69231,1.75,29.3763
chic,55,51.3,1998-04-30,39.4,34.0,10.786232,25.3131
chic,59,53.7,1998-05-01,35.4,28.5,14.295125,31.42905
chic,57,52.0,1998-05-02,33.3,35.0,20.662879,26.79861
chic,57,56.0,1998-05-07,32.1,34.5,24.270422,33.99167


In [19]:
# Multiple columns!
chic.f <- filter(chicago, pm25tmean2 > 30 & tmpd > 80)
head(chic.f)

city,tmpd,dptp,date,pm25tmean2,pm10tmean2,o3tmean2,no2tmean2
chic,81,71.2,1998-08-23,39.6,59.0,45.86364,14.32639
chic,81,70.4,1998-09-06,31.5,50.5,50.6625,20.3125
chic,82,72.2,2001-07-20,32.3,58.5,33.0038,33.675
chic,84,72.9,2001-08-01,43.7,81.5,45.17736,27.44239
chic,85,72.6,2001-08-08,38.8375,70.0,37.98047,27.62743
chic,84,72.6,2001-08-09,38.2,66.0,36.73245,26.46742


## Usage of group_by

In [36]:
chicago <- mutate(chicago, tempcat = factor(1 * (tmpd > 80), labels = c("cold", "hot")))
head(chicago)

city,tmpd,dptp,date,pm25tmean2,pm10tmean2,o3tmean2,no2tmean2,pm25detrend,tempcat
chic,35,30.1,2005-12-31,15.0,23.5,2.53125,13.25,-1.230958,cold
chic,36,31.0,2005-12-30,15.05714,19.2,3.03442,22.80556,-1.173815,cold
chic,35,29.4,2005-12-29,7.45,23.5,6.794837,19.97222,-8.780958,cold
chic,37,34.5,2005-12-28,17.75,27.5,3.260417,19.28563,1.519042,cold
chic,40,33.6,2005-12-27,23.56,27.0,4.46875,23.5,7.329042,cold
chic,35,29.6,2005-12-26,8.4,8.5,14.041667,16.81944,-7.830958,cold


In [47]:
hotcold <- group_by(chicago, tempcat)
head(hotcold)
tail(hotcold)

city,tmpd,dptp,date,pm25tmean2,pm10tmean2,o3tmean2,no2tmean2,pm25detrend,tempcat
chic,35,30.1,2005-12-31,15.0,23.5,2.53125,13.25,-1.230958,cold
chic,36,31.0,2005-12-30,15.05714,19.2,3.03442,22.80556,-1.173815,cold
chic,35,29.4,2005-12-29,7.45,23.5,6.794837,19.97222,-8.780958,cold
chic,37,34.5,2005-12-28,17.75,27.5,3.260417,19.28563,1.519042,cold
chic,40,33.6,2005-12-27,23.56,27.0,4.46875,23.5,7.329042,cold
chic,35,29.6,2005-12-26,8.4,8.5,14.041667,16.81944,-7.830958,cold


city,tmpd,dptp,date,pm25tmean2,pm10tmean2,o3tmean2,no2tmean2,pm25detrend,tempcat
chic,40.0,35.125,1987-01-06,,48.0,5.833333,25.77233,,cold
chic,32.0,28.875,1987-01-05,,,4.75,30.33333,,cold
chic,29.0,28.625,1987-01-04,,47.0,4.375,30.43452,,cold
chic,33.0,27.375,1987-01-03,,34.16667,3.333333,23.81548,,cold
chic,33.0,29.875,1987-01-02,,,3.304348,23.19099,,cold
chic,31.5,31.5,1987-01-01,,34.0,4.25,19.9881,,cold


In [44]:
summarise(hotcold, pm25tmean2 = mean(pm25tmean2, na.rm = TRUE), o3 = max(o3tmean2), no2 = median(no2tmean2))

tempcat,pm25tmean2,o3,no2
cold,15.97807,66.5875,24.54924
hot,26.48118,62.969656,24.9387
,47.7375,9.416667,37.44444


In [53]:
# organise by year
chicago <- mutate(chicago, year = as.POSIXlt(date)$year + 1900)
years <- group_by(chicago, year)
summarise(years, pm25 = mean(pm25tmean2, na.rm = TRUE), o3 = max(o3tmean2))

year,pm25,o3
1987,,62.96966
1988,,61.67708
1989,,59.72727
1990,,52.22917
1991,,63.10417
1992,,50.8287
1993,,44.30093
1994,,52.17844
1995,,66.5875
1996,,58.39583


## Split-apply-combine

In [9]:
?ddply

0,1
ddply {plyr},R Documentation

0,1
.data,data frame to be processed
.variables,"variables to split data frame by, as as.quoted variables, a formula or character vector"
.fun,function to apply to each piece
...,other arguments passed on to .fun
.progress,"name of the progress bar to use, see create_progress_bar"
.inform,"produce informative error messages? This is turned off by default because it substantially slows processing speed, but is very useful for debugging"
.drop,"should combinations of variables that do not appear in the input data be preserved (FALSE) or dropped (TRUE, default)"
.parallel,"if TRUE, apply function in parallel, using parallel backend provided by foreach"
.paropts,a list of additional options passed into the foreach function when parallel computation is enabled. This is important if (for example) your code relies on external data or packages: use the .export and .packages arguments to supply them so that all cluster nodes have the correct environment set up for computing.


In [11]:
ddply(InsectSprays, .(spray), summarize, sum=sum(count))
# ddply(for dataframe, for variable, we want to summarize, by returning a sum)

spray,sum
A,174
B,184
C,25
D,59
E,42
F,200


In [12]:
?ave

0,1
ave {stats},R Documentation

0,1
x,A numeric.
...,"Grouping variables, typically factors, all of the same length as x."
FUN,Function to apply for each factor level combination.


In [14]:
spraySums <- ddply(InsectSprays, .(spray), summarize, sum=ave(count, FUN=sum))
spraySums

spray,sum
A,174
A,174
A,174
A,174
A,174
A,174
A,174
A,174
A,174
A,174
