![](logo.png)

# <font color='red'>Introduction to Tidyverse</font>

This is a brief introduction to Tidyverse. Mainly, an introduction to the following packages:

> ### dplyr
> ### tidyr

Other packages within the Tidyverse world to explore include:

> ### ggplot2 (Covered in data visualizations)
> ### readr
> ### purrr
> ### tibble
> ### stringr
> ### forcats

To learn more about these packages, please visit the [Tidyverse](https://www.tidyverse.org/packages/) packages page.

# <font color='red'>dplyr() package</font>

We'll be covering the following functions:

filter() (and slice())
<br>arrange()
<br>select() (and rename())
<br>distinct()
<br>mutate() (and transmute())
<br>summarise()
<br>sample_n() and sample_frac()

## Installing
You can install dplyr using

In [None]:
install.packages('dplyr')

In [None]:
# Run it using
library(dplyr)

## Example Data
Let's use some flight data for our examples. We'll download the nycflights13 data package:

In [None]:
install.packages('nycflights13',repos = 'http://cran.us.r-project.org')

In [None]:
library(nycflights13)
summary(flights)

In [None]:
# Notice how large the data frame is:
dim(flights)

## filter()
filter() allows you to select a subset of rows in a data frame. The first argument is the name of the data frame. The second and subsequent arguments are the expressions that filter the data frame:

For example, we can select all flights on November 3rd that were from American Airlines (AA) with:

In [None]:
head(filter(flights,month==11,day==3,carrier=='AA'))

This is a lot simpler than the normal way to do this with a dataframe:

In [None]:
head(flights[flights$month == 11 & flights$day == 3 & flights$carrier == 'AA', ])

## slice()
We can select rows by position using slice()

In [None]:
slice(flights, 1:10)

## arrange()
arrange() works similarly to filter() except that instead of filtering or selecting rows, it reorders them. It takes a data frame, and a set of column names (or more complicated expressions) to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns:

In [None]:
head(arrange(flights,year,month,day,air_time))

You can add desc() to arrange in descending order:

In [None]:
head(arrange(flights,desc(dep_delay)))

## select()
Often you work with large datasets with many columns but only a few are actually of interest to you. select() allows you to rapidly zoom in on a useful subset using operations that usually only work on numeric variable positions:

In [None]:
head(select(flights,carrier))

## rename()
You can use rename() to rename columns, note this is not "in-place" you'll need to reassign the renamed data structures.

In [None]:
head(rename(flights,airline_car = carrier))

## distinct()
A common use of select() is to find the values of a set of variables. This is particularly useful in conjunction with the distinct() verb which only returns the unique values in a table.

In [None]:
distinct(select(flights,carrier))

## mutate()
Besides selecting sets of existing columns, it’s often useful to add new columns that are functions of existing columns. This is the job of mutate():

In [None]:
head(mutate(flights, new_col = arr_delay-dep_delay))

## transmute()
Use transmute if you only want the new columns:

In [None]:
head(transmute(flights, new_col = arr_delay-dep_delay))

## summarise()
You can use summarise() to quickly collapse data frames into single rows using functions that aggregate results. Remember to use na.rm=TRUE to remove NA values.

In [None]:
summarise(flights,avg_air_time=mean(air_time,na.rm=TRUE))

## sample_n() and sample_frac()
You can use sample_n() and sample_frac() to take a random sample of rows: use sample_n() for a fixed number and sample_frac() for a fixed fraction.

In [None]:
sample_n(flights,10)

In [None]:
# .005% of the data
sample_frac(flights,0.00005) # USE replace=TRUE for bootstrap sampling

# <font color='red'>tidyr() package</font>

Now that we've learned about dplyr we can begin to learn about tidyr which is a complementary package that will help us create tidy data sets. So what do we mean when we say "tidy data"?

Tidy data is when we have a data set where every row is an observation and every column is a variable, this way the data is organized in such a way where every cell is a value for a specific variable of a specific observation. Having your data in this format will help build an understanding of your data and allow you to analyze or visualize it quickly and efficiently.

After viewing this lecture, you can reference this handy cheatsheet on [data wrangling](https://rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf)

## Installing tidyr

In [None]:
install.packages('tidyr',repos = 'http://cran.us.r-project.org')

In [None]:
library(tidyr)
library(data.table)

## Data.frames versus data.tables
All data.tables are also data.frames. Loosely speaking, you can think of data.tables as data.frames with extra features.

data.frame is part of base R.

data.table is a package that extends data.frames. Two of its most notable features are speed and cleaner syntax.

However, that syntax for a data.table is different from the standard R syntax for data.frame while being hard for the untrained eye to distinguish at a glance. Therefore, if you read a code snippet and there is no other context to indicate you are working with data.tables and try to apply the code to a data.frame it may fail or produce unexpected results.

So what are some of the practical differences? Here are a few:

* Much faster and very intuitive by operations
* You won't accidentally print out a huge data.frame with the need to press Ctrl-C, data.table prevents this sort of accident
* Faster and better file reading with fread
* The package also provides a number of other utility functions, like %between% or rbindlist that make life better
* Pretty much faster for a lot of basic operations, since a lot of data.frame operations copy the entire thing needlessly

## Using tidyr

We'll cover some of the most useful functions in tidyr. Including the following:

gather()
<br>spread()
<br>separate()
<br>unite()

Which basically perform the following actions:

![](tidyr)

![](tidyr.png)

## Example Data Set
Let's create some fake data that needs to be cleaned using tidyr

In [None]:
comp <- c(1,1,1,2,2,2,3,3,3)
yr <- c(1998,1999,2000,1998,1999,2000,1998,1999,2000)
q1 <- runif(9, min=0, max=100)
q2 <- runif(9, min=0, max=100)
q3 <- runif(9, min=0, max=100)
q4 <- runif(9, min=0, max=100)

df <- data.frame(comp=comp,year=yr,Qtr1 = q1,Qtr2 = q2,Qtr3 = q3,Qtr4 = q4)

In [None]:
df

## Gather() and Spread()
Sometimes people like to think of these operations as analogous to pivot tables in excel, let's see some examples of how to use them:

## gather()
The gather() function will collapse multiple columns into key-pair values. The data frame above is considered wide since the time variable (represented as quarters) is structured such that each quarter represents a variable. To re-structure the time component as an individual variable, we can gather each quarter within one column variable and also gather the values associated with each quarter in a second column variable.

In [None]:
# Using Pipe Operator
head(df %>% gather(Quarter,Revenue,Qtr1:Qtr4))

In [None]:
# With just the function
head(gather(df,Quarter,Revenue,Qtr1:Qtr4))

## spread()
This is the complement of gather(), which is why its called spread():

In [None]:
stocks <- data.frame(
  time = as.Date('2009-01-01') + 0:9,
  X = rnorm(10, 0, 1),
  Y = rnorm(10, 0, 2),
  Z = rnorm(10, 0, 4)
)
stocks

In [None]:
stocksm <- stocks %>% gather(stock, price, -time)

In [None]:
stocksm %>% spread(stock, price)

In [None]:
stocksm %>% spread(time, price)

## Separate and Unite
## separate()
Given either regular expression or a vector of character positions, separate() turns a single character column into multiple columns.

In [None]:
df <- data.frame(x = c(NA, "a.x", "b.y", "c.z"))
df

In [None]:
df %>% separate(x, c("ABC", "XYZ"))

## unite()
Unite is a convenience function to paste together multiple columns into one.

In [None]:
head(mtcars)

In [None]:
unite_(mtcars, "vs.am", c("vs","am"),sep = '.')

In [None]:
# Separate is the complement of unite
mtcars %>%
  unite(vs_am, vs, am) %>%
  separate(vs_am, c("vs", "am"))