[![AnalyticsDojo](https://s3.amazonaws.com/analyticsdojo/logo/final-logo.png)](http://www.analyticsdojo.com)
<center><h1>Introduction to R - Tidyverse </h1></center>
<center><h3><a href = 'http://www.analyticsdojo.com'>www.analyticsdojo.com</a></h3></center>

### Links: [local](http://localhost:8888/notebooks/classes/05-intro-r2/intro-r-tidyverse.ipynb) [github](https://github.com/AnalyticsDojo/materials/blob/master/analyticsdojo/classes/05-intro-r2/intro-r-tidyverse.ipynb) [slides](http://nbviewer.jupyter.org/format/slides/github/AnalyticsDojo/materials/blob/master/analyticsdojo/classes/05-intro-r2/intro-r-tidyverse.ipynb#/)

In [8]:
library(dplyr)
library(ggplot2)
library(tidyr)

In [11]:
install.packages(nycflights13)
library(nycflights13)

ERROR: Error in install.packages(nycflights13): object 'nycflights13' not found


## Overview
- What is the Tidyverse?
- Subpackages
- Piping
- `dplyr`

# What is the Tidyverse?

## Tidyverse
- "The tidyverse is a set of packages that work in harmony because they share common data representations and API design." -Hadley Wickham
- The variety of packages include `dplyr`, `tibble`, `tidyr`, `readr`, `purrr` (and more).


![](http://r4ds.had.co.nz/diagrams/data-science-explore.png)
- From [R for Data Science](http://r4ds.had.co.nz/explore-intro.html) by [Hadley Wickham](https://github.com/hadley)

## Piping
- `%>%` Is used to help to write cleaner code.
- It is loaded by default when running the `tidyverse`, but it comes from the `magrittr` package.
- Input from one command is piped to another without saving directly in memory with an intermediate throwaway variable. 

In [None]:
mpg<-mpg

#This just gives a dataframe with 70 obs, only 8 cylinder cars 
mpg.8cyl<-mpg %>% 
  filter(cyl == 8)

#This takes the mean city MPG by manufacturer 
mpg.8cyl %>% #This starts with are saved dataframe.
 group_by(manufacturer) %>% 
 summarise(citympg = mean(cty))

```{r}
# A tibble: 11 x 2
   manufacturer  citympg
          <chr>    <dbl>
1          audi 16.00000
2     chevrolet 13.64286
3         dodge 11.57143
4          ford 13.13333
5          jeep 12.20000
6    land rover 11.50000
7       lincoln 11.33333
8       mercury 13.00000
9        nissan 12.00000
10      pontiac 16.00000
11       toyota 12.66667```

## `dplyr`
- ["A fast, consistent tool for working with data frame like objects, both in memory and out of memory."](https://cran.r-project.org/web/packages/dplyr/index.html)
- Subset observations using their value with `filter()`.
- Reorder rows using `arrange()`.
- Select columns using  `select()`.
- Recode variables useing `mutate()`.
- Sumarize variables using `summarise()`.

In [None]:
#Filter to only those cars that have miles per gallon equal to 
mpg.8cyl<-mpg %>% 
  filter(cyl == 8)

#Alt Syntax
mpg.8cyl<-filter(mpg, cyl == 8)

#Flights on the 1/1
flight11<-filter(flights, month == 1, day == 1) 

In [None]:
#Sort cars by MPG highway(hwy) then city(cty)
mpgsort<-arrange(mpg, hwy, cty)

In [None]:
#From the documentation https://cran.r-project.org/web/packages/dplyr/dplyr.pdf  
select(iris, starts_with("Petal")) #returns columns that start with "Petal"
select(iris, ends_with("Width")) #returns columns that start with "Width"
select(iris, contains("etal"))
select(iris, matches(".t."))
select(iris, Petal.Length, Petal.Width)
vars <- c("Petal.Length", "Petal.Width")
select(iris, one_of(vars))

In [None]:
#Recoding Data
# See Creating new variables with mutate and ifelse: 
# https://rstudio-pubs-static.s3.amazonaws.com/116317_e6922e81e72e4e3f83995485ce686c14.html 
mutate(mpg, displ_l = disp / 61.0237)


In [None]:
# Example taken from David Ranzolin
# https://rstudio-pubs-static.s3.amazonaws.com/116317_e6922e81e72e4e3f83995485ce686c14.html#/9 
section <- c("MATH111", "MATH111", "ENG111")
grade <- c(78, 93, 56)
student <- c("David", "Kristina", "Mycroft")
gradebook <- data.frame(section, grade, student)

#As the output is a tibble, here we are saving each intermediate version.
gradebook2<-mutate(gradebook, Pass.Fail = ifelse(grade > 60, "Pass", "Fail"))  
gradebook3<-mutate(gradebook2, letter = ifelse(grade %in% 60:69, "D",
                                               ifelse(grade %in% 70:79, "C",
                                                      ifelse(grade %in% 80:89, "B",
                                                             ifelse(grade %in% 90:99, "A", "F")))))



In [None]:
#Here we are using piping to do this more effectively. 
gradebook4<-gradebook %>%
mutate(Pass.Fail = ifelse(grade > 60, "Pass", "Fail"))  %>%
mutate(letter = ifelse(grade %in% 60:69, "D", 
                                  ifelse(grade %in% 70:79, "C",
                                         ifelse(grade %in% 80:89, "B",
                                                ifelse(grade %in% 90:99, "A", "F")))))




In [None]:
#find the average city and highway mpg
summarise(mpg, mean(cty), mean(hwy))
#find the average city and highway mpg by cylander
summarise(group_by(mpg, cyl), mean(cty), mean(hwy))
summarise(group_by(mtcars, cyl), m = mean(disp), sd = sd(disp))

# With data frames, you can create and immediately use summaries
by_cyl <- mtcars %>% group_by(cyl)
by_cyl %>% summarise(a = n(), b = a + 1)

## `tibble`
- `Tibbles` are data frames, but slight changed so that they work better in the `tidyverse` and with `dplyr`.
- https://github.com/tidyverse/tibble]

In [None]:
#Tibble Demo
iris<-as_tibble(read.csv(file="../../data/iris.csv", header=TRUE,sep=","))
#you can see this is of class tbl_df
class(iris)