## dplyr & tidyr 

The remainder of the course will be about data formatting, summarization and visualization. It is entirely possible to do these things using base R. However in recent years better alternatives have been developed. Now that the basic R syntax has been explained, it is time to introduce a more consistent and easier to read syntax created by Hadley Wickham in the form of key packages `dplyr`, `tidyr` and `ggplot2`. Most operations can be performed with `dplyr` and `tidyr`, and `ggplot2` can handle most visualizations. It is however still necessary to know the basic R syntax - especially for more complicated and specific statistical analysis for which many methods exist. We will explain method (dispatch) in the visualization paragraph. 

First install and load the packages:  (installation has already been done for the course)

In [None]:
#install.packages(c("dplyr", "tidyr", "ggplot2"))
library(dplyr)
library(tidyr)
library(ggplot2)

### tidyr & dplyr syntax

Advantages:  
** Piping `%>%` more readable: data flow operator from left to right    
** Tibble, nicer implementation of data.frame  
** Consistent language: `filter`, `select`, `mutate`,  `summarise` and `arrange`  
** Explicit function names (function names do what they say)  

Overview:  
row selection:  `filter(iris, Sepal.Length >6)`  
column selection: `select(iris, Species : Petal.Length)`  
Calculations: `mutate(iris, Petal.Ratio = Petal.Width / Petal.Length)`  
Define annotation: `group_by(iris, Species)` For summarizing or joining tables  
Summarize based on group annotation: `summarise(iris, n = n(), mean.Petal.Ratio = mean(Petal.Ratio))` 
Row order the table: `arrange(iris, desc(mean.Petal.Ratio))` 
Merging tables based on a shared column id: `left_join()`, `full_join()` 

For example, using the dplyr syntax on `my_dataframe`:  

In [None]:
my_dataframe %>% filter( names %in% c("Margriet", "Elke") ) %>%
  select(names:hobbies) # select columns from names to hobbies

To perform calculations on a data frame and perform a statistical summary:  

In [None]:
my_dataframe %>% 
  mutate(nonsensicle_confidence_limits = 1.97*sd(fakeage)/sqrt(3)) %>%
  group_by(sex) %>%
  summarize(n = n(), total_age = sum(fakeage), CoV = sd(fakeage)/total_age )

In the example above the `mutate` call calculates the confidence limits and adds it as a new column.  
The `group_by` call is needed to group the data frame by sex, so that R knows over what levels to summarize over in the next line. The `n = n()` gives us the number of rows summarized over, then two more calculations are performed as extra arguments to the `summarise` call; the sum of the age of the males/females, and the coefficient of variation.  

* exercise: After observing the following code, rewrite it with pipes and `dplyr` syntax: 
hint: run line by line and observe the result, some functions like `aggregate` have not been treated but by looking at the result will make its function it obvious.  

In [None]:
tmp <- iris[ iris$Sepal.Length > 6, ]
tmp <- tmp[, c("Species", "Petal.Width", "Petal.Length")]
tmp$Petal.Ratio <- tmp$Petal.Width / tmp$Petal.Length
tmp2 <- aggregate(Petal.Ratio ~ Species, data = tmp, FUN = length)
tmp <- aggregate(Petal.Ratio ~ Species, data = tmp, FUN = mean)
colnames(tmp)[2] <- "mean.Petal.Ratio"
colnames(tmp2)[2] <- "n"
tmp <- merge(tmp2, tmp, by = "Species")
tmp[ rev(order(tmp$mean.Petal.Ratio)), ]