# Data Tidying (and a bit more) in R
## by Diya Das and Andrey Indukaev

### The goal
Data tidying is a necessary first step for data analysis - it's the process of taking your messily formatted data (missing values, unwieldy coding/organization, etc.) and literally tidying it up so it can be easily used for downstream analyses. To quote Hadley Wickham, "Tidy datasets are easy to manipulate, model and visualise, and have a specific structure:
each variable is a column, each observation is a row, and each type of observational unit
is a table."

### The datasets
We are going to be using the data from the R package [`nycflights13`](https://cran.r-project.org/web/packages/nycflights13/nycflights13.pdf). There are five datasets corresponding to flights departing NYC in 2013. We will load directly into R from the library, but the repository also includes CSV files we created for the purposes of the Python demo and can also be used to load the data into our R session.

### But the data are tidy!
So we're going to start by making them a bit untidy. And then we're also going to teach you to do basic manipulations/operations to get statistics that can be used for downstream operations (a bit more than tidying).

In [2]:
#shall we install and load all the packages somewhere here?
#install.packages(c('nycflights13','dplyr','data.table'), repos='http://cran.us.r-project.org')
pckgsToLoad <- c('nycflights13','dplyr','data.table')
invisible(lapply(pckgsToLoad, require, character.only = TRUE, quietly = T))

### Merging data frames: the outline of the logic and basic R functions

Combining different datasets seems to be an operation that is quite often required, and R provides a few tools to do that, often there are many ways of doing smth one want to have done.
Adding new rows or joining data frames "vertically" may be seen as merging and is for sure quite often used.
This may be done with ``rbind`` function. Here is kind of artificial exemple - let's create a data frame with information on flights by United Airlines and Amercian Airlines only, by creating two data frames via subsetting data about each airline one by one and then merging.

In [4]:
flightsUA <- flights[flights$carrier == 'UA',]
flightsAA <- flights[flights$carrier == 'UA',]
nrow(flightsUA) + nrow(flightsUA)
fligthsUAandAA <- rbind(flightsUA,flightsAA)
nrow(fligthsUAandAA)

Nothing special, the only condition is that columns have to have the same names (may be in different order).

A useful tip is to use ``do.call`` in order to merge more than two data frames.
``do.call`` is a function that applies a function to a list of elements.

In [7]:
nrow(do.call(rbind, list(flightsUA,flightsAA,fligthsUAandAA)))

This technique is really userfull when one want to populate a data frame within a loop.
Each time we append a row to a dataframe within a row a new copy of a dataframe is stored in the memory :(
So the solutions is to create a list of 1 row data frames and then merge them with ``do.call rbind`` combo.
But since ``rbind``, as many native R functions, is slow and not memory-efficient, for large datasets one may want to use
``rbindlist`` function from ``data.table`` package, which does the same, but faster.

In [8]:
nrow(rbindlist(list(flightsUA,flightsAA,fligthsUAandAA)))

In [None]:
flights %>% group_by(origin) %>%
  summarise(avgDepDelay = mean(dep_delay, na.rm = TRUE)) %>%
  arrange(desc(avgDepDelay)) 