# Data Manipulation and Analysis

Exploring data in R is an important part of most workflows and is one of the strengths of the language. Python's Pandas and other similar structures were inspired by R's data frames. Here we'll use many features of the ubiquitous library [dplyr](https://github.com/hadley/dplyr) to showcase the variety of options available.

Dplyr defines a "grammar of data manipulation" which structures operations (verbs) and objects (nouns) consistently to allow for flexible and interchangable commands.

We'll focus especially on some of the most common verbs:
* Select
* Filter
* Mutate
* Summarise
* Arrange
* Do

In [None]:
library(dplyr)

In [None]:
library(ggplot2)

In [None]:
data()

In [None]:
df <- airquality

In [None]:
str(df)

In [None]:
glimpse(df)

All verbs take a data frame as input and output a data frame. This combined with R's functional nature allows for easy method chaining.

In general:
* The first argument is a data frame.
* The subsequent arguments describe what to do with the data frame. You can refer to columns in the data frame directly without using $.

## Subsetting

In [None]:
filter(df, Month == 9, Day == 1)

In [None]:
filter(df, Month == 9 & Day == 1)

In [None]:
filter(df, !(Wind > 10 | Temp > 60))

In [None]:
head(filter(df, between(Wind, 10, 11)), 3)

In [None]:
head(df, 3)

In [None]:
# Remember to assign the result, dplyr doesn't overwrite data by default
newyearsday <- filter(df, Month == 1, Day == 1)

In [None]:
head(filter(df, Wind == 2 * 4), 3)

In [None]:
filter(df, Wind == (sqrt(2) ^ 2) * 4)

In [None]:
head(filter(df, near(Wind, (sqrt(2) ^ 2) * 4)), 3)

### NA values

In [None]:
head(filter(df, is.na(Solar.R)), 3)

*Questions*:
* Is NA ^ 0 missing?
* Is NA | TRUE missing?
* What about FALSE & NA?
* Can you figure out the general rule? (NA * 0 is a tricky counterexample!)

### Arrange

In [None]:
head(arrange(df, desc(Month), desc(Day), Temp), 3)

### Select

In [None]:
head(select(df, Solar.R, Temp) == select(df, c(Solar.R, Temp)), 5)

In [None]:
head(select(df, -(Month:Day)))

In [None]:
head(rename(df, Solar = Solar.R), 3)

*Exercise*: Use startswith(), endswith(), contains(), and matches() to specify columns. Note that matches uses regular expressions for pattern matching.

## Mutating and Transmuting

In [None]:
head(
    mutate(df,
           knots = 0.868976 * Wind,
           celsius = (Temp - 32) * (5/9),
           kelvin = celsius + 273.15),
    3)

In [None]:
# transmute() works similarly but returns only the derived columns.
head(
    transmute(filter(df, !is.na(Ozone)),
              pct_max_ozone = Ozone / max(Ozone)),
    3)

*Question*: What happens if we don't filter out NA elements?

In [None]:
filter(mutate(df, new_months = (Month != lag(Month))), new_months == TRUE)

...and more! Ranking, rolling aggregates, etc.

## Grouping, Aggregation, and the Pipe

In [None]:
by_month <- group_by(df, Month)
summarise(by_month,
          avg_temp = mean(Temp, na.rm = TRUE),
          avg_hot_temp = mean(Temp[Temp > 75], na.rm = TRUE),
          days = n())
# alternatively, sum(!is.na(x)) to count all non-NA values

In [None]:
# Pipelining operations with "then"
df %>%
group_by(Month) %>%
summarise(avg_temp = mean(Temp, na.rm = TRUE)) %>%
filter(avg_temp > 80)

In [None]:
df %>% summarise(n_distinct(Month))

In [None]:
df %>% group_by(Month, Day) %>% summarise(n()) %>% head(3)

In [None]:
df %>% group_by(Month, Day) %>% summarise(n()) %>% summarise(n()) %>% head(3)

In [None]:
df %>% group_by(Month, Day) %>% ungroup() %>% summarise(n()) %>% head(3)

In [None]:
# What question does this query answer?
df %>% 
  group_by(Month) %>%
  filter(rank(desc(Temp)) < 4)

In [None]:
# What question does this query answer?
df %>%
  group_by(Temp) %>%
  filter(n() > 10) %>%
  arrange(Month, Day)

There are an enormous amount of aggregation functions, including
* first(), last()
* min(), max(), nth(), quantile()
* mean(), median()
* sd(x), IQR(x), mad(x)
* rank()

*Exercise*: For the months we have data for, in how many cases is the windiest day of the month **below** the 25% quantile or **above** the 75% quantile for temperature?

## Joins

In [None]:
month_nums <- df %>% distinct(Month) %>% arrange(Month)
print(month_nums)

In [None]:
month_names <- c('May', 'June', 'July', 'August', 'September')
months <- data.frame(month_nums, month_names)
glimpse(months)

In [None]:
joined_df <- inner_join(df, months)

In [None]:
joined_df %>% group_by(Month) %>% summarise(name = first(month_names))

*Question*: When would you not want to perform an inner join?

*Note*: semi_join and anti_join are available for filtering joins on observations.

In [None]:
joined_df %>%
  semi_join(joined_df %>% filter(Day == 31) %>% select(Day))

## Summarise() and join
Summarise() drops unused columns, and returns summary values. Sometimes, you want add back some columns, for which left_join() is very useful.

In [None]:
# summarise() only keeps necessary columns...
joined_df %>%
    group_by(Month) %>%
    summarise(mean_temp=mean(Temp)) %>% head

In [None]:
# ... therefore, add them back with a left_join
joined_df %>%
    group_by(Month) %>%
    summarise(mean_temp=mean(Temp)) %>%
    left_join(., months) %>%                     ## add back the month names 
    select(Month, month_names, mean_temp)        ## rearrange columns)

## Extract columns from a dataframe

In [None]:
# or, use magrittr::extract2
joined_df %>% .$month_names

## Plotting and Visual Analysis

ggplot2 provides a standard and extensible interface for plotting data.

In [None]:
june <- df %>% filter(Month == 7) %>% select(Day, Temp, Wind)

In [None]:
ggplot(data = june, mapping = aes(x = Day, y = Temp)) +
  geom_point(aes(size = Wind), alpha = 1/3) +
  geom_smooth(se = FALSE)

In [None]:
temps = df %>% group_by(Temp) %>% summarise(ozone = mean(Ozone), rad = sd(Solar.R))

In [None]:
ggplot(data = temps, mapping = aes(x = Temp, y = ozone)) + 
  geom_point(aes(size = rad), alpha = 1/2)

We'll talk more about ggplot in the Visualization notebook.

*Exercise*: Load the nycflights13 dataset as well as the US precipitation dataset. Join the average precipitation data to the flight delays data frame in a way that makes sense, then make a scatterplot of precipitation vs. delay time.

*Copyright &copy; 2016 The Data Incubator.  All rights reserved.*