# Lab 3: A Deep Dive on `group_by()` and `summarize()` in `dplyr`

The goal of this week's lab is to get yourself more familiar with `dplyr`, specifically `group_by()` and `summarize()`. Consequently, there is some overlap between this week's and last week's materials.

In [None]:
# Install nycflights13 package
install.packages('nycflights13')

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



In [None]:
# Load packages
library(tidyverse)
library(nycflights13)

## Lab 2 Review
- ggplot
    - Key Components:
      - Data
      - Aesthetic Mapping
      - Layer(s) of (geom)etric Objects
    - Plots Covered:
      - Scatterplot (geom_point)
      - Boxplot (geom_boxplot)
      - Barplot (geom_bar)
      - Histogram (geom_histogram)

- dplyr
    - Work Flow: Input Dataframe -> Operation -> New Output Dataframe
    - Core Functions:
      - Select
      - Filter
      - Arrange
      - Mutate
      - Summarise (equivalently, summarize)

## Grouping

Grouping is an incredibly important operation in data science. When working with a data set, you may want to calculate a metric not for every row, but rather for every unique combination, or group, in the data set.

To give a concrete example, consider a subset of columns in the `nycflights13` data set shown below.

In [None]:
flights %>%
  select(year, month, day, dep_delay) %>%
  print()

[90m# A tibble: 336,776 × 4[39m
    year month   day dep_delay
   [3m[90m<int>[39m[23m [3m[90m<int>[39m[23m [3m[90m<int>[39m[23m     [3m[90m<dbl>[39m[23m
[90m 1[39m  [4m2[24m013     1     1         2
[90m 2[39m  [4m2[24m013     1     1         4
[90m 3[39m  [4m2[24m013     1     1         2
[90m 4[39m  [4m2[24m013     1     1        -[31m1[39m
[90m 5[39m  [4m2[24m013     1     1        -[31m6[39m
[90m 6[39m  [4m2[24m013     1     1        -[31m4[39m
[90m 7[39m  [4m2[24m013     1     1        -[31m5[39m
[90m 8[39m  [4m2[24m013     1     1        -[31m3[39m
[90m 9[39m  [4m2[24m013     1     1        -[31m3[39m
[90m10[39m  [4m2[24m013     1     1        -[31m2[39m
[90m# ℹ 336,766 more rows[39m


Given this data set, one could ask: what is the average departure delay for each day of the year?

Before looking at how we can code this in R, think about the usefulness of this information for NYC airport authorities.

A possible answer: this information would allow airport authorities to be more proactive about potential delay waves through proper accommodations in advance.

Now, how can we code this? First, we need to create **groups**. As we are interested in the average departure delay for each day of the year, the days themselves are the groups!

In [None]:
# Showcasing the grouped data set
flights %>%
  group_by(year, month, day) %>%
  print()

[90m# A tibble: 336,776 × 19[39m
[90m# Groups:   year, month, day [365][39m
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   [3m[90m<int>[39m[23m [3m[90m<int>[39m[23m [3m[90m<int>[39m[23m    [3m[90m<int>[39m[23m          [3m[90m<int>[39m[23m     [3m[90m<dbl>[39m[23m    [3m[90m<int>[39m[23m          [3m[90m<int>[39m[23m
[90m 1[39m  [4m2[24m013     1     1      517            515         2      830            819
[90m 2[39m  [4m2[24m013     1     1      533            529         4      850            830
[90m 3[39m  [4m2[24m013     1     1      542            540         2      923            850
[90m 4[39m  [4m2[24m013     1     1      544            545        -[31m1[39m     [4m1[24m004           [4m1[24m022
[90m 5[39m  [4m2[24m013     1     1      554            600        -[31m6[39m      812            837
[90m 6[39m  [4m2[24m013     1     1      554            558        -[31m4[39

Note that the `group_by()` function **does not modify the input data set at all**, but rather stores the groups behind the scenes for future computation. `[365]` shows you how many unique grups there are (note that 2013 is not a leap year).

Having looked at the `group_by()` function, we now see how we can calculate a summary statistic (in our case: average departure delay) *within each group*.

## Summarize
`summarize()` takes in an input data set, performs some calculations, and outputs a new data frame. It is generally used in tandem with `group_by()`.


In [None]:
flights %>%
  group_by(year, month, day) %>%
  summarize(mean_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
  print()

[1m[22m`summarise()` has grouped output by 'year', 'month'. You can override using the
`.groups` argument.


[90m# A tibble: 365 × 4[39m
[90m# Groups:   year, month [12][39m
    year month   day mean_dep_delay
   [3m[90m<int>[39m[23m [3m[90m<int>[39m[23m [3m[90m<int>[39m[23m          [3m[90m<dbl>[39m[23m
[90m 1[39m  [4m2[24m013     1     1          11.5 
[90m 2[39m  [4m2[24m013     1     2          13.9 
[90m 3[39m  [4m2[24m013     1     3          11.0 
[90m 4[39m  [4m2[24m013     1     4           8.95
[90m 5[39m  [4m2[24m013     1     5           5.73
[90m 6[39m  [4m2[24m013     1     6           7.15
[90m 7[39m  [4m2[24m013     1     7           5.42
[90m 8[39m  [4m2[24m013     1     8           2.55
[90m 9[39m  [4m2[24m013     1     9           2.28
[90m10[39m  [4m2[24m013     1    10           2.84
[90m# ℹ 355 more rows[39m


This new data frame contains one column for each grouping variable (in our case: year, month, day) and one column for each summary (in our case: mean_dep_delay).

Also, another important point is the use of `na.rm = TRUE`). This is important since R does not know how to numerically handle missing values. Let's look at a simple example with a vector of numbers.

In [None]:
### Example of the use of na.rm = TRUE

example_vector <- c(5, 6, 7, NA) # a vector of numbers with NA

example_vector %>% mean() # calculating the mean of a vector with NAs gives you <NA>

example_vector %>% mean(na.rm = TRUE) # with na.rm = TRUE, R removes the NAs and calculates the mean of the remaining numbers

Now, suppose your friend comes to you and ask which month of the year is the busiest in terms of number of flights. In R, you can count the number of occurrences of something using the function `n()`.

In [None]:
flights %>%
  group_by(month) %>%
  summarize(n = n()) %>%
  arrange(desc(n)) %>%
  print()

[90m# A tibble: 12 × 2[39m
   month     n
   [3m[90m<int>[39m[23m [3m[90m<int>[39m[23m
[90m 1[39m     7 [4m2[24m[4m9[24m425
[90m 2[39m     8 [4m2[24m[4m9[24m327
[90m 3[39m    10 [4m2[24m[4m8[24m889
[90m 4[39m     3 [4m2[24m[4m8[24m834
[90m 5[39m     5 [4m2[24m[4m8[24m796
[90m 6[39m     4 [4m2[24m[4m8[24m330
[90m 7[39m     6 [4m2[24m[4m8[24m243
[90m 8[39m    12 [4m2[24m[4m8[24m135
[90m 9[39m     9 [4m2[24m[4m7[24m574
[90m10[39m    11 [4m2[24m[4m7[24m268
[90m11[39m     1 [4m2[24m[4m7[24m004
[90m12[39m     2 [4m2[24m[4m4[24m951


Alternatively, you can use `count`.

In [None]:
flights %>%
  count(month, sort = TRUE) %>%
  print()

[90m# A tibble: 12 × 2[39m
   month     n
   [3m[90m<int>[39m[23m [3m[90m<int>[39m[23m
[90m 1[39m     7 [4m2[24m[4m9[24m425
[90m 2[39m     8 [4m2[24m[4m9[24m327
[90m 3[39m    10 [4m2[24m[4m8[24m889
[90m 4[39m     3 [4m2[24m[4m8[24m834
[90m 5[39m     5 [4m2[24m[4m8[24m796
[90m 6[39m     4 [4m2[24m[4m8[24m330
[90m 7[39m     6 [4m2[24m[4m8[24m243
[90m 8[39m    12 [4m2[24m[4m8[24m135
[90m 9[39m     9 [4m2[24m[4m7[24m574
[90m10[39m    11 [4m2[24m[4m7[24m268
[90m11[39m     1 [4m2[24m[4m7[24m004
[90m12[39m     2 [4m2[24m[4m4[24m951


It is important to note that `n()` also counts the missing values. If you do not want to include these missing values, `sum()` and `is.na()` can be helpful.

For example, your friend comes to you again and asks that you only include flights which are not canceled (i.e., `arr_delay` is not `NA`).

In [None]:
flights %>%
  group_by(month) %>%
  summarize(n_notcanceled = sum(!is.na(arr_delay))) %>%
  arrange(desc(n_notcanceled)) %>%
  print()

[90m# A tibble: 12 × 2[39m
   month n_notcanceled
   [3m[90m<int>[39m[23m         [3m[90m<int>[39m[23m
[90m 1[39m     8         [4m2[24m[4m8[24m756
[90m 2[39m    10         [4m2[24m[4m8[24m618
[90m 3[39m     7         [4m2[24m[4m8[24m293
[90m 4[39m     5         [4m2[24m[4m8[24m128
[90m 5[39m     3         [4m2[24m[4m7[24m902
[90m 6[39m     4         [4m2[24m[4m7[24m564
[90m 7[39m     6         [4m2[24m[4m7[24m075
[90m 8[39m    12         [4m2[24m[4m7[24m020
[90m 9[39m     9         [4m2[24m[4m7[24m010
[90m10[39m    11         [4m2[24m[4m6[24m971
[90m11[39m     1         [4m2[24m[4m6[24m398
[90m12[39m     2         [4m2[24m[4m3[24m611


While July sees the most number of flights, August sees the most number of non-canceled flights.

### Sum, Mean, Maximum, Minimum, Median, Quantiles, SD, IQR

With `summarize()`, you can also calculate other summary statistics such as
1. `sum()`: summation;
2. `mean()`: average;
3. `min()` and `max()`: smallest and largest;
4. `quantile()`: `quantile(x, 0.25)` will find the value of x that is greater than 25% of all values;
5. `sd()`: standard deviation;
6. `IQR()`: interquartile range, i.e., `quantile(x, 0.75)` - `quantile(x, 0.25)`.

In [None]:
flights %>%
  group_by(year, month, day) %>%
  summarize(
    sum = sum(dep_delay, na.rm = TRUE),
    mean = mean(dep_delay, na.rm = TRUE), # mean departure delay
    max = max(dep_delay, na.rm = TRUE), # maximum departure delay
    min = min(dep_delay, na.rm = TRUE), # minimum departure delay
    median = median(dep_delay, na.rm = TRUE), # median departure delay
    q5 = quantile(dep_delay, 0.05, na.rm = TRUE), # 5-th percentile value of departure delay
    q95 = quantile(dep_delay, 0.95, na.rm = TRUE), # 95-th percemntile of departure delay
    sd = sd(dep_delay, na.rm = TRUE), # standard deviation of departure delay
    IQR = IQR(dep_delay, na.rm = TRUE), # interquartile range of departure delay
  ) %>%
  print()

[1m[22m`summarise()` has grouped output by 'year', 'month'. You can override using the
`.groups` argument.


[90m# A tibble: 365 × 12[39m
[90m# Groups:   year, month [12][39m
    year month   day   sum  mean   max   min median     q5   q95    sd   IQR
   [3m[90m<int>[39m[23m [3m[90m<int>[39m[23m [3m[90m<int>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m  [3m[90m<dbl>[39m[23m  [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m
[90m 1[39m  [4m2[24m013     1     1  [4m9[24m678 11.5    853   -[31m15[39m     -[31m1[39m  -[31m8[39m     70.1  45.3  13  
[90m 2[39m  [4m2[24m013     1     2 [4m1[24m[4m2[24m958 13.9    379   -[31m13[39m      0  -[31m7[39m     85    37.2  17  
[90m 3[39m  [4m2[24m013     1     3  [4m9[24m933 11.0    291   -[31m13[39m      0  -[31m7[39m[31m.[39m[31m85[39m  68    31.5  15  
[90m 4[39m  [4m2[24m013     1     4  [4m8[24m137  8.95   288   -[31m19[39m     -[31m1[39m  -[31m8[39m     60    27.7  14  

### Extracting the First, Last, and N-th Value
We can find the first object of `x` using `first(x)`, the last object using `last(x)`, and the nth object using `nth(x, n)`. These functions are equivalent to `x[1]`, `x[length(x)]`, `x[n]`. In order to better understand these functions, consider a simple example below.

In [None]:
x <- c(NA, 8, 7, 10) # an example vector

# both give <NA>
x[1]
first(x)

# both give 10
x[length(x)]
last(x)

# both give 7
x[3]
nth(x, 3)

# both give <NA>
x[10]
nth(x, 10)

Now, let us find the first and 900-th arrival time for each day, among non-canceled flights.

In [None]:
# find the first, 900th, and last arrival time everyday

flights %>%
  filter(!is.na(arr_delay)) %>%
  group_by(year, month, day) %>%
  summarize(
    first_arr = first(arr_time),
    nine_hundredth_arr = nth(arr_time, 900)) %>%
  print()

[1m[22m`summarise()` has grouped output by 'year', 'month'. You can override using the
`.groups` argument.


[90m# A tibble: 365 × 5[39m
[90m# Groups:   year, month [12][39m
    year month   day first_arr nine_hundredth_arr
   [3m[90m<int>[39m[23m [3m[90m<int>[39m[23m [3m[90m<int>[39m[23m     [3m[90m<int>[39m[23m              [3m[90m<int>[39m[23m
[90m 1[39m  [4m2[24m013     1     1       830                 [31mNA[39m
[90m 2[39m  [4m2[24m013     1     2       518                 43
[90m 3[39m  [4m2[24m013     1     3       504                434
[90m 4[39m  [4m2[24m013     1     4       505                257
[90m 5[39m  [4m2[24m013     1     5       503                 [31mNA[39m
[90m 6[39m  [4m2[24m013     1     6       451                 [31mNA[39m
[90m 7[39m  [4m2[24m013     1     7       531               [4m2[24m220
[90m 8[39m  [4m2[24m013     1     8       625                 [31mNA[39m
[90m 9[39m  [4m2[24m013     1     9       432                 [31mNA[39m
[90m10[39m  [4m2[24m013     1    10       426       

The NAs corresponds to days in which there are less than 900 non-canceled flights.

**Warning:** The column `first_arr` does not refer to the EARLIEST arrival time for each day. It simply refers to the arrival time of the flight that occurs first in the data set for each day.

What if we want to get the EARLIEST arrival time for each day? We can use the `min()` function.

In [None]:
flights %>%
  filter(!is.na(arr_delay)) %>%
  group_by(year, month, day) %>%
  summarize(earliest_arr = min(arr_time, na.rm = TRUE)) %>%
  print()

[1m[22m`summarise()` has grouped output by 'year', 'month'. You can override using the
`.groups` argument.


[90m# A tibble: 365 × 4[39m
[90m# Groups:   year, month [12][39m
    year month   day earliest_arr
   [3m[90m<int>[39m[23m [3m[90m<int>[39m[23m [3m[90m<int>[39m[23m        [3m[90m<int>[39m[23m
[90m 1[39m  [4m2[24m013     1     1            3
[90m 2[39m  [4m2[24m013     1     2            1
[90m 3[39m  [4m2[24m013     1     3            4
[90m 4[39m  [4m2[24m013     1     4            2
[90m 5[39m  [4m2[24m013     1     5            4
[90m 6[39m  [4m2[24m013     1     6            2
[90m 7[39m  [4m2[24m013     1     7            2
[90m 8[39m  [4m2[24m013     1     8           10
[90m 9[39m  [4m2[24m013     1     9           12
[90m10[39m  [4m2[24m013     1    10            2
[90m# ℹ 355 more rows[39m


Technically, we can also use the `first()` function, but we need to make sure that the data is already ordered by `arr_time`.

In [None]:
flights %>%
  filter(!is.na(arr_delay)) %>%
  arrange(arr_time) %>%
  group_by(year, month, day) %>%
  summarize(earliest_arr = first(arr_time)) %>%
  print()

[1m[22m`summarise()` has grouped output by 'year', 'month'. You can override using the
`.groups` argument.


[90m# A tibble: 365 × 4[39m
[90m# Groups:   year, month [12][39m
    year month   day earliest_arr
   [3m[90m<int>[39m[23m [3m[90m<int>[39m[23m [3m[90m<int>[39m[23m        [3m[90m<int>[39m[23m
[90m 1[39m  [4m2[24m013     1     1            3
[90m 2[39m  [4m2[24m013     1     2            1
[90m 3[39m  [4m2[24m013     1     3            4
[90m 4[39m  [4m2[24m013     1     4            2
[90m 5[39m  [4m2[24m013     1     5            4
[90m 6[39m  [4m2[24m013     1     6            2
[90m 7[39m  [4m2[24m013     1     7            2
[90m 8[39m  [4m2[24m013     1     8           10
[90m 9[39m  [4m2[24m013     1     9           12
[90m10[39m  [4m2[24m013     1    10            2
[90m# ℹ 355 more rows[39m


## EXERCISES

1. Create a density plot which shows the distribution of the number of flights that a plane takes. Note that a plane is uniquely determined by its `tailnum`. Do not include flights with `tailnum = NA`.

In [None]:
## YOUR ANSWER HERE

2. Produce a scatterplot where each point represents an individual plane, with the number of non-canceled flights on the x-axis and the average arrival delay on the y-axis. Recall that a flight is canceled if `arr_delay` is `NA`. Again, do not include flights with `tailnum = NA`.

In [None]:
## YOUR ANSWER HERE

3. Which carrier has the longest arrival delay time on average?

In [None]:
## YOUR ANSWER HERE

4. Which carrier has the highest percentange of canceled flights?

In [None]:
## YOUR ANSWER HERE

5. Let say a departure delay is *extreme* if is it longer than 60 minutes. Find the origin airport which records the highest percentage of *extreme* delays (out of all non-canceled flights).

In [None]:
## YOUR ANSWER HERE

6. (OPTIONAL) Continue from Question 3, where we discovered the carrier with the longest arrival delay time on average.

How does the performance compare to other carriers in the same route? Consider the differences of mean arrival delay between one carrier and the others.

*(This is an open-ended question)*

In [None]:
## YOUR ANSWER HERE