# Lecture 3.2:  Data transformation

<div style="border: 1px double black; padding: 10px; margin: 10px">

**Goals for today's lecture:**
* Continue to learn [how to manipulate data](#Data-manipulation), including:
    * Pipes
    * Adding New Variables
    
* We will go through the data set `flight` 
    
We will answer question such as:     
* What days of the year / week are the busiest for flying?    
    
This lecture note corresponds to Chapter 5.5 of your book.
    
    
</div>


Let us load up the `tidyverse` and `nycflights13` packages.



We will start with the `flight` data set that we use in the previous lecture. 

In [2]:
library(tidyverse)
library(nycflights13)
head(flights)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.5     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.3     [32m✔[39m [34mdplyr  [39m 1.0.7
[32m✔[39m [34mtidyr  [39m 1.1.3     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 2.0.1     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
<int>,<int>,<int>,<int>,<int>,<dbl>,<int>,<int>,<dbl>,<chr>,<int>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dttm>
2013,1,1,517,515,2,830,819,11,UA,1545,N14228,EWR,IAH,227,1400,5,15,2013-01-01 05:00:00
2013,1,1,533,529,4,850,830,20,UA,1714,N24211,LGA,IAH,227,1416,5,29,2013-01-01 05:00:00
2013,1,1,542,540,2,923,850,33,AA,1141,N619AA,JFK,MIA,160,1089,5,40,2013-01-01 05:00:00
2013,1,1,544,545,-1,1004,1022,-18,B6,725,N804JB,JFK,BQN,183,1576,5,45,2013-01-01 05:00:00
2013,1,1,554,600,-6,812,837,-25,DL,461,N668DN,LGA,ATL,116,762,6,0,2013-01-01 06:00:00
2013,1,1,554,558,-4,740,728,12,UA,1696,N39463,EWR,ORD,150,719,5,58,2013-01-01 05:00:00


# Pipes
Starting now, we will make extensive use of the pipe operator `%>%`. 

### How `%>%` works
Under the hood, `x %>% f(y)` turns into `f(x, y)`, and `x %>% f(y) %>% g(z)` turns into `g(f(x, y), z)` and so on. We can use `%>%` on any function, not just those defined in tidyverse.

Here is an example on printing "hello world" using pipe. 

In [12]:
"hello world" %>% print()  # prints "hello world"

[1] "hello world"


We will see the usefulness of pipe `%>%` later in the lecture as it greatly simplifies our code

# Adding New Variables
The `dplyr`/`tidyverse` package offers the `mutate()` and `transmute()` commands to add new variables to data tibbles. The syntax is:
```{r}
<tibble> %>% mutate(<new variable> = <formula for new variable>,
                    <other new variable> = <other formula>)
```
This returns a copy of `<tibble>` with the new variables added on `transmute()` does the same thing as `mutate()` but only keeps the new variables.

Let us zoom in on a few variables of interest.

In [3]:
my_flights <- select(flights, year:day, dep_time, arr_time, air_time, origin, dest)
head(my_flights)

year,month,day,dep_time,arr_time,air_time,origin,dest
<int>,<int>,<int>,<int>,<int>,<dbl>,<chr>,<chr>
2013,1,1,517,830,227,EWR,IAH
2013,1,1,533,850,227,LGA,IAH
2013,1,1,542,923,160,JFK,MIA
2013,1,1,544,1004,183,JFK,BQN
2013,1,1,554,812,116,LGA,ATL
2013,1,1,554,740,150,EWR,ORD


Use Pipe `%>%` to create the table above. 

In [4]:
flights %>% select(year:day, dep_time, arr_time, air_time, origin, dest) %>% head()

year,month,day,dep_time,arr_time,air_time,origin,dest
<int>,<int>,<int>,<int>,<int>,<dbl>,<chr>,<chr>
2013,1,1,517,830,227,EWR,IAH
2013,1,1,533,850,227,LGA,IAH
2013,1,1,542,923,160,JFK,MIA
2013,1,1,544,1004,183,JFK,BQN
2013,1,1,554,812,116,LGA,ATL
2013,1,1,554,740,150,EWR,ORD


The above code basically select the variables that we are interested in and save it into the object `my_flights`.  

Additional variable can be added using the `mutate()` function. We already have an `air_time` variable. Let us compute the total time for the flight by subtracting the time of departure `dep_time` from time of arrival `arr_time`.

In [5]:
mutate(my_flights, total_time = arr_time - dep_time) %>%
    head()

year,month,day,dep_time,arr_time,air_time,origin,dest,total_time
<int>,<int>,<int>,<int>,<int>,<dbl>,<chr>,<chr>,<int>
2013,1,1,517,830,227,EWR,IAH,313
2013,1,1,533,850,227,LGA,IAH,317
2013,1,1,542,923,160,JFK,MIA,381
2013,1,1,544,1004,183,JFK,BQN,460
2013,1,1,554,812,116,LGA,ATL,258
2013,1,1,554,740,150,EWR,ORD,186


Another way to do the same thing is by using pipe twice

In [6]:
flights %>% mutate(total_time = arr_time - dep_time) %>% 
            select(year, month, day, dep_time, arr_time, air_time, origin, dest, total_time) %>% head()

year,month,day,dep_time,arr_time,air_time,origin,dest,total_time
<int>,<int>,<int>,<int>,<int>,<dbl>,<chr>,<chr>,<int>
2013,1,1,517,830,227,EWR,IAH,313
2013,1,1,533,850,227,LGA,IAH,317
2013,1,1,542,923,160,JFK,MIA,381
2013,1,1,544,1004,183,JFK,BQN,460
2013,1,1,554,812,116,LGA,ATL,258
2013,1,1,554,740,150,EWR,ORD,186


We notice something odd though. When we subtract 5h 17m from 8h 30m we should get 3h 13m, i.e. 193 minutes. But instead we get 313 minutes below.

The issue is that `dep_time` and `arr_time` are in the hour-minute notation, so you cannot add and subtract them like regular numbers. We should first convert these times into the number of minutes elapsed since midnight.

We want add to new variables `new_dep` and `new_arr` but we need to write a function first that can do the conversion. The function is given below; we'll learn how it works later in the semester. For now just think of it as a black box that converts times from one format to another.

In [7]:
hourmin2min <- function(hourmin) {
    min <- hourmin %% 100 # quotient after division by 100
    hour <- (hourmin - min) %/% 100 # remainder after division by 100
    return(60*hour + min)
} 

Let us test the function on 530. That's 5h 30min, i.e., 330 minutes since midnight.

In [8]:
hourmin2min(530)

The `hourmin2min` function is **vectorized**: given a vector, it outputs a vector.

In [9]:
hourmin2min(c(430,530,630,730))

Let us now create two new variables obtained from `arr_time` and `dep_time` by converting them into minutes since midnight. In the same command, we can also create a new `total_time` column containing their difference.

In [10]:
my_flights_new <- mutate(my_flights, new_arr = hourmin2min(arr_time), new_dep = hourmin2min(dep_time))
head(my_flights_new)

year,month,day,dep_time,arr_time,air_time,origin,dest,new_arr,new_dep
<int>,<int>,<int>,<int>,<int>,<dbl>,<chr>,<chr>,<dbl>,<dbl>
2013,1,1,517,830,227,EWR,IAH,510,317
2013,1,1,533,850,227,LGA,IAH,530,333
2013,1,1,542,923,160,JFK,MIA,563,342
2013,1,1,544,1004,183,JFK,BQN,604,344
2013,1,1,554,812,116,LGA,ATL,492,354
2013,1,1,554,740,150,EWR,ORD,460,354


Now we can subtract the departure time `new_dep` from the arrival time `new_arr` to get a new variable `total_time`.

In [11]:
my_flights_total <- mutate(my_flights_new, total_time = new_arr - new_dep)
head(my_flights_total)

year,month,day,dep_time,arr_time,air_time,origin,dest,new_arr,new_dep,total_time
<int>,<int>,<int>,<int>,<int>,<dbl>,<chr>,<chr>,<dbl>,<dbl>,<dbl>
2013,1,1,517,830,227,EWR,IAH,510,317,193
2013,1,1,533,850,227,LGA,IAH,530,333,197
2013,1,1,542,923,160,JFK,MIA,563,342,221
2013,1,1,544,1004,183,JFK,BQN,604,344,260
2013,1,1,554,812,116,LGA,ATL,492,354,138
2013,1,1,554,740,150,EWR,ORD,460,354,106


Let us try to do the same thing using pipe just using one line of code.  In this code, we are only interested in the following variables -- `dep_time`, `arr_time`, `new_dep`, `new_arr`, and `total_time`.  

In [12]:
 mutate(flights, new_arr = hourmin2min(arr_time), new_dep = hourmin2min(dep_time),total_time = new_arr - new_dep
) %>% select(dep_time, arr_time, new_dep, new_arr, total_time) %>% head()

dep_time,arr_time,new_dep,new_arr,total_time
<int>,<int>,<dbl>,<dbl>,<dbl>
517,830,317,510,193
533,850,333,530,197
542,923,342,563,221
544,1004,344,604,260
554,812,354,492,138
554,740,354,460,106


## Up Next - Summarize Function and Case Study on Data Manipulation

Let's now review some tools that we have learnt.  Specifically, we have learnt the function:
* `filter`
* `arrange`
* `select` 

### Exercise

<table class="table-condensed">
    <tr><th>Beginner</th><th>Advanced</th></tr>
    <tr><td>How many flights were there in 
        months beginning with the letter <code>J</code>?</td>
        <td>How many flights departed on a Monday?</td>
    </tr>
<tr><td>

1. 27,004
2. 57,668
3. 84,672
4. 93,101

</td><td>

1. 46,537
2. 51,812
3. 80,100
4. 101,991

</td>
    </tr></table>

In [11]:
nrow(filter(flights, month %in% which(substr(month.name,0,1) == "J")))



### Exercise

<table class="table-condensed">
    <tr><th>Beginner</th><th>Advanced</th></tr>
    <tr><td>
What proportion of the flights have a missing departure time?
        </td>
        <td>
Of all the flights that departed in the first week of January, how many have a missing departure time?
        </td>
    </tr>
<tr><td>

1. 0.003
2. 0.025
3. 0.081
4. 0.105


</td><td>

1. None
2. 35
3. 101
4. 6,064

</td>
    </tr></table>
    

In [3]:
#  Beginner question
print(filter(flights, is.na(dep_time)))
8255 / nrow(flights)

[38;5;246m# A tibble: 8,255 x 19[39m
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   [3m[38;5;246m<int>[39m[23m [3m[38;5;246m<int>[39m[23m [3m[38;5;246m<int>[39m[23m    [3m[38;5;246m<int>[39m[23m          [3m[38;5;246m<int>[39m[23m     [3m[38;5;246m<dbl>[39m[23m    [3m[38;5;246m<int>[39m[23m          [3m[38;5;246m<int>[39m[23m
[38;5;250m 1[39m  [4m2[24m013     1     1       [31mNA[39m           [4m1[24m630        [31mNA[39m       [31mNA[39m           [4m1[24m815
[38;5;250m 2[39m  [4m2[24m013     1     1       [31mNA[39m           [4m1[24m935        [31mNA[39m       [31mNA[39m           [4m2[24m240
[38;5;250m 3[39m  [4m2[24m013     1     1       [31mNA[39m           [4m1[24m500        [31mNA[39m       [31mNA[39m           [4m1[24m825
[38;5;250m 4[39m  [4m2[24m013     1     1       [31mNA[39m            600        [31mNA[39m       [31mNA[39m            901
[38;5;250m 5

In [4]:
# Advanced question
nrow(filter(flights, is.na(dep_time), month==1, between(day,1,7)))


### Exercise

<table class="table-condensed">
    <tr><th>Beginner</th><th>Advanced</th></tr>
    <tr><td>
        What time did the first flight depart on the last day in February?
<td>
 
 Of all the flights that departed on or ahead of schedule in the first 15 days of any month, which one was in the air for the *second* shortest amount of time?

</td>
    </tr>
<tr><td>

1. 4:15am
2. 4:57am
3. 5:01am
4. 5:40am

</td><td>

1. EV 4118
2. HA 51
3. EV 4631
4. EV 4619

</td>
    </tr></table>

In [5]:
# Beginner
answer <- arrange(filter(flights, month == 2),
      desc(day),
      dep_time)
print(slice(answer, 1:5))

[38;5;246m# A tibble: 5 x 19[39m
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
  [3m[38;5;246m<int>[39m[23m [3m[38;5;246m<int>[39m[23m [3m[38;5;246m<int>[39m[23m    [3m[38;5;246m<int>[39m[23m          [3m[38;5;246m<int>[39m[23m     [3m[38;5;246m<dbl>[39m[23m    [3m[38;5;246m<int>[39m[23m          [3m[38;5;246m<int>[39m[23m
[38;5;250m1[39m  [4m2[24m013     2    28      457            500        -[31m3[39m      639            648
[38;5;250m2[39m  [4m2[24m013     2    28      458            501        -[31m3[39m      748            800
[38;5;250m3[39m  [4m2[24m013     2    28      522            530        -[31m8[39m      832            831
[38;5;250m4[39m  [4m2[24m013     2    28      539            540        -[31m1[39m      836            850
[38;5;250m5[39m  [4m2[24m013     2    28      540            545        -[31m5[39m     [4m1[24m015           [4m1[24m023
[38;5;246m# … with 11 mor

### Exercise

<table class="table-condensed">
    <tr><th>Beginner</th><th>Advanced</th></tr>
    <tr><td>
        How many columns in flights contain the word time?
<td>
    
How many column names in flights do *not* contain `s`?

</td>
    </tr>
<tr>
    </tr></table>

In [6]:
# Beginner
ncol(select(flights, contains("time")))
select(flights, contains("time"))

dep_time,sched_dep_time,arr_time,sched_arr_time,air_time,time_hour
<int>,<int>,<int>,<int>,<dbl>,<dttm>
517,515,830,819,227,2013-01-01 05:00:00
533,529,850,830,227,2013-01-01 05:00:00
542,540,923,850,160,2013-01-01 05:00:00
544,545,1004,1022,183,2013-01-01 05:00:00
554,600,812,837,116,2013-01-01 06:00:00
554,558,740,728,150,2013-01-01 05:00:00
555,600,913,854,158,2013-01-01 06:00:00
557,600,709,723,53,2013-01-01 06:00:00
557,600,838,846,140,2013-01-01 06:00:00
558,600,753,745,138,2013-01-01 06:00:00


In [7]:
# Advanced
ncol(flights)-ncol(select(flights, contains("s")))

## Question:  What days of the year / at what airport are the busiest for flying?

Let's think about the table we would want to have in order to answer this question. Ideally,
it would look something like this:

    # A tibble: 1,095 x 4
       month   day airport n_sched_departures
       <int> <int> <chr>                <int>
     1     1     1 EWR                    305
     2     1     1 JFK                    297
     3     1     1 LGA                    240
     4     1     2 EWR                    350
     5     1     2 JFK                    321
     6     1     2 LGA                    272
     7     1     3 EWR                    336
     8     1     3 JFK                    318
     9     1     3 LGA                    260
    10     1     4 EWR                    339
    # … with 1,085 more rows

The table we are given has ~337k rows, one for each flight. How do we go from the `flights` table to the one shown above?

## Summaries
`summarize()` can be used to summarize entire data frames by collapsing them into single number summaries. The syntax is:
```{r}
summarize(<grouped tibble>, 
<new variable> = 
<formula for new variable>,
<other new variable> = <other formula>)
```

The most basic use of summarize is to compute statistics over the whole data set. 

Let us calculate the average time for departure delay `dep_delay`. 

In [8]:
tbl <- filter(flights, !is.na(dep_delay))
print(summarize(tbl, delay = mean(dep_delay)))

[38;5;246m# A tibble: 1 x 1[39m
  delay
  [3m[38;5;246m<dbl>[39m[23m
[38;5;250m1[39m  12.6


## Grouping observations
`summarize()` is most useful when combined with `group_by()` to group observations before calculating the summary statistic. The `group_by` function tells `R` how your data are grouped:

In [9]:
?group_by
tbl <- group_by(flights, month)
# print(tbl)
tbl1 <- filter(tbl, !is.na(dep_delay))
print(summarize(tbl1, delay = mean(dep_delay), 
                total_delay = sum(dep_delay)))

`summarise()` ungrouping output (override with `.groups` argument)



[38;5;246m# A tibble: 12 x 3[39m
   month delay total_delay
   [3m[38;5;246m<int>[39m[23m [3m[38;5;246m<dbl>[39m[23m       [3m[38;5;246m<dbl>[39m[23m
[38;5;250m 1[39m     1 10.0       [4m2[24m[4m6[24m[4m5[24m801
[38;5;250m 2[39m     2 10.8       [4m2[24m[4m5[24m[4m6[24m251
[38;5;250m 3[39m     3 13.2       [4m3[24m[4m7[24m[4m0[24m001
[38;5;250m 4[39m     4 13.9       [4m3[24m[4m8[24m[4m5[24m554
[38;5;250m 5[39m     5 13.0       [4m3[24m[4m6[24m[4m6[24m658
[38;5;250m 6[39m     6 20.8       [4m5[24m[4m6[24m[4m7[24m729
[38;5;250m 7[39m     7 21.7       [4m6[24m[4m1[24m[4m8[24m916
[38;5;250m 8[39m     8 12.6       [4m3[24m[4m6[24m[4m3[24m715
[38;5;250m 9[39m     9  6.72      [4m1[24m[4m8[24m[4m2[24m327
[38;5;250m10[39m    10  6.24      [4m1[24m[4m7[24m[4m8[24m909
[38;5;250m11[39m    11  5.44      [4m1[24m[4m4[24m[4m6[24m945
[38;5;250m12[39m    12 16.6       [4m4[24m[4m4[24m[4m9[24m

`summarize()` applies the summary function to each group of data. Remember that it always returns **one row per group**.

In [10]:
print(summarize(group_by(flights, month), mean_dep_delay = mean(dep_delay, na.rm=T)))

`summarise()` ungrouping output (override with `.groups` argument)



[38;5;246m# A tibble: 12 x 2[39m
   month mean_dep_delay
   [3m[38;5;246m<int>[39m[23m          [3m[38;5;246m<dbl>[39m[23m
[38;5;250m 1[39m     1          10.0 
[38;5;250m 2[39m     2          10.8 
[38;5;250m 3[39m     3          13.2 
[38;5;250m 4[39m     4          13.9 
[38;5;250m 5[39m     5          13.0 
[38;5;250m 6[39m     6          20.8 
[38;5;250m 7[39m     7          21.7 
[38;5;250m 8[39m     8          12.6 
[38;5;250m 9[39m     9           6.72
[38;5;250m10[39m    10           6.24
[38;5;250m11[39m    11           5.44
[38;5;250m12[39m    12          16.6 


It's as if `summarize()` filtered your data for each group, calculated the summary statistic, and
then combined all the results back into one table.

In [11]:
df <- filter(flights, month == 3)
mean(df$dep_delay, na.rm = T)  

Many summary functions are available:

* Center: `mean(), median()`
* Spread: `sd(), IQR(), mad()`
* Range: `min(), max(), quantile()`
* Position: `first(), last(), nth()`
* Count: `n(), n_distinct()`
* Logical: `any(), all()`

### Examples
The `n()` function calculates the number of rows in each group:

Let us write some code to output the number of flights for each month.   Then output the number of flights in December.   

In [12]:
tbl <- group_by(flights, month)
print(summarize(tbl, n = n()))
nrow(filter(flights, month == 12))

`summarise()` ungrouping output (override with `.groups` argument)



[38;5;246m# A tibble: 12 x 2[39m
   month     n
   [3m[38;5;246m<int>[39m[23m [3m[38;5;246m<int>[39m[23m
[38;5;250m 1[39m     1 [4m2[24m[4m7[24m004
[38;5;250m 2[39m     2 [4m2[24m[4m4[24m951
[38;5;250m 3[39m     3 [4m2[24m[4m8[24m834
[38;5;250m 4[39m     4 [4m2[24m[4m8[24m330
[38;5;250m 5[39m     5 [4m2[24m[4m8[24m796
[38;5;250m 6[39m     6 [4m2[24m[4m8[24m243
[38;5;250m 7[39m     7 [4m2[24m[4m9[24m425
[38;5;250m 8[39m     8 [4m2[24m[4m9[24m327
[38;5;250m 9[39m     9 [4m2[24m[4m7[24m574
[38;5;250m10[39m    10 [4m2[24m[4m8[24m889
[38;5;250m11[39m    11 [4m2[24m[4m7[24m268
[38;5;250m12[39m    12 [4m2[24m[4m8[24m135


Now we are ready to generate the following table using the tools that we have learnt

    # A tibble: 1,095 x 4
       month   day airport n_sched_departures
       <int> <int> <chr>                <int>
     1     1     1 EWR                    305
     2     1     1 JFK                    297
     3     1     1 LGA                    240
     4     1     2 EWR                    350
     5     1     2 JFK                    321
     6     1     2 LGA                    272
     7     1     3 EWR                    336
     8     1     3 JFK                    318
     9     1     3 LGA                    260
    10     1     4 EWR                    339
    # … with 1,085 more rows

Use this table to answer the question: which day of the year is busiest, and at what airport?

In [13]:
tbl <- group_by(flights, month, day, origin) 
tbl1 <- summarize(tbl, n_sched_dep = n())
tbl2 <- arrange(tbl1, desc(n_sched_dep))
tbl3 <- group_by(tbl1, origin)
print(tbl1)
print(tbl2)
print(tbl3)
top_n(tbl3, 3, n_sched_dep)

`summarise()` regrouping output by 'month', 'day' (override with `.groups` argument)



[38;5;246m# A tibble: 1,095 x 4[39m
[38;5;246m# Groups:   month, day [365][39m
   month   day origin n_sched_dep
   [3m[38;5;246m<int>[39m[23m [3m[38;5;246m<int>[39m[23m [3m[38;5;246m<chr>[39m[23m        [3m[38;5;246m<int>[39m[23m
[38;5;250m 1[39m     1     1 EWR            305
[38;5;250m 2[39m     1     1 JFK            297
[38;5;250m 3[39m     1     1 LGA            240
[38;5;250m 4[39m     1     2 EWR            350
[38;5;250m 5[39m     1     2 JFK            321
[38;5;250m 6[39m     1     2 LGA            272
[38;5;250m 7[39m     1     3 EWR            336
[38;5;250m 8[39m     1     3 JFK            318
[38;5;250m 9[39m     1     3 LGA            260
[38;5;250m10[39m     1     4 EWR            339
[38;5;246m# … with 1,085 more rows[39m
[38;5;246m# A tibble: 1,095 x 4[39m
[38;5;246m# Groups:   month, day [365][39m
   month   day origin n_sched_dep
   [3m[38;5;246m<int>[39m[23m [3m[38;5;246m<int>[39m[23m [3m[38;5;246m<chr>[39m

month,day,origin,n_sched_dep
<int>,<int>,<chr>,<int>
4,11,EWR,376
4,15,EWR,377
4,18,EWR,376
7,10,JFK,331
7,11,JFK,332
7,12,JFK,331
9,9,LGA,345
9,12,LGA,345
9,13,LGA,346
9,16,LGA,345


#### A shortcut
`summarize(object =  n())` occurs so often that there is a shortcut for it: and the function is called `count` 

Let us try to output the number of flights for each month again.  Previously we use the following code to do it:

In [14]:
tbl <- group_by(flights, month)
print(summarize(tbl, n = n()))
flights %>% group_by(month) %>% summarize(n = n()) %>% print()

`summarise()` ungrouping output (override with `.groups` argument)



[38;5;246m# A tibble: 12 x 2[39m
   month     n
   [3m[38;5;246m<int>[39m[23m [3m[38;5;246m<int>[39m[23m
[38;5;250m 1[39m     1 [4m2[24m[4m7[24m004
[38;5;250m 2[39m     2 [4m2[24m[4m4[24m951
[38;5;250m 3[39m     3 [4m2[24m[4m8[24m834
[38;5;250m 4[39m     4 [4m2[24m[4m8[24m330
[38;5;250m 5[39m     5 [4m2[24m[4m8[24m796
[38;5;250m 6[39m     6 [4m2[24m[4m8[24m243
[38;5;250m 7[39m     7 [4m2[24m[4m9[24m425
[38;5;250m 8[39m     8 [4m2[24m[4m9[24m327
[38;5;250m 9[39m     9 [4m2[24m[4m7[24m574
[38;5;250m10[39m    10 [4m2[24m[4m8[24m889
[38;5;250m11[39m    11 [4m2[24m[4m7[24m268
[38;5;250m12[39m    12 [4m2[24m[4m8[24m135


`summarise()` ungrouping output (override with `.groups` argument)



[38;5;246m# A tibble: 12 x 2[39m
   month     n
   [3m[38;5;246m<int>[39m[23m [3m[38;5;246m<int>[39m[23m
[38;5;250m 1[39m     1 [4m2[24m[4m7[24m004
[38;5;250m 2[39m     2 [4m2[24m[4m4[24m951
[38;5;250m 3[39m     3 [4m2[24m[4m8[24m834
[38;5;250m 4[39m     4 [4m2[24m[4m8[24m330
[38;5;250m 5[39m     5 [4m2[24m[4m8[24m796
[38;5;250m 6[39m     6 [4m2[24m[4m8[24m243
[38;5;250m 7[39m     7 [4m2[24m[4m9[24m425
[38;5;250m 8[39m     8 [4m2[24m[4m9[24m327
[38;5;250m 9[39m     9 [4m2[24m[4m7[24m574
[38;5;250m10[39m    10 [4m2[24m[4m8[24m889
[38;5;250m11[39m    11 [4m2[24m[4m7[24m268
[38;5;250m12[39m    12 [4m2[24m[4m8[24m135


### Question -- output the number of flights for each carrier

In [15]:
top_n(summarize(group_by(flights, carrier), n = n()), 5)

`summarise()` ungrouping output (override with `.groups` argument)

Selecting by n



carrier,n
<chr>,<int>
AA,32729
B6,54635
DL,48110
EV,54173
UA,58665


### Exercise

Use `summarize()`, `count()`, `filter()`, `arrange()` and/or `top_n()` to answer:

<table class="table-condensed">
    <tr><th>Beginner</th><th>Advanced</th></tr>
    <tr><td>
        Which plane (tail number) flew the most flights in July?
        </td>
    <td>
        How many planes flew at least one flight in January, but none in February?
        </td>
    </tr>
    <tr>
<td>
</tr>
</table>

In [16]:
# Beginner
tbl0 <- filter(flights, month == 7, !is.na(tailnum))
#print(tbl0)

tbl1 <- count(tbl0, tailnum)
print(tbl1)

top_n(tbl1,1)



[38;5;246m# A tibble: 3,215 x 2[39m
   tailnum     n
   [3m[38;5;246m<chr>[39m[23m   [3m[38;5;246m<int>[39m[23m
[38;5;250m 1[39m D942DN      1
[38;5;250m 2[39m N0EGMQ     11
[38;5;250m 3[39m N10156      7
[38;5;250m 4[39m N102UW      6
[38;5;250m 5[39m N103US      6
[38;5;250m 6[39m N104UW      8
[38;5;250m 7[39m N10575     17
[38;5;250m 8[39m N105UW      5
[38;5;250m 9[39m N107US      9
[38;5;250m10[39m N108UW      7
[38;5;246m# … with 3,205 more rows[39m


Selecting by n



tailnum,n
<chr>,<int>
N298JB,76
