# Lecture 3.2:  Data transformation

<div style="border: 1px double black; padding: 10px; margin: 10px">

**Goals for today's lecture:**
* Continue to learn [how to manipulate data](#Data-manipulation), including:
    * Pipes
    * Adding New Variables
    
* We will go through the data set `flight` 
    
We will answer question such as:     
* What days of the year / week are the busiest for flying?    
    
This lecture note corresponds to Chapter 5.5 of your book.
    
    
</div>


Let us load up the `tidyverse` and `nycflights13` packages.



We will start with the `flight` data set that we use in the previous lecture. 

In [1]:
library(tidyverse)
library(nycflights13)
head(flights)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.5     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.3     [32m✔[39m [34mdplyr  [39m 1.0.7
[32m✔[39m [34mtidyr  [39m 1.1.3     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 2.0.1     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
<int>,<int>,<int>,<int>,<int>,<dbl>,<int>,<int>,<dbl>,<chr>,<int>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dttm>
2013,1,1,517,515,2,830,819,11,UA,1545,N14228,EWR,IAH,227,1400,5,15,2013-01-01 05:00:00
2013,1,1,533,529,4,850,830,20,UA,1714,N24211,LGA,IAH,227,1416,5,29,2013-01-01 05:00:00
2013,1,1,542,540,2,923,850,33,AA,1141,N619AA,JFK,MIA,160,1089,5,40,2013-01-01 05:00:00
2013,1,1,544,545,-1,1004,1022,-18,B6,725,N804JB,JFK,BQN,183,1576,5,45,2013-01-01 05:00:00
2013,1,1,554,600,-6,812,837,-25,DL,461,N668DN,LGA,ATL,116,762,6,0,2013-01-01 06:00:00
2013,1,1,554,558,-4,740,728,12,UA,1696,N39463,EWR,ORD,150,719,5,58,2013-01-01 05:00:00


# Pipes
Starting now, we will make extensive use of the pipe operator `%>%`. 

### How `%>%` works
Under the hood, `x %>% f(y)` turns into `f(x, y)`, and `x %>% f(y) %>% g(z)` turns into `g(f(x, y), z)` and so on. We can use `%>%` on any function, not just those defined in tidyverse.

Here is an example on printing "hello world" using pipe. 

We will see the usefulness of pipe `%>%` later in the lecture as it greatly simplifies our code

# Adding New Variables
The `dplyr`/`tidyverse` package offers the `mutate()` and `transmute()` commands to add new variables to data tibbles. The syntax is:
```{r}
<tibble> %>% mutate(<new variable> = <formula for new variable>,
                    <other new variable> = <other formula>)
```
This returns a copy of `<tibble>` with the new variables added on `transmute()` does the same thing as `mutate()` but only keeps the new variables.

Let us zoom in on a few variables of interest.

Use Pipe `%>%` to create the table above. 

The above code basically select the variables that we are interested in and save it into the object `my_flights`.  

Additional variable can be added using the `mutate()` function. We already have an `air_time` variable. Let us compute the total time for the flight by subtracting the time of departure `dep_time` from time of arrival `arr_time`.

Another way to do the same thing is by using pipe twice

We notice something odd though. When we subtract 5h 17m from 8h 30m we should get 3h 13m, i.e. 193 minutes. But instead we get 313 minutes below.

The issue is that `dep_time` and `arr_time` are in the hour-minute notation, so you cannot add and subtract them like regular numbers. We should first convert these times into the number of minutes elapsed since midnight.

We want add to new variables `new_dep` and `new_arr` but we need to write a function first that can do the conversion. The function is given below; we'll learn how it works later in the semester. For now just think of it as a black box that converts times from one format to another.

Let us test the function on 530. That's 5h 30min, i.e., 330 minutes since midnight.

The `hourmin2min` function is **vectorized**: given a vector, it outputs a vector.

Let us now create two new variables obtained from `arr_time` and `dep_time` by converting them into minutes since midnight. In the same command, we can also create a new `total_time` column containing their difference.

Now we can subtract the departure time `new_dep` from the arrival time `new_arr` to get a new variable `total_time`.

Let us try to do the same thing using pipe just using one line of code.  In this code, we are only interested in the following variables -- `dep_time`, `arr_time`, `new_dep`, `new_arr`, and `total_time`.  

## Up Next - Summarize Function and Case Study on Data Manipulation

Let's now review some tools that we have learnt.  Specifically, we have learnt the function:
* `filter`
* `arrange`
* `select` 

### Exercise

<table class="table-condensed">
    <tr><th>Beginner</th><th>Advanced</th></tr>
    <tr><td>How many flights were there in 
        months beginning with the letter <code>J</code>?</td>
        <td>How many flights departed on a Monday?</td>
    </tr>
<tr><td>

1. 27,004
2. 57,668
3. 84,672
4. 93,101

</td><td>

1. 46,537
2. 51,812
3. 80,100
4. 101,991

</td>
    </tr></table>

### Exercise

<table class="table-condensed">
    <tr><th>Beginner</th><th>Advanced</th></tr>
    <tr><td>
What proportion of the flights have a missing departure time?
        </td>
        <td>
Of all the flights that departed in the first week of January, how many have a missing departure time?
        </td>
    </tr>
<tr><td>

1. 0.003
2. 0.025
3. 0.081
4. 0.105


</td><td>

1. None
2. 35
3. 101
4. 6,064

</td>
    </tr></table>
    

### Exercise

<table class="table-condensed">
    <tr><th>Beginner</th><th>Advanced</th></tr>
    <tr><td>
        What time did the first flight depart on the last day in February?
<td>
 
 Of all the flights that departed on or ahead of schedule in the first 15 days of any month, which one was in the air for the *second* shortest amount of time?

</td>
    </tr>
<tr><td>

1. 4:15am
2. 4:57am
3. 5:01am
4. 5:40am

</td><td>

1. EV 4118
2. HA 51
3. EV 4631
4. EV 4619

</td>
    </tr></table>

### Exercise

<table class="table-condensed">
    <tr><th>Beginner</th><th>Advanced</th></tr>
    <tr><td>
        How many columns in flights contain the word time?
<td>
    
How many column names in flights do *not* contain `s`?

</td>
    </tr>
<tr>
    </tr></table>

## Question:  What days of the year / at what airport are the busiest for flying?

Let's think about the table we would want to have in order to answer this question. Ideally,
it would look something like this:

    # A tibble: 1,095 x 4
       month   day airport n_sched_departures
       <int> <int> <chr>                <int>
     1     1     1 EWR                    305
     2     1     1 JFK                    297
     3     1     1 LGA                    240
     4     1     2 EWR                    350
     5     1     2 JFK                    321
     6     1     2 LGA                    272
     7     1     3 EWR                    336
     8     1     3 JFK                    318
     9     1     3 LGA                    260
    10     1     4 EWR                    339
    # … with 1,085 more rows

The table we are given has ~337k rows, one for each flight. How do we go from the `flights` table to the one shown above?

## Summaries
`summarize()` can be used to summarize entire data frames by collapsing them into single number summaries. The syntax is:
```{r}
summarize(<grouped tibble>, 
<new variable> = 
<formula for new variable>,
<other new variable> = <other formula>)
```

The most basic use of summarize is to compute statistics over the whole data set. 

Let us calculate the average time for departure delay `dep_delay`. 

## Grouping observations
`summarize()` is most useful when combined with `group_by()` to group observations before calculating the summary statistic. The `group_by` function tells `R` how your data are grouped:

`summarize()` applies the summary function to each group of data. Remember that it always returns **one row per group**.

It's as if `summarize()` filtered your data for each group, calculated the summary statistic, and
then combined all the results back into one table.

Many summary functions are available:

* Center: `mean(), median()`
* Spread: `sd(), IQR(), mad()`
* Range: `min(), max(), quantile()`
* Position: `first(), last(), nth()`
* Count: `n(), n_distinct()`
* Logical: `any(), all()`

### Examples
The `n()` function calculates the number of rows in each group:

Let us write some code to output the number of flights for each month.   Then output the number of flights in December.   

Now we are ready to generate the following table using the tools that we have learnt

    # A tibble: 1,095 x 4
       month   day airport n_sched_departures
       <int> <int> <chr>                <int>
     1     1     1 EWR                    305
     2     1     1 JFK                    297
     3     1     1 LGA                    240
     4     1     2 EWR                    350
     5     1     2 JFK                    321
     6     1     2 LGA                    272
     7     1     3 EWR                    336
     8     1     3 JFK                    318
     9     1     3 LGA                    260
    10     1     4 EWR                    339
    # … with 1,085 more rows

Use this table to answer the question: which day of the year is busiest, and at what airport?

#### A shortcut
`summarize(object =  n())` occurs so often that there is a shortcut for it: and the function is called `count` 

Let us try to output the number of flights for each month again.  Previously we use the following code to do it:

### Question -- output the number of flights for each carrier

### Exercise

Use `summarize()`, `count()`, `filter()`, `arrange()` and/or `top_n()` to answer:

<table class="table-condensed">
    <tr><th>Beginner</th><th>Advanced</th></tr>
    <tr><td>
        Which plane (tail number) flew the most flights in July?
        </td>
    <td>
        How many planes flew at least one flight in January, but none in February?
        </td>
    </tr>
    <tr>
<td>
</tr>
</table>