# Lecture 4.1: More about Data Transformation

<div style="border: 1px double black; padding: 10px; margin: 10px">

**Goals for today's lecture:**
* Consolidate knowledge we have learned so far
* Learn more about `group_by` and summarize
    
</div>


In [1]:
library(tidyverse)
library(nycflights13)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.5     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.3     [32m✔[39m [34mdplyr  [39m 1.0.7
[32m✔[39m [34mtidyr  [39m 1.1.3     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 2.0.1     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



Thus far, we have learnt several functions that are useful for data transformation:

- `filter`, filtering the rows based on some criteria
- `select`, select columns of interest
- `mutate`, create new variables 
- `summarize`, create a summary statistic of a variable

<div style="border: 1px double black; padding: 10px; margin: 10px">

**One student pointed out that there is a statement in Lecture 3.2 that is not concise**. 
 
* Previously the statement was the following:

`summarize()` applies the summary function to each group of data. Remember that it always returns **one row per group**.

* The more precise statement should be

`summarize()` applies the summary function to each group of data. It returns one row per group whena all functions in the arguments of `summarize()` return single values for vectors. 
</div>




<div style="border: 1px double black; padding: 10px; margin: 10px">

**Why slice function does not work as intended after grouping?**. 
 
`slice(k)` is essentially listing the top `k` elements in your table, organized by group.  Since our table is grouped as one, `slice` will output the entire group for us. 

    
Instead, we should use the `ungroup` function to ungroup our data before applying `slice`. Here is the same example that we have gone through in Lecture 3.2. 
</div>




#### Which day of the year is busiest, and at what airport?

## Some Useful Functions in R for Data Transformation

R provides you with several in-built vectorized functions that can be used to create more complicated function. These include:

* **Arithmetic operators** `+, -, *, /, ^`
* **Modular arithmetic operators** `%/%` and `%%` 
* **Logarithms** `log()`, `log10()`, `log2()`
* **Offsets** `lag()` and `lead()`

To do a regular division, we use `/`.  To do an integer division, we use the code `%/%`. Integer division is a division in which the fractional part (remainder) is discarded.

In [2]:
4 / 3   
4 %/% 3 

Sometimes you will find the modular operation `%%` useful.  This is outputting the fractional part of a division.  

In [3]:
1%%5 

In [4]:
3%%5

In [5]:
5%%5

In [8]:
1:20 %% 5

The shorter argument 5 is extended to match length of longer argument

You may also find the function `lag` and `lead` useful.   For instance, `lag` computes a lagged version of a time series, shifting the time base back by a given number of observations.

We also have:

* **Logical comparisons** `==, !=, <, <=, >, >=`
* **Cumulative aggregates** `cumsum(), cumprod(), cummin(), cummax()` (`dplyr` also provides `cummean()`)

## Ranking functions
Sometimes, we want to *rank* our data by assigning integers for 1st place, 2nd place, and etc. The functions `dense_rank()`, `min_rank()`, and `row_number()` can be used for this purpose:

Note the differences in behavior: 
- The rankings from `dense_rank()` never have gaps.
- The rankings from `min_rank()` skips over 3rd place (because we have two entries tied for 2nd.)
- The rankings from `row_number()` break ties arbitrarily, so the first 4.0 GPA gets ranked 5th, and the second 4.0 GPA gets ranked 6th.

By default, the ranking functions rank lowest first. If we want to reverse that, and assign rank 1 to the highest entry, we can couple the ranking functionuse with the `desc()` function:

## More Exercise on Summary Function




Many summarization functions are available:

* Center: `mean(), median()`
* Spread: `sd(), IQR(), mad()`
* Range: `min(), max(), quantile()`
* Position: `first(), last(), nth()`
* Count: `n(), n_distinct()`
* Logical: `any(), all()`

Now, let us try to use some of the summarize functions to create a new table with the variables airports, total flights, mean distance, and standard deviation of the distance.  We want to sort the mean distance in descending order.   Let's try to guess which airport has the largest mean distance before we even proceed! 

## Using Pipe with ggplot

You can even plot the data by adding a `ggplot` command at the end after manipulating your data.

Let's try to create a table for each month with the mean delay time.  Then plot a barchart for each month.  

How about a bar chart of mean arrival delay by destination airport for the top 10 airports that have the highest traffic volume?  We will use `group_by`, `summarize`, `arrange`, `slice`, and `ggplot`.

Now, let us try to get a scatter plot of airport distance vs average arrival delay after grouping by destination airport。  We will also superimpose the scatter plot with a smoothed plot

# Find the worst flight with the worst delay for each day

# Summary thus far
Before we move on to the next part of the book, I want to spend some time summarizing and tying together the main ideas from the past few lectures. In chapter 5 we learned about five types of operations for altering data tibbles:
* `filter()`: drop rows from a data table based on certain logical conditions.
* `select()`: keep *columns* in a data table by name, range, or logical conditions.
* `arrange()`: sort / reorder the rows of a data table.
* `mutate()`: generate new columns in a data table by applying functions to the existing ones.
* `group_by()` / `summarize()`: group rows together based on one or more variables, and compute summary statistics within each group.

#### `filter()` vs `select()`
Some students were mixing up the use of `filter()` and `select()`.

`filter()` selects the rows based on some specific criterion

`select()` selects the columns of your data set

### Common Error `` and ' ' and "  "

### `=` versus `==`

Remember that `=` and `==` mean different things. The former is used for assignment and to pass keyword parameters to functions. The latter is used to test for equality and returns either `TRUE` or `FALSE`.

### Vector versus column versus data table
There is particular confusion about when it is appropriate to use vectors, columns and data tables. We will be discussing these concepts at greater length in the coming weeks, but here are some essentials that you should know:

**Vectors** in R contain multiple values. You create vectors using the `c()` function. If you do neglect to do this, R will produce an error and/or do the wrong thing. Some examples of this I saw include:
```{r}
a = factor(b, levels=1, 2, 3, 4, 5) ## wrong
a = factor(b, levels=(1, 2, 3, 4, 5)) ## wrong
a = factor(b, levels=c(1, 2, 3, 4, 5)) ## correct
```

Vectors have a particular type, and all the entries of the vector must be of that same type; if they are not R will convert them to be.

You can think of a data table as a list of vectors. Each column has its own vector. To access a vector of values stored in a column in R, we traditionally use the `$` operator:

If working inside one of the `dplyr` functions like `mutate()`, `filter()`, etc., the dataset is specified by the first parameter. So you don't need to use the `$` operator, just specify the column name:
```{r}
filter(flights, flights$arr_delay < 10)  # wrong (although it will work)
filter(flights, arr_delay < 10)  # correct
```

Even though they contain the same information, a column vector is *not the same* as a table containing only that column:

## Visualization Distributions


The file `bil.RData` contains a dataset on [billionaires](https://think.cs.vt.edu/corgis/csv/billionaires/billionaires.html): who they are, where they are from, how & when they made their fortune, etc.

## Visualizing discrete distributions
We already saw how to visualize the distribution of a discrete random variable: make a bar plot. For example, in the `billionaire` data set, `region` is categorical:

In [34]:
colnames(bil) %>% print

 [1] "age"               "category"          "citizenship"      
 [4] "company.name"      "company.type"      "country code"     
 [7] "founded"           "from emerging"     "gdp"              
[10] "gender"            "industry"          "inherited"        
[13] "name"              "rank"              "region"           
[16] "relationship"      "sector"            "was founder"      
[19] "was political"     "wealth.type"       "worth in billions"
[22] "year"             


Say we are interested in the distirbution of the variable `region`.  What should we plot to visualize this? 

You see that there are NAs in the variable `region`. You could also combine this with what you have learnt by removing the NAs first before plotting.

## Continuous random variables
We cannot directly use a bar plot to visualize a continuous random variable, because every observation potentially has a different value. Instead we create a **histogram**. The command to do this is **geom_histogram**.

Let's visualize the distribution of wealth among billionaires. 

In [39]:
colnames(bil) %>% print

 [1] "age"               "category"          "citizenship"      
 [4] "company.name"      "company.type"      "country code"     
 [7] "founded"           "from emerging"     "gdp"              
[10] "gender"            "industry"          "inherited"        
[13] "name"              "rank"              "region"           
[16] "relationship"      "sector"            "was founder"      
[19] "was political"     "wealth.type"       "worth in billions"
[22] "year"             


Most billionaires are worth about \\$1-5b. However, the distribution has a "long tail": there are some billionaires who are worth as much as \\$60-80b. Interestingly, the income distribution among billionaires looks quite a bit like the income distribution in society as a whole. Even the .001% have their 1%.

A histogram is basically a bar plot where the continuous random variable has been *quantized* into one of a finite number of values.

## Typical and Atypical Values
In EDA, it is a good idea to try and get a sense of what constitutes a "typical" value in your data. Let's look at the disribution of the ages of billionaires:

Typical values of `age` in these data range from about 30 to 90. We see a very unusual spike around zero. Let us try to investigate more by filtering the data set to contain only rows with age less than 10.   

These represent missing data where we do not know the person's age. We'll fix this by *recoding* all values of -1 to `NA`:

The `na_if(a,b)` function as setting a to be equal to `NA` if `a==b`.

A good way to get a sense of typical values is by looking at percentiles. The $p$th percentile of a column is the number $x$ for which $p$% of the values are less than or equal to $x$. The best known example is the *median*: half the values are below the median.

This tells us that 98% of the billionaires are between 32 and 90. Let us redo the visualization with extreme values filtered out.