# Lecture 5.2: Tidy Data

<div style="border: 1px double black; padding: 10px; margin: 10px">

**After today's lecture you will:**
* Understand what makes [tidy data](#Tidy-data) and why we care
* [Gather](#Gather) multiple columns into one
* [Spread](#Spread) one column into several
* [Separate](#Separate) and [unite](#Unite) columns
* Impute [missing values](#Missing-values)
    
This lecture note corresponds to Chapter 12 of your book. 
</div>


    




In [1]:
library(tidyverse)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.3.2     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.0.3     [32m✔[39m [34mdplyr  [39m 1.0.2
[32m✔[39m [34mtidyr  [39m 1.1.1     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.3.1     [32m✔[39m [34mforcats[39m 0.5.0

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



## Tidy data
There are many different ways to represent data in a table, but some are better than others.
We say that a data table is "[tidy](http://vita.had.co.nz/papers/tidy-data.pdf)" if:
- Each row represents an observation.
- Each column represents a variable.
- Each value gets its own cell.
- Each different type of data set gets its own table.

Data tables which are not tidy are called messy!
![http://r4ds.had.co.nz/images/tidy-1.png](http://r4ds.had.co.nz/images/tidy-1.png)

## Why we care about tidy data
The tools we have learned so far all live in the `tidyverse`. This means that each tool expect to recieve tidy data as input, and, where necessary, returns tidy data as output. You can think of tidy data as a sort of contract that everything in `tidyverse` respects. This makes it possible to string many tidyverse commands together using `%>%`  without having to worry about whether they all work together.

The mutate command expects tidy data and returns tidy data.

In the following, we are going to ouput several tables data.  These `table` data are part of your `tidyverse` package.

In [3]:
print(table1)

[38;5;246m# A tibble: 6 x 4[39m
  country      year  cases population
  [3m[38;5;246m<chr>[39m[23m       [3m[38;5;246m<int>[39m[23m  [3m[38;5;246m<int>[39m[23m      [3m[38;5;246m<int>[39m[23m
[38;5;250m1[39m Afghanistan  [4m1[24m999    745   19[4m9[24m[4m8[24m[4m7[24m071
[38;5;250m2[39m Afghanistan  [4m2[24m000   [4m2[24m666   20[4m5[24m[4m9[24m[4m5[24m360
[38;5;250m3[39m Brazil       [4m1[24m999  [4m3[24m[4m7[24m737  172[4m0[24m[4m0[24m[4m6[24m362
[38;5;250m4[39m Brazil       [4m2[24m000  [4m8[24m[4m0[24m488  174[4m5[24m[4m0[24m[4m4[24m898
[38;5;250m5[39m China        [4m1[24m999 [4m2[24m[4m1[24m[4m2[24m258 [4m1[24m272[4m9[24m[4m1[24m[4m5[24m272
[38;5;250m6[39m China        [4m2[24m000 [4m2[24m[4m1[24m[4m3[24m766 [4m1[24m280[4m4[24m[4m2[24m[4m8[24m583


In the above data set, each column is a variable and every entries in the same column is of the same type. So therefore this data set is tidy.

Now let us take a look at the following data set obtained by merging the cases and population into the `type` variable.

In [3]:
print(table2)

[38;5;246m# A tibble: 12 x 4[39m
   country      year type            count
   [3m[38;5;246m<chr>[39m[23m       [3m[38;5;246m<int>[39m[23m [3m[38;5;246m<chr>[39m[23m           [3m[38;5;246m<int>[39m[23m
[38;5;250m 1[39m Afghanistan  [4m1[24m999 cases             745
[38;5;250m 2[39m Afghanistan  [4m1[24m999 population   19[4m9[24m[4m8[24m[4m7[24m071
[38;5;250m 3[39m Afghanistan  [4m2[24m000 cases            [4m2[24m666
[38;5;250m 4[39m Afghanistan  [4m2[24m000 population   20[4m5[24m[4m9[24m[4m5[24m360
[38;5;250m 5[39m Brazil       [4m1[24m999 cases           [4m3[24m[4m7[24m737
[38;5;250m 6[39m Brazil       [4m1[24m999 population  172[4m0[24m[4m0[24m[4m6[24m362
[38;5;250m 7[39m Brazil       [4m2[24m000 cases           [4m8[24m[4m0[24m488
[38;5;250m 8[39m Brazil       [4m2[24m000 population  174[4m5[24m[4m0[24m[4m4[24m898
[38;5;250m 9[39m China        [4m1[24m999 cases          [4m2[24m[4m1[24m

In the type column, you have two variables cases and population.   For a data to be tidy, each variable needs to be in each column.  So this data is messy.

How about the following table? 

In [7]:
table3 %>% print

[38;5;246m# A tibble: 6 x 3[39m
  country      year rate             
[38;5;250m*[39m [3m[38;5;246m<chr>[39m[23m       [3m[38;5;246m<int>[39m[23m [3m[38;5;246m<chr>[39m[23m            
[38;5;250m1[39m Afghanistan  [4m1[24m999 745/19987071     
[38;5;250m2[39m Afghanistan  [4m2[24m000 2666/20595360    
[38;5;250m3[39m Brazil       [4m1[24m999 37737/172006362  
[38;5;250m4[39m Brazil       [4m2[24m000 80488/174504898  
[38;5;250m5[39m China        [4m1[24m999 212258/1272915272
[38;5;250m6[39m China        [4m2[24m000 213766/1280428583


The above table is also messy, because the counts for cases and population is defined as a division in the variable rate.  So this data is also messy.

In [9]:
print(table4a) # cases in one tibble
print(table4b) # population in another one

[38;5;246m# A tibble: 3 x 3[39m
  country     `1999` `2000`
[38;5;250m*[39m [3m[38;5;246m<chr>[39m[23m        [3m[38;5;246m<int>[39m[23m  [3m[38;5;246m<int>[39m[23m
[38;5;250m1[39m Afghanistan    745   [4m2[24m666
[38;5;250m2[39m Brazil       [4m3[24m[4m7[24m737  [4m8[24m[4m0[24m488
[38;5;250m3[39m China       [4m2[24m[4m1[24m[4m2[24m258 [4m2[24m[4m1[24m[4m3[24m766
[38;5;246m# A tibble: 3 x 3[39m
  country         `1999`     `2000`
[38;5;250m*[39m [3m[38;5;246m<chr>[39m[23m            [3m[38;5;246m<int>[39m[23m      [3m[38;5;246m<int>[39m[23m
[38;5;250m1[39m Afghanistan   19[4m9[24m[4m8[24m[4m7[24m071   20[4m5[24m[4m9[24m[4m5[24m360
[38;5;250m2[39m Brazil       172[4m0[24m[4m0[24m[4m6[24m362  174[4m5[24m[4m0[24m[4m4[24m898
[38;5;250m3[39m China       [4m1[24m272[4m9[24m[4m1[24m[4m5[24m272 [4m1[24m280[4m4[24m[4m2[24m[4m8[24m583


The above is a messy data, since each observation for different year is separated into columns.

Recapping: tidy data means

* each observation has its own row
* each variable has its own column
* each value has its own cell

## Why we care about tidy data
The tools we have learned so far all live in the `tidyverse`. This means that each tool expects to recieve tidy data as input, and, where necessary, returns tidy data as output. You can think of tidy data as a sort of contract that everything in `tidyverse` respects. This makes it possible to string many tidyverse commands together using `%>%`  without having to worry about whether they all work together.

For instance, let us calculate rate of cases per 10000 people for the data in `table1`.

In [10]:
mutate(table1, rate = cases / population * 10000) # rate of cases per 10000 peoplev

country,year,cases,population,rate
<chr>,<int>,<int>,<int>,<dbl>
Afghanistan,1999,745,19987071,0.372741
Afghanistan,2000,2666,20595360,1.294466
Brazil,1999,37737,172006362,2.19393
Brazil,2000,80488,174504898,4.612363
China,1999,212258,1272915272,1.667495
China,2000,213766,1280428583,1.669488


How would we calculate the `rate` variable using `table2`? 

In [12]:
print(table2)

[38;5;246m# A tibble: 12 x 4[39m
   country      year type            count
   [3m[38;5;246m<chr>[39m[23m       [3m[38;5;246m<int>[39m[23m [3m[38;5;246m<chr>[39m[23m           [3m[38;5;246m<int>[39m[23m
[38;5;250m 1[39m Afghanistan  [4m1[24m999 cases             745
[38;5;250m 2[39m Afghanistan  [4m1[24m999 population   19[4m9[24m[4m8[24m[4m7[24m071
[38;5;250m 3[39m Afghanistan  [4m2[24m000 cases            [4m2[24m666
[38;5;250m 4[39m Afghanistan  [4m2[24m000 population   20[4m5[24m[4m9[24m[4m5[24m360
[38;5;250m 5[39m Brazil       [4m1[24m999 cases           [4m3[24m[4m7[24m737
[38;5;250m 6[39m Brazil       [4m1[24m999 population  172[4m0[24m[4m0[24m[4m6[24m362
[38;5;250m 7[39m Brazil       [4m2[24m000 cases           [4m8[24m[4m0[24m488
[38;5;250m 8[39m Brazil       [4m2[24m000 population  174[4m5[24m[4m0[24m[4m4[24m898
[38;5;250m 9[39m China        [4m1[24m999 cases          [4m2[24m[4m1[24m

Summary commands like `summarize` and `count` also preserve tidy data:

In [14]:
count(table1, year, wt = cases) # compute no. of cases for each year

year,n
<int>,<int>
1999,250740
2000,296920


In [16]:
count(table1, year, wt = population) # compute no. of cases for each year

year,n
<int>,<int>
1999,1464908705
2000,1475528841


`ggplot` also expects tidy data.  What if we plot a table with a messy data?

In [18]:
print(table2)

[38;5;246m# A tibble: 12 x 4[39m
   country      year type            count
   [3m[38;5;246m<chr>[39m[23m       [3m[38;5;246m<int>[39m[23m [3m[38;5;246m<chr>[39m[23m           [3m[38;5;246m<int>[39m[23m
[38;5;250m 1[39m Afghanistan  [4m1[24m999 cases             745
[38;5;250m 2[39m Afghanistan  [4m1[24m999 population   19[4m9[24m[4m8[24m[4m7[24m071
[38;5;250m 3[39m Afghanistan  [4m2[24m000 cases            [4m2[24m666
[38;5;250m 4[39m Afghanistan  [4m2[24m000 population   20[4m5[24m[4m9[24m[4m5[24m360
[38;5;250m 5[39m Brazil       [4m1[24m999 cases           [4m3[24m[4m7[24m737
[38;5;250m 6[39m Brazil       [4m1[24m999 population  172[4m0[24m[4m0[24m[4m6[24m362
[38;5;250m 7[39m Brazil       [4m2[24m000 cases           [4m8[24m[4m0[24m488
[38;5;250m 8[39m Brazil       [4m2[24m000 population  174[4m5[24m[4m0[24m[4m4[24m898
[38;5;250m 9[39m China        [4m1[24m999 cases          [4m2[24m[4m1[24m

It seems like ggplot is treating both cases and populations as some data point too.  I am not sure what is going on in the above plot with a messy data.

**Remark** Extremely important to only use the functions we have learnt for tidy data.

## Creating tidy data
If the data is not already tidy, then we might need to some work before we can use the tools in `dplyr`. The package `tidyr` inside `tidyverse` allows one to convert data into the tidy data.

### Gather
One common problem in when a variable is spread across multiple columns and we need to gather those columns to create a new pair of variables. For example, consider `table4a` from above:

In [19]:
print(table4a)

[38;5;246m# A tibble: 3 x 3[39m
  country     `1999` `2000`
[38;5;250m*[39m [3m[38;5;246m<chr>[39m[23m        [3m[38;5;246m<int>[39m[23m  [3m[38;5;246m<int>[39m[23m
[38;5;250m1[39m Afghanistan    745   [4m2[24m666
[38;5;250m2[39m Brazil       [4m3[24m[4m7[24m737  [4m8[24m[4m0[24m488
[38;5;250m3[39m China       [4m2[24m[4m1[24m[4m2[24m258 [4m2[24m[4m1[24m[4m3[24m766


Here there is a year variable which is spread across two columns. To become tidy, it should get its own `year` column. We want to *gather* the columns of year into a single column.
![gather illustration](http://r4ds.had.co.nz/images/tidy-9.png)

The command to do this is called `gather()`. To use `gather()` we need to specify three things:

* which existing columns correspond to values of a variable
* what is the name of the variable (the **key**) whose values currently appear as column names.
* what is the name of the variable (the **value**) whose values are currently spread over the cells.

(When using `gather()`, neither the **key** nor **value** column names currently exist in your data. They are "destination" columns in the new table.)

### Exercise
Let us try to transform `table4b` to tidy format:

### Spread
Another type of problem is when an observation is scattered across multiple rows.

Here we want to do the opposite of gather: we want to *spread* these rows out into new columns.
![spread data](http://r4ds.had.co.nz/images/tidy-8.png)

We need to specify two things:

* which existing column (the **key**) has the variable names as values
* which existing column (the **value**) has the values for those variables

What is the key here? What is the value?

### Exercise
Convert `table2` to tidy format using `spread()`.

In [5]:
print(table2)

[38;5;246m# A tibble: 12 x 4[39m
   country      year type            count
   [3m[38;5;246m<chr>[39m[23m       [3m[38;5;246m<int>[39m[23m [3m[38;5;246m<chr>[39m[23m           [3m[38;5;246m<int>[39m[23m
[38;5;250m 1[39m Afghanistan  [4m1[24m999 cases             745
[38;5;250m 2[39m Afghanistan  [4m1[24m999 population   19[4m9[24m[4m8[24m[4m7[24m071
[38;5;250m 3[39m Afghanistan  [4m2[24m000 cases            [4m2[24m666
[38;5;250m 4[39m Afghanistan  [4m2[24m000 population   20[4m5[24m[4m9[24m[4m5[24m360
[38;5;250m 5[39m Brazil       [4m1[24m999 cases           [4m3[24m[4m7[24m737
[38;5;250m 6[39m Brazil       [4m1[24m999 population  172[4m0[24m[4m0[24m[4m6[24m362
[38;5;250m 7[39m Brazil       [4m2[24m000 cases           [4m8[24m[4m0[24m488
[38;5;250m 8[39m Brazil       [4m2[24m000 population  174[4m5[24m[4m0[24m[4m4[24m898
[38;5;250m 9[39m China        [4m1[24m999 cases          [4m2[24m[4m1[24m

## Summary

* `gather` tends to make wide tables narrower and longer
* `spread` tends to make long tables shorter and wider
* `gather` and `spread` are inverses -- each one undos the other.

Sometimes we want to `spread()` data for other reasons. Some tables are more readable if they are put in a non-tidy format. This is often the case with time data.Sometimes we want to `spread()` data for other reasons. Some tables are more readable if they are put in a non-tidy format. This is often the case with time data.

### Exercise
Use `spread()` to put the `year` variable of `table1` into columns (show `population` only):
```
  country     1999       2000      
1 Afghanistan   19987071   20595360
2 Brazil       172006362  174504898
3 China       1272915272 1280428583
```

In [8]:
print(table1)

[38;5;246m# A tibble: 6 x 4[39m
  country      year  cases population
  [3m[38;5;246m<chr>[39m[23m       [3m[38;5;246m<int>[39m[23m  [3m[38;5;246m<int>[39m[23m      [3m[38;5;246m<int>[39m[23m
[38;5;250m1[39m Afghanistan  [4m1[24m999    745   19[4m9[24m[4m8[24m[4m7[24m071
[38;5;250m2[39m Afghanistan  [4m2[24m000   [4m2[24m666   20[4m5[24m[4m9[24m[4m5[24m360
[38;5;250m3[39m Brazil       [4m1[24m999  [4m3[24m[4m7[24m737  172[4m0[24m[4m0[24m[4m6[24m362
[38;5;250m4[39m Brazil       [4m2[24m000  [4m8[24m[4m0[24m488  174[4m5[24m[4m0[24m[4m4[24m898
[38;5;250m5[39m China        [4m1[24m999 [4m2[24m[4m1[24m[4m2[24m258 [4m1[24m272[4m9[24m[4m1[24m[4m5[24m272
[38;5;250m6[39m China        [4m2[24m000 [4m2[24m[4m1[24m[4m3[24m766 [4m1[24m280[4m4[24m[4m2[24m[4m8[24m583


Now let us apply the tools that we have learnt to the `flights` data set

### Exercise
Re-create the following table which shows monthly departures from the three NYC airports:
```
  origin 1    2    3     4     5     6     7     8     9    10    11   12  
1 EWR    9893 9107 10420 10531 10592 10175 10475 10359 9550 10104 9707 9922
2 JFK    9161 8421  9697  9218  9397  9472 10023  9983 8908  9143 8710 9146
3 LGA    7950 7423  8717  8581  8807  8596  8927  8985 9116  9642 8851 9067
```

### Spreading more than one column
Consider the following simple table:

In [16]:
grades <- tribble(
    ~person, ~exam, ~q1, ~q2, ~q3,
    "alice", "mt1", 1, 2, 3.5,
    "alice", "mt2", .5, 2.5, 1.5,
    "bob", "mt1", 0.0, 1.0, 1.5,
    "bob", "mt2", 1.5, 2.5, 2.0
)

In [17]:
print(grades)

[38;5;246m# A tibble: 4 x 5[39m
  person exam     q1    q2    q3
  [3m[38;5;246m<chr>[39m[23m  [3m[38;5;246m<chr>[39m[23m [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<dbl>[39m[23m
[38;5;250m1[39m alice  mt1     1     2     3.5
[38;5;250m2[39m alice  mt2     0.5   2.5   1.5
[38;5;250m3[39m bob    mt1     0     1     1.5
[38;5;250m4[39m bob    mt2     1.5   2.5   2  


Suppose we want to expand this into multiple columns `mt1.q1`, `mt1.q2`, and so on. How should we use `spread()`?

We have uncovered a limitation of `spread()`. It can only operate on a single key-value pair. (This is on purpose, in order to keep the command simple.)

### Unite and separate
To `spread()` multiple values at once we'll use the `unite()` command to combine them into a single variable. The unite command does the opposite of `separate()`: stick several variables together to form a new variable.

`unite()` has taken each of the values q1, q2, q3 and combined them into a single column. Now we can `spread()` the `q` column to obtain:

## Missing Values
Missing values can be:

* **Explicit** (marked as `NA` in our data); or
* **Implicit** (not present in the data).

In [24]:
(stocks <- tibble(
  year   = c(2015, 2015, 2015, 2015, 2016, 2016, 2016),
  qtr    = c(   1,    2,    3,    4,    2,    3,    4),
  return = c(1.88, NA, 0.35,   0.8, 0.92, 0.17, 2.66)
))
mutate(stocks, last_qtr_ret = lag(return))

year,qtr,return
<dbl>,<dbl>,<dbl>
2015,1,1.88
2015,2,
2015,3,0.35
2015,4,0.8
2016,2,0.92
2016,3,0.17
2016,4,2.66


year,qtr,return,last_qtr_ret
<dbl>,<dbl>,<dbl>,<dbl>
2015,1,1.88,
2015,2,,1.88
2015,3,0.35,
2015,4,0.8,0.35
2016,2,0.92,0.8
2016,3,0.17,0.92
2016,4,2.66,0.17


In this example we have one explicitly missing value for the 4th quarter of 2015. Are there any other missing values? Yes, because we do not have an observation for the first quarter of 2016.

The **complete** command makes implicit missing value explicit by considering all combinations of unique values of specified variables.

The missing values also become explicit if we **spread** the tibble.

**gather** will keep all these explicitly missing values by default.

If you don't like the default behavior of **gather**, you can turn off explicit missing values using the `na.rm` argument.