In [1]:
install.packages('nycflights13')

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



In [2]:
options(repr.plot.width=8, repr.plot.height=5)
library(tidyverse)
library(nycflights13)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.3     [32m✔[39m [34mreadr    [39m 2.1.4
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.0
[32m✔[39m [34mggplot2  [39m 3.4.3     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.2     [32m✔[39m [34mtidyr    [39m 1.3.0
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


# Lecture 5: Tidy data, missing data

<div style="border: 1px double black; padding: 10px; margin: 10px">

**After today's lecture you will:**
* Gain more experience creating and working with tidy data
* Learn about how R handles missing data.
    
This lecture note corresponds to Chapters 6 and 20 of your book.
</div>


    




## A usage of `pivot_wider()`
Let's revisit the `gapminder` dataset that we first saw last lecture:

In [3]:
install.packages('gapminder')

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



In [4]:
library(gapminder)
print(gapminder)

[90m# A tibble: 1,704 × 6[39m
   country     continent  year lifeExp      pop gdpPercap
   [3m[90m<fct>[39m[23m       [3m[90m<fct>[39m[23m     [3m[90m<int>[39m[23m   [3m[90m<dbl>[39m[23m    [3m[90m<int>[39m[23m     [3m[90m<dbl>[39m[23m
[90m 1[39m Afghanistan Asia       [4m1[24m952    28.8  8[4m4[24m[4m2[24m[4m5[24m333      779.
[90m 2[39m Afghanistan Asia       [4m1[24m957    30.3  9[4m2[24m[4m4[24m[4m0[24m934      821.
[90m 3[39m Afghanistan Asia       [4m1[24m962    32.0 10[4m2[24m[4m6[24m[4m7[24m083      853.
[90m 4[39m Afghanistan Asia       [4m1[24m967    34.0 11[4m5[24m[4m3[24m[4m7[24m966      836.
[90m 5[39m Afghanistan Asia       [4m1[24m972    36.1 13[4m0[24m[4m7[24m[4m9[24m460      740.
[90m 6[39m Afghanistan Asia       [4m1[24m977    38.4 14[4m8[24m[4m8[24m[4m0[24m372      786.
[90m 7[39m Afghanistan Asia       [4m1[24m982    39.9 12[4m8[24m[4m8[24m[4m1[24m816      978.
[90m 8[3

## Missing data in R
The `gapminder` appears very nice, it only contains data sets with no missing observations. But! The raw data looks like this:

In [5]:
gapminder_unfiltered  %>% print

[90m# A tibble: 3,313 × 6[39m
   country     continent  year lifeExp      pop gdpPercap
   [3m[90m<fct>[39m[23m       [3m[90m<fct>[39m[23m     [3m[90m<int>[39m[23m   [3m[90m<dbl>[39m[23m    [3m[90m<int>[39m[23m     [3m[90m<dbl>[39m[23m
[90m 1[39m Afghanistan Asia       [4m1[24m952    28.8  8[4m4[24m[4m2[24m[4m5[24m333      779.
[90m 2[39m Afghanistan Asia       [4m1[24m957    30.3  9[4m2[24m[4m4[24m[4m0[24m934      821.
[90m 3[39m Afghanistan Asia       [4m1[24m962    32.0 10[4m2[24m[4m6[24m[4m7[24m083      853.
[90m 4[39m Afghanistan Asia       [4m1[24m967    34.0 11[4m5[24m[4m3[24m[4m7[24m966      836.
[90m 5[39m Afghanistan Asia       [4m1[24m972    36.1 13[4m0[24m[4m7[24m[4m9[24m460      740.
[90m 6[39m Afghanistan Asia       [4m1[24m977    38.4 14[4m8[24m[4m8[24m[4m0[24m372      786.
[90m 7[39m Afghanistan Asia       [4m1[24m982    39.9 12[4m8[24m[4m8[24m[4m1[24m816      978.
[90m 8[3

### What happens when we reshape the "unfiltered" data?

In [6]:
# pivot unfiltered data wider
gapminder_unfiltered %>% pivot_wider(id_cols = country, names_from = year, values_from = gdpPercap)

country,1952,1957,1962,1967,1972,1977,1982,1987,1992,⋯,1995,1996,1998,1999,2000,2001,2003,2004,2005,2006
<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
Afghanistan,779.4453,820.8530,853.1007,836.1971,739.9811,786.1134,978.0114,852.3959,649.3414,⋯,,,,,,,,,,
Albania,1601.0561,1942.2842,2312.8890,2760.1969,3313.4222,3533.0039,3630.8807,3738.9327,2497.4379,⋯,,,,,,,,,,
Algeria,2449.0082,3013.9760,2550.8169,3246.9918,4182.6638,4910.4168,5745.1602,5681.3585,5023.2166,⋯,,,,,,,,,,
Angola,3520.6103,3827.9405,4269.2767,5522.7764,5473.2880,3008.6474,2756.9537,2430.2083,2627.8457,⋯,,,,,,,,,,
Argentina,5911.3151,6856.8562,7133.1660,8052.9530,9443.0385,10079.0267,8997.8974,9139.6714,9308.4187,⋯,,,,,,,,,,
Armenia,,,,,,,,,1442.9378,⋯,,,,,,,,,,
Aruba,,,,,4939.7580,7390.3599,10874.9150,17674.3389,25120.5496,⋯,,,,,,,,,,
Australia,10039.5956,10949.6496,12217.2269,14526.1246,16788.6295,18334.1975,19477.0093,21888.8890,23424.7668,⋯,25518.715,26151.132,28169.153,28983.267,29241.515,30043.243,31634.242,32098.506,,
Austria,6137.0765,8842.5980,10750.7211,12834.6024,16661.6256,19749.4223,21597.0836,23687.8261,27042.0187,⋯,27918.819,28602.353,30107.978,31039.619,32008.505,32196.425,32741.187,33455.694,34108,
Azerbaijan,,,,,,,,,3455.5430,⋯,,,,,,,,,,


In [None]:
# pivot unfiltered data wider
gapminder_unfiltered %>% pivot_wider(id_cols = country, names_from = year, values_from = gdpPercap)

[90m# A tibble: 187 × 59[39m
   country `1952` `1957` `1962` `1967` `1972` `1977` `1982` `1987` `1992` `1997`
   [3m[90m<fct>[39m[23m    [3m[90m<dbl>[39m[23m  [3m[90m<dbl>[39m[23m  [3m[90m<dbl>[39m[23m  [3m[90m<dbl>[39m[23m  [3m[90m<dbl>[39m[23m  [3m[90m<dbl>[39m[23m  [3m[90m<dbl>[39m[23m  [3m[90m<dbl>[39m[23m  [3m[90m<dbl>[39m[23m  [3m[90m<dbl>[39m[23m
[90m 1[39m Afghan…   779.   821.   853.   836.   740.   786.   978.   852.   649.   635.
[90m 2[39m Albania  [4m1[24m601.  [4m1[24m942.  [4m2[24m313.  [4m2[24m760.  [4m3[24m313.  [4m3[24m533.  [4m3[24m631.  [4m3[24m739.  [4m2[24m497.  [4m3[24m193.
[90m 3[39m Algeria  [4m2[24m449.  [4m3[24m014.  [4m2[24m551.  [4m3[24m247.  [4m4[24m183.  [4m4[24m910.  [4m5[24m745.  [4m5[24m681.  [4m5[24m023.  [4m4[24m797.
[90m 4[39m Angola   [4m3[24m521.  [4m3[24m828.  [4m4[24m269.  [4m5[24m523.  [4m5[24m473.  [4m3[24m009.  [4m2[24m757.  [4m2[24m4

You can see that there are many missing observations in the unfiltered data. In real life, you will mostly get unfiltered data -- how should we handle missing data?

## Missing Values
Missing values can be:

* **Explicit** (marked as `NA` in our data); or
* **Implicit** (not present in the data).

In [11]:
df <- tribble(
  ~person,           ~treatment, ~response,
  "Derrick Whitmore", 1,         7,
  NA,                 2,         10,
  NA,                 3,         NA,
  "Katherine Burke",  1,         4,
  NA,  1,         NA
)

In [12]:
df

person,treatment,response
<chr>,<dbl>,<dbl>
Derrick Whitmore,1,7.0
,2,10.0
,3,
Katherine Burke,1,4.0
,1,


The missing values are **explicit** in this table: each missing value is indicated by `NA` in the table.

You can fill in these missing values with `tidyr::fill()`. It works like `select()`, taking a set of columns, and fills them in with the last non-missing value.

In [13]:
# fill df for person and response columns
fill(df, person, response)

person,treatment,response
<chr>,<dbl>,<dbl>
Derrick Whitmore,1,7
Derrick Whitmore,2,10
Derrick Whitmore,3,10
Katherine Burke,1,4
Katherine Burke,1,4


In [10]:
# fill in each missing value in gapminder_unfiltered carrying forward
fill(gapminder_unfiltered, gdpPercap) %>% filter(is.na(gdpPercap))

country,continent,year,lifeExp,pop,gdpPercap
<fct>,<fct>,<int>,<dbl>,<int>,<dbl>


## Encoded missing data

Sometimes, data contain a concrete value that actually represents a missing value. You see this often when dealing with data that is imported from a format that does not have a way to represent missing values, such as text or CSV. So it must instead use some special value like 99 or -999.

To correct for this type of missing value, we can use a function called `na_if(x, n)`. This takes a vector `x` and replaces any occurence of `n` with `NA`:

In [15]:
x = c(-99, 1, 3, -99, 2)
print(x)
na_if(x, -99)

[1] -99   1   3 -99   2


## Implicit missing values
A second type of missing data occurs when there are simply no observations in the dataset for a particular combination of columns. For example:

In [16]:
stocks <- tibble(
  year  = c(2020, 2020, 2020, 2020, 2021, 2021, 2021),
  qtr   = c(   1,    2,    3,    4,    2,    3,    4),
  price = c(1.88, 0.59, 0.35,   NA, 0.92, 0.17, 2.66)
)

In [17]:
print(stocks)

[90m# A tibble: 7 × 3[39m
   year   qtr price
  [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m
[90m1[39m  [4m2[24m020     1  1.88
[90m2[39m  [4m2[24m020     2  0.59
[90m3[39m  [4m2[24m020     3  0.35
[90m4[39m  [4m2[24m020     4 [31mNA[39m   
[90m5[39m  [4m2[24m021     2  0.92
[90m6[39m  [4m2[24m021     3  0.17
[90m7[39m  [4m2[24m021     4  2.66


This dataset has two missing observations:
- The price in the 2020q4 is explicitly missing. (It has an `NA`.)
- The price in 2021q1 is implicitly missing: it does not appear in the dataset.

> An explicit missing value is the presence of an absence.
>
> An implicit missing value is the absence of a presence.


How can we handle implicit missing values? As we have already seen, one option is to use `pivot_wider()`:

In [18]:
# using pivot_wider() on stocks converts implicit missing values to explicit
pivot_wider(stocks, names_from = qtr, values_from = price) %>%
  print

[90m# A tibble: 2 × 5[39m
   year   `1`   `2`   `3`   `4`
  [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m
[90m1[39m  [4m2[24m020  1.88  0.59  0.35 [31mNA[39m   
[90m2[39m  [4m2[24m021 [31mNA[39m     0.92  0.17  2.66


Alternatively, we can use the `complete()` function, which creates entries for all possible combinations of a set of columns:

In [19]:
# use complete to fill in missing values for stocks
complete(stocks, year, qtr)

year,qtr,price
<dbl>,<dbl>,<dbl>
2020,1,1.88
2020,2,0.59
2020,3,0.35
2020,4,
2021,1,
2021,2,0.92
2021,3,0.17
2021,4,2.66


### Fixed value replacement

Sometimes missing values represent some fixed and known value, most commonly 0. You can use dplyr::coalesce() to replace them:

In [20]:
x <- c(1, 4, 5, 7, NA)
coalesce(x, 100)
#> [1] 1 4 5 7 0

### Titanic analysis

On April 15, 1912, the great "unsinkable" RMS Titanic ship sank to the bottom of the Atlantic.

Unfortunately, there weren’t enough lifeboats for everyone on board, resulting in more than 1500 deaths out of 2224 passengers and crew. Refer:https://en.wikipedia.org/wiki/Titanic

We have partial list of passengers here. Let us do some analysis

In [22]:
titanic = read.csv('https://storage.googleapis.com/mbcc/titanic.csv')
titanic %>% head %>% print

  survived pclass    sex age sibsp parch    fare embarked class   who
1        0      3   male  22     1     0  7.2500        S Third   man
2        1      1 female  38     1     0 71.2833        C First woman
3        1      3 female  26     0     0  7.9250        S Third woman
4        1      1 female  35     1     0 53.1000        S First woman
5        0      3   male  35     0     0  8.0500        S Third   man
6        0      3   male  NA     0     0  8.4583        Q Third   man
  adult_male deck embark_town alive alone
1       True      Southampton    no False
2      False    C   Cherbourg   yes False
3      False      Southampton   yes  True
4      False    C Southampton   yes False
5       True      Southampton    no  True
6       True       Queenstown    no  True


## &#129300; Quiz

What is the average age of the passengers (ignoring the decimal part)?

<ol style="list-style-type: upper-alpha;">
    <li>28</li>
    <li>30</li>
    <li>27</li>
    <li>29</li>
    <li>31</li>
</ol>


In [23]:
## solution
titanic %>%
  filter(!is.na(age)) %>%
  summarize(mean_age = mean(age))

mean_age
<dbl>
29.69912


In [24]:
summary(titanic$age)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   0.42   20.12   28.00   29.70   38.00   80.00     177 

In [26]:
mean(titanic$age, na.rm = T)

In [28]:
titanic$age %>% mean(na.rm = T)

In [29]:
summary(titanic)

    survived          pclass          sex                 age       
 Min.   :0.0000   Min.   :1.000   Length:891         Min.   : 0.42  
 1st Qu.:0.0000   1st Qu.:2.000   Class :character   1st Qu.:20.12  
 Median :0.0000   Median :3.000   Mode  :character   Median :28.00  
 Mean   :0.3838   Mean   :2.309                      Mean   :29.70  
 3rd Qu.:1.0000   3rd Qu.:3.000                      3rd Qu.:38.00  
 Max.   :1.0000   Max.   :3.000                      Max.   :80.00  
                                                     NA's   :177    
     sibsp           parch             fare          embarked        
 Min.   :0.000   Min.   :0.0000   Min.   :  0.00   Length:891        
 1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:  7.91   Class :character  
 Median :0.000   Median :0.0000   Median : 14.45   Mode  :character  
 Mean   :0.523   Mean   :0.3816   Mean   : 32.20                     
 3rd Qu.:1.000   3rd Qu.:0.0000   3rd Qu.: 31.00                     
 Max.   :8.000   Max.   :6.0

## &#129300; Quiz

How many na values are there in the age column?

<ol style="list-style-type: upper-alpha;">
    <li>177</li>
    <li>178</li>
    <li>714</li>
    <li>There are no NA values</li>
</ol>


#### multiple ways of finding the answer

In [30]:
sum(is.na(titanic$age))

In [33]:
titanic %>% count(is.na(age))

is.na(age),n
<lgl>,<int>
False,714
True,177


In [34]:
titanic %>% filter(is.na(age)) %>% nrow

Replace the NA values in the `age` column with average age

In [37]:
## solution

titanic %>%
  mutate(age = coalesce(age, mean(age, na.rm = T))) %>%
  count(is.na(age))


is.na(age),n
<lgl>,<int>
False,891
