# Lecture 11.2: Statistical Modeling II
<div style="border: 1px double black; padding: 10px; margin: 10px">
    
We will learn more about linear regression



In [1]:
library(tidyverse)
library(modelr)
library(lubridate)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.3.2     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.0.3     [32m✔[39m [34mdplyr  [39m 1.0.2
[32m✔[39m [34mtidyr  [39m 1.1.1     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.3.1     [32m✔[39m [34mforcats[39m 0.5.0

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()


Attaching package: ‘lubridate’


The following objects are masked from ‘package:base’:

    date, intersect, setdiff, union




### Review
Use the linear model to determine:
1. Expected price of a one-carat diamond with a good cut.
2. If the average city gas mileage for Audis is statistically different from the average city gas mileage for Volkswagens.

### Formulas with factors
The situation becomes more interesting when we consider models that contain factors:

#### For instance, let say our outcome variable $Y$ is height and our explanatory variable $X$ is sex (Male or Female).  Then, sex is a categorical variable with two categories.  The lm function will do something special to the variable sex by creating an indicator variable that takes value one if sex is male, and value 0 if sex is female.  

### Interpreting the output of `summary(lm(...))`

#### Degrees of freedom

#### Residuals

#### Residual standard error
The residual sum of squares is 

$$\text{RSS} = \frac{1}{\text{df}_e}\sum_{i=1}^{n} [y_i - \hat{y}_i(a_1,a_2)]^2$$.

Hence, $\text{RSS}/\rm{df}_e \approx \rm{var}(\epsilon)$.

#### $p$- and $t$-values

#### Multiple $R^2$

$$R^2 = 1 - \frac{\text{RSS}}{\text{TSS}}$$

"Fraction of variance explained."

#### $\bar{R}^2$, a.k.a adjusted $R^2$

$$\bar{R}^2 = 1 - \frac{\text{RSS}/\rm{df}_e}{\text{TSS}/\rm{df}_t}$$

#### $F$-test

### Gapminder
The `gapminder` package contains data from [Gapminder](https://www.gapminder.org/), which was popularised by Swedish statistician Hans Rosling. If you don't know about this data or this person, pause the lecture and take five minutes and [watch one of his videos](https://www.youtube.com/watch?v=jbkSRLYSojo).

In [12]:
#install.packages("gapminder")
library(gapminder)
gapminder %>% print

[38;5;246m# A tibble: 1,704 x 6[39m
   country     continent  year lifeExp      pop gdpPercap
   [3m[38;5;246m<fct>[39m[23m       [3m[38;5;246m<fct>[39m[23m     [3m[38;5;246m<int>[39m[23m   [3m[38;5;246m<dbl>[39m[23m    [3m[38;5;246m<int>[39m[23m     [3m[38;5;246m<dbl>[39m[23m
[38;5;250m 1[39m Afghanistan Asia       [4m1[24m952    28.8  8[4m4[24m[4m2[24m[4m5[24m333      779.
[38;5;250m 2[39m Afghanistan Asia       [4m1[24m957    30.3  9[4m2[24m[4m4[24m[4m0[24m934      821.
[38;5;250m 3[39m Afghanistan Asia       [4m1[24m962    32.0 10[4m2[24m[4m6[24m[4m7[24m083      853.
[38;5;250m 4[39m Afghanistan Asia       [4m1[24m967    34.0 11[4m5[24m[4m3[24m[4m7[24m966      836.
[38;5;250m 5[39m Afghanistan Asia       [4m1[24m972    36.1 13[4m0[24m[4m7[24m[4m9[24m460      740.
[38;5;250m 6[39m Afghanistan Asia       [4m1[24m977    38.4 14[4m8[24m[4m8[24m[4m0[24m372      786.
[38;5;250m 7[39m Afghanistan Asia 

To begin with we will focus on how life expectancy varies by year and by country.

The regression line shows that overall trend in life expectancy has been upwards over the last fifty years. That's good! But there are some obvious exceptions. 

The linear trend is a good fit most of the non-African and non-Asian countries. However, beginning in the 1990s, a number of African countries have lagged far behind the rest of the world in terms of life expectancy. 

## Interaction terms
To dig deeper we will want to fit a separate linear model to each country. We want our model to be:

$$\text{lifeExp}_{c}(\text{year}) = \alpha_c + \beta_c \cdot \text{year}.$$

Here $c$ indexes countries. To do this we will add an interaction term:

To understand what this does, let's turn to the model matrix:

The interaction term creates a separate slope *and* intercept term for every country except one.

## Measurements of model quality
So far we have looked at residuals to judge how well the models fit. There are other more general measurements of model quality. To help us look at these we will use the `broom` package for turning models into tidy data:

In [27]:
library(broom)


Attaching package: ‘broom’


The following object is masked from ‘package:modelr’:

    bootstrap




The `broom::glance()` function lets us quickly look at a model and judge how well it fits:

`glance` prints out some technical measurements of how well the model fits. The basic one is `r.squared`. In the simple linear model this simply measures the square of the correlation between the predictions $\hat{\mathbf{y}}$ and the observations $\mathbf{y}$:

To investigate each country individually, we are going to want to fit a linear model separately. For that we'll use a new command called `nest()`. The purpose of `nest()` is to package up our data frame into a bunch of nested data frames:

### Exercise
Use `map()` to run a regression of life expectancy over time for *each* of the 142 countries in this data set. Store the results in a column called `model`.

Plotting the resulting data, we see that most countries are fit pretty well by the linear model. But some countries, especially those in Africa, have a very bad fit:

Let's extract those for further analysis: