#### Code a simple one-variable regression
For the first coding exercise, you'll create a formula to define a one-variable modeling task, and then fit a linear model to the data. You are given the rates of male and female unemployment in the United States over several years.

The task is to predict the rate of female unemployment from the observed rate of male unemployment. The outcome is female_unemployment, and the input is male_unemployment.

The sign of the variable coefficient tells you whether the outcome increases (+) or decreases (-) as the variable increases.

Recall the calling interface for lm() is:

lm(formula, data = ___)



In [2]:
# load data
unemployment <- readRDS("data/unemployment.rds")

In [3]:
# unemployment is loaded in the workspace
summary(unemployment)
#  male_unemployment female_unemployment
#  Min.   :2.900     Min.   :4.000      
#  1st Qu.:4.900     1st Qu.:4.400      
#  Median :6.000     Median :5.200      
#  Mean   :5.954     Mean   :5.569      
#  3rd Qu.:6.700     3rd Qu.:6.100      
#  Max.   :9.800     Max.   :7.900

# Define a formula to express female_unemployment as a function of male_unemployment
fmla <- as.formula("female_unemployment ~ male_unemployment")

# Print it
fmla
# female_unemployment ~ male_unemployment

# Use the formula to fit a model: unemployment_model
unemployment_model <- lm(fmla, data = unemployment)

# Print it
unemployment_model
# Call:
# lm(formula = fmla, data = unemployment)

# Coefficients:
#       (Intercept)  male_unemployment  
#            1.4341             0.6945

 male_unemployment female_unemployment
 Min.   :2.900     Min.   :4.000      
 1st Qu.:4.900     1st Qu.:4.400      
 Median :6.000     Median :5.200      
 Mean   :5.954     Mean   :5.569      
 3rd Qu.:6.700     3rd Qu.:6.100      
 Max.   :9.800     Max.   :7.900      

female_unemployment ~ male_unemployment


Call:
lm(formula = fmla, data = unemployment)

Coefficients:
      (Intercept)  male_unemployment  
           1.4341             0.6945  


#### Examining a model
Let's look at the model unemployment_model that you have just created. There are a variety of different ways to examine a model; each way provides different information. We will use summary(), broom::glance(), and sigr::wrapFTest().

In [4]:
# Call summary() on unemployment_model to get more details
summary(unemployment_model)
# Call:
# lm(formula = fmla, data = unemployment)

# Residuals:
#      Min       1Q   Median       3Q      Max 
# -0.77621 -0.34050 -0.09004  0.27911  1.31254 

# Coefficients:
#                   Estimate Std. Error t value Pr(>|t|)    
# (Intercept)        1.43411    0.60340   2.377   0.0367 *  
# male_unemployment  0.69453    0.09767   7.111 1.97e-05 ***
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Residual standard error: 0.5803 on 11 degrees of freedom
# Multiple R-squared:  0.8213,	Adjusted R-squared:  0.8051 
# F-statistic: 50.56 on 1 and 11 DF,  p-value: 1.966e-05

# Call glance() on unemployment_model to see the details in a tidier form
glance(unemployment_model)
#   r.squared adj.r.squared     sigma statistic      p.value df    logLik
# 1 0.8213157     0.8050716 0.5802596  50.56108 1.965985e-05  2 -10.28471
#        AIC      BIC deviance df.residual
# 1 26.56943 28.26428 3.703714          11

# Call wrapFTest() on unemployment_model to see the most relevant details
wrapFTest(unemployment_model)
# [1] "F Test summary: (R2=0.8213, F(1,11)=50.56, p=1.966e-05)."


Call:
lm(formula = fmla, data = unemployment)

Coefficients:
      (Intercept)  male_unemployment  
           1.4341             0.6945  



Call:
lm(formula = fmla, data = unemployment)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.77621 -0.34050 -0.09004  0.27911  1.31254 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)        1.43411    0.60340   2.377   0.0367 *  
male_unemployment  0.69453    0.09767   7.111 1.97e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.5803 on 11 degrees of freedom
Multiple R-squared:  0.8213,	Adjusted R-squared:  0.8051 
F-statistic: 50.56 on 1 and 11 DF,  p-value: 1.966e-05


ERROR: Error in glance(unemployment_model): could not find function "glance"


#### Predicting from the unemployment model
In this exercise, you will use your unemployment model unemployment_model to make predictions from the unemployment data, and compare predicted female unemployment rates to the actual observed female unemployment rates on the training data, unemployment. You will also use your model to predict on the new data in newrates, which consists of only one observation, where male unemployment is 5%.

The predict() interface for lm models takes the form
```predict(model, newdata)``` <br>
You will use the ggplot2 package to make the plots, so you will add the prediction column to the unemployment data frame. You will plot outcome versus prediction, and compare them to the line that represents perfect predictions (that is when the outcome is equal to the predicted value).

The ggplot2 command to plot a scatterplot of dframe\\$outcome versus dframe\\$pred (pred on the x axis, outcome on the y axis), along with a blue line where outcome == pred is as follows:

In [None]:
# newrates is in your workspace
newrates
#   male_unemployment
# 1                 5

# Predict female unemployment in the unemployment data set
unemployment$prediction <-  predict(unemployment_model, unemployment)

# load the ggplot2 package
library(ggplot2)

# Make a plot to compare predictions to actual (prediction on x axis). 
ggplot(unemployment, aes(x = prediction, y = female_unemployment)) +
  geom_point() +
  geom_abline(color = "blue")

# Predict female unemployment rate when male unemployment is 5%
pred <- predict(unemployment_model, newrates)

# Print it
pred
#        1 
# 4.906757

![linear_reg_1](./figures/linear_reg_1.png)

#### Multivariate linear regression - build the model
In this exercise, you will work with the blood pressure dataset (Source), and model blood_pressure as a function of weight and age.

In [None]:
# bloodpressure is in the workspace
summary(bloodpressure)
#  blood_pressure       age            weight   
#  Min.   :128.0   Min.   :46.00   Min.   :167  
#  1st Qu.:140.0   1st Qu.:56.50   1st Qu.:186  
#  Median :153.0   Median :64.00   Median :194  
#  Mean   :150.1   Mean   :62.45   Mean   :195  
#  3rd Qu.:160.5   3rd Qu.:69.50   3rd Qu.:209  
#  Max.   :168.0   Max.   :74.00   Max.   :220

# Create the formula and print it
fmla <- blood_pressure ~ weight + age
fmla
# blood_pressure ~ weight + age

# Fit the model: bloodpressure_model
bloodpressure_model <- lm(fmla, data = bloodpressure)

# Print bloodpressure_model and call summary() 
bloodpressure_model
# Call:
# lm(formula = fmla, data = bloodpressure)

# Coefficients:
# (Intercept)       weight          age  
#     30.9941       0.3349       0.8614

summary(bloodpressure_model)
# Call:
# lm(formula = fmla, data = bloodpressure)

# Residuals:
#     Min      1Q  Median      3Q     Max 
# -3.4640 -1.1949 -0.4078  1.8511  2.6981 

# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)   
# (Intercept)  30.9941    11.9438   2.595  0.03186 * 
# weight        0.3349     0.1307   2.563  0.03351 * 
# age           0.8614     0.2482   3.470  0.00844 **
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Residual standard error: 2.318 on 8 degrees of freedom
# Multiple R-squared:  0.9768,	Adjusted R-squared:  0.9711 
# F-statistic: 168.8 on 2 and 8 DF,  p-value: 2.874e-07

#### Multivariate linear regression - prediction
Now you will make predictions using the blood pressure model bloodpressure_model that you fit in the previous exercise.

You will also compare the predictions to outcomes graphically. ggplot2 is already loaded in your workspace. Recall the plot command takes the form:

ggplot(dframe, aes(x = pred, y = outcome)) + 
     geom_point() + 
     geom_abline(color = "blue")

In [None]:
# predict blood pressure using bloodpressure_model :prediction
bloodpressure$prediction <- predict(bloodpressure_model, bloodpressure)

# plot the results
ggplot(bloodpressure, aes(x = prediction, y = blood_pressure)) +
    geom_point() +
    geom_abline(color = "blue")

![linear_reg_2](./figures/linear_reg_2.png)