#### Graphically evaluate the unemployment model
In this exercise you will graphically evaluate the unemployment model, unemployment_model, that you fit to the unemployment data in the previous chapter. Recall that the model predicts female_unemployment from male_unemployment.

You will plot the model's predictions against the actual female_unemployment; recall the command is of the form

```
ggplot(dframe, aes(x = pred, y = outcome)) + 
       geom_point() +  
       geom_abline()
```
Then you will calculate the residuals:

```
residuals <- actual outcome - predicted outcome
```

and plot predictions against residuals. The residual graph will take a slightly different form: you compare the residuals to the horizontal line x=0 (using geom_hline()) rather than to the line x=y. The command will be provided.

The data frame unemployment and model unemployment_model are available in the workspace.

![linear_reg_3](./figures/linear_reg_3.png)

#### The gain curve to evaluate the unemployment model
In the previous exercise you made predictions about female_unemployment and visualized the predictions and the residuals. Now, you will also plot the gain curve of the unemployment_model's predictions against actual female_unemployment using the WVPlots::GainCurvePlot() function.

For situations where order is more important than exact values, the gain curve helps you check if the model's predictions sort in the same order as the true outcome.

Calls to the function GainCurvePlot() look like:

```
GainCurvePlot(frame, xvar, truthvar, title)
```
where

frame is a data frame <br>
xvar and truthvar are strings naming the prediction and actual outcome columns of frame <br>
title is the title of the plot 
When the predictions sort in exactly the same order, the relative Gini coefficient is 1. When the model sorts poorly, the relative Gini coefficient is close to zero, or even negative.

In [None]:
# Load the package WVPlots
library(WVPlots)

# Plot the Gain Curve
GainCurvePlot(unemployment, "predictions", "female_unemployment", "Unemployment model")

![linear_reg_4](./figures/linear_reg_4.png)

#### Calculate RMSE
In this exercise you will calculate the RMSE of your unemployment model. In the previous coding exercises, you added two columns to the unemployment dataset:

the model's predictions (predictions column)
the residuals between the predictions and the outcome (residuals column)
You can calculate the RMSE from a vector of residuals, res, as:
```
RMSE=sqrt(mean(res^2))
```
You want RMSE to be small. How small is "small"? One heuristic is to compare the RMSE to the standard deviation of the outcome. With a good model, the RMSE should be smaller.

In [None]:
# unemployment is in the workspace
summary(unemployment)

# For convenience put the residuals in the variable res
res <- unemployment$predictions - unemployment$female_unemployment

# Calculate RMSE, assign it to the variable rmse and print it
(rmse <- sqrt(mean(res^2)))
# [1] 0.5337612

# Calculate the standard deviation of female_unemployment and print it
(sd_unemployment <- sd(unemployment$female_unemployment))
# [1] 1.314271

#### Calculate R-Squared
Now that you've calculated the RMSE of your model's predictions, you will examine how well the model fits the data: that is, how much variance does it explain. You can do this using R^2.

Suppose y is the true outcome, p is the prediction from the model, and res=y−p are the residuals of the predictions.

Then the total sum of squares tss ("total variance") of the data is:

```
tss=sum(y - y_hat)^2
```
where y_hat is the mean value of y.

The residual sum of squared errors of the model, rss is:
```
rss=sum(res^2)
```
R^2 (R-Squared), the "variance explained" by the model, is then:

1−rss/tss
After you calculate R^2, you will compare what you computed with the R^2 reported by glance(). glance() returns a one-row data frame; for a linear regression model, one of the columns returned is the R^2 of the model on the training data.

The data frame unemployment is in your workspace, with the columns predictions and residuals that you calculated in a previous exercise.

In [None]:
# Calculate mean female_unemployment: fe_mean. Print it
(fe_mean <- mean(unemployment$female_unemployment))
# [1] 5.569231

# Calculate total sum of squares: tss. Print it
(tss <- sum((unemployment$female_unemployment - fe_mean)^2))
# [1] 20.72769

# Calculate residual sum of squares: rss. Print it
(rss <- sum(unemployment$residuals^2))
# [1] 3.703714

# Calculate R-squared: rsq. Print it. Is it a good fit?
(rsq <- 1 - rss / tss)
# [1] 0.8213157

# Get R-squared from glance. Print it
(rsq_glance <- glance(unemployment_model)$r.squared)
# [1] 0.8213157

#### Correlation and R-squared
The linear correlation of two variables, x and y, measures the strength of the linear relationship between them. When x and y are respectively:

- the outcomes of a regression model that minimizes squared-error (like linear regression) and
- the true outcomes of the training data

then the square of the correlation is the same as R^2. You will verify that in this exercise.

In [None]:
# Get the correlation between the prediction and true outcome: rho and print it
(rho <- cor(unemployment$prediction, unemployment$female_unemployment))

# Square rho: rho2 and print it
(rho2 <- rho^2)
# [1] 0.8213157

# Get R-squared from glance and print it
(rsq_glance <- glance(unemployment_model)$r.squared)
# [1] 0.8213157

#### Generating a random test/train split
For the next several exercises you will use the mpg data from the package ggplot2. The data describes the characteristics of several makes and models of cars from different years. The goal is to predict city fuel efficiency from highway fuel efficiency.

In this exercise, you will split mpg into a training set mpg_train (75% of the data) and a test set mpg_test (25% of the data). One way to do this is to generate a column of uniform random numbers between 0 and 1, using the function runif().

If you have a data set dframe of size N, and you want a random subset of approximately size 100∗X% of N (where X is between 0 and 1), then:

Generate a vector of uniform random numbers: gp = runif(N).
dframe[gp < X,] will be about the right size.
dframe[gp >= X,] will be the complement.

In [None]:
# mpg is in the workspace
summary(mpg)
#  manufacturer          model               displ            year     
#  Length:234         Length:234         Min.   :1.600   Min.   :1999  
#  Class :character   Class :character   1st Qu.:2.400   1st Qu.:1999  
#  Mode  :character   Mode  :character   Median :3.300   Median :2004  
#                                        Mean   :3.472   Mean   :2004  
#                                        3rd Qu.:4.600   3rd Qu.:2008  
#                                        Max.   :7.000   Max.   :2008  
#       cyl           trans               drv                 cty       
#  Min.   :4.000   Length:234         Length:234         Min.   : 9.00  
#  1st Qu.:4.000   Class :character   Class :character   1st Qu.:14.00  
#  Median :6.000   Mode  :character   Mode  :character   Median :17.00  
#  Mean   :5.889                                         Mean   :16.86  
#  3rd Qu.:8.000                                         3rd Qu.:19.00  
#  Max.   :8.000                                         Max.   :35.00  
#       hwy             fl               class          
#  Min.   :12.00   Length:234         Length:234        
#  1st Qu.:18.00   Class :character   Class :character  
#  Median :24.00   Mode  :character   Mode  :character  
#  Mean   :23.44                                        
#  3rd Qu.:27.00                                        
#  Max.   :44.00
dim(mpg)
# [1] 234  11

# Use nrow to get the number of rows in mpg (N) and print it
(N <- nrow(mpg))
# [1] 234

# Calculate how many rows 75% of N should be and print it
# Hint: use round() to get an integer
(target <- round(N * 0.75))

# Create the vector of N uniform random variables: gp
gp <- runif(N)

# Use gp to create the training set: mpg_train (75% of data) and mpg_test (25% of data)
mpg_train <- mpg[gp < 0.75, ]
mpg_test <- mpg[gp >= 0.75, ]

# Use nrow() to examine mpg_train and mpg_test
nrow(mpg_train)
# [1] 175
nrow(mpg_test)
# [1] 59

#### Train a model using test/train split
Now that you have split the mpg dataset into mpg_train and mpg_test, you will use mpg_train to train a model to predict city fuel efficiency (cty) from highway fuel efficiency (hwy).

In [None]:
# mpg_train is in the workspace
summary(mpg_train)
#  manufacturer          model               displ            year     
#  Length:180         Length:180         Min.   :1.600   Min.   :1999  
#  Class :character   Class :character   1st Qu.:2.500   1st Qu.:1999  
#  Mode  :character   Mode  :character   Median :3.400   Median :2008  
#                                        Mean   :3.558   Mean   :2004  
#                                        3rd Qu.:4.600   3rd Qu.:2008  
#                                        Max.   :7.000   Max.   :2008  
#       cyl           trans               drv                 cty       
#  Min.   :4.000   Length:180         Length:180         Min.   : 9.00  
#  1st Qu.:4.000   Class :character   Class :character   1st Qu.:14.00  
#  Median :6.000   Mode  :character   Mode  :character   Median :16.00  
#  Mean   :6.022                                         Mean   :16.58  
#  3rd Qu.:8.000                                         3rd Qu.:19.00  
#  Max.   :8.000                                         Max.   :33.00  
#       hwy             fl               class          
#  Min.   :12.00   Length:180         Length:180        
#  1st Qu.:18.00   Class :character   Class :character  
#  Median :24.00   Mode  :character   Mode  :character  
#  Mean   :23.11                                        
#  3rd Qu.:27.00                                        
#  Max.   :44.00

# Create a formula to express cty as a function of hwy: fmla and print it.
(fmla <- cty ~ hwy)
# cty ~ hwy

# Now use lm() to build a model mpg_model from mpg_train that predicts cty from hwy 
mpg_model <- lm(fmla, mpg_train)

# Use summary() to examine the model
summary(mpg_model)
# Call:
# lm(formula = fmla, data = mpg_train)

# Residuals:
#     Min      1Q  Median      3Q     Max 
# -2.8400 -0.8305 -0.1551  0.5865  4.8140 

# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept)  1.13375    0.38309   2.959   0.0035 ** 
# hwy          0.66825    0.01608  41.564   <2e-16 ***
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Residual standard error: 1.251 on 178 degrees of freedom
# Multiple R-squared:  0.9066,	Adjusted R-squared:  0.9061 
# F-statistic:  1728 on 1 and 178 DF,  p-value: < 2.2e-16

#### Evaluate a model using test/train split
Now you will test the model mpg_model on the test data, mpg_test. Functions rmse() and r_squared() to calculate RMSE and R-squared have been provided for convenience:

```
rmse(predcol, ycol)
r_squared(predcol, ycol)
```
where:

predcol: The predicted values
ycol: The actual outcome
You will also plot the predictions vs. the outcome.

Generally, model performance is better on the training data than the test data (though sometimes the test set "gets lucky"). A slight difference in performance is okay; if the performance on training is significantly better, there is a problem.

In [None]:
# predict cty from hwy for the training set
mpg_train$pred <- predict(mpg_model, mpg_train)

# predict cty from hwy for the test set
mpg_test$pred <- predict(mpg_model, mpg_test)

# Evaluate the rmse on both training and test data and print them
(rmse_train <- rmse(mpg_train$pred, mpg_train$cty))
# [1] 1.243958
(rmse_test <- rmse(mpg_test$pred, mpg_test$cty))
# [1] 1.277228

# Evaluate the r-squared on both training and test data.and print them
(rsq_train <- r_squared(mpg_train$pred, mpg_train$cty))
# [1] 0.9065908
(rsq_test <- r_squared(mpg_test$pred, mpg_test$cty))
# [1] 0.9251412

# Plot the predictions (on the x-axis) against the outcome (cty) on the test data
ggplot(mpg_test, aes(x = pred, y = cty)) + 
  geom_point() + 
  geom_abline()

![linear_reg_5](./figures/linear_reg_5.png)

#### Create a cross validation plan
There are several ways to implement an n-fold cross validation plan. In this exercise you will create such a plan using vtreat::kWayCrossValidation(), and examine it.

kWayCrossValidation() creates a cross validation plan with the following call:

```
splitPlan <- kWayCrossValidation(nRows, nSplits, dframe, y)
```
where nRows is the number of rows of data to be split, and nSplits is the desired number of cross-validation folds.

Strictly speaking, dframe and y aren't used by kWayCrossValidation; they are there for compatibility with other vtreat data partitioning functions. You can set them both to NULL.

The resulting splitPlan is a list of nSplits elements; each element contains two vectors:

train: the indices of dframe that will form the training set
app: the indices of dframe that will form the test (or application) set
In this exercise you will create a 3-fold cross-validation plan for the data set mpg.

In [None]:
# Load the package vtreat
library(vtreat)

# Get the number of rows in mpg
nRows <- nrow(mpg)

# Implement the 3-fold cross-fold plan with vtreat
splitPlan <- kWayCrossValidation(nRows, 3, NULL, NULL)
# List of 3
#  $ :List of 2
#   ..$ train: int [1:156] 2 3 4 6 8 9 12 13 14 15 ...
#   ..$ app  : int [1:78] 189 208 125 155 181 97 173 148 227 45 ...
#  $ :List of 2
#   ..$ train: int [1:156] 1 3 4 5 7 8 9 10 11 13 ...
#   ..$ app  : int [1:78] 180 183 91 34 166 143 100 195 153 25 ...
#  $ :List of 2
#   ..$ train: int [1:156] 1 2 5 6 7 10 11 12 15 17 ...
#   ..$ app  : int [1:78] 105 178 76 168 22 158 101 129 14 188 ...
#  - attr(*, "splitmethod")= chr "kwaycross"

#### Evaluate a modeling procedure using n-fold cross-validation
In this exercise you will use splitPlan, the 3-fold cross validation plan from the previous exercise, to make predictions from a model that predicts mpg$cty from mpg$hwy.

If dframe is the training data, then one way to add a column of cross-validation predictions to the frame is as follows:

```
# Initialize a column of the appropriate length
dframe$pred.cv <- 0 

# k is the number of folds
# splitPlan is the cross validation plan

for(i in 1:k) {
  # Get the ith split
  split <- splitPlan[[i]]

  # Build a model on the training data 
  # from this split 
  # (lm, in this case)
  model <- lm(fmla, data = dframe[split$train,])

  # make predictions on the 
  # application data from this split
  dframe$pred.cv[split$app] <- predict(model, newdata = dframe[split$app,])
}
```

Cross-validation predicts how well a model built from all the data will perform on new data. As with the test/train split, for a good modeling procedure, cross-validation performance and training performance should be close.

In [None]:
# Run the 3-fold cross validation plan from splitPlan
k <- 3 # Number of folds
mpg$pred.cv <- 0 
for(i in 1:3) {
  split <- splitPlan[[i]]
  model <- lm(cty ~ hwy, data = mpg[split$train,])
  mpg$pred.cv[split$app] <- predict(model, newdata = mpg[split$app,])
}

# Predict from a full model
mpg$pred <- predict(lm(cty ~ hwy, data = mpg))

# Get the rmse of the full model's predictions
rmse(mpg$pred, mpg$cty)
# [1] 1.247045

# Get the rmse of the cross-validation predictions
rmse(mpg$pred.cv, mpg$cty)
# [1] 1.260323