Linear Regression Solutions.Rmd

---
title: "Linear Regression"
author: "Dr. Christian Haas"
date: "August 28, 2020"
output: html_document
editor_options: 
  chunk_output_type: console
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# R Linear Regression

This is an overview of how to run and analyze linear regression in R. We will cover simple linear regression, multiple regression, analysis of assumptions, and including qualitative predictors.

## Simple Linear Regression

Simple Linear Regression is the simplest form of regression. It consists of one predictor variable x and one target variable y.

The 'lm' function in R is used to calculate both simple and multiple linear regression models if y is a numerical/continuous/quantitative variable.

On another note: We will start using the tidyverse packages to build clean and reproducible data 'pipelines', as well as the caret package to use a consistent interface for machine learning tasks. While we only use the full functionality of caret in later sections (e.g., resampling techniques), the general processes are the same: prepare/pre-process the data, define the model, train the model, evaluate the model.

```{r simple regression}

# set your working directory to the current file directory 
tryCatch({
  library(rstudioapi)
  setwd(dirname(rstudioapi::getActiveDocumentContext()$path))
  }, error=function(cond){message(paste("cannot change working directory"))
})

# first, load some required libraries
library(MASS) # large collection of data sets and functions
library(ISLR)
library(ggplot2)

# we also load a package for a systematic and clean interface to build machine learning models
library(caret)
library(tidyverse)
library(tidymodels)

# Simple Linear Regression

#dataset <- Boston # Boston is a pre-loaded dataset provided by the MASS package

# print out the column names of the data set
names(Carseats)

# as it's an imported data set from the MASS library, we can use following command to get additional information about the variables
#?Boston

## Let's build a simple linear regression model.
# we want to see if the variable lstat (x, lower status of the population) significantly affects medv (y, median home value)
# we use the caret package functionality. you will see in later sections that this will make it much easier to specify additional steps in model training and evaluation

## We will use a pipeline approach consisting of following steps:
# 1) data pre-processing, including feature engineering (optional)
# 2) specifying the model and additional processing steps (required)
# 3) train the model (required)
# 4) evaluate the model (required)
# 5) use the model to predict 'new' data (required)

# we also introduce the trainControl parameter which lets us set a resampling strategy. for now, we use 'none'

# step 1: we use the Boston dataset directly
# step 2: specify model
target_var <- 'Sales'
model_form <- Sales ~ Price
model_type <- 'lm'

# here, we add a pre-processing step that removes variables with zero variance (zv), i.e., constant variables
model_recipe <- recipe(model_form, data = Carseats) %>%  
  step_zv(all_predictors())

# step 3: specify training parameters and train

# we're not using any resampling for now
trControl <- trainControl(method = 'none')

# let's fit the model
lm_fit <- train(model_recipe, data = Carseats, method = model_type, trControl = trControl)

# step 4: evaluate the model
# calling the summary function on the lm object will give you the basic regression output
summary(lm_fit)
lm_fit$finalModel

# to get the names and coefficients of the output lm model, use these functions
names(lm_fit$finalModel)
coef(lm_fit$finalModel)

# step 5: get the predictions and evaluate the model performance
# we can predict the regression output for new values of x by using the generic 'predict' function 

lm_pred <- lm_fit %>% predict(newdata = Carseats)

# we can get the MSE and RMSE from the model output:
postResample(lm_pred, Carseats[[target_var]])
# a note: caret uses a slightly different version to calculate R2, namely it calculates the correlation between observed and predicted values and squares this correlation r. this can lead to slight differences in R2 compared to what we're used to from the lm() models. However, for all intents and purposes of this class, we don't need to worry about these small differences

# to get the confidence intervals for the parameters, you can use the confint functions
confint(lm_fit$finalModel)

# sidenote: note that when we want to predict new data points, we have to distinguish between confidence intervals 
lm_fit$finalModel %>% predict(newdata = data.frame(Price=(c(7,12))), interval="confidence") # confidence is the for actual values of the parameters
lm_fit$finalModel %>% predict(newdata = data.frame(Price=(c(7,12))), interval="prediction") # prediction intervals are larger than confidence intervals, as we have additional uncertainty when predicting new data


## let's use some graphical plots

# start with a simple scatterplot and a regression line. Use ggplot for nicer graphs 
Carseats %>% ggplot(aes(x = Price, y = Sales)) + geom_point() + xlab("Price") + ylab ("Sales")


Carseats %>% ggplot(aes(x = Price, y = Sales)) + geom_point() + xlab("Price") + ylab ("Sales") + geom_smooth(method='lm')

# to plot the relevant diagnostics plots, use the plot function on the lm object
library(ggfortify)
autoplot(lm_fit$finalModel)

```

## Extension to Multiple Linear Regression and Interaction Terms

In R, extending a simple to a multiple regression is easy. We simply include additional predictor variables in our model formulation.

Note that '.' can be used to indicate all the columns that are not explicitly specified. This can be helpful when working with data sets with many columns, at least if all of them need to be included as predictor variables.

```{r multiple regression}
# build a model with 2 predictors
trControl <- trainControl(method = 'none')

target_var <- 'Sales'
model_form <- Sales ~ Price+Urban+US
model_type <- 'lm'

model_recipe <- recipe(model_form, data = Carseats) %>%  step_zv(all_predictors())

# step 3: specify training parameters and train
# let's fit the model
lm_fit <- train(model_recipe, data = Carseats, method = model_type, trControl = trControl)

# step 4: evaluate the model
summary(lm_fit)


# build a model using all columns other than 'Sales' as predictor variables
model_form <- Sales ~ .

model_recipe <- recipe(model_form, data = Carseats) %>%  step_zv(all_predictors())

# step 3: specify training parameters and train
# let's fit the model
lm_fit <- train(model_recipe, data = Carseats, method = model_type, trControl = trControl)

# step 4: evaluate the model
summary(lm_fit)

# we can also use the vif (variance inflation factor) function to see which predictors have the most influence in the regression. the built-in version of varImp() in caret uses the t statistics for regression
lm_fit_importance <- varImp(lm_fit, scale = FALSE) # scale=TRUE scales everything between 0 and 100, where 100 indicates the highest importance 
plot(lm_fit_importance)

lm_fit_importance <- varImp(lm_fit)
plot(lm_fit_importance)

# if we want to manually exclude a non-significant variable, we can do it by removing the variable in the recipe:
model_form <- Sales ~ .

#step_rm removes one or more variables
model_recipe <- recipe(model_form, data = Carseats) %>%
  step_zv(all_predictors()) %>%
  step_rm(age)

lm_fit <- train(model_recipe, data = Carseats, method = model_type, trControl = trControl)

summary(lm_fit)

## Interaction Terms

# we can include interaction terms between two variables by using * instead of +
model_form <- Sales ~ .

model_recipe <- recipe(model_form, data = Carseats) %>%
  step_interact(terms = ~ Price * Age) %>%
  step_zv(all_predictors())

lm_fit <- train(model_recipe, data = Carseats, method = model_type, trControl = trControl)

summary(lm_fit)

```

## In-class Exercise 1:

We are going to build a regression model using the WinPercentage as dependent variable and the other two variables OffPassYds and DefYds as predictor variables.

```{r exercise 1}

# load the NFl.csv dataset
dataset <- read_csv("NFL.csv")

# build a regression model as described above
target_var <- 'WinPercentage'
model_form <- WinPercentage ~ OffPassYds + DefYds
model_type <- 'lm'

model_recipe <- recipe(model_form, data = dataset) %>%  step_zv(all_predictors())

# step 3: specify training parameters and train
# let's fit the model
lm_fit <- train(model_recipe, data = dataset, method = model_type, trControl = trControl)

# step 4: evaluate the model
summary(lm_fit)

lm_pred <- lm_fit %>% predict(newdata = dataset)
postResample(lm_pred, dataset[[target_var]])

# plot the diagnostic plots. Do you see any issues?
autoplot(lm_fit$finalModel)

```


## Non-linear Transformations

An easy way to use regression for non-linear applications is through transformations of the input data. Two common transformations are polynomial or logarithmic transformations.

```{r non-linear transformations}
# Non-linear Transformations of the Predictors

# to include polynomial terms, we can use the step_poly function in the recipe

dataset <- Boston
target_var <- 'medv'
model_form <- medv ~ lstat
model_type <- 'lm'

model_recipe <- recipe(model_form, data = dataset) %>%
  step_zv(all_predictors()) %>%
  step_poly(lstat, degree=2)

# step 3: specify training parameters and train

# we're not using any resampling for now
trControl <- trainControl(method = 'none')

# let's fit the model
lm_fit_poly <- train(model_recipe, data = dataset, method = model_type, trControl = trControl)

# step 4: evaluate the model
# calling the summary function on the lm object will give you the basic regression output
summary(lm_fit_poly)

# step 5: get the predictions and evaluate the model performance
# we can predict the regression output for new values of x by using the generic 'predict' function 
lm_pred_poly <- lm_fit_poly %>%  predict(newdata = dataset)
# we can get the MSE and RMSE from the model output:
postResample(lm_pred_poly, dataset[[target_var]])


# we can actually use the anova function to see if adding the quadratic variable significantly decreases the RSS
model_recipe <- recipe(model_form, data = dataset) %>%
  step_zv(all_predictors()) 

lm_fit <- train(model_recipe, data = dataset, method = model_type, trControl = trControl)

anova(lm_fit$finalModel, lm_fit_poly$finalModel) # if the result is significant, it means that the models lead to significantly different explained variances

# plot the diagnostics
autoplot(lm_fit_poly$finalModel)

# alternative formulation for higher-level polynomials
model_recipe <- recipe(model_form, data = dataset) %>%
  step_zv(all_predictors()) %>%
  step_poly(lstat, degree=5)

# step 3: specify training parameters and train

# we're not using any resampling for now
trControl <- trainControl(method = 'none')

# let's fit the model
lm_fit_poly <- train(model_recipe, data = dataset, method = model_type, trControl = trControl)
summary(lm_fit_poly)

# last but not least, the logarithmic transformation is often useful for non-linear variables

model_recipe_log <- recipe(model_form, data = dataset) %>%
  step_zv(all_predictors()) %>%
  step_log(lstat)

# step 3: specify training parameters and train
# we're not using any resampling for now
trControl <- trainControl(method = 'none')
# let's fit the model
lm_fit_log <- train(model_recipe_log, data = dataset, method = model_type, trControl = trControl)
summary(lm_fit_log)

# we can plot the models and compare them
Boston %>% ggplot(mapping = aes(x = medv, y = lstat)) + geom_point() + 
  geom_smooth(method = "lm", se=FALSE, color="black", formula = y ~ x) +
  geom_smooth(method = "lm", se=FALSE, color="blue", formula = y ~ poly(x, 2, raw = TRUE)) +
  geom_smooth(method = "lm", se=FALSE, color="red", formula = y ~ poly(x, 5, raw = TRUE)) +
  geom_smooth(method = "lm", se=FALSE, color="green", formula = y ~ log(x)) 
  

```

## In-class Exercise 2:

We are going to build a regression model using Score (y) as a function of the Price and Type of restaurant.

```{r exercise 2}
# load the RestaurantRatings.csv dataset
dataset <- read_csv("RestaurantRatings.csv")

library(GGally)
dataset %>% select(-c("Restaurant")) %>% ggpairs()

# build a regression model as described above
target_var <- 'Score'
model_form <- Score ~ Type + Price
model_type <- 'lm'

model_recipe <- recipe(model_form, data = dataset) %>%
  step_zv(all_predictors())

# step 3: specify training parameters and train
# we're not using any resampling for now
trControl <- trainControl(method = 'none')

# let's fit the model
lm_fit <- train(model_recipe, data = dataset, method = model_type, trControl = trControl)

# which variables, if any, are significant? how much variance is explained by the model?
summary(lm_fit)

# plot the diagnostic plots. Do you see any issues?
autoplot(lm_fit$finalModel)

# try a non-linear transformation on a variable. Does it increase the model performance? Do the diagnostic plots look better now?
model_recipe <- recipe(model_form, data = dataset) %>%
  step_log(Price) %>%
  step_zv(all_predictors())

# which variables, if any, are significant? how much variance is explained by the model?
lm_fit_log <- train(model_recipe, data = dataset, method = model_type, trControl = trControl)

# which variables, if any, are significant? how much variance is explained by the model?
summary(lm_fit_log)

autoplot(lm_fit_log$finalModel)

```