-
Notifications
You must be signed in to change notification settings - Fork 0
/
Linear Regression Solutions.Rmd
360 lines (243 loc) · 13.5 KB
/
Linear Regression Solutions.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
---
title: "Linear Regression"
author: "Dr. Christian Haas"
date: "August 28, 2020"
output: html_document
editor_options:
chunk_output_type: console
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# R Linear Regression
This is an overview of how to run and analyze linear regression in R. We will cover simple linear regression, multiple regression, analysis of assumptions, and including qualitative predictors.
## Simple Linear Regression
Simple Linear Regression is the simplest form of regression. It consists of one predictor variable x and one target variable y.
The 'lm' function in R is used to calculate both simple and multiple linear regression models if y is a numerical/continuous/quantitative variable.
On another note: We will start using the tidyverse packages to build clean and reproducible data 'pipelines', as well as the caret package to use a consistent interface for machine learning tasks. While we only use the full functionality of caret in later sections (e.g., resampling techniques), the general processes are the same: prepare/pre-process the data, define the model, train the model, evaluate the model.
```{r simple regression}
# set your working directory to the current file directory
tryCatch({
library(rstudioapi)
setwd(dirname(rstudioapi::getActiveDocumentContext()$path))
}, error=function(cond){message(paste("cannot change working directory"))
})
# first, load some required libraries
library(MASS) # large collection of data sets and functions
library(ISLR)
library(ggplot2)
# we also load a package for a systematic and clean interface to build machine learning models
library(caret)
library(tidyverse)
library(tidymodels)
# Simple Linear Regression
#dataset <- Boston # Boston is a pre-loaded dataset provided by the MASS package
# print out the column names of the data set
names(Carseats)
# as it's an imported data set from the MASS library, we can use following command to get additional information about the variables
#?Boston
## Let's build a simple linear regression model.
# we want to see if the variable lstat (x, lower status of the population) significantly affects medv (y, median home value)
# we use the caret package functionality. you will see in later sections that this will make it much easier to specify additional steps in model training and evaluation
## We will use a pipeline approach consisting of following steps:
# 1) data pre-processing, including feature engineering (optional)
# 2) specifying the model and additional processing steps (required)
# 3) train the model (required)
# 4) evaluate the model (required)
# 5) use the model to predict 'new' data (required)
# we also introduce the trainControl parameter which lets us set a resampling strategy. for now, we use 'none'
# step 1: we use the Boston dataset directly
# step 2: specify model
target_var <- 'Sales'
model_form <- Sales ~ Price
model_type <- 'lm'
# here, we add a pre-processing step that removes variables with zero variance (zv), i.e., constant variables
model_recipe <- recipe(model_form, data = Carseats) %>%
step_zv(all_predictors())
# step 3: specify training parameters and train
# we're not using any resampling for now
trControl <- trainControl(method = 'none')
# let's fit the model
lm_fit <- train(model_recipe, data = Carseats, method = model_type, trControl = trControl)
# step 4: evaluate the model
# calling the summary function on the lm object will give you the basic regression output
summary(lm_fit)
lm_fit$finalModel
# to get the names and coefficients of the output lm model, use these functions
names(lm_fit$finalModel)
coef(lm_fit$finalModel)
# step 5: get the predictions and evaluate the model performance
# we can predict the regression output for new values of x by using the generic 'predict' function
lm_pred <- lm_fit %>% predict(newdata = Carseats)
# we can get the MSE and RMSE from the model output:
postResample(lm_pred, Carseats[[target_var]])
# a note: caret uses a slightly different version to calculate R2, namely it calculates the correlation between observed and predicted values and squares this correlation r. this can lead to slight differences in R2 compared to what we're used to from the lm() models. However, for all intents and purposes of this class, we don't need to worry about these small differences
# to get the confidence intervals for the parameters, you can use the confint functions
confint(lm_fit$finalModel)
# sidenote: note that when we want to predict new data points, we have to distinguish between confidence intervals
lm_fit$finalModel %>% predict(newdata = data.frame(Price=(c(7,12))), interval="confidence") # confidence is the for actual values of the parameters
lm_fit$finalModel %>% predict(newdata = data.frame(Price=(c(7,12))), interval="prediction") # prediction intervals are larger than confidence intervals, as we have additional uncertainty when predicting new data
## let's use some graphical plots
# start with a simple scatterplot and a regression line. Use ggplot for nicer graphs
Carseats %>% ggplot(aes(x = Price, y = Sales)) + geom_point() + xlab("Price") + ylab ("Sales")
Carseats %>% ggplot(aes(x = Price, y = Sales)) + geom_point() + xlab("Price") + ylab ("Sales") + geom_smooth(method='lm')
# to plot the relevant diagnostics plots, use the plot function on the lm object
library(ggfortify)
autoplot(lm_fit$finalModel)
```
## Extension to Multiple Linear Regression and Interaction Terms
In R, extending a simple to a multiple regression is easy. We simply include additional predictor variables in our model formulation.
Note that '.' can be used to indicate all the columns that are not explicitly specified. This can be helpful when working with data sets with many columns, at least if all of them need to be included as predictor variables.
```{r multiple regression}
# build a model with 2 predictors
trControl <- trainControl(method = 'none')
target_var <- 'Sales'
model_form <- Sales ~ Price+Urban+US
model_type <- 'lm'
model_recipe <- recipe(model_form, data = Carseats) %>% step_zv(all_predictors())
# step 3: specify training parameters and train
# let's fit the model
lm_fit <- train(model_recipe, data = Carseats, method = model_type, trControl = trControl)
# step 4: evaluate the model
summary(lm_fit)
# build a model using all columns other than 'Sales' as predictor variables
model_form <- Sales ~ .
model_recipe <- recipe(model_form, data = Carseats) %>% step_zv(all_predictors())
# step 3: specify training parameters and train
# let's fit the model
lm_fit <- train(model_recipe, data = Carseats, method = model_type, trControl = trControl)
# step 4: evaluate the model
summary(lm_fit)
# we can also use the vif (variance inflation factor) function to see which predictors have the most influence in the regression. the built-in version of varImp() in caret uses the t statistics for regression
lm_fit_importance <- varImp(lm_fit, scale = FALSE) # scale=TRUE scales everything between 0 and 100, where 100 indicates the highest importance
plot(lm_fit_importance)
lm_fit_importance <- varImp(lm_fit)
plot(lm_fit_importance)
# if we want to manually exclude a non-significant variable, we can do it by removing the variable in the recipe:
model_form <- Sales ~ .
#step_rm removes one or more variables
model_recipe <- recipe(model_form, data = Carseats) %>%
step_zv(all_predictors()) %>%
step_rm(age)
lm_fit <- train(model_recipe, data = Carseats, method = model_type, trControl = trControl)
summary(lm_fit)
## Interaction Terms
# we can include interaction terms between two variables by using * instead of +
model_form <- Sales ~ .
model_recipe <- recipe(model_form, data = Carseats) %>%
step_interact(terms = ~ Price * Age) %>%
step_zv(all_predictors())
lm_fit <- train(model_recipe, data = Carseats, method = model_type, trControl = trControl)
summary(lm_fit)
```
## In-class Exercise 1:
We are going to build a regression model using the WinPercentage as dependent variable and the other two variables OffPassYds and DefYds as predictor variables.
```{r exercise 1}
# load the NFl.csv dataset
dataset <- read_csv("NFL.csv")
# build a regression model as described above
target_var <- 'WinPercentage'
model_form <- WinPercentage ~ OffPassYds + DefYds
model_type <- 'lm'
model_recipe <- recipe(model_form, data = dataset) %>% step_zv(all_predictors())
# step 3: specify training parameters and train
# let's fit the model
lm_fit <- train(model_recipe, data = dataset, method = model_type, trControl = trControl)
# step 4: evaluate the model
summary(lm_fit)
lm_pred <- lm_fit %>% predict(newdata = dataset)
postResample(lm_pred, dataset[[target_var]])
# plot the diagnostic plots. Do you see any issues?
autoplot(lm_fit$finalModel)
```
## Non-linear Transformations
An easy way to use regression for non-linear applications is through transformations of the input data. Two common transformations are polynomial or logarithmic transformations.
```{r non-linear transformations}
# Non-linear Transformations of the Predictors
# to include polynomial terms, we can use the step_poly function in the recipe
dataset <- Boston
target_var <- 'medv'
model_form <- medv ~ lstat
model_type <- 'lm'
model_recipe <- recipe(model_form, data = dataset) %>%
step_zv(all_predictors()) %>%
step_poly(lstat, degree=2)
# step 3: specify training parameters and train
# we're not using any resampling for now
trControl <- trainControl(method = 'none')
# let's fit the model
lm_fit_poly <- train(model_recipe, data = dataset, method = model_type, trControl = trControl)
# step 4: evaluate the model
# calling the summary function on the lm object will give you the basic regression output
summary(lm_fit_poly)
# step 5: get the predictions and evaluate the model performance
# we can predict the regression output for new values of x by using the generic 'predict' function
lm_pred_poly <- lm_fit_poly %>% predict(newdata = dataset)
# we can get the MSE and RMSE from the model output:
postResample(lm_pred_poly, dataset[[target_var]])
# we can actually use the anova function to see if adding the quadratic variable significantly decreases the RSS
model_recipe <- recipe(model_form, data = dataset) %>%
step_zv(all_predictors())
lm_fit <- train(model_recipe, data = dataset, method = model_type, trControl = trControl)
anova(lm_fit$finalModel, lm_fit_poly$finalModel) # if the result is significant, it means that the models lead to significantly different explained variances
# plot the diagnostics
autoplot(lm_fit_poly$finalModel)
# alternative formulation for higher-level polynomials
model_recipe <- recipe(model_form, data = dataset) %>%
step_zv(all_predictors()) %>%
step_poly(lstat, degree=5)
# step 3: specify training parameters and train
# we're not using any resampling for now
trControl <- trainControl(method = 'none')
# let's fit the model
lm_fit_poly <- train(model_recipe, data = dataset, method = model_type, trControl = trControl)
summary(lm_fit_poly)
# last but not least, the logarithmic transformation is often useful for non-linear variables
model_recipe_log <- recipe(model_form, data = dataset) %>%
step_zv(all_predictors()) %>%
step_log(lstat)
# step 3: specify training parameters and train
# we're not using any resampling for now
trControl <- trainControl(method = 'none')
# let's fit the model
lm_fit_log <- train(model_recipe_log, data = dataset, method = model_type, trControl = trControl)
summary(lm_fit_log)
# we can plot the models and compare them
Boston %>% ggplot(mapping = aes(x = medv, y = lstat)) + geom_point() +
geom_smooth(method = "lm", se=FALSE, color="black", formula = y ~ x) +
geom_smooth(method = "lm", se=FALSE, color="blue", formula = y ~ poly(x, 2, raw = TRUE)) +
geom_smooth(method = "lm", se=FALSE, color="red", formula = y ~ poly(x, 5, raw = TRUE)) +
geom_smooth(method = "lm", se=FALSE, color="green", formula = y ~ log(x))
```
## In-class Exercise 2:
We are going to build a regression model using Score (y) as a function of the Price and Type of restaurant.
```{r exercise 2}
# load the RestaurantRatings.csv dataset
dataset <- read_csv("RestaurantRatings.csv")
library(GGally)
dataset %>% select(-c("Restaurant")) %>% ggpairs()
# build a regression model as described above
target_var <- 'Score'
model_form <- Score ~ Type + Price
model_type <- 'lm'
model_recipe <- recipe(model_form, data = dataset) %>%
step_zv(all_predictors())
# step 3: specify training parameters and train
# we're not using any resampling for now
trControl <- trainControl(method = 'none')
# let's fit the model
lm_fit <- train(model_recipe, data = dataset, method = model_type, trControl = trControl)
# which variables, if any, are significant? how much variance is explained by the model?
summary(lm_fit)
# plot the diagnostic plots. Do you see any issues?
autoplot(lm_fit$finalModel)
# try a non-linear transformation on a variable. Does it increase the model performance? Do the diagnostic plots look better now?
model_recipe <- recipe(model_form, data = dataset) %>%
step_log(Price) %>%
step_zv(all_predictors())
# which variables, if any, are significant? how much variance is explained by the model?
lm_fit_log <- train(model_recipe, data = dataset, method = model_type, trControl = trControl)
# which variables, if any, are significant? how much variance is explained by the model?
summary(lm_fit_log)
autoplot(lm_fit_log$finalModel)
```