Data on driving incidents was collected for all 52 states in the United states. This lab will explore whether or not state-level factors contribute to Car insurance premiums. The original goal of this dataset was to determine which state had the worst drivers.
Instead of a private repo per student, we're going to work collaboratively in a single repository. You can work by yourself or with another student in a team.
On the GitHub repository page for this lab, click "Fork" in the top right corner. This will create a copy of the lab repository in your own account. You will then clone this repository to RStudio. If you're working in a team, only one of you needs to fork the repository. Once you have cloned the repository, create a new .Rmd file with a name like "Lab02_teamname.Rmd", specific to your team. In that R Markdown file, complete the Lab Tasks and Discussion items outlined below. Then commit and push your work to GitHub. Your work will go to your forked version of the repository. Once you're ready, submit a pull request to merge your work back into the main class repository.
You must submit your pull request by 2019-09-27 at 23:59:59.
-
Read in the data set
data/bad-drivers.csv
- Name your dataset, for example,
badDrivers <- read.csv("./data/bad-drivers.csv")
- (recommended) rename the columns to shorter nicknames
- Name your dataset, for example,
-
Exploratory data analysis
- Create a draftsman plot showing all pairwise comparisons between columns in the data.
- Present a brief description of trends you see in the data, and how they may influence fitting a model.
- Plot the estimated distribution of Percentage of Drivers Involved in Fatal Collisions who were speeding using a histogram.
- If you include the above covariate as an explanatory variable in your regression (part of the X), will the distribution impact your model fit?
-
Regression analysis
- The target variable for our regression models is
Car Insurance Premiums (CIP)
, measured in dollars. - Pick a covariate you feel is most related to Car Insurance Premiums (I'll call this
M
for most related) - Fit a simple linear regression
(lm)
model that related CIP to M and save this model asreg01
. - Fit a multiple linear regression model
(lm)
that includes M and save this asreg02
. - Fit a polynomial regression model
(lm)
relating CIP to M and save this asreg03
. - Pick a model from REG01, REG02, and REG03. Plot M by CIP, and overlay the chosen model's fitted values.
- Describe your model. Do all the variables significantly contribute to predicting CIP? Interpret the coefficients, their direction (positive or negative) and how they relate to CIP.
- How does your multiple regression model compare the your simple linear regression, and how would communicate these results to an audience?
- The target variable for our regression models is
-
Hold-out
- Randomly select, and remove, 10 states from the training set. Store these 10 states in a dataset called
holdOut
and remaining 41 states in a dataset calledtraining
.badDrivers[c(1,2,3),]
selects the first, second, and third observation from your dataset.- Take a look at the
sample
command in R
- Re-train REG01,REG02,REG03 on
training
- For REG01, 02, and 03, compute the mean-squared error (MSE) on
holdOut
- Which model would you select and why?
- Randomly select, and remove, 10 states from the training set. Store these 10 states in a dataset called
-
Cross-validation
- For REG01, REG02, REG03
- Split your data into 5 training/testing sets (note, one dataset will have 11 observations)
- Create an empty data.frame called
crossValResults
that has 3 columns (one for each model) and 5 rows (one for each test MSE) - Program a for loop that
- trains your model on 4 pieces of the data
- tests, or makes predictions, on the "held-out" dataset.
- computes the MSE on the "held-out" dataset
- stores the test MSE in
crossValResults
.
- When completed, you should have computed 15 MSEs, 5 for every regression model stored as columns in a data frame.
- Compute the CV error for your regression models, the MSE averaged over each test set.
- How does the CV error compare to the hold-out error?
- How does the Cross-validation MSE compare between your simple and multiple regression?
- For REG01, REG02, REG03