# Reducing traffic mortality in the US

In [None]:
# This is a code cell without any tag. You can put convenience code here,
# but it won't be included in any way in the final project.
# For example, to be able to run tests locally in the notebook
# you need to install the following:
# install.packages("devtools")
# install.packages(testthat")
# devtools::install_github('datacamp/IRkernel.testthat')

# This allows .... to be used as placeholder value in the sample code cells
.... <- NULL 

## 1. Identifying the raw data ﬁles and determining their format

![](img/car_accident.jpg)

While the rate of fatal road accidents rate have been decreasing steadily since the 80's, the past 10 years have seen a stagnation in this reduction. Coupled with the increase number of miles driven in the nation, the total number of traﬃc related fatalities has now reached a 10 year high and is rapidly increasing.

![](img/accident-history.png)

Per request of the US Department of Transportation we are currently investigating how to derive a strategy to reduce the incidence of road accidents across the nation. By looking at the demographics of traﬃc accident victims for each US state, we find that there is a lot of variation between states. Now we want to understand if there are patterns in this variation in order to derive suggestions for a policy action plan. In particular, instead of implementing a costly nation-wide plan we want to focus on groups of  states with similar profiles. How can we find such groups in a statistically sound way and communicate the result effectively?  

To accomplish these tasks, we will make use of data wrangling, plotting, dimensionality reduction, and unsupervised clustering.

The data given to us was originally collected by the National Highway Traﬃc Safety Administration and the National Association of Insurance Commissioners. This particular dataset was compiled and released as a [CSV-ﬁle](https://github.com/ﬁvethirtyeight/data/tree/master/bad-drivers) by FiveThirytEight under the [CC-BY4.0 license](https://github.com/ﬁvethirtyeight/data).

**One** sentence that summarizes the code the student will write in this task.

- The specific task instructions go in a bullet point list. One sentence ideally (max 2) per bullet.
- Try to map code cell comments to instruction bullets.
- At most 4 bullets.

<hr>

## Good to know

The `@instructions` for **task 1** should include a "Good to know" section where you direct the student to resources that could be useful _throughout_ the Project, as well as the recommended prerequisites. You should also link to resources that are helpful for task 1 specifically. These resources could be external documentation, DataCamp courses and exercises, cheat sheets, Stack Overflow answers, etc. Recommended format for task 1 below.

This Project lets you practice the skills from [Introduction to the Tidyverse](https://www.datacamp.com/courses/introduction-to-the-tidyverse), including filtering, grouping and summarizing data, and visualizing with ggplot2. We recommend that you take that course before starting this Project.

Helpful links:
- tidyverse [cheat sheet](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Tidyverse+Cheat+Sheet.pdf)
- dplyr's `mutate()` function [documentation](https://www.rdocumentation.org/packages/dplyr/versions/0.5.0/topics/mutate)
- Mutate [exercises](https://campus.datacamp.com/courses/introduction-to-the-tidyverse/data-wrangling-1?ex=10) in the Introduction to the Tidyverse course

Hints are meant for students who are stuck. Since students can't view solutions in Projects, clicking the hint button is their last resort. We often recommend including code scaffolding (example below).

You can read `path_to/my_data.csv` into a data frame named `my_data` like this after loading the tidyverse package:

```r
my_data <- read_csv("path_to/my_data.csv")
```

In [None]:
# This is the sample code the student will see. It should
# consist of up to 10 lines of code and comments, and the
# student should have to complete at most 5 lines of code.

# Rule of thumb: each bullet point in @instructions should
# correspond to a comment in the @sample_code.

# Indicate missing code with ....
like_this <- ....
# or when a line or more is required, like this:
# .... YOUR CODE FOR TASK 1 ....

In [None]:
# Checking the name of the current directory to know that you we are the right place
current_dir <- getwd() 
print(current_dir)

# Listing all files in this directory to see the name of the main data file
file_list <- list.files() 
print(file_list)

# Listing the content of the "data" directory
file_list <- list.files("datasets")
print(file_list)

# Studying the first 20 lines of this file to understand how to read it in as a dataframe in the next cell
accidents_head <- readLines('datasets/road-accidents.csv', n=20) 
print(accidents_head)

In [None]:
# These packages need to be loaded in the first @tests cell. 
library(testthat) 
library(IRkernel.testthat)

# Then follows one or more tests of the student's code. 
# The @solution should pass the tests.
# The purpose of the tests is to try to catch common errors and to 
# give the student a hint on how to resolve these errors.

run_tests({
    test_that("the answer is correct", {
    expect_true(like_this == "missing part filled in", 
        info = "The student will see this info if the test fails.")
    })
    # You can have more than one test
})

## 2. Reading in and getting an overview of the data

After peeking at the beginning of the file we now know the dataformat and the next step is to import the data as a dataframe. After this, we will orient outselves to get to know the data that we are dealing with.

**One** sentence that summarizes the code the student will write in this task.
- The specific task instructions go in a bullet point list. One sentence ideally (max 2) per bullet.
- Try to map code cell comments to instruction bullets.
- At most 4 bullets.

<hr>

Provide more info (if necessary) and include links to external resources under the horizontal ruler. The instructions should at most have 600 characters. Example format for links below.

Helpful links:
- dplyr's mutate() function [documentation](https://www.rdocumentation.org/packages/dplyr/versions/0.5.0/topics/mutate)
- Mutate [exercises](https://campus.datacamp.com/courses/introduction-to-the-tidyverse/data-wrangling-1?ex=10) in the Introduction to the Tidyverse course

Hints are meant for students who are stuck. Since students can't view solutions in Projects, clicking the hint button is their last resort. We often recommend including code scaffolding (example below).

You can read `path_to/my_data.csv` into a data frame named `my_data` like this after loading the tidyverse package:

```r
my_data <- read_csv("path_to/my_data.csv")
```

In [None]:
# This is the sample code the student will see. It should
# consist of up to 10 lines of code and comments, and the
# student should have to complete at most 5 lines of code.

# Rule of thumb: each bullet point in @instructions should
# correspond to a comment in the @sample_code.

# Indicate missing code with ....
like_this <- ....
# or when a line or more is required, like this:
# .... YOUR CODE FOR TASK 2 ....

In [None]:
# Importing the road accident data as a dataframe using tidyverse
library(tidyverse)
car_acc <- read_delim(file = 'datasets/road-accidents.csv', comment = '#', delim = '|')

# Saving the number of rows columns as a vector
rows_and_cols <- dim(car_acc)

# Saving an overview of the data frame
# This overview should include the number of rows and columns, the column data names and data types
car_acc_structure <- str(car_acc)
print(car_acc_structure)

# Displaying the last six rows of the data frame. 
# Comparing the column data types from previously with this output.
# Does it make sense? A quick data type sanity check like this can save us major headaches down the line.
tail(car_acc)

In [None]:
# One or more tests of the student's code. 
# The @solution should pass the tests.
# The purpose of the tests is to try to catch common errors and to 
# give the student a hint on how to resolve these errors.
run_tests({
    test_that("the answer is correct", {
    expect_true(like_this == "missing part filled in", 
        info = "The student will see this info if the test fails.")
    })
    # You can have more than one test
})

## 3. Creating a textual and a graphical summary of the data

We now have an idea of what the dataset looks like. To further familiarize ourselves with this data, we will calculate summary statistics and produce a graphical overview of the data. The graphical overview is good to get a sense for the distribution of variables within the data, and could consist of one histogram per column. It is often a good idea to also explore the pairwise relationsship between all columns in the data set by using a using pairwise scatterplots (sometimes referred to as a "scatterplot matrix").

**One** sentence that summarizes the code the student will write in this task.
- The specific task instructions go in a bullet point list. One sentence ideally (max 2) per bullet.
- Try to map code cell comments to instruction bullets.
- At most 4 bullets.

<hr>

Provide more info (if necessary) and include links to external resources under the horizontal ruler. The instructions should at most have 600 characters. Example format for links below.

Helpful links:
- dplyr's mutate() function [documentation](https://www.rdocumentation.org/packages/dplyr/versions/0.5.0/topics/mutate)
- Mutate [exercises](https://campus.datacamp.com/courses/introduction-to-the-tidyverse/data-wrangling-1?ex=10) in the Introduction to the Tidyverse course

Hints are meant for students who are stuck. Since students can't view solutions in Projects, clicking the hint button is their last resort. We often recommend including code scaffolding (example below).

You can read `path_to/my_data.csv` into a data frame named `my_data` like this after loading the tidyverse package:

```r
my_data <- read_csv("path_to/my_data.csv")
```

In [None]:
# This is the sample code the student will see. It should
# consist of up to 10 lines of code and comments, and the
# student should have to complete at most 5 lines of code.

# Rule of thumb: each bullet point in @instructions should
# correspond to a comment in the @sample_code.

# Indicate missing code with ....
like_this <- ....
# or when a line or more is required, like this:
# .... YOUR CODE FOR TASK 3 ....

In [None]:
# Producing summary statistics of all columns in the data frame
summary(car_acc)

# Creating a pairwise scatterplot to explore the data
car_acc %>% 
    select(-state) %>%
    plot()

In [None]:
# One or more tests of the student's code. 
# The @solution should pass the tests.
# The purpose of the tests is to try to catch common errors and to 
# give the student a hint on how to resolve these errors.
run_tests({
    test_that("the answer is correct", {
    expect_true(like_this == "missing part filled in", 
        info = "The student will see this info if the test fails.")
    })
    # You can have more than one test
})

## 4. Quantifying association of features and fatal accidents

We can already see some potentially interesting relationships between the target variable (the number of fatal accidents) and the feature variables (the remaining three columns).

To quantify the pairwise relationships that we observed in the scatterplots, we can compute the Pearson correlation coefficient matrix. The Pearson correlation coeffcient is one of the most common methods to quantify correlation between variables and by convention the following thresholds are usually used:

- 0.2 = weak
- 0.5 = medium
- 0.8 = strong
- 0.9 = very strong

**One** sentence that summarizes the code the student will write in this task.
- The specific task instructions go in a bullet point list. One sentence ideally (max 2) per bullet.
- Try to map code cell comments to instruction bullets.
- At most 4 bullets.

<hr>

Provide more info (if necessary) and include links to external resources under the horizontal ruler. The instructions should at most have 600 characters. Example format for links below.

Helpful links:
- dplyr's mutate() function [documentation](https://www.rdocumentation.org/packages/dplyr/versions/0.5.0/topics/mutate)
- Mutate [exercises](https://campus.datacamp.com/courses/introduction-to-the-tidyverse/data-wrangling-1?ex=10) in the Introduction to the Tidyverse course

Hints are meant for students who are stuck. Since students can't view solutions in Projects, clicking the hint button is their last resort. We often recommend including code scaffolding (example below).

You can read `path_to/my_data.csv` into a data frame named `my_data` like this after loading the tidyverse package:

```r
my_data <- read_csv("path_to/my_data.csv")
```

In [None]:
# This is the sample code the student will see. It should
# consist of up to 10 lines of code and comments, and the
# student should have to complete at most 5 lines of code.

# Rule of thumb: each bullet point in @instructions should
# correspond to a comment in the @sample_code.

# Indicate missing code with ....
like_this <- ....
# or when a line or more is required, like this:
# .... YOUR CODE FOR TASK 4 ....

In [None]:
# Computing the correlation for all column pairs 
car_acc %>% select(-state) %>% cor()

In [None]:
# One or more tests of the student's code. 
# The @solution should pass the tests.
# The purpose of the tests is to try to catch common errors and to 
# give the student a hint on how to resolve these errors.
run_tests({
    test_that("the answer is correct", {
    expect_true(like_this == "missing part filled in", 
        info = "The student will see this info if the test fails.")
    })
    # You can have more than one test
})

## 5. Fitting a multivariate linear regression

From the correlation table we see that the amount of fatal accidents is most strongly correlated with alcohol consumption (first row). But in addition, we also see that some of features are correlated with each other, for instance speeding and alcohol consumption are positively correlated. We therefore want to compute the association of the target with each feature while adjusting for the effect of the remaining features. This can be done using a multivariate linear regression.

Both the multivariate regression and the correlation measure how strongly the features are associated with the outcome (fatal accidents). When comparing the regression coefficients with the correlation coefficients we will see that they are slightly different. The reason for this is that the multiple regression computes the association of a feature with an outcome, given the association with all other features, which is not accounted for when calculating the correlation coefficients.

A particularly interesting case is when the correlation coefficient and the regression coefficient of the same feature have opposite signs. How can this be? For example, when a feature A is positively correlated with the outcome Y but also positively correlated with a different feature B that has a negative effect on Y, then the indirect correlation (A->B->Y) can overwhelm the direct correlation (A->Y). In such a case, the regression coefficient of feature A could be positive, while the correlation coefficient is negative. This is sometimes called a *masking* relationship. Let's see if the multivariate regression can reveal such a phenomenon.  

**One** sentence that summarizes the code the student will write in this task.
- The specific task instructions go in a bullet point list. One sentence ideally (max 2) per bullet.
- Try to map code cell comments to instruction bullets.
- At most 4 bullets.

<hr>

Provide more info (if necessary) and include links to external resources under the horizontal ruler. The instructions should at most have 600 characters. Example format for links below.

Helpful links:
- dplyr's mutate() function [documentation](https://www.rdocumentation.org/packages/dplyr/versions/0.5.0/topics/mutate)
- Mutate [exercises](https://campus.datacamp.com/courses/introduction-to-the-tidyverse/data-wrangling-1?ex=10) in the Introduction to the Tidyverse course

Hints are meant for students who are stuck. Since students can't view solutions in Projects, clicking the hint button is their last resort. We often recommend including code scaffolding (example below).

You can read `path_to/my_data.csv` into a data frame named `my_data` like this after loading the tidyverse package:

```r
my_data <- read_csv("path_to/my_data.csv")
```

In [None]:
# This is the sample code the student will see. It should
# consist of up to 10 lines of code and comments, and the
# student should have to complete at most 5 lines of code.

# Rule of thumb: each bullet point in @instructions should
# correspond to a comment in the @sample_code.

# Indicate missing code with ....
like_this <- ....
# or when a line or more is required, like this:
# .... YOUR CODE FOR TASK 5 ....

In [None]:
# Fitting a linear regression model 
fit_reg <- lm( drvr_fatl_col_bmiles ~ perc_fatl_speed + perc_fatl_alcohol + perc_fatl_1st_time , data=car_acc )
# Printing the regression coefficients
coef(fit_reg)

In [None]:
# One or more tests of the student's code.  
# The @solution should pass the tests.
# The purpose of the tests is to try to catch common errors and to 
# give the student a hint on how to resolve these errors.
run_tests({
    test_that("the answer is correct", {
    expect_true(like_this == "missing part filled in", 
        info = "The student will see this info if the test fails.")
    })
    # You can have more than one test
})

## 6. Performing PCA on standardized data

We have learned that alcohol consumption is weakly associated with the amount of fatal accidents across states. This could lead you to already conclude that alcohol consumption should be a focus for futher investigations and maybe strategies should devide states into high versus low alcohol consumption in accidents. But there are also associations between  alcohol consumptions and the other two features, so it might be worth trying to split the states in a way that accounts for all three features.

One way of clustering the data is to use PCA to visualize data in reduced dimensional space where we can try to pickup patterns by eye. PCA uses the absolute variance to calculate the overall variance explained for each principal component, so it is important that the features are on a similar scale (unless we would have a particular reason that one features should be weighted more).

We will use the appropriate scaling function to standardize the features to be centered with mean 0 and scaled to a standard deviation of 1.

**One** sentence that summarizes the code the student will write in this task.
- The specific task instructions go in a bullet point list. One sentence ideally (max 2) per bullet.
- Try to map code cell comments to instruction bullets.
- At most 4 bullets.

<hr>

Provide more info (if necessary) and include links to external resources under the horizontal ruler. The instructions should at most have 600 characters. Example format for links below.

Helpful links:
- dplyr's mutate() function [documentation](https://www.rdocumentation.org/packages/dplyr/versions/0.5.0/topics/mutate)
- Mutate [exercises](https://campus.datacamp.com/courses/introduction-to-the-tidyverse/data-wrangling-1?ex=10) in the Introduction to the Tidyverse course

Hints are meant for students who are stuck. Since students can't view solutions in Projects, clicking the hint button is their last resort. We often recommend including code scaffolding (example below).

You can read `path_to/my_data.csv` into a data frame named `my_data` like this after loading the tidyverse package:

```r
my_data <- read_csv("path_to/my_data.csv")
```

In [None]:
# This is the sample code the student will see. It should
# consist of up to 10 lines of code and comments, and the
# student should have to complete at most 5 lines of code.

# Rule of thumb: each bullet point in @instructions should
# correspond to a comment in the @sample_code.

# Indicate missing code with ....
like_this <- ....
# or when a line or more is required, like this:
# .... YOUR CODE FOR TASK 6 ....

In [None]:
# Centering and standadizing the feature columns
car_acc_standised <- car_acc %>% 
                mutate( perc_fatl_speed=scale(perc_fatl_speed),
                        perc_fatl_alcohol=scale(perc_fatl_alcohol),
                        perc_fatl_1st_time=scale(perc_fatl_1st_time) )

# Performing PCA on the standadized features
pca_fit <- princomp( car_acc_standised[-c(1,2)]  )

# Plotting the proportion of variance explained by each principle component (PC)
pr_var <- pca_fit$sdev^2
pve <- pr_var/sum(pr_var)
plot( pve , xlab="Principal Component",
      ylab="Proportion of Variance Explained", type="b",ylim=c(0,1))

# Computing the cumulative variance explained by the first two principle components
cve <- cumsum(pve)
cve_12 <- cve[2]
print(cve_12)

In [None]:
# One or more tests of the student's code. 
# The @solution should pass the tests.
# The purpose of the tests is to try to catch common errors and to 
# give the student a hint on how to resolve these errors.
run_tests({
    test_that("the answer is correct", {
    expect_true(like_this == "missing part filled in", 
        info = "The student will see this info if the test fails.")
    })
    # You can have more than one test
})

## 7. Visualizing the data using the ﬁrst two principal components

The ﬁrst two principal components enable visualization of the data in two dimensions while capturing a high proportion of the variation (79%). This enables us to use our eyes to try to discern patterns in the data. Although clustering algorithms are becoming increasingly eﬃcient, human pattern recognition is an easy accessible and very eﬃcient method of assessing clusters in data.

We will create a scatter plot of the ﬁrst 2 principle components, and determine how many clusters there are.

**One** sentence that summarizes the code the student will write in this task.
- The specific task instructions go in a bullet point list. One sentence ideally (max 2) per bullet.
- Try to map code cell comments to instruction bullets.
- At most 4 bullets.

<hr>

Provide more info (if necessary) and include links to external resources under the horizontal ruler. The instructions should at most have 600 characters. Example format for links below.

Helpful links:
- dplyr's mutate() function [documentation](https://www.rdocumentation.org/packages/dplyr/versions/0.5.0/topics/mutate)
- Mutate [exercises](https://campus.datacamp.com/courses/introduction-to-the-tidyverse/data-wrangling-1?ex=10) in the Introduction to the Tidyverse course

Hints are meant for students who are stuck. Since students can't view solutions in Projects, clicking the hint button is their last resort. We often recommend including code scaffolding (example below).

You can read `path_to/my_data.csv` into a data frame named `my_data` like this after loading the tidyverse package:

```r
my_data <- read_csv("path_to/my_data.csv")
```

In [None]:
# This is the sample code the student will see. It should
# consist of up to 10 lines of code and comments, and the
# student should have to complete at most 5 lines of code.

# Rule of thumb: each bullet point in @instructions should
# correspond to a comment in the @sample_code.

# Indicate missing code with ...
like_this <- ....
# or when a line or more is required, like this:
# .... YOUR CODE FOR TASK 7 ....

In [None]:
# Plotling the first 2 principle components in a scatterplot
plot(pca_fit$scores[,1],pca_fit$scores[,2],pch=16) 

In [None]:
# One or more tests of the student's code.
# The @solution should pass the tests.
# The purpose of the tests is to try to catch common errors and to 
# give the student a hint on how to resolve these errors.
run_tests({
    test_that("the answer is correct", {
    expect_true(like_this == "missing part filled in", 
        info = "The student will see this info if the test fails.")
    })
    # You can have more than one test
})

## 8. Finding clusters of similar states in the data

It was not entirely clear from the PCA scatter plot how many clusters there are in the data. We will therefore try to quantify the number of clusters using KMeans clustering. KMeans clustering can be used to assist with the identification of a number of clusters by creating a scree plot and finding the elbow, which is an indication of when the addition of more clusters does not contribute much additional explanatory power.

**One** sentence that summarizes the code the student will write in this task.
- The specific task instructions go in a bullet point list. One sentence ideally (max 2) per bullet.
- Try to map code cell comments to instruction bullets.
- At most 4 bullets.

<hr>

Provide more info (if necessary) and include links to external resources under the horizontal ruler. The instructions should at most have 600 characters. Example format for links below.

Helpful links:
- dplyr's mutate() function [documentation](https://www.rdocumentation.org/packages/dplyr/versions/0.5.0/topics/mutate)
- Mutate [exercises](https://campus.datacamp.com/courses/introduction-to-the-tidyverse/data-wrangling-1?ex=10) in the Introduction to the Tidyverse course

Hints are meant for students who are stuck. Since students can't view solutions in Projects, clicking the hint button is their last resort. We often recommend including code scaffolding (example below).

You can read `path_to/my_data.csv` into a data frame named `my_data` like this after loading the tidyverse package:

```r
my_data <- read_csv("path_to/my_data.csv")
```

In [None]:
# This is the sample code the student will see. It should
# consist of up to 10 lines of code and comments, and the
# student should have to complete at most 5 lines of code.

# Rule of thumb: each bullet point in @instructions should
# correspond to a comment in the @sample_code.

# Indicate missing code with ...
like_this <- ....
# or when a line or more is required, like this:
# .... YOUR CODE FOR TASK 8 ....

In [None]:
# Creating a loop that applies the kmean method using k=1 to k=10 clusters
# And for each cluster compute the explanatory power using the within cluster sum of squares
k_vec <- 1:10
inertias <- rep(NA, length(k_vec))
mykm <- list()
set.seed(1)
for (k in k_vec) {
                mykm[[k]] <- kmeans( car_acc_standised[-c(1,2)] , k , nstart=50  )
                inertias[k] <- mykm[[k]]$tot.withinss             
}
plot( k_vec,inertias , type="b")

In [None]:
# One or more tests of the student's code. 
# The @solution should pass the tests.
# The purpose of the tests is to try to catch common errors and to 
# give the student a hint on how to resolve these errors.
run_tests({
    test_that("the answer is correct", {
    expect_true(like_this == "missing part filled in", 
        info = "The student will see this info if the test fails.")
    })
    # You can have more than one test
})

*The recommended number of tasks in a DataCamp Project is between 8 and 10, so feel free to add more if necessary. You can't have more than 12 tasks.*

## 9. Using KMeans to visualize clusters in the PCA visualization

There is no clear elbow in the scree plot, both 2 and 3 clusters seem like reasonable choices. We resume our analysis using 3 clusters. 

**One** sentence that summarizes the code the student will write in this task.
- The specific task instructions go in a bullet point list. One sentence ideally (max 2) per bullet.
- Try to map code cell comments to instruction bullets.
- At most 4 bullets.

<hr>

Provide more info (if necessary) and include links to external resources under the horizontal ruler. The instructions should at most have 600 characters. Example format for links below.

Helpful links:
- dplyr's mutate() function [documentation](https://www.rdocumentation.org/packages/dplyr/versions/0.5.0/topics/mutate)
- Mutate [exercises](https://campus.datacamp.com/courses/introduction-to-the-tidyverse/data-wrangling-1?ex=10) in the Introduction to the Tidyverse course

Hints are meant for students who are stuck. Since students can't view solutions in Projects, clicking the hint button is their last resort. We often recommend including code scaffolding (example below).

You can read `path_to/my_data.csv` into a data frame named `my_data` like this after loading the tidyverse package:

```r
my_data <- read_csv("path_to/my_data.csv")
```

In [None]:
# This is the sample code the student will see. It should
# consist of up to 10 lines of code and comments, and the
# student should have to complete at most 5 lines of code.

# Rule of thumb: each bullet point in @instructions should
# correspond to a comment in the @sample_code.

# Indicate missing code with ...
like_this <- ....
# or when a line or more is required, like this:
# .... YOUR CODE FOR TASK 8 ....

In [None]:
# Colouring the points of the principle component plot according to cluster number
cluster_id <- as.factor(mykm[[3]]$cluster)
plot(pca_fit$scores[,1],pca_fit$scores[,2],col=cluster_id,pch=16) 

In [None]:
# One or more tests of the student's code. 
# The @solution should pass the tests.
# The purpose of the tests is to try to catch common errors and to 
# give the student a hint on how to resolve these errors.
run_tests({
    test_that("the answer is correct", {
    expect_true(like_this == "missing part filled in", 
        info = "The student will see this info if the test fails.")
    })
    # You can have more than one test
})

## 10. Visualizing the feature differences between the clusters

Next we want to understand in what way the three clusters of states are different. It helps to use the the unscaled data here. This will be a comparison of 3 different clusters in terms of all three different features. A visualisation will help us to make this comparison.  

**One** sentence that summarizes the code the student will write in this task.
- The specific task instructions go in a bullet point list. One sentence ideally (max 2) per bullet.
- Try to map code cell comments to instruction bullets.
- At most 4 bullets.

<hr>

Provide more info (if necessary) and include links to external resources under the horizontal ruler. The instructions should at most have 600 characters. Example format for links below.

Helpful links:
- dplyr's mutate() function [documentation](https://www.rdocumentation.org/packages/dplyr/versions/0.5.0/topics/mutate)
- Mutate [exercises](https://campus.datacamp.com/courses/introduction-to-the-tidyverse/data-wrangling-1?ex=10) in the Introduction to the Tidyverse course

Hints are meant for students who are stuck. Since students can't view solutions in Projects, clicking the hint button is their last resort. We often recommend including code scaffolding (example below).

You can read `path_to/my_data.csv` into a data frame named `my_data` like this after loading the tidyverse package:

```r
my_data <- read_csv("path_to/my_data.csv")
```

In [None]:
# This is the sample code the student will see. It should
# consist of up to 10 lines of code and comments, and the
# student should have to complete at most 5 lines of code.

# Rule of thumb: each bullet point in @instructions should
# correspond to a comment in the @sample_code.

# Indicate missing code with ...
like_this <- ....
# or when a line or more is required, like this:
# .... YOUR CODE FOR TASK 8 ....

In [None]:
# Assigning the cluster labels to the data frame with the unscaled data. 
# To use the violing plot, we need to transform the data to the long format 
car_acc$cluster <- cluster_id
car_acc %>%  select(-drvr_fatl_col_bmiles) %>% 
                gather(key=feature,value=percent,-state,-cluster) %>% 
                ggplot( aes(x=feature,y=percent , fill=cluster) ) +
                geom_violin( ) +
                coord_flip()

In [None]:
# One or more tests of the student's code. 
# The @solution should pass the tests.
# The purpose of the tests is to try to catch common errors and to 
# give the student a hint on how to resolve these errors.
run_tests({
    test_that("the answer is correct", {
    expect_true(like_this == "missing part filled in", 
        info = "The student will see this info if the test fails.")
    })
    # You can have more than one test
})

## 11. Find out which clusters have the highest incidence of fatal accidents

Now it is clear that different groups of states may require different interventions, but which cluster should we start helping? A reasonable approach is to try to help a cluster that will save as many people as possible and also has relatively few states.

The data of how many miles are driven in each state is available in another tab-delimeted text file. We will assign these values to a column in the data frame and create a violin plot for how many deaths there are within each cluster.

**One** sentence that summarizes the code the student will write in this task.
- The specific task instructions go in a bullet point list. One sentence ideally (max 2) per bullet.
- Try to map code cell comments to instruction bullets.
- At most 4 bullets.

<hr>

Provide more info (if necessary) and include links to external resources under the horizontal ruler. The instructions should at most have 600 characters. Example format for links below.

Helpful links:
- dplyr's mutate() function [documentation](https://www.rdocumentation.org/packages/dplyr/versions/0.5.0/topics/mutate)
- Mutate [exercises](https://campus.datacamp.com/courses/introduction-to-the-tidyverse/data-wrangling-1?ex=10) in the Introduction to the Tidyverse course

Hints are meant for students who are stuck. Since students can't view solutions in Projects, clicking the hint button is their last resort. We often recommend including code scaffolding (example below).

You can read `path_to/my_data.csv` into a data frame named `my_data` like this after loading the tidyverse package:

```r
my_data <- read_csv("path_to/my_data.csv")
```

In [None]:
# This is the sample code the student will see. It should
# consist of up to 10 lines of code and comments, and the
# student should have to complete at most 5 lines of code.

# Rule of thumb: each bullet point in @instructions should
# correspond to a comment in the @sample_code.

# Indicate missing code with ...
like_this <- ....
# or when a line or more is required, like this:
# .... YOUR CODE FOR TASK 8 ....

In [None]:
# Reading in the annual miles file and joining it with the car accident data frame
# Computing the total number of accidents per state and summarizing for each cluster
# Finally showing the summed number of accidents in each cluster
miles_driven <- read_delim( file="datasets/miles-driven.csv", delim = '|' )

carr_acc_joined <- left_join(car_acc, miles_driven, by="state") 
carr_acc_joined <- carr_acc_joined %>% mutate( num_drvr_fatl_col=drvr_fatl_col_bmiles*million_miles_annually/1000 )

carr_acc_joined_summ <- carr_acc_joined %>% group_by(cluster) %>% select(cluster,num_drvr_fatl_col) %>% 
                summarise(count=n(),
                          mean=mean(num_drvr_fatl_col),
                          min=min(num_drvr_fatl_col),
                          q25=quantile(num_drvr_fatl_col,0.25),
                          q50=quantile(num_drvr_fatl_col,0.50),
                          q75=quantile(num_drvr_fatl_col,0.75),
                          max=max(num_drvr_fatl_col),
                          sum=sum(num_drvr_fatl_col))
print(carr_acc_joined_summ)

carr_acc_joined_summ %>% ggplot( aes(x=cluster,y=sum) ) +
                geom_bar( aes(fill=cluster), stat = "identity" , show.legend = F )

In [None]:
# One or more tests of the student's code. 
# The @solution should pass the tests.
# The purpose of the tests is to try to catch common errors and to 
# give the student a hint on how to resolve these errors.
run_tests({
    test_that("the answer is correct", {
    expect_true(like_this == "missing part filled in", 
        info = "The student will see this info if the test fails.")
    })
    # You can have more than one test
})

## 12. Making a decision when there is no clear correct choice

As we can see, there is no obvious correct choice regarding which cluster is the most important to focus on. Yet, we can still argue for a certain cluster and motivate this using our findings above. Which cluster should be a focus for policy intervention and further investigation? 

# We suggest that every answer would be correct here.
# It's just away to get students to reflect on the advanatage of either case.
# Is this approach ok for the last question considering there
# are already many regular tasks in the notebook?

cluster_num in range(3)
# The info below would show up as the solution

# There are several ways to justify each cluster choice.
# - 1, Red = total number of people in accidents the largest.
# - 2, Green = The highest alcohol consumption among fatal cases, which had the strongest association with accidents.
# - 3, Blue = Cluster with the fewest number of states, good for a pilot effort.

**One** sentence that summarizes the code the student will write in this task.
- The specific task instructions go in a bullet point list. One sentence ideally (max 2) per bullet.
- Try to map code cell comments to instruction bullets.
- At most 4 bullets.

<hr>

Provide more info (if necessary) and include links to external resources under the horizontal ruler. The instructions should at most have 600 characters. Example format for links below.

Helpful links:
- dplyr's mutate() function [documentation](https://www.rdocumentation.org/packages/dplyr/versions/0.5.0/topics/mutate)
- Mutate [exercises](https://campus.datacamp.com/courses/introduction-to-the-tidyverse/data-wrangling-1?ex=10) in the Introduction to the Tidyverse course

Hints are meant for students who are stuck. Since students can't view solutions in Projects, clicking the hint button is their last resort. We often recommend including code scaffolding (example below).

You can read `path_to/my_data.csv` into a data frame named `my_data` like this after loading the tidyverse package:

```r
my_data <- read_csv("path_to/my_data.csv")
```

In [None]:
# This is the sample code the student will see. It should
# consist of up to 10 lines of code and comments, and the
# student should have to complete at most 5 lines of code.

# Rule of thumb: each bullet point in @instructions should
# correspond to a comment in the @sample_code.

# Indicate missing code with ...
like_this <- ....
# or when a line or more is required, like this:
# .... YOUR CODE FOR TASK 8 ....

In [None]:
# One or more tests of the student's code. 
# The @solution should pass the tests.
# The purpose of the tests is to try to catch common errors and to 
# give the student a hint on how to resolve these errors.
run_tests({
    test_that("the answer is correct", {
    expect_true(like_this == "missing part filled in", 
        info = "The student will see this info if the test fails.")
    })
    # You can have more than one test
})