#### Copyright 2019 Google LLC.

In [0]:
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Cross Validation

In past colabs we have practiced dividing our data into testing and training sets in order to evaluate the performance of our models. In some cases, especially when there is limited data available, we can't be too sure that our model isn't overfitting.

Cross validation is a [resampling](https://en.wikipedia.org/wiki/Resampling_(statistics)) technique that we use in machine learning to measure the quality of our model while by running multiple iterations of training and testing using different subsets of the data.

The most common algorithm used for cross validation in machine learning is k-fold cross validation. This colab will focus primarily on k-fold cross validation.


## Overview

### Learning Objectives

* Understand cross validation and how it is different from simple test/train splits.
* Build a synthetic column.
* Use k-fold cross validation libraries.


### Prerequisites

* Linear Regression with scikit-learn

### Estimated Duration

60 minutes

### Grading Criteria

Each exercise is worth 3 points. The rubric for calculating those points is:

| Points | Description |
|--------|-------------|
| 0      | No attempt at exercise |
| 1      | Attempted exercise, but code does not run |
| 2      | Attempted exercise, code runs, but produces incorrect answer |
| 3      | Exercise completed successfully |

There are 2 exercises in this Colab so there are 6 points available. The grading scale will be 3 points.

## Obtain and Prepare the Data

### Obtain the Data

Montgomery County Maryland provides [salary data](https://catalog.data.gov/dataset?tags=salary-and-gender) for government employees by year. For this colab we will use the 2017 dataset.

The code box below downloads the dataset and saves it in a file called 'Employee_Salaries_-_2017.csv'.

In [0]:
import urllib.request
import os

url = 'https://data.montgomerycountymd.gov/api/views/2qd6-mr43/rows.csv?accessType=DOWNLOAD'
file_name = 'Employee_Salaries_-_2017.csv'

urllib.request.urlretrieve(url, file_name)

if file_name not in os.listdir('./'):
  raise Exception(f'{file_name} was not downloaded to the correct directory')

Once the file is downloaded we can read it into a `DataFrame` and take a peek at the first bit of data.

In [0]:
import pandas as pd

salary_data = pd.read_csv(file_name)
salary_data.head()

### Build a Synthetic Column

Let's take a look at just the column names and data types.

In [0]:
salary_data.dtypes

The only numeric columns are related to pay, but there is one other column that looks like it might could be converted to a number and predictive of pay: 'Date First Hired'. 

In [0]:
salary_data['Date First Hired'].sample(10)

We can convert the dates from a string to a date using Panda's `to_datetime` function.

In [0]:
pd.to_datetime(salary_data['Date First Hired']).sample(10)

It would be useful to convert this data to a numeric variable. Since we know we are working with data from 2017 we can choose some date in 2017 and subtract the first hire date from it creating a 'Tenure' column.
 
Since some employees were likely hired in 2017 we don't want to use the first of the year. For simplicity's sake we will simply subtract the first hired date from January 1st 2018.

In [0]:
salary_data['Tenure'] = ((pd.to_datetime('01/01/2018') - pd.to_datetime(salary_data['Date First Hired'])).apply(lambda td: td.days)).astype('int64')
salary_data[['Date First Hired', 'Tenure']].sample(10)

Let's also describe the data and check out the distribution of our new 'Tenure' column.

In [0]:
salary_data['Tenure'].describe()

We can see that the minimum tenure is 6 days and that the 50th percentile is 4263 days, which is just over 11.5 years. That doesn't seem too crazy for a government job.
 
Check out that max value though! 19086 days, which is just over 52 years.
 
Let's look closer.

In [0]:
salary_data[salary_data['Tenure'] == 19086]

A quick search on the internet brings up [meeting minutes](https://www.montgomerycountymd.gov/HHS-Program/Resources/Files/PHSDocs/COH/June%202016%20approved%20COH%20Minutes--FINAL.pdf) from 2016 where our outlier is being congratulated for 50 years of service to Montgomery County.
 
Our new column seems to be legit.

### Visualize the Data

Our dataset contains salary data for every government employee. This includes a wide range of job roles, each with their own competitive pay ranges.

Let's look at a visualization of salary and tenure.

In [0]:
import matplotlib.pyplot as plt

plt.plot(salary_data['Tenure'], salary_data['Current Annual Salary'], 'b.')
plt.show()

### Narrow the Problem Scope

This data would be impossible to predict reliably using a linear function. Let's explore the data and see what the most common job roles are. We can then just build a model for that job role.

In [0]:
salary_data.groupby('Employee Position Title')['Tenure'].count().sort_values(ascending=False).head()

'Police Officer III' seems to be the clear winner with 'Firefighter/Rescuer III' and 'Bus Operator' not far behind.
 
We will limit our model to predict the pay of people with 'Police Officer III' roles.

In [0]:
police_officers = salary_data[salary_data['Employee Position Title'] == 'Police Officer III']

police_officers.describe()

And visualizing that data.

In [0]:
import matplotlib.pyplot as plt

plt.plot(police_officers['Tenure'], police_officers['Current Annual Salary'], 'b.')
plt.show()

This data looks much better and might be a good fit for a linear model. There do seem to be some obvious pay bands and an overall salary cap just over $90,000.

Another interesting salary-related datapoint is the actual gross pay that the officer received over the course of the year. Where the salary is promised pay, the gross salary is actual pay including overtime and subtracting days of unpaid leave and days before the officer started work (if the officer started in 2017).

In [0]:
import matplotlib.pyplot as plt

plt.plot(police_officers['Tenure'], police_officers['2017 Gross Pay Received'], 'b.')
plt.show()

Here we can see more pronounced bands, but they aren't pay bands. The bands are groups of officers with the same tenure. Likely this is a class of officers starting at the same time. Also, most people start a new job on Mondays.

So which do we want to try to predict? Salary is likely more predictable since gross pay can be affected by an individual's willingness to perform overtime work.

A quick look at a correlation matrix indicates that we'd have better luck predicting salary also.

In [0]:
police_officers[['Tenure', 'Current Annual Salary', '2017 Gross Pay Received']].corr()

## k-Fold Cross Validation

### Create a Pipeline

For this lab we are going to use the [SGDRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html) to find our regression line. Before we do that it is a good idea to scale our feature data using the [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html).

Since we'll be using k-fold cross validation for exploring our model we can make it easier for ourselves to perform the data preprocessing and model fitting by using a `Pipeline`.

In [0]:
from sklearn.linear_model import SGDRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

estimator = Pipeline(
  steps=[
    ['scale', StandardScaler()],
    ['regressor', SGDRegressor(random_state=562)],
  ]
)

### Shuffle the Data

We will also want to shuffle our data before sending it into the cross validation function.

In [0]:
police_officers = police_officers.sample(frac=1.0, random_state=324)
police_officers.head()

### Calculate Scores

Now we can calculate our cross validation scores. We'll use the `cross_val_score` function which uses k-Fold cross validation by default.

We pass the function:

1. Estimator (it will be trained k-times)
1. Feature data
1. Target data
1. The number of folds.

In the case below we choose 5 folds, which holds out 20% of our data for testing and trains on 80% of the data. Other common values are 10 folds and even 4 folds. There isn't really a correct answer here as to the number of folds to use. The more folds, the more iterations of training and scoring, so your data size and cost of processing might dictate less folds. Above 10 folds isn't very beneficial because of the chances of your testing data for any given fold being very different from your training data increases, so you might not get an accurate view of your model's performance.

In [0]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(
    estimator,
    police_officers[['Tenure']],
    police_officers['Current Annual Salary'],
    cv=5
)

scores

You can see after scoring the model with 5 folds we get 5 scores. Since we used an [SGDRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html) the score is a r-squared score. Different models return different scores.

We can now take the mean of the scores to get a more balanced view of our model's performance.

In [0]:
scores.mean()

### Comparing to Standard Test/Train Splitting

Let's compare our mean r-squared score to the score that we would have gotten from a standard test/train split of data.

In [0]:
from sklearn.linear_model import SGDRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split

estimator = Pipeline(
  steps=[
    ['scale', StandardScaler()],
    ['regressor', SGDRegressor(random_state=343)],
  ]
)

X_train, X_test, y_train, y_test = train_test_split(
    police_officers[['Tenure']], 
    police_officers['Current Annual Salary'], 
    test_size=0.2, 
    random_state=1234
)

estimator.fit(X_train, y_train)
estimator.score(X_test, y_test)

In this case, at least when this colab was created, the mean cross validation score was actually higher than the individual test score. This won't always be the case and depending on the random states and number of times that you ran each code cell, might not be true.
 
What does the difference between the cross validation score and hold-out score tell us?
 
If the cross validation score is higher, then it means that our model probably performs better on unknown/new data than we would have thought by simply doing holdout.
 
If the cross validation score is lower than our model is likely to just be too well adjusted to the single set of holdout test data.
 
The cross validation score gives us a better idea of how our model would actually perform that the single hold-out score.

### Next Steps

We now have a mean cross validation score and have some idea of how our model will perform, so what's next?
 
There are a few options:
 
* Train the model on the entire dataset.
* Train the model on a subset of the dataset.
 
The first option isn't too bad, but it does still expose you to a slight risk that your model will end up overfitting. Since you are training with all of the data there is no way to go back and do one last check to ensure that your model is sane.
 
The second option is preferred. In this case, you still do a train-test split first. Then you perform cross validation on just the training data. Once you get the cross validation score you then train a new model on just the training data and do one final sanity check using the testing data.

In [0]:
from sklearn.linear_model import SGDRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split

estimator = Pipeline(
  steps=[
    ['scale', StandardScaler()],
    ['regressor', SGDRegressor(random_state=343)],
  ]
)

X_train, X_test, y_train, y_test = train_test_split(
    police_officers[['Tenure']], 
    police_officers['Current Annual Salary'], 
    test_size=0.2, 
    random_state=1234
)

scores = cross_val_score(estimator, X_train, y_train, cv=5)

estimator = Pipeline(
  steps=[
    ['scale', StandardScaler()],
    ['regressor', SGDRegressor(random_state=343)],
  ]
)

estimator.fit(X_train, y_train)
score = estimator.score(X_test, y_test)

print(f'cross validation score (min): {scores.min()}')
print(f'cross validation score: {scores.mean()}')
print(f'final score {score}')

In this case again the cross validation score differs quite a bit from our cross validation score. This hints that we still might be overfitting to our training data and not generalizing well despite our cross validation score.

# Resources

* [scikit-learn Cross Validation](https://scikit-learn.org/stable/modules/cross_validation.html)

# Exercises

## Exercise 1

`cross_val_score` isn't limited to k-fold validation. It can perform many other splits on the data.
 
Use the [ShuffleSplit](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.ShuffleSplit.html#sklearn.model_selection.ShuffleSplit) in `cross_val_score` instead of k-fold.
 
Set the parameters to:
 
* 5 splits
* 30% test size
* 56789 random state
 
Calculate the mean of the returned scores and store the mean in a variable called `mean_score`.

### Student Solution

In [0]:
# Your answer goes here

### Answer Key

**Solution**

In [0]:
from sklearn.linear_model import SGDRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.model_selection import ShuffleSplit

estimator = Pipeline(
  steps=[
    ['scale', StandardScaler()],
    ['regressor', SGDRegressor(random_state=343)],
  ]
)

X_train, X_test, y_train, y_test = train_test_split(
    police_officers[['Tenure']], 
    police_officers['Current Annual Salary'], 
    test_size=0.2, 
    random_state=1234
)

scores = cross_val_score(estimator, X_train, y_train, cv=ShuffleSplit(n_splits=5, test_size=0.3, random_state=56789))

estimator = Pipeline(
  steps=[
    ['scale', StandardScaler()],
    ['regressor', SGDRegressor(random_state=343)],
  ]
)

estimator.fit(X_train, y_train)
score = estimator.score(X_test, y_test)

print(f'cross validation score (min): {scores.min()}')
print(f'cross validation score: {scores.mean()}')
print(f'final score {score}')

**Validation**

In [0]:
# If the solution can be auto-graded, perform the autograding here.

## Exercise 2

In classification problems we may have unbalanced classes and want an even distribution of those classes in our training and testing data. This is called *stratification* and for classification problems scikit-learn provides the [StratifiedKFold](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html#sklearn.model_selection.StratifiedKFold) splitter. This balances our target data across folds.

Using the [digits dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html) create an [SGDClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html) and run stratified cross validation over the digit targets.

Cross validate over 5 folds. Save the mean validation score in a variable called `mean_score`.

### Student Solution

In [0]:
# Your answer goes here

### Answer Key

**Solution**

In [0]:
# TODO(joshmcadams)

**Validation**

In [0]:
# TODO(joshmcadams)

## Exercise 3: Challenge (Ungraded)

There are times when we have insights about our data that we can feed our model. In some cases we can are aware of tranches in the data that have different characteristics and we'd like those characteristics reflected in our testing and training data.
 
In classification problems we may have unbalanced classes and want an even distribution of those classes in our training and testing data. This is called *stratification* and for classification problems scikit-learn provides the [StratifiedKFold](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html#sklearn.model_selection.StratifiedKFold) splitter. This balances our target data across folds.
 
However, there are other times when we need to balance our feature data across folds. Thinking about the salary data we have dealt with in this colab, the [gender pay gap](https://en.wikipedia.org/wiki/Gender_pay_gap) comes to mind.
 
Let's see if we can find any evidence of a gap. First let's see the distribution of female and male officers.

In [0]:
police_officers.groupby(by='Gender')['Gender'].count()

We have about 26% females.

Now we can calculate an average salary by years tenure.

In [0]:
max_years = int(police_officers['Tenure'].max() / 365) + 1

bins = list(range(0, 365*max_years+1, 365))

labels = list(range(0, max_years))

police_officers['Tenure Years'] = pd.cut(police_officers['Tenure'], bins=bins, labels=labels)

police_officers.sample(10)[['Tenure', 'Tenure Years']]

And finally, we can see on average if there is a gap.

In [0]:
females = police_officers[police_officers['Gender'] == 'F'].groupby(by='Tenure Years')['Current Annual Salary'].mean()
males = police_officers[police_officers['Gender'] == 'M'].groupby(by='Tenure Years')['Current Annual Salary'].mean()
(females-males).mean()

In this case we see that there is over a $1000 gap in average salary based on years tenure.
 
If we want this reflected in our model we need to split the data in each of our folds. Unfortunately, this isn't quite as easy as stratifying target data.
 
There are a few approaches. One is to pre-split the data and use the [PredefinedSplit](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.PredefinedSplit.html#sklearn.model_selection.PredefinedSplit) class to identify test folds per iteration of cross validation.
 
Another is to write a custom [CV Splitter](https://scikit-learn.org/stable/glossary.html#term-cv-splitter) and implement `split` and `get_n_splits`.
 
In this challenge you are tasked with creating a 5-split CV Splitter or PredefinedSplit that keeps the feature data roughly 3:1 male:female.

### Student Solution

In [0]:
# Your code goes here

### Answer Key

**Solution**

In [0]:
# TODO(joshmcadams)

**Validation**

In [0]:
# TODO(joshmcadams)