# Data Science: Bridging Principles and Practice
## Scikit-Learn Template

<img src="images/office-and-workers-in-barcelona-spain.jpg" />

<br>

*In this notebook, we will walk through solving a classification problem using machine learning. To do so, we will introduce the Scikit-Learn machine learning library for Python.*

### Table of Contents

<a href="#sectioncase">Case Study: Employee Attrition at IBM</a>

<ol start="9">
    <li><a href="#section9">Machine Learning</a>
        <ol type=a>
            <br>
            <li><a href="#section9a">The K-Nearest Neighbors Algorithm</a></li>
            <br>
            <li><a href="#section9b">Using Scikit-Learn: An Example</a></li>
            <br>
            <li><a href="#section9c">Using Scikit-Learn: KNN</a></li>            
        </ol>
    </li>
    </ol>

In [None]:
# run this cell to import some necessary software
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
%matplotlib inline
import sklearn
import seaborn as sns

from sklearn.model_selection import train_test_split
from mpl_toolkits.mplot3d import Axes3D 

# set the random seed for reproducibility
np.random.seed(28)

## Overview <a id="section9b"></a>

[Scikit-Learn](https://scikit-learn.org/stable/index.html) is free, open-source software for machine learning in Python. The software includes tools to create, train, and evaluate machine learning models.

Overview of using Scikit-Learn for machine learning:

1. Load the data
1. Choose your explanatory variables (what you're going to use to make predictions) and response variable (what you want to predict) 
1. Split the data into training and testing sets (and, if you're doing a lot of parameter tuning, a validation set)
2. Create a model in Python
3. Fit the data to the model
4. Make predictions using your fitted model
5. Score the accuracy of your model



## Before you use this template
- This template assumes that your dataset is already clean.
- You may choose to clean the data using Python or another software, like Excel.
- Every dataset will have different cleaning needs. Common ones including dealing with missing values (removing or imputing them), converting values to standard data types, or feature engineering (i.e. creating new columns of values calculated from other variables)

<div class="alert alert-info"><b> Why don't we have a template for data cleaning? </b></div>


## 1. Load the data

- 

In [None]:
# load the data
# fill in the ... with the path to the data file. Don't forget the file extension

In [None]:
# run this cell to load the data
data = pd.read_csv("...")

# show the first 5 rows of the data
data.head()

## 3. Select the explanatory and response variables
- explanatory variables should be the names of the appropriate columns, each enclosed in quotation marks, listed inside the square brackets and separated by commas 
- response variable should be the name of the appropriate column, enclosed in quotation marks

In [None]:
# choose explanatory and response variables
expl_vars = [...]

resp_var = ...

In [None]:
# create the X DataFrame
X = data.loc[:, expl_vars]

# show the first 5 rows
X.head()

In [None]:
# create the y array
y = data[resp_var]

# show the first 5 items
y.head()

## 4. Split into train, test and (optionally) validation sets

- the random seed can be any number, as long as it's consistent
- use a validation set if you want to go through the full model selection process, including tuning hyperparameters. See Notebook 08 (Model Selection) for an example.
- running only the first cell will put 80% of the data in the training set and 20% in the test set
- running the first and second cells will put 60% of the data in the training set, 20% in the test set, and 20% in the validation set
- to change the proportions of how much data goes in each set, edit the train_size and test_size arguments

In [None]:
# set the random seed
np.random.seed(28)

In [None]:

# run this cell to split the data into training and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, test_size=0.2)

In [None]:
# if you want to create a validation set, delete the # from the beginning of the next line and run the cell

# X_train, X_test, X_val, y_val = train_test_split(X, y, train_size=0.75, test_size=0.25)

## 5. Import and create the model


In [None]:
# import the code that creates linear regression models
from sklearn.linear_model import LinearRegression

# create a new, untrained model
lr_model = LinearRegression(fit_intercept=True, normalize=False)


## 6. Fit the model

In [None]:
# fit the model
lr_model.fit(X_train, y_train)

## 7. Make predictions with fitted model



In [None]:
# save the predictions to a variable
lr_predictions = lr_model.predict(X_test)
# show the predictions
lr_predictions

## 8. Score the model

- note: depending on which algorithm you use, the `score` method will return something slightly different. The [Scikit-Learn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)for each type of model will have details on what `score` shows if you scroll down.


In [None]:
# save the score to a variable
lr_score = lr_model.score(X_test, y_test)
# show the score
lr_score

### 9c. Using Scikit-Learn: KNN <a id="section9c"></a>

Now it's time to return to our IBM employee attrition data set, to see if we can use a k-nearest neighbors model to predict whether or not an employee will leave the company. This section will follow the steps in section 9b very closely- if you're stuck, check the LinearRegression code for some hints.

As a reminder, here's what our attrition data looks like:

In [None]:
# the first five rows of the attrition data
attrition.head()

Remember, our steps are:

1. Choose your explanatory variables (what you're going to use to make predictions) and response variable (what you want to predict) 
1. Split the data into training and testing sets 
2. Create a model in Python
3. Fit the data to the model
4. Make predictions using your fitted model
5. Score the accuracy of your model

Start by choosing explanatory and response variables. For this exercise, we'll try predicting attrition based on **monthly income**, **number of years in current role**, and **overtime eligibility**.

<div class="alert alert-warning">
    <b>EXERCISE</b>: Choose explanatory and response variables.
    <ul>
        <li>explanatory variables must be numerical feature column names that could help predict attrition. For this exercise, we'll be using the columns with data on the employee's monthly income, years in current role, and overtime status. The names should be strings (i.e. inside quotation marks), listed within the square brackets and separated by commas</li>
        <li>the response variable should be the name of the column that says whether or not an employee left by attrition, inside quotation marks</li>
    </ul>
    </div>

In [None]:
# choose explanatory and response variables
expl_vars_attrition = ["DistanceFromHome"]

resp_var_attrition = "Attrition"

In [None]:
# run this cell to check your answer for some common mistakes
check("tests/09/vars.ok")

Now, split the data into training and test sets. If you've chosen your variables correctly, you can just run the next few cells. The training data will be called `X_train_att` and `y_train_att`, while the test data will be called `X_test_att` and `y_test_att`.

In [None]:
# once you've chosen your variables, just run this cell
# create the X DataFrame
X_attrition = attrition.loc[:, expl_vars_attrition]

# show the first 5 rows
X_attrition.head()

In [None]:
# once you've chosen your variables, just run this cell
# create the y array
y_attrition = attrition[resp_var_attrition]

# show the first 5 items
y_attrition.head()

In [None]:
# once you've chosen your variables, just run this cell
# split the data into training and test
X_train_att, X_test_att, y_train_att, y_test_att = train_test_split(X_attrition, y_attrition,
                                                                    train_size=0.8, test_size=0.2)

The next step is to create the KNN model. The first line of the next cell imports the code for the model, which is called `KNeighborsClassifier`. For now, we'll use the default number of neighbors, which is 5.

<div class="alert alert-warning">
    <b>EXERCISE:</b> Create the KNN model. Replace the ellipses with a model creation call expression, which is the model type followed by parentheses (don't put anything in the parentheses to use the default number of neighbors). We will assign the created model the name of <code>knn</code> using an equals sign.
    </div>

In [None]:
from sklearn.neighbors import KNeighborsClassifier

# create the KNeighborsClassifier
knn = KNeighborsClassifier()


In [None]:
# run this cell to check your answer for some common mistakes
check("tests/09/model.ok")

With the model created, it can now be fit. 

<div class="alert alert-warning">
    <b>EXERCISE:</b> Fit the KNN model to the training data. Use the <code>fit</code> method on the model to train it. <code>fit</code> takes two arguments: the training feature matrix (<code>X_train_att</code>) and the list of classes for the training data (<code>y_train_att</code>).
    </div>

In [None]:
# fit the knn model to the training data
# look at the code for the linear regression model for a hint
knn.fit(X_train_att, y_train)

A fitted model can make predictions. Use the fitted model's `predict` to predict attrition for the test data. `predict` takes only one argument: an X matrix of feature data with the same features (columns) that it was trained on.

<div class="alert alert-warning">
    <b>EXERCISE:</b> Use <code>knn.predict</code> to make predictions for the test data (<code>X_test_att</code>). Check the code for the linear regression model predictions for a hint on how the code expression should look.
    </div>

In [None]:
# use the fitted knn model to predict attrition for the test data
# look at the code for the linear regression model for a hint
knn_predictions = ...
knn_predictions

In [None]:
# run this cell to check your answer for some common mistakes
check("tests/09/pred.ok")

Finally, let's evaluate the model. 

For KNN, the `score` method [returns the mean accuracy on the data and labels](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier.score). Put another way, the score for KNN is the proportion of employees correctly labeled by the model- the number of correctly labeled employees divided by the total number of employees. So, a score of 1.0 indicates a perfect model, and a score of 0.0 indicates a model that got all classifications wrong.

<div class="alert alert-warning">
    <b>EXERCISE:</b> Use the model's <code>score</code> method to evaluate its accuracy on the test data. <code>score</code> requires 2 arguments: a matrix of feature data (X) and an array of response variables (y).
    </div>

In [None]:
# use the fitted knn model to score predictions for the test data
# look at the code for the linear regression model for a hint
knn_score = ...
# show the score
knn_score

In [None]:
# run this cell to check your answer for some common mistakes
check("tests/09/score.ok")

#### Visualization
When evaluating the KNN model, it can also be helpful to visualize the data so we can see the scope of the problem we're trying to solve. 

The following scatter plots plot our explanatory variables against each other. Every point represents a single employee, and the color and shape of the point indicate whether or not they left IBM: an orange circle indicates attrition and a blue triangle indicates the employee remained at IBM.

Because KNN assumes that points of the same class are near to each other, the ideal KNN scatter plot would have all points of the same type located next to each other on the plot. 

This plot shows monthly income and years in current role:

In [None]:
# make the plots bigger
sns.set(rc={'figure.figsize':(11,8)})

# create the scatter plot
sns.scatterplot(x=attrition["MonthlyIncome"], 
                # add a bit of jitter to YearsInCurrentRole so the points overlap less
                y=attrition["YearsInCurrentRole"] + np.random.normal(scale = 0.3, size=attrition.shape[0]), 
                data=attrition,
                hue="Attrition", style="Attrition", markers={0:"^", 1:"o"}, palette={0:"#003262", 1:"#FDB515"});

Sometimes, adding a third dimension can make a difference for the KNN algorithm. The next plot shows all three of our chosen explanatory variables: monthly income, years in current role, and overtime. Note that overtime is a binary value (employees are either eligible for overtime or not), so the overtime axis splits the points into two planes (one for each possible value of "OverTime").

In [None]:
# make the 3D scatter plot
fig = plt.figure(figsize=(12, 9))
ax = fig.add_subplot(111, projection='3d')

att_pos = attrition[attrition.Attrition == 1]
att_neg = attrition[attrition.Attrition == 0]

x_label = "YearsInCurrentRole"
y_label = "MonthlyIncome"
z_label = "OverTime"

for data, m , l, c in [(att_neg, "^", "Attrition = 0", "#003262"), (att_pos, "o", "Attrition = 1", "#FDB515")]:
    ax.scatter(data[x_label], data[y_label], (data[z_label] + np.random.normal(scale=0.05,size=data.shape[0])), 
               marker=m, label=l, color=c)

ax.set_xlabel(x_label)
ax.set_ylabel(y_label)
ax.set_zlabel(z_label)

plt.legend();

<div class="alert alert-warning">
    <b>QUESTION:</b> based on the model score, how well is our classifier working? Based on the scatter plot, who are the people most likely to be mis-classified, and how can you tell (hint: think about how KNN makes predictions)?
    </div>

YOUR ANSWER HERE

#### References
- [IBM HR Analytics Employee Attrition & Performance mock data set](https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset/home) is made available under the [Open Database License](http://opendatacommons.org/licenses/odbl/1.0/). Any rights in individual contents of the database are licensed under the [Database Contents License](http://opendatacommons.org/licenses/dbcl/1.0/).