# Data Science: Bridging Principles and Practice
## Scikit-Learn Template

<img src="images/ml.jpg" />

## Overview <a id="section9b"></a>

[Scikit-Learn](https://scikit-learn.org/stable/index.html) is free, open-source software for machine learning in Python. The software includes tools to create, train, and evaluate machine learning models.

Overview of using Scikit-Learn for machine learning:

1. Load the data
1. Choose your explanatory variables (what you're going to use to make predictions) and response variable (what you want to predict) 
1. Split the data into training and testing sets (and, if you're doing a lot of parameter tuning, a validation set)
2. Create a model in Python
3. Fit the data to the model
4. Make predictions using your fitted model
5. Score the accuracy of your model
8. Visualize your results



## Before you use this template
- This template assumes that your dataset is already clean.
- You may choose to clean the data using Python or another software, like Excel. For tools to clean data using Python, please see the Data-Cleaning-template notebook.


In [None]:
# run this cell to import some necessary software
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
%matplotlib inline
import sklearn
import seaborn as sns

from sklearn.model_selection import train_test_split

# set the random seed for reproducibility
np.random.seed(28)


## 1. Load the clean data


- If you're working in the Haas Executive Education cloud server, you will need to upload your dataset. Go to [bee.haas.berkeley.edu](https://bee.haas.berkeley.edu). Click on the "haas-ds-online" folder, then the "data" folder. Then, click the "Upload" button at the top right. The relative path to your dataset is now "data/name-of-your-file"
- If you're on your computer, make sure you know the relative path of your data file. Putting it in the same directory as your notebook will help.
- Replace the ... with the relative path of your data file. Don't forget the file extension.
- Use the first code cell if your data is in a csv (comma separated values) file.  Use the second code cell if your data is in an Excel file.
- If your data is in a different file type, you can see if there are functions to read it with Pandas at [this link](https://pandas.pydata.org/docs/user_guide/io.html). Note that csv and Excel files tend to be the easiest to work with in Pandas.

In [None]:
# run this cell to load the data
data = pd.read_csv(...)

# show the first 5 rows of the data
data.head()

In [None]:
#  For Excel file data
data = pd.read_csv(...)

# show the first 5 rows of the data
data.head()

## 2. Select the explanatory and response variables
- explanatory variables should be the names of the appropriate columns, each enclosed in quotation marks, listed inside the square brackets and separated by commas 
- response variable should be the name of the appropriate column, enclosed in quotation marks

In [None]:
# choose explanatory and response variables
expl_vars = [...]

resp_var = ...

In [None]:
# create the X DataFrame
X = data.loc[:, expl_vars]

# show the first 5 rows
X.head()

In [None]:
# create the y array
y = data[resp_var]

# show the first 5 items
y.head()

## 3. Split into train, test and (optionally) validation sets

- the random seed can be any number, as long as it's consistent
- use a validation set if you want to go through the full model selection process, including tuning hyperparameters. See Notebook 08 (Model Selection) for an example.
- running only the first cell will put 80% of the data in the training set and 20% in the test set
- running the first and second cells will put 60% of the data in the training set, 20% in the test set, and 20% in the validation set
- to change the proportions of how much data goes in each set, edit the train_size and test_size arguments

In [None]:
# set the random seed
np.random.seed(28)

In [None]:

# run this cell to split the data into training and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, test_size=0.2)

In [None]:
# if you want to create a validation set, delete the # from the beginning of the next line and run the cell

# X_train, X_test, X_val, y_val = train_test_split(X, y, train_size=0.75, test_size=0.25)

## 4. Import and create the model


In [None]:
# import the code that creates linear regression models
from sklearn.linear_model import LinearRegression

# create a new, untrained model
lr_model = LinearRegression(fit_intercept=True, normalize=False)

## 5. Fit the model

In [None]:
# fit the model
lr_model.fit(X_train, y_train)

## 6. Make predictions with fitted model


For the training set:

In [None]:
# save the predictions to a variable
lr_train_predictions = lr_model.predict(X_train)
# show the predictions
lr_train_predictions

For the validation set (if you're not using a validation set, this will error):

In [None]:
# If you're using a validation set:
# save the predictions to a variable
lr_val_predictions = lr_model.predict(X_val)
# show the predictions
lr_val_predictions

For the test set:

In [None]:
# save the predictions to a variable
lr_test_predictions = lr_model.predict(X_test)
# show the predictions
lr_test_predictions

## 7. Score the model

- note: depending on which algorithm you use, the `score` method will return something slightly different. The [Scikit-Learn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)for each type of model will have details on what `score` shows if you scroll down.


In [None]:
# save the score to a variable
lr_train_score = lr_model.score(X_train, y_train)

# show the score
lr_train_score

In [None]:
# save the score to a variable
lr_val_score = lr_model.score(X_val, y_val)

# show the score
lr_val_score

In [None]:
# save the score to a variable
lr_score = lr_model.score(X_test, y_test)

# show the score
lr_score

## 8. Visualize Results

The following cell shows code to plot predicted vs actual values for the training data. The cell after that has the same code with references to the training data removed, if you want to plot validation or test data.

Note: scatter plots work well for problems that work with continuous, numerical data, like regression problems. Other types of visualizations may be appropriate for other problems. Check out the [Seaborn Python visualization library](https://seaborn.pydata.org/tutorial.html) for example code for other types of plots.

In [None]:
# make the blank subplots and increase their size 
f, [ax1, ax2] = plt.subplots(2, figsize=(12, 12))

# plot the actual values (x) against the predicted values (y)
sns.regplot(x=y_train, y=lr_train_predictions, ax=ax1, color="#003262") 
ax1.set_xlabel("actual")
ax1.set_ylabel("predicted")
ax1.set_title("Predicted vs. Actual Values")

# plot the actual values (x) against the error for each prediction (y)
sns.scatterplot(x=y_train, y=y_train - lr_train_predictions, ax=ax2, color="#FDB515") 
ax2.set_xlabel("actual")
ax2.set_ylabel("error")
ax2.set_title("Error")
ax2.hlines(y=0, xmin=min(y_test), xmax=max(y_test));

In [None]:
# make the blank subplots and increase their size 
f, [ax1, ax2] = plt.subplots(2, figsize=(12, 12))

# plot the actual values (x) against the predicted values (y)
sns.regplot(x=..., y=..., ax=ax1, color="#003262") 
ax1.set_xlabel("actual")
ax1.set_ylabel("predicted")
ax1.set_title("Predicted vs. Actual Values")

# plot the actual values (x) against the error for each prediction (y)
sns.scatterplot(x=..., y=... - ..., ax=ax2, color="#FDB515") 
ax2.set_xlabel("actual")
ax2.set_ylabel("error")
ax2.set_title("Error")
ax2.hlines(y=0, xmin=min(...), xmax=max(...));

### Using a different kind of model

Scikit-Learn's ["Choosing the right estimator" flowchart](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html) is a nice tool, but ultimately your algorithm choice should draw strongly from your knowledge of the domain, the problem, and the available algorithms.

You can find the [list of supervised and unsupervised learning algorithms in Scikit-Learn here](https://scikit-learn.org/stable/user_guide.html). To use a different algorithm, click through the link to the algorithm you want; on each page, you'll find example code of how to use it. For most algorithms, the below code will work as long as you fill in the import statement (saying what model to import and where it's located) and fill in the code to create a new, untrained instance of that model.

In [None]:
# the first five rows of the attrition data

from ... import ...

model = ...
model.fit(X_train, y_train)
model.predict(X_test)
model.score(X_test, y_test)

#### References
- header image credit: "Machine Learning & Artificial Intelligence", [Mike Mackenzie](https://www.flickr.com/photos/mikemacmarketing/42271822770). [CC BY 2.0](https://creativecommons.org/licenses/by/2.0/)

Notebook author: Keeley Takimoto