# Scikit-learn Intro

## Resources
+ [Scikit-learn docs][1]
+ [Hands on Machine Laerning with Scikit-Learn][2] - good book

[1]: http://scikit-learn.org/stable/
[2]: https://www.amazon.com/Hands-Machine-Learning-Scikit-Learn-TensorFlow/dp/1491962291

# Typical Workflow for Beginners
* Find dataset
    * [Kaggle Datasets](https://www.kaggle.com/datasets)
    * [data.world](https://data.world/)
    * [data.gov](https://www.data.gov/)
 

* Read data into Pandas

* Clean data

* Exploratory data analysis with basic statistics and visualizations

* Define Problem

* Extract to NumPy

* Train and Evaluate model with Scikit-Learn

# Ames Housing Data

* Compiled by professor Dean De Cock from Ames, Iowa from 2006 - 2010
* Original dataset has 79 features and 1460 samples
* For simplicity, we will only look at 8 features
* Predict sale price
* Evaluation metric - R^2 - least squared error

# Read in Data

In [None]:
import pandas as pd
import numpy as np
pd.options.display.max_columns = 100
import seaborn as sns

%matplotlib inline

In [None]:
housing = pd.read_csv('data/housing.csv')
housing.head()

# Some Quick EDA

In [None]:
sns.pointplot(x='OverallQual', y='SalePrice', data=housing, ci=0)

In [None]:
sns.barplot(x='GarageType', y='SalePrice', data=housing, ci=0)

In [None]:
sns.regplot(x='GrLivArea', y='SalePrice', data=housing.sample(frac=.3))

# What are we predicting
In this problem, we want to predict the final sale price of the house.

### Remove target variable from DataFrame and assign to its own variable
The **`pop`** method will remove a column and return its values.

In [None]:
sale_price = housing.pop('SalePrice')

### Verify that it worked

In [None]:
sale_price.head()

In [None]:
housing.head()

# Remedying missing values
* Replacing numeric missing values with the median, mean, or mode

In [None]:
housing.isna().sum()

# Make a copy of the data to keep original the same

In [None]:
housing_ml = housing.copy()

# Categorical vs Continuous features

* Each feature (column) is either categorical or continuous
* Categorical features are distinct values and are usually strings (though can be numbers as well)
* Continuous features can take on any value and are are always numeric

# Strategy for missing values
* Drop rows or columns with missing values
* Fill with mean, median, mode (Continuous)
* Fill with most common or another string (Categorical)

### Fill LotFrontage with median

In [None]:
med = housing_ml['LotFrontage'].median()
med

In [None]:
housing_ml['LotFrontage'] = housing_ml['LotFrontage'].fillna(med)

In [None]:
housing_ml.isna().sum()

### Examine Garage Type

In [None]:
housing_ml['GarageType'].value_counts()

### Fill missing garage type missing values with 'Missing'
Create a new category for garage type.

In [None]:
housing_ml['GarageType'] = housing_ml['GarageType'].fillna('Missing')

## Verify no missing values

In [None]:
housing_ml.isna().sum()

# Encode categorical features
We can only have numeric values when using Scikit-Learn. We have two string columns:

In [None]:
housing_ml.head()

### String columns have a data type of 'object' in Pandas

In [None]:
housing_ml.dtypes

### No standard way of handling them in Scikit-Learn

### Use Pandas `get_dummies` function
For each object or category data type column in your DataFrame, pandas will encode it.

### How the encoding works
For each unique value in the object/category columns, a new column will be created in the DataFrame. The new column name will have the unique value appended to the old column name and connected by an **underscore**.

In [None]:
housing_num = pd.get_dummies(housing_ml)
housing_num.head()

## New columns for each unique value
Notice that the `RoofMatl` column no longer exists. It is replaced by 8 columns, for each of its unique values. All row values will be 0 except for the old column value, which will now be 1. This is sometimes referred to as **one hot encoding**.

In [None]:
housing_ml['RoofMatl'].unique()

# Data is Prepared for Machine Learning
We removed missing values and encoded categorical variables. We are now ready to move to do machine learning.

# Extracting to Scikit-Learn
Even though you can make scikit-learn work with Pandas objects, its best to get the underlying NumPy arrays as the documentation explicitly works with NumPy.

# Begin with the simplest model
Use a very simple model to form a baseline. This will help you track your progress.

### Choose a single predictor - `TotalBsmtSF`
Let's choose a single predictor variable (feature) such as `TotalBsmtSF`.

# Extracting to Scikit-Learn
Even though you can make scikit-learn work with Pandas objects, its best to get the underlying NumPy arrays as the documentation explicitly works with NumPy.

### Use the `values` attribute

In [None]:
X = housing_num['TotalBsmtSF'].values
y = sale_price.values

In [None]:
type(X)

In [None]:
type(y)

In [None]:
X

In [None]:
y

# Training a model in Scikit-Learn in 3 steps

## Step 1: Import your Model (Estimator)

### We need to use a Regression model and NOT Classification
The simplest regression model is linear regression.

Let's take a look at the [Scikit-Learn API][1] to see the estimators and where they are located.

[1]: http://scikit-learn.org/stable/modules/classes.html

In [None]:
from sklearn.linear_model import LinearRegression

## Step 2: Instantiate the Estimator
When we import the estimator, we have only imported a **blue print**. We haven't built our model yet. We must create an instance of it. This is called **instantiation** and is a common term in object-oriented programming.

In [None]:
lr = LinearRegression()

## Step 3: Train the model with the `fit` method
The `fit` method takes two parameters - the X and y arrays. An algorithm uses the input and target arrays to meet its objective (least squared error here).

In [None]:
lr.fit(X, y)

## AHHH!! Another NumPy Gotcha - The input data must be a 2 dimensional array
Even though we are only using a single predictor, scikit-learn requires us to have a 2-dimensional array. They even give us some code to make it 2-dimensional.

In [None]:
X.shape

In [None]:
X.ndim

In [None]:
X = X.reshape(-1, 1)
X.shape

In [None]:
X.ndim

## Now we can train the model

In [None]:
lr.fit(X, y)

# Our model is ready to use - Let's make some predictions
We can get predictions if we pass in values for total basement square feet. Let's make predictions for 100, 1,000, and 10,000 square foot basements.

In [None]:
lr.predict(100)

In [None]:
lr.predict(1000)

In [None]:
lr.predict(10000)

### Make a prediction for all inputs
We can pass in our input data to get the predicted value for each house in our dataset.

In [None]:
lr.predict(X)

# How well did our model do?
Our evaluation metric is least squared error, which a standardized version of is called R-squared. An R-squared value of 1 means a perfect fit.

### Use the `score` method to find out the R-squared value
To score it, we use our `y` as the ground truth.

In [None]:
lr.score(X, y)

# Your Turn
* Choose a different predictor variable
* Extract to NumPy
* Make 2-dimensional
* Fit with linear regression
* Predict for various values
* Score

In [None]:
# your code here

# Different Models
There are dozens and dozens of machine learning models in Scikit-Learn and they all work very similarly.

## Let's use a decision tree model
Make sure we know what data we are using - easy to lose track of in the notebooks!

In [None]:
X = housing_num['TotalBsmtSF'].values
X = X.reshape(-1, 1)
y = sale_price.values

## Step 1: Import the Model

In [None]:
from sklearn.tree import DecisionTreeRegressor

## Step 2: Instantiate the Model

In [None]:
dtr = DecisionTreeRegressor()

## Step 3: Train the Model

In [None]:
dtr.fit(X, y)

## Make predictions and score

In [None]:
dtr.predict(1000)

In [None]:
dtr.predict(X)

In [None]:
dtr.score(X, y)

## Wow, thats quite accurate!

# But, is something wrong?

# Are we cheating?

# Yes, we are cheating
We are scoring ourselves on the training set. '

# The only thing that matters is our performance on unseen data

## Use Cross Validation to determine error rate we are likely to see in the real world

![](images/kfold.png)

### Function `cross_val_score` automates this process for us
This is not an estimator. It is a helper function. Pass it your estimator, data, and number of folds. It returns the evaluation metric on each fold.

In [None]:
from sklearn.model_selection import cross_val_score

In [None]:
cross_val_score(dtr, X, y, cv=5)

### Average the scores

In [None]:
scores = cross_val_score(dtr, X, y, cv=5)
scores.mean()

# Wow that's pretty terrible!

## Overfitting
Our decision tree model overfit the data. It **memorized** the training data. By default, the decision trees will be built until just a single data point exists in each leaf.

## Too much flexibility
The decision tree has too much flexibility to fit the training data.

## Must Generalize to Unseen data
A good machine learning model will generalize to unseen data.

## Put constraints on decision tree when instantiating it
So far, all of our machine learning models have been instantiated with the default values. We can change these defaults as if they were knobs on a control panel.

## Press shift + tab + tab to see the default values
Inspecting the estimator below shows that we have the following default values for these parameters.

* max_depth=None
* min_samples_split=2
* min_samples_leaf=1

## Change max_depth and check_results with cross_val_score
Let's limit how deep the decision tree can go.

In [None]:
dtr = DecisionTreeRegressor(max_depth=4)

In [None]:
scores = cross_val_score(dtr, X, y, cv=5)
scores

In [None]:
scores.mean()

## Much better!!

# Your Turn - Can you tweak the parameters to get a better cross validation score?

In [None]:
# your code here

# Use all the variables
So, we have all these variables. Let's go ahead and see how much better we can predict with all of them. Let's also use a random forest, which is very popular and usually gives good results.

In [None]:
X = housing_num.values

In [None]:
X.shape

## Do the three steps

In [None]:
# import
from sklearn.ensemble import RandomForestRegressor

In [None]:
# instantiate
rfr = RandomForestRegressor()

In [None]:
# train
rfr.fit(X, y)

In [None]:
# get cross validation score
scores = cross_val_score(rfr, X, y, cv=5)
scores

In [None]:
scores.mean()

# Pretty good

# Can you tweak the parameters to get better results?

In [None]:
# your code here

# Grid Search

# Normalization