# Bias-Variance Tradeoff


## Agenda

1. Explain what bias, variance, and error are in the context of statistical modeling
2. Defining Error: prediction error and irreducible error
3. Define prediction error as a combination of bias and variance
4. Explore the bias-variance tradeoff
5. Train-test split



# 1. Explain what bias, variance, and error are in the context of statistical modeling

![which model is better](img/which_model_is_better.png)

https://towardsdatascience.com/cultural-overfitting-and-underfitting-or-why-the-netflix-culture-wont-work-in-your-company-af2a62e41288


# What makes a model good?

- We don’t ultimately care about how well your model fits your data.

- What we really care about is how well your model describes the process that generated your data.

- Why? Because the data set you have is but one sample from a universe of possible data sets, and you want a model that would work for any data set from that universe

# What is a “Model”?

 - A “model” is a general specification of relationships among variables. E.G. Linear Regression:

$$\Large Price = \beta_1 X_1 + \beta_0 + \epsilon$$

 - Each model makes assumptions about how the variables interact. 
 - A 'trained model' operates on these assumptions to learn from how best to interact with training data.
 - In linear regression, the learning results in a set of parameters that define the best fit linear equation.
 - The higher the quality of learning form this training data, the more precicely the model will reflect the real world process the data was generated from.
 - The model will then perform more accurately on unseen samples.


# Remember Expected Value?
- The expected value of a quantity is the weighted average of that quantity across all possible samples

![6 sided die](https://media.giphy.com/media/sRJdpUSr7W0AiQ3RcM/giphy.gif)

- for a 6 sided die, another way to think about the expected value is the arithmetic mean of the rolls of a very large number of independent samples.  

### The expected value of a 6-sided die is:

In [4]:
# code

In [1]:
#__SOLUTION__

probs = 1/6
rolls = range(1,7)

expected_value = sum([probs * roll for roll in rolls])
expected_value


3.5

Suppose we created a model which always predicted that the die roll would be 3.

The **bias** of our model would be the difference between the our expected prediction (3) and the expected value (3.5).

What would the **variance** of our model be?


# 2. Defining Error: prediction error and irreducible error



### Regression fit statistics are often called “error”
 - Sum of Squared Errors (SSE)
 - Mean Squared Error (MSE) 
 
 Both are calculated using residuals

![residuals](img/residuals.png)


This error can be broken up into parts:

$Total Error = Residual = Prediction\ Error+ Irreducible\ Error$

![defining error](img/defining_error.png)

There will always be some random, irreducible error inherent in the data.  Real data always has noise.

The goal of modeling is to reduce the prediction error, which is the difference between our model and the realworld processes from which our data is generated.

# 3. Define prediction error as a combination of bias and variance

Our prediction error can be further broken down into error due to bias and error due to variance.

$\Large Prediction\ Error = Model\ Bias^2 + Model\ Variance $

So our total error can be thought of as a combination of bias, variance, and irriducile error.

$\Large Total Error = Model\ Bias^2 + Model\ Variance + Irreducible\ Error$


**Model Bias** is the expected prediction error from your expected trained model

> In other words, if you were to train multiple models on different samples, what would be the average prediction error.

**Model Variance** is the expected variation in predictions, relative to your expected trained model

> In other words, it is a measure of how much your model varies for any given point.

**Let's do a thought experiment:**

# Thought Experiment

1. Imagine you've collected 23 different training sets for the same problem.
2. Now imagine using one algorithm to train 23 models, one for each of your training sets.
3. Bias vs. variance refers to the accuracy vs. consistency of the models trained by your algorithm.

![target_bias_variance](img/target.png)

http://scott.fortmann-roe.com/docs/BiasVariance.html



# 4.  Explore Bias Variance Tradeoff

**High bias** algorithms tend to be less complex, with simple or rigid underlying structure.

+ They train models that are consistent, but inaccurate on average.
+ These include linear or parametric algorithms such as regression and naive Bayes.

On the other hand, **high variance** algorithms tend to be more complex, with flexible underlying structure.

+ They train models that are accurate on average, but inconsistent.
+ These include non-linear or non-parametric algorithms such as decision trees and nearest neighbors.



While we build our models, we have to keep this relationship in mind.  If we build complex models, we risk overfitting our models.  Their predictions will vary greatly when introduced to new data.  If our models are too simple, the predictions as a whole will be inaccurate.   

The goal is to build a model with enough complexity to be accurate, but not too much complexity to be erratic.

![optimal](img/optimal_bias_variance.png)
http://scott.fortmann-roe.com/docs/BiasVariance.html

### Let's take a look at our familiar King County housing data. 

In [35]:
import pandas as pd
import numpy as np
np.random.seed(42)
df = pd.read_csv('data/kc_housing.csv', index_col='id')
df.head()

Unnamed: 0_level_0,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
7129300520,20141013T000000,221900.0,3,1.0,1180,5650,1.0,0,0,3,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
6414100192,20141209T000000,538000.0,3,2.25,2570,7242,2.0,0,0,3,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
5631500400,20150225T000000,180000.0,2,1.0,770,10000,1.0,0,0,3,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
2487200875,20141209T000000,604000.0,4,3.0,1960,5000,1.0,0,0,5,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
1954400510,20150218T000000,510000.0,3,2.0,1680,8080,1.0,0,0,3,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


In [36]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
# Let's generate random subsets of our data
df = pd.read_csv('data/kc_housing.csv', index_col='id')

#Date  is not in the correct format so we are dropping it for now.
df_low_var = df.drop(['date', 'zipcode', 'lat', 'long'], axis=1)

r_2 = []
low_var_rmse = []
for i in range(100):
    
    df_sample = df_low_var.sample(5000, replace=True)
    y = df_sample.price
    X = df_sample.drop('price', axis=1)
    
    lr = LinearRegression()
    lr.fit(X, y)
    y_hat = lr.predict(X)
    low_var_rmse.append(np.sqrt(mean_squared_error(y, y_hat)))
    r_2.append(lr.score(X,y))
    
    


In [37]:
print(f'low variance sample mean mean {np.mean(low_var_rmse)}')
print(f'low variance sample mean variance {np.var(low_var_rmse)}')

low variance sample mean mean 213324.0264883605
low variance sample mean variance 72481195.82350436


In [38]:
from sklearn.preprocessing import PolynomialFeatures


df = pd.read_csv('data/kc_housing.csv', index_col='id')
#Date  is not in the correct format so we are dropping it for now.
df = df.drop(['date', 'zipcode', 'lat', 'long'], axis=1)

pf = PolynomialFeatures(2)

df_poly = pd.DataFrame(pf.fit_transform(df.drop('price', axis=1)))
df_poly.index = df.index
df_poly['price'] = df['price']

cols = list(df_poly)
# move the column to head of list using index, pop and insert
cols.insert(0, cols.pop(cols.index('price')))

df_poly = df_poly.loc[:,cols]

df_poly.head(10)

Unnamed: 0_level_0,price,0,1,2,3,4,5,6,7,8,...,126,127,128,129,130,131,132,133,134,135
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
7129300520,221900.0,1.0,3.0,1.0,1180.0,5650.0,1.0,0.0,0.0,3.0,...,3822025.0,0.0,2619700.0,11045750.0,0.0,0.0,0.0,1795600.0,7571000.0,31922500.0
6414100192,538000.0,1.0,3.0,2.25,2570.0,7242.0,2.0,0.0,0.0,3.0,...,3806401.0,3884441.0,3297190.0,14903689.0,3964081.0,3364790.0,15209249.0,2856100.0,12909910.0,58354320.0
5631500400,180000.0,1.0,2.0,1.0,770.0,10000.0,1.0,0.0,0.0,3.0,...,3736489.0,0.0,5257760.0,15583846.0,0.0,0.0,0.0,7398400.0,21928640.0,64995840.0
2487200875,604000.0,1.0,4.0,3.0,1960.0,5000.0,1.0,0.0,0.0,5.0,...,3861225.0,0.0,2672400.0,9825000.0,0.0,0.0,0.0,1849600.0,6800000.0,25000000.0
1954400510,510000.0,1.0,3.0,2.0,1680.0,8080.0,1.0,0.0,0.0,3.0,...,3948169.0,0.0,3576600.0,14908461.0,0.0,0.0,0.0,3240000.0,13505400.0,56295010.0
7237550310,1225000.0,1.0,4.0,4.5,5420.0,101930.0,1.0,0.0,0.0,3.0,...,4004001.0,0.0,9524760.0,203961930.0,0.0,0.0,0.0,22657600.0,485186800.0,10389720000.0
1321400060,257500.0,1.0,3.0,2.25,1715.0,6819.0,2.0,0.0,0.0,3.0,...,3980025.0,0.0,4464810.0,13603905.0,0.0,0.0,0.0,5008644.0,15260922.0,46498760.0
2008000270,291850.0,1.0,3.0,1.5,1060.0,9711.0,1.0,0.0,0.0,3.0,...,3853369.0,0.0,3238950.0,19062693.0,0.0,0.0,0.0,2722500.0,16023150.0,94303520.0
2414600126,229500.0,1.0,3.0,1.0,1780.0,7470.0,1.0,0.0,0.0,3.0,...,3841600.0,0.0,3488800.0,15901480.0,0.0,0.0,0.0,3168400.0,14441140.0,65820770.0
3793500160,323000.0,1.0,3.0,2.5,1890.0,6560.0,2.0,0.0,0.0,3.0,...,4012009.0,0.0,4787170.0,15162710.0,0.0,0.0,0.0,5712100.0,18092300.0,57304900.0


In [39]:
r_2 = []
high_var_rmse = []
for i in range(100):
    
    df_sample = df_poly.sample(1000, replace=True)
    y = df_sample.price
    X = df_sample.drop('price', axis=1)
    
    lr = LinearRegression()
    lr.fit(X, y)
    y_hat = lr.predict(X)
    high_var_rmse.append(np.sqrt(mean_squared_error(y, y_hat)))
    r_2.append(lr.score(X,y))
    

In [40]:
print(f'lo variance mean {np.mean(low_var_rmse)}')
print(f'Hi variance mean {np.mean(high_var_rmse)}')

print(f'lo variance variance {np.var(low_var_rmse)}')
print(f'Hi variance variance {np.var(high_var_rmse)}')

lo variance mean 213324.0264883605
Hi variance mean 157307.0655871797
lo variance variance 72481195.82350436
Hi variance variance 286694300.50911325


![which_model](img/which_model_is_better_2.png)

# 5. Train Test Split

It is hard to know if your model is too simple or complex by just using it on training data.

We can hold out part of our training sample, and use it as a test sample and use it to monitor our prediction error.

This allows us to evaluate whether our model has the right balance of bias/variance. 

<img src='img/testtrainsplit.png' width =550 />

* **training set** —a subset to train a model.
* **test set**—a subset to test the trained model.


### Should you ever train on your test set?  


![no](https://media.giphy.com/media/d10dMmzqCYqQ0/giphy.gif)


**Never train on test data.** If you are seeing surprisingly good results on your evaluation metrics, it might be a sign that you are accidentally training on the test set. 

##### [Link](https://datascience.stackexchange.com/questions/38395/standardscaler-before-and-after-splitting-data) about data leakage and scalars

**How do we know if our model is overfitting or underfitting?**


If our model is not performing well on the training  data, we are probably underfitting it.  


To know if our  model is overfitting the data, we need  to test our model on unseen data. 
We then measure our performance on the unseen data. 

If the model performs way worse on the  unseen data, it is probably  overfitting the data.

<img src='https://developers.google.com/machine-learning/crash-course/images/WorkflowWithTestSet.svg' width=500/>

Let's go back to our KC housing data without the polynomial transformation.

In [359]:
df = pd.read_csv('data/kc_housing.csv', index_col='id')

#Date  is not in the correct format so we are dropping it for now.
df = df.drop(['date', 'zipcode', 'lat', 'long'], axis=1)
df.head()

Unnamed: 0_level_0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,sqft_living15,sqft_lot15
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
7129300520,221900.0,3,1.0,1180,5650,1.0,0,0,3,7,1180,0,1955,0,1340,5650
6414100192,538000.0,3,2.25,2570,7242,2.0,0,0,3,7,2170,400,1951,1991,1690,7639
5631500400,180000.0,2,1.0,770,10000,1.0,0,0,3,6,770,0,1933,0,2720,8062
2487200875,604000.0,4,3.0,1960,5000,1.0,0,0,5,7,1050,910,1965,0,1360,5000
1954400510,510000.0,3,2.0,1680,8080,1.0,0,0,3,8,1680,0,1987,0,1800,7503


Now, we create a train-test split via the sklearn model selection package.

In [421]:
from sklearn.model_selection import train_test_split


y = df.price
X = df.drop('price', axis=1)

# Here is the convention for a traditional train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=43, test_size=.25)

In [422]:
# Instanstiate your linear regression object
lr = LinearRegression()

In [423]:
# fit the model on the training set
lr.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [424]:
# Check the R^2 of the training data
lr.score(X_train, y_train)

0.6573692385436587

A .65 R-squared reflects a model that explains a fairly high amount of the total variance in the data. 

### Knowledge check
How would you describe the bias of the model based on the above training R^2?

In [425]:
# Your answer here

In [426]:
#__SOLUTION__
"A model with a .65 R^2 is approaching a low bias model."

'A model with a .65 R^2 is approaching a low bias model.'

Next, we test how well the model performs on the unseen test data. Remember, we do not fit the model again. The model has calculated the optimal parameters learning from the training set.  


In [427]:
lr.score(X_test, y_test)

0.641985077406776

The difference between the train and test scores are low.

What does that indicate about variance?

In [428]:
#__SOLUTION__
'The model has low variance'

'The model has low variance'

# Now, let's try the same thing with our complex, polynomial model.

In [438]:
df = pd.read_csv('data/kc_housing.csv', index_col='id')
#Date  is not in the correct format so we are dropping it for now.
df = df.drop(['date', 'zipcode', 'lat', 'long'], axis=1)
df.head()

Unnamed: 0_level_0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,sqft_living15,sqft_lot15
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
7129300520,221900.0,3,1.0,1180,5650,1.0,0,0,3,7,1180,0,1955,0,1340,5650
6414100192,538000.0,3,2.25,2570,7242,2.0,0,0,3,7,2170,400,1951,1991,1690,7639
5631500400,180000.0,2,1.0,770,10000,1.0,0,0,3,6,770,0,1933,0,2720,8062
2487200875,604000.0,4,3.0,1960,5000,1.0,0,0,5,7,1050,910,1965,0,1360,5000
1954400510,510000.0,3,2.0,1680,8080,1.0,0,0,3,8,1680,0,1987,0,1800,7503


In [464]:
poly_2 = PolynomialFeatures(2)

df_poly = pd.DataFrame(
            poly_2.fit_transform(df.drop('price', axis=1))
                      )

X = df_poly
y = df.price

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=.25)

# Always fit on the training set
lr.fit(X_train, y_train)

lr.score(X_train, y_train)

0.7551982375870033

In [465]:
# That indicates a lower bias

In [466]:
lr.score(X_test, y_test)

0.7100018682109852

In [468]:
# Indicates higher variance

# Kfolds 

In [None]:
For a more rigorous cross-validation, we turn to K-folds

![kfolds](img/k_folds.png)

[image via sklearn](https://scikit-learn.org/stable/modules/cross_validation.html)

In this process, we split the dataset into train and test as usual, then we perform a shuffling train test split on the train set.  

KFolds holds out one fraction of the dataset, trains on the larger fraction, then calculates a test score on the held out set.  It repeats this process until each group has served as the test set.

We tune our parameters on the training set using kfolds, then validate on the test data.  This allows us to build our model and check to see if it is overfit without touching the test data set.  This protects our model from bias.

In [479]:
X = df.drop('price', axis=1)
y = df.price

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=.25)



In [480]:
kf = KFold(n_splits=5)

train_r2 = []
test_r2 = []
for train_ind, test_ind in kf.split(X,y):
    
    X_train, y_train = X.iloc[train_ind], y.iloc[train_ind]
    X_test, y_test = X.iloc[test_ind], y.iloc[test_ind]
    
    lr.fit(X_train, y_train)
    train_r2.append(lr.score(X_train, y_train))
    test_r2.append(lr.score(X_test, y_test))

In [482]:
# Mean train r_2
np.mean(train_r2)

0.6543164995590857

In [483]:
# Mean test r_2
np.mean(test_r2)

0.6468201186632571

In [484]:
# Test out our polynomial model
poly_2 = PolynomialFeatures(2)

df_poly = pd.DataFrame(
            poly_2.fit_transform(df.drop('price', axis=1))
                      )

X = df_poly
y = df.price

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=.25)

In [485]:
kf = KFold(n_splits=5)

train_r2 = []
test_r2 = []
for train_ind, test_ind in kf.split(X,y):
    
    X_train, y_train = X.iloc[train_ind], y.iloc[train_ind]
    X_test, y_test = X.iloc[test_ind], y.iloc[test_ind]
    
    lr.fit(X_train, y_train)
    train_r2.append(lr.score(X_train, y_train))
    test_r2.append(lr.score(X_test, y_test))

In [486]:
# Mean train r_2
np.mean(train_r2)

0.7530146190048036

In [487]:
# Mean test r_2
np.mean(test_r2)

0.7305072362988075

Once we have an acceptable model, we train our model on the entire training set, and score on the test to validate.

