### Codio Activity 8.5: Comparing Complexity and Variance

**Expected Time: 60 Minutes**

**Total Points: 35**

In this activity, you will explore the effect of model complexity on the variance in predictions.  Continuing with the automotive data, you will build models on a subset of 10 vehicles.  You will compare the model error when used on the entire dataset, and investigate how variance changes with model complexity.

#### Index:

- [Problem 1](#Problem-1)
- [Problem 2](#Problem-2)
- [Problem 3](#Problem-3)
- [Problem 4](#Problem-4)


In [2]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error
import plotly.express as px

In [3]:
auto = pd.read_csv('../data/auto.csv')

In [6]:
auto.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
0,18.0,8,307.0,130.0,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150.0,3436,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150.0,3433,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140.0,3449,10.5,70,1,ford torino


### The Sample

Below, a sample of ten vehicles from the data is extracted.  These data are what will form our **training** data.  The data is subsequently split into `X_train` and `y_train`.  You are to use this smaller dataset to build your models on, and explore their performance using the entire dataset.

In [9]:
X = auto.loc[:,['horsepower']]
y = auto['mpg']
sample = auto.sample(10, random_state = 22)
X_train = sample.loc[:, ['horsepower']]
y_train = sample['mpg']

In [11]:
X_train

Unnamed: 0,horsepower
280,88.0
57,80.0
46,100.0
223,110.0
303,90.0
73,140.0
98,100.0
250,105.0
254,100.0
337,110.0


In [13]:
y_train

280    22.3
57     25.0
46     19.0
223    17.5
303    28.4
73     13.0
98     18.0
250    19.2
254    20.5
337    23.5
Name: mpg, dtype: float64

In [15]:
X.shape

(392, 1)

[Back to top](#Index:) 

### Problem 1

#### Iterate on Models

**20 Points**

Complete the code below according to the instructions below:

- Assign the values in the `horsepower` column of `auto` to the variable `X` below.
- Assign the values in the `mpg` column of `auto` to the variable `y` below.

Use a `for` loop to loop over the values from one to ten. For each iteration `i`:

- Use `Pipeline` to create a pipeline object. Inside the pipeline object define a a tuple where the first element is a string identifier `quad_features'` and the second element is an instance of `PolynomialFeatures` of degree `i` with `include_bias = False`. Inside the pipeline define another tuple where the first element is a string identifier `quad_model`, and the second element is an instance of `LinearRegression`. Assign the pipeline object to the variable `pipe`.
- Use the `fit` function on `pipe` to train your model on `X_train` and `y_train`. Assign the result to `preds`.
- Use the `predict` function to predict the value of `X_train`. Assign the result to `preds`.
- Assign the each `model_predictions` of degree `i` the corresponding `preds` value.

In [51]:
### GRADED

### YOUR SOLUTION HERE
model_predictions = {f'degree_{i}': None for i in range(1, 11)}

print("Starting Dictionary of Predictions\n", model_predictions)

X = auto[['horsepower']]
y = auto['mpg']

#for 1, 2, 3, ..., 10
for i in range(1, 11):
    #create pipeline
    pipe = Pipeline([
        ('quad_features', PolynomialFeatures(degree=i, include_bias=False)),
        ('quad_model', LinearRegression())
    ])
    
    #fit pipeline on training data
    pipe.fit(X_train, y_train)
    
    #make predictions on all data
    preds = pipe.predict(X_train)
    
    #assign to model_predictions
    model_predictions[f'degree_{i}'] = preds

# Answer check
model_predictions['degree_1'][:10]

Starting Dictionary of Predictions
 {'degree_1': None, 'degree_2': None, 'degree_3': None, 'degree_4': None, 'degree_5': None, 'degree_6': None, 'degree_7': None, 'degree_8': None, 'degree_9': None, 'degree_10': None}


array([23.60120856, 25.25782873, 21.1162783 , 19.04550308, 23.18705352,
       12.83317743, 21.1162783 , 20.08089069, 21.1162783 , 19.04550308])

[Back to top](#Index:) 

### Problem 2

#### DataFrame of Predictions

**5 Points**

Use the `model_predictions` dictionary to create a DataFrame of the 10 models predictions.  Assign your solution to `pred_df` below as a DataFrame. 

In [41]:
### GRADED

### YOUR SOLUTION HERE
pred_df = pd.DataFrame(model_predictions)

# Answer check
print(type(pred_df))
pred_df.head(10)

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,degree_1,degree_2,degree_3,degree_4,degree_5,degree_6,degree_7,degree_8,degree_9,degree_10
0,14.903953,14.959892,15.704485,32.550328,97.807527,101.886398,103.934544,103.117944,98.288488,87.83473
1,7.656239,9.465786,0.931088,-372.035448,-3456.141665,-4370.275872,-5342.443862,-6208.274949,-6618.861218,-5878.338979
2,10.762402,11.618435,9.428697,-61.767623,-516.945175,-606.298592,-688.570562,-746.836711,-752.164365,-655.409764
3,10.762402,11.618435,9.428697,-61.767623,-516.945175,-606.298592,-688.570562,-746.836711,-752.164365,-655.409764
4,12.833177,13.221841,13.121121,13.003201,12.998835,13.007348,12.999361,12.999488,12.999649,12.99976
5,0.822681,5.796354,-36.418588,-2829.974843,-37292.470511,-55083.737747,-78497.67681,-105646.83052,-128082.313347,-121002.393396
6,-3.733024,4.164672,-81.244311,-6990.01697,-110848.137856,-179675.982201,-280878.53241,-413685.147737,-544562.255622,-540348.68566
7,-2.697637,4.478285,-69.372104,-5796.729901,-88388.42593,-140373.95352,-215012.775799,-310430.483703,-401212.190234,-393585.878483
8,-4.768412,3.884721,-94.197888,-8355.125483,-137595.304912,-227545.258945,-362900.09423,-545064.004823,-730626.129138,-733374.620936
9,2.479301,6.551268,-24.47263,-1910.516962,-23324.325388,-33238.424249,-45711.6187,-59441.518101,-69876.544283,-64916.073089


[Back to top](#Index:) 

### Problem 3

#### DataFrame of Errors

**5 Points**

Now, determine the error for each model and create a DataFrame of these errors.  One way to do this is to use your prediction DataFrame's `.subtract` method to subtract `y` from each feature.  Assign the DataFrame of errors as `error_df` below.  

In [45]:
### GRADED

### YOUR SOLUTION HERE
error_df = pred_df.subtract(y, axis=0)

# Answer check
print(type(error_df))
error_df.head(10)

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,degree_1,degree_2,degree_3,degree_4,degree_5,degree_6,degree_7,degree_8,degree_9,degree_10
0,-3.096047,-3.040108,-2.295515,14.550328,79.807527,83.886398,85.934544,85.117944,80.288488,69.83473
1,-7.343761,-5.534214,-14.068912,-387.035448,-3471.141665,-4385.275872,-5357.443862,-6223.274949,-6633.861218,-5893.338979
2,-7.237598,-6.381565,-8.571303,-79.767623,-534.945175,-624.298592,-706.570562,-764.836711,-770.164365,-673.409764
3,-5.237598,-4.381565,-6.571303,-77.767623,-532.945175,-622.298592,-704.570562,-762.836711,-768.164365,-671.409764
4,-4.166823,-3.778159,-3.878879,-3.996799,-4.001165,-3.992652,-4.000639,-4.000512,-4.000351,-4.00024
5,-14.177319,-9.203646,-51.418588,-2844.974843,-37307.470511,-55098.737747,-78512.67681,-105661.83052,-128097.313347,-121017.393396
6,-17.733024,-9.835328,-95.244311,-7004.01697,-110862.137856,-179689.982201,-280892.53241,-413699.147737,-544576.255622,-540362.68566
7,-16.697637,-9.521715,-83.372104,-5810.729901,-88402.42593,-140387.95352,-215026.775799,-310444.483703,-401226.190234,-393599.878483
8,-18.768412,-10.115279,-108.197888,-8369.125483,-137609.304912,-227559.258945,-362914.09423,-545078.004823,-730640.129138,-733388.620936
9,-12.520699,-8.448732,-39.47263,-1925.516962,-23339.325388,-33253.424249,-45726.6187,-59456.518101,-69891.544283,-64931.073089


[Back to top](#Index:) 

### Problem 4

#### Mean and Variance of Model Errors

**5 Points**


Using the DataFrame of errors, examine the mean and variance of each model's error.  What degree model has the highest variance?  Assign your response as an integer to `highest_var_degree` below.

In [49]:
### GRADED

### YOUR SOLUTION HERE
# Calculate the mean of each column
means = pred_df.mean()
print(means)

# Calculate the variance of each column
variances = pred_df.var()
print(variances)

highest_var_degree = int(10)
    

# Answer check
print(type(highest_var_degree))
print(highest_var_degree)

degree_1        20.190769
degree_2        21.002833
degree_3        18.274483
degree_4      -260.748039
degree_5     -3666.621907
degree_6     -5835.795650
degree_7     -8913.248709
degree_8    -12853.885820
degree_9    -16646.498288
degree_10   -16391.571881
dtype: float64
degree_1     6.353133e+01
degree_2     5.323108e+01
degree_3     4.074620e+02
degree_4     1.347811e+06
degree_5     3.401180e+08
degree_6     9.034403e+08
degree_7     2.245584e+09
degree_8     4.974680e+09
degree_9     8.818810e+09
degree_10    8.825161e+09
dtype: float64
<class 'int'>
5
