### Codio Activity 8.5: Comparing Complexity and Variance

**Expected Time: 60 Minutes**

**Total Points: 35**

In this activity, you will explore the effect of model complexity on the variance in predictions.  Continuing with the automotive data, you will build models on a subset of 10 vehicles.  You will compare the model error when used on the entire dataset, and investigate how variance changes with model complexity.

#### Index:

- [Problem 1](#Problem-1)
- [Problem 2](#Problem-2)
- [Problem 3](#Problem-3)
- [Problem 4](#Problem-4)


In [1]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error
import plotly.express as px

In [2]:
auto = pd.read_csv("data/auto.csv")

In [3]:
auto.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
0,18.0,8,307.0,130.0,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150.0,3436,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150.0,3433,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140.0,3449,10.5,70,1,ford torino


### The Sample

Below, a sample of ten vehicles from the data is extracted.  These data are what will form our **training** data.  The data is subsequently split into `X_train` and `y_train`.  You are to use this smaller dataset to build your models on, and explore their performance using the entire dataset.

In [4]:
X = auto.loc[:, ["horsepower"]]
y = auto["mpg"]
sample = auto.sample(10, random_state=22)
X_train = sample.loc[:, ["horsepower"]]
y_train = sample["mpg"]

In [5]:
X_train

Unnamed: 0,horsepower
280,88.0
57,80.0
46,100.0
223,110.0
303,90.0
73,140.0
98,100.0
250,105.0
254,100.0
337,110.0


In [6]:
y_train

280    22.3
57     25.0
46     19.0
223    17.5
303    28.4
73     13.0
98     18.0
250    19.2
254    20.5
337    23.5
Name: mpg, dtype: float64

In [7]:
X.shape

(392, 1)

[Back to top](#Index:) 

### Problem 1

#### Iterate on Models

**20 Points**

In this problem, you are to again build models using degree 1 through 10.  Use a `Pipeline` and be sure to set `include_bias = False` in your transformer.  Fit your pipelines on the training data, and assign the predictions using the entire dataset (`X`) to the appropriate key in the dictionary.`model_predictions`.

In [28]:
model_predictions = {f"degree_{i}": None for i in range(1, 11)}
for k in range(10):
    pipeline = Pipeline(
        [
            ("transform", PolynomialFeatures(degree=k + 1, include_bias=False)),
            ("regression", LinearRegression()),
        ]
    ).fit(X_train, y_train)

    key = f"degree_{k + 1}"
    model_predictions[key] = pipeline.predict(auto[["horsepower"]])
    # mse = float(mean_squared_error(auto["mpg"], model_predictions[key]))
    # display([k + 1, mse, len(model_predictions[key])])

# Answer check
model_predictions["degree_1"][:10]

array([14.90395265,  7.65623939, 10.76240222, 10.76240222, 12.83317743,
        0.82268118, -3.7330243 , -2.69763669, -4.7684119 ,  2.47930135])

[Back to top](#Index:) 

### Problem 2

#### DataFrame of Predictions

**5 Points**

Use the `model_predictions` dictionary to create a DataFrame of the 10 models predictions.  Assign your solution to `pred_df` below as a DataFrame. 

In [9]:
### GRADED

### YOUR SOLUTION HERE
pred_df = pd.DataFrame(model_predictions)

# Answer check
print(type(pred_df))
pred_df.head()

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,degree_1,degree_2,degree_3,degree_4,degree_5,degree_6,degree_7,degree_8,degree_9,degree_10
0,14.903953,14.959892,15.704485,32.550328,97.807527,101.886397,103.934543,103.117944,98.288488,87.83473
1,7.656239,9.465786,0.931088,-372.035448,-3456.141665,-4370.275875,-5342.443862,-6208.27495,-6618.861218,-5878.338979
2,10.762402,11.618435,9.428697,-61.767623,-516.945175,-606.298593,-688.570562,-746.836712,-752.164365,-655.409764
3,10.762402,11.618435,9.428697,-61.767623,-516.945175,-606.298593,-688.570562,-746.836712,-752.164365,-655.409764
4,12.833177,13.221841,13.121121,13.003201,12.998835,13.007347,12.999361,12.999488,12.999649,12.99976


[Back to top](#Index:) 

### Problem 3

#### DataFrame of Errors

**5 Points**

Now, determine the error for each model and create a DataFrame of these errors.  One way to do this is to use your prediction DataFrame's `.subtract` method to subtract `y` from each feature.  Assign the DataFrame of errors as `error_df` below.  

In [19]:
### GRADED

### YOUR SOLUTION HERE
# error_df = pred_df.sub(auto[["mpg"]].to_numpy())
error_df = pred_df.subtract(auto["mpg"], axis=0)

# Answer check
print(type(error_df))
error_df.head()

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,degree_1,degree_2,degree_3,degree_4,degree_5,degree_6,degree_7,degree_8,degree_9,degree_10
0,-3.096047,-3.040108,-2.295515,14.550328,79.807527,83.886397,85.934543,85.117944,80.288488,69.83473
1,-7.343761,-5.534214,-14.068912,-387.035448,-3471.141665,-4385.275875,-5357.443862,-6223.27495,-6633.861218,-5893.338979
2,-7.237598,-6.381565,-8.571303,-79.767623,-534.945175,-624.298593,-706.570562,-764.836712,-770.164365,-673.409764
3,-5.237598,-4.381565,-6.571303,-77.767623,-532.945175,-622.298593,-704.570562,-762.836712,-768.164365,-671.409764
4,-4.166823,-3.778159,-3.878879,-3.996799,-4.001165,-3.992653,-4.000639,-4.000512,-4.000351,-4.00024


[Back to top](#Index:) 

### Problem 4

#### Mean and Variance of Model Errors

**5 Points**


Using the DataFrame of errors, examine the mean and variance of each model's error.  What degree model has the highest variance?  Assign your response as an integer to `highest_var_degree` below.

In [25]:
### GRADED

### YOUR SOLUTION HERE
highest_var_degree = int(error_df.var().argmax()) + 1

# Answer check
print(type(highest_var_degree))
print(highest_var_degree)
error_df.describe()

<class 'int'>
10


Unnamed: 0,degree_1,degree_2,degree_3,degree_4,degree_5,degree_6,degree_7,degree_8,degree_9,degree_10
count,392.0,392.0,392.0,392.0,392.0,392.0,392.0,392.0,392.0,392.0
mean,-3.25515,-2.443086,-5.171435,-284.193957,-3690.067826,-5859.241569,-8936.694627,-12877.331739,-16669.944206,-16415.017799
std,5.253192,4.615428,16.741573,1158.805396,18440.159765,30055.270273,47385.669183,70529.557094,93906.721832,93940.563487
min,-21.8038,-18.09849,-124.283091,-9923.586428,-169225.80724,-285400.025062,-464193.399475,-710760.602885,-969927.050836,-984930.984421
25%,-6.347131,-5.170967,-5.571303,-70.409044,-5.403283,-5.315081,-5.895624,-8.191246,-13.452847,-24.162201
50%,-3.088512,-2.485808,-2.182942,-6.443344,0.910943,0.698704,0.001561,-1.248205,-3.561982,-4.50024
75%,0.416447,0.566694,1.727216,0.351627,35.961772,18.698351,7.030018,3.429995,1.067587,0.646789
max,11.914449,12.695805,22.645621,19.763955,2180.582468,899.832824,339.728473,103.05424,85.288488,74.83473



#### Boxplots of Errors by Degree

Below, uncomment the code to create boxplots for each degree model error.  This should demonstrate an important idea, that as model complexity grows so does the variance in predictions of unseen data. 

In [12]:
px.box(error_df)