# Variance In Linear Regression

### Introduction

In this lesson, we'll return to the topic of where error comes from in machine learning models.  Remember that we have error whenever our machine learning model makes a prediction different from the actual outcome.  And in regression problems, we have used SSE or MSE to quantify this error.  

### Reviewing sources of error

We have also identified the three sources of error.   

1. Irreducible error - this is from randomness in our future or holdout data.  We cannot predict this randomness, and thus we will always have an amount of irreducible error.

2. Variance - this is from randomness in our training data.  Our model fits to the training data, and as it fits to randomness in our training data, we train an incorrect model, which varies as the training data varies.

3. Bias - this is a systematic problem that persists over the long run.  It could be due to a problem in our training data (like using an unrepresentative dataset).  In linear regression, it often refers to ommitted variable bias, or underfitting, where our model is not flexible enough to discover the underlying pattern producing the data.

### Focusing on Variance

In this lesson, we'll focus on regularization, which is a technique for reducing variance in linear models.  Remember variance occurs when our models are too flexible and thus overfit to the randomness in the data.  One technique, for handling variance is simply to remove the less important features.

In linear models, the variance can also be due to multicollinearity.  Remember that multicollinearity occurs when we have highly correlated features.  When that occurs, the model will see a similar effect as both features change similarly.  And from model to model will attribute this effect differently.  

Let's see this with our Airbnb data.

### Loading the Data

Let's start by loading up our training data.

In [1]:
import pandas as pd
url = "https://raw.githubusercontent.com/jigsawlabs-student/regularization-in-regression/master/listings_train_df.csv"
df = pd.read_csv(url, index_col = 0)
df_subset = df[df['price'] < 320]

We'll remove listings who are outliers, with prices over 320 dollars. 

In [3]:
import numpy as np
X_train = df.drop('price', axis = 1)
y_train = np.log(df['price'])


Then we load up the test data.

In [4]:
url = "https://raw.githubusercontent.com/jigsawlabs-student/regularization-in-regression/master/listings_test_df.csv"
df_test = pd.read_csv(url, index_col = 0)
X_test =  df_test.drop('price', axis = 1)
y_test = np.log(df_test['price'])

### Training our models

Now that we've loaded our data, let's train multiple models and see how the coefficients of our parameters change from model to model.  The larger the change, the larger the difference in our models and the higher our variance.  Eventually, we'll try to reduce the variance in coefficients that change the most.

Now one issue we have is that larger coefficients will have larger changes from model to model.  And these larger coefficients will depend on the scale of the underlying data (remember that there is a difference between increasing a listing by one foot and by one meter).

So let's begin by scaling our data.  And then we'll fit ten different models on different samples of the data.

In [5]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
transformed_X = pd.DataFrame(scaler.fit_transform(X_train), 
                             columns = X_train.columns)

In [6]:
transformed_X[:2]

Unnamed: 0,id,host_id,host_is_superhost,host_has_profile_pic,host_identity_verified,latitude,longitude,is_location_exact,accommodates,guests_included,...,first_reviewWeek_is_na,first_reviewDay_is_na,first_reviewDayofweek_is_na,first_reviewDayofyear_is_na,last_reviewYear_is_na,last_reviewMonth_is_na,last_reviewWeek_is_na,last_reviewDay_is_na,last_reviewDayofweek_is_na,last_reviewDayofyear_is_na
0,0.008381,-0.165645,-0.391608,0.060282,-0.792188,1.62127,0.151553,0.587728,-0.420238,-0.395254,...,-0.458112,-0.458112,-0.458112,-0.458112,-0.457578,-0.457578,-0.457578,-0.457578,-0.457578,-0.457578
1,-0.194988,0.526644,-0.391608,0.060282,-0.792188,-0.250794,0.377464,0.587728,-0.420238,-0.395254,...,2.182873,2.182873,2.182873,2.182873,2.18542,2.18542,2.18542,2.18542,2.18542,2.18542


In [7]:
y_train = y_train.reset_index()[['price']]
y_train[:2]

Unnamed: 0,price
0,3.496508
1,3.218876


In [57]:
from sklearn.linear_model import LinearRegression
linear_models = []
samples = []
for i in range(10):
    X_sample = transformed_X.sample(15000, random_state = i)
    y_sample = y_train.loc[X_sample.index]
    model = LinearRegression().fit(X_sample, y_sample)
    linear_models.append(model)
    samples.append((X_sample, y_sample))

Now let's look at the coefficients in the model.

In [75]:
import numpy as np
stacked_coef = np.vstack([model.coef_ for model in linear_models])
coef_df = pd.DataFrame(stacked_coef, columns = X_sample.columns)
coef_df[:5]

Unnamed: 0,id,host_id,host_is_superhost,host_has_profile_pic,host_identity_verified,latitude,longitude,is_location_exact,accommodates,guests_included,...,first_reviewWeek_is_na,first_reviewDay_is_na,first_reviewDayofweek_is_na,first_reviewDayofyear_is_na,last_reviewYear_is_na,last_reviewMonth_is_na,last_reviewWeek_is_na,last_reviewDay_is_na,last_reviewDayofweek_is_na,last_reviewDayofyear_is_na
0,0.015274,0.019837,0.013542,0.001779,0.000676,0.010243,-0.02693,-0.00418,0.155194,0.048866,...,-329.758153,-329.758153,-329.758153,-329.758153,56561800000.0,56561800000.0,56561800000.0,-124382900000.0,-108279100000.0,62976540000.0
1,0.017324,0.014383,0.013798,9.6e-05,0.002851,0.015629,-0.020208,-0.00334,0.153367,0.044605,...,-376.873147,-376.873147,-376.873147,-376.873147,19295150000.0,19295150000.0,19295150000.0,-33570160000.0,-62226930000.0,37911650000.0
2,0.014882,0.015451,0.016353,-0.001455,0.003948,0.004409,-0.024528,-0.004426,0.154365,0.045218,...,-312.932209,-312.931768,-312.931768,-312.931768,-18378080000.0,-18378080000.0,-18378080000.0,7082893000.0,40787060000.0,7264285000.0
3,0.008907,0.021261,0.014576,-0.000664,0.003518,0.003209,-0.028299,-0.005408,0.153036,0.045907,...,-362.684512,-362.684512,-362.684512,-362.684512,1513171000.0,1513171000.0,1513171000.0,-2954910000.0,-808987700.0,-775614000.0
4,0.014148,0.01654,0.014186,0.000649,-0.000865,0.009958,-0.019772,-0.003379,0.149927,0.048519,...,-315.919063,-315.919063,-315.919063,-315.919063,-8050963000.0,-8050963000.0,-8050963000.0,20380130000.0,16249120000.0,-12476350000.0


While some of the coefficients stay fairly consistent.  Others appear to widely vary.  For example, take a look at how some of the `is_na` columns vary between model and model.  It's hard to know which to believe.

Moreover, remember that variance in our model is a sign of our model overfitting to the randomness in the data.  We can quantify this variance, by well looking at the variance.

In [59]:
coef_df.var()

id                            6.162351e-06
host_id                       8.447565e-06
host_is_superhost             2.103953e-06
host_has_profile_pic          1.162555e-06
host_identity_verified        3.969869e-06
                                  ...     
last_reviewMonth_is_na        1.181411e+21
last_reviewWeek_is_na         1.181411e+21
last_reviewDay_is_na          9.984990e+22
last_reviewDayofweek_is_na    1.322295e+22
last_reviewDayofyear_is_na    1.318559e+23
Length: 321, dtype: float64

That's a lot of variance.

Remember that variance occurs when we are overfitting to the randomness in the training data.  We can see by looking at how our models perform on the training data and test data.

In [60]:
train_scores = []

for model, sample in zip(linear_models, samples):
    X_sample_train, y_sample_train = sample
    score = model.score(X_sample_train, y_sample_train)
    train_scores.append(score)

In [61]:
train_scores[:4]

[0.5942207451864707, 0.5943320145546616, 0.592339544092867, 0.5953252833742946]

> So we see that from model to model, we have fairly high scores on the training data.

In [69]:
transformed_X_test = scaler.transform(X_test)

In [70]:
test_scores = []
for model in linear_models:
    score = model.score(transformed_X_test, y_test)
    test_scores.append(score)

In [71]:
test_scores[:4]

[-1.3704105317150126e+18,
 -1.2493879652347668e+18,
 -2.472987605610295e+18,
 -1.357788129732622e+16]

We can see that the high scores we saw on the training data do not generalize to our test data.

### Combatting Variance

In the lessons that follow, we'll see a potential solution to variance: regularizing our linear model.  We'll see exactly what this means and how it works in future lessons, but for now, let's just see it's effectiveness.  

We'll use a model called ridge regression.  Ridge regression takes a hyperparameter called alpha, which we'll initialize and pass through our model.

In [54]:
alphas = np.linspace(.01, 5, 100)
alphas[:3]

array([0.01      , 0.06040404, 0.11080808])

Then let's our model and see how it performs.

In [72]:
from sklearn.linear_model import RidgeCV

ridge = RidgeCV(alphas = alphas)
ridge.fit(transformed_X, y_train).score(transformed_X, y_train)

0.5916016838506353

So here we have a model that performs *slightly* worse on the training data.

In [74]:
ridge.score(transformed_X_test, y_test)

0.5835320993438434

However, it generalizes much better to our holdout data.  In the lessons that follow we'll see how this technique works.

### Summary

In this lesson we reviewed the sources of error and the problem of variance in our models.  Our models suffer from variance when they are overfit to the error in the data. Our models are more susceptible to this error when they become overly flexible, due to many features, and multicollinearity.  

We saw this variance by fitting multiple models on different subsets of the same training data.  We saw high degrees of variance in the coefficients of our parameters, and we saw models that did not generalize to our holdout data.  Finally, we saw how, somehow, ridge regression could allow us to train a model that did generalize.