# Variance In Linear Regression

1. Show the problem
    * high degree of variance, 
    * Redefine variance
    * multicollinearity - remember sign flipping, etc.
2. Show the solution
    * Show how model fits worse on training set, but better on the holdout set
    
3. Then in lessons that follow, we'll explain why.

In [129]:
import pandas as pd
df = pd.read_csv('./listings_train_df.csv', index_col = 0)
df_subset = df[df['price'] < 320]

In [130]:
df_subset.shape

(17905, 322)

In [132]:
# df.shape

In [142]:
X_train = df.drop('price', axis = 1)
y_train = np.log(df['price'])
# from sklearn.model_selection import train_test_split


In [161]:
df_test = pd.read_csv('./listings_test_df.csv', index_col = 0)
X_test =  df_test.drop('price', axis = 1)
y_test = np.log(df_test['price'])

Now we expect larger coefficients to have larger amounts of variance, so let's also scale the coefficients, so that we can more easily compare variance between coefficients.

In [143]:
from sklearn.preprocessing import StandardScaler

In [144]:
scaler = StandardScaler()
transformed_X = pd.DataFrame(scaler.fit_transform(X_train), columns = X_train.columns)

In [145]:
transformed_X.shape

(17952, 321)

In [151]:
from sklearn.linear_model import LinearRegression
linear_models = []
for i in range(10):
    X_sample = transformed_X.sample(17952, random_state = i)
    y_sample = y.loc[X_sample.index]
    model = LinearRegression().fit(X_sample, y_sample)
    linear_models.append(model)

Now let's look at the coefficients in the model.

In [181]:
import numpy as np
stacked_coef = np.stack([model.coef_ for model in linear_models])
coef_df = pd.DataFrame(stacked_coef, columns = X_sample.columns)
coef_df.var().iloc[-10::]

first_reviewWeek_is_na         4.017760e+04
first_reviewDay_is_na          4.017760e+04
first_reviewDayofweek_is_na    4.017760e+04
first_reviewDayofyear_is_na    4.017760e+04
last_reviewYear_is_na          8.575424e+20
last_reviewMonth_is_na         8.575424e+20
last_reviewWeek_is_na          8.575424e+20
last_reviewDay_is_na           8.952607e+21
last_reviewDayofweek_is_na     3.440071e+21
last_reviewDayofyear_is_na     3.948345e+21
dtype: float64

While some of the coefficients stay fairly consistent.  Others appear to widely vary.  For example, notice that the cancellation policy coefficients range between $-28$ ,and $-.5$.  It's hard to know which to believe.

Moreover, remember that variance in our model is a sign of our model overfitting to the randomness in the data.  Above we randomly sample different subsets of the data, and get wildly different results with certain features.

We can quantify this variance, by well looking at the variance.

That's a lot of variance.

> Note that high amounts of variance is often due to multicollinearity in among features.  A high coefficient in one attribute could lead to a high value in a different feature to offset the effect.

In [154]:
train_scores = []
for model in linear_models:
    score = model.score(transformed_X, y_train)
    train_scores.append(score)

In [155]:
train_scores

[0.5919382427317739,
 0.5919466915768645,
 0.5919461417286767,
 0.5919466117937522,
 0.5919198330867317,
 0.591936567703342,
 0.5919431518680847,
 0.5919462451065304,
 0.5919466920109846,
 0.5919349439074786]

In [158]:
transformed_X_test = scaler.fit_transform(X_test)

In [163]:
test_scores = []
for model in linear_models:
    score = model.score(transformed_X_test, y_test)
    test_scores.append(score)

In [172]:
test_scores[:4]

[-1.6067398303872696e+19,
 -8.902554393370783e+17,
 -5.308387726759547e+19,
 -3.427887085243626e+19]

### Combatting Variance

In [171]:
from sklearn.linear_model import RidgeCV

ridge = RidgeCV(alphas = np.linspace(.1, 10, 100))
ridge.fit(transformed_X, y_train).score(transformed_X_test, y_test)

0.5839889248274908

In [173]:
ridge.score(transformed_X, y_train)

0.5916036599781592

In [182]:
ridge.alpha_

0.1

With machine learning one mechanism is to 

> Ok

While we can see that some of our coefficients, remain fairly consistent, the last two displayed coefficients have a large degress of variance.

Now in our models, we have a preference for a simpler model.

In [56]:
selected_cols = ['cancellation_policy_x0_flexible',
       'cancellation_policy_x0_moderate',
       'cancellation_policy_x0_strict_14_with_grace_period',
       'accommodates', 'room_type_x0_Entire home/apt', 'availability_90',
       'bedrooms', 'neighbourhood_group_cleansed_x0_Mitte',
       'guests_included', 'cleaning_fee',
       'calculated_host_listings_count', 'bathrooms',
       'neighbourhood_cleansed_x0_Moabit West',
       'neighbourhood_cleansed_x0_Osloer Straße',
       'neighbourhood_cleansed_x0_Parkviertel',
       'property_type_x0_Apartment',
       'neighbourhood_cleansed_x0_Wedding Zentrum',
       'neighbourhood_group_cleansed_x0_Friedrichshain-Kreuzberg',
       'last_reviewYear_is_na', 'host_listings_count_is_na',
       'host_sinceElapsed', 'host_sinceYear', 'last_reviewElapsed', 'price']

In [52]:
selected_df = df[selected_cols]

In [53]:
selected_df.to_csv('./bnb_selected_cols.csv')

In [54]:
selected_cols_df = pd.read_csv('./bnb_selected_cols.csv', index_col = 0)

In [55]:
selected_cols_df.columns

Index(['cancellation_policy_x0_flexible', 'cancellation_policy_x0_moderate',
       'cancellation_policy_x0_strict_14_with_grace_period', 'accommodates',
       'room_type_x0_Entire home/apt', 'availability_90', 'bedrooms',
       'neighbourhood_group_cleansed_x0_Mitte', 'guests_included',
       'cleaning_fee', 'calculated_host_listings_count', 'bathrooms',
       'neighbourhood_cleansed_x0_Moabit West',
       'neighbourhood_cleansed_x0_Osloer Straße',
       'neighbourhood_cleansed_x0_Parkviertel', 'property_type_x0_Apartment',
       'neighbourhood_cleansed_x0_Wedding Zentrum',
       'neighbourhood_group_cleansed_x0_Friedrichshain-Kreuzberg',
       'last_reviewYear_is_na', 'host_listings_count_is_na',
       'host_sinceElapsed', 'host_sinceYear', 'last_reviewElapsed', 'price'],
      dtype='object')

In [72]:
# ' guests_included, accommodates

In [58]:
X = selected_df.drop('price', axis = 1)
y = selected_df['price']

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [70]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_transformed = scaler.fit_transform(X_train)
X_test_transformed = scaler.transform(X_test)

In [68]:
scaled_X_train = pd.DataFrame(X_train_transformed, columns = X.columns)
scaled_X[:3]

Unnamed: 0,cancellation_policy_x0_flexible,cancellation_policy_x0_moderate,cancellation_policy_x0_strict_14_with_grace_period,accommodates,room_type_x0_Entire home/apt,availability_90,bedrooms,neighbourhood_group_cleansed_x0_Mitte,guests_included,cleaning_fee,...,neighbourhood_cleansed_x0_Osloer Straße,neighbourhood_cleansed_x0_Parkviertel,property_type_x0_Apartment,neighbourhood_cleansed_x0_Wedding Zentrum,neighbourhood_group_cleansed_x0_Friedrichshain-Kreuzberg,last_reviewYear_is_na,host_listings_count_is_na,host_sinceElapsed,host_sinceYear,last_reviewElapsed
0,-0.823514,-0.681493,1.619946,-1.089895,-0.942669,1.566217,-0.240469,-0.507671,-0.393754,-1.198509,...,-0.122794,-0.130071,0.340114,-0.151219,-0.567515,-0.45306,-0.037592,0.007717,-0.340476,0.458011
1,-0.823514,1.467366,-0.617304,-0.415739,1.060818,-0.700872,-1.834779,-0.507671,-0.393754,0.002345,...,-0.122794,-0.130071,0.340114,-0.151219,-0.567515,2.207212,-0.037592,0.253095,1.216721,-2.20716
2,-0.823514,1.467366,-0.617304,-0.415739,1.060818,0.078439,-0.240469,-0.507671,-0.393754,-0.524448,...,-0.122794,-0.130071,-2.940187,-0.151219,-0.567515,-0.45306,-0.037592,-0.275922,-1.897672,0.458161


In [69]:
from sklearn.linear_model import LinearRegression

model = LinearRegression().fit(scaled_X_train, y_train)

In [71]:
model.score(X_test_transformed, y_test)

0.5033415469476993

In [73]:
import numpy as np
df_coef = pd.DataFrame({'features': X.columns, 'coef': model.coef_, 'abs_coef': np.abs(model.coef_)})
df_coef.sort_values('abs_coef', ascending = False)

Unnamed: 0,features,coef,abs_coef
18,last_reviewYear_is_na,139.217008,139.217008
22,last_reviewElapsed,136.917935,136.917935
0,cancellation_policy_x0_flexible,-20.365775,20.365775
1,cancellation_policy_x0_moderate,-18.703297,18.703297
2,cancellation_policy_x0_strict_14_with_grace_pe...,-17.266226,17.266226
3,accommodates,10.077555,10.077555
4,room_type_x0_Entire home/apt,10.076962,10.076962
19,host_listings_count_is_na,-8.711553,8.711553
20,host_sinceElapsed,-8.476337,8.476337
5,availability_90,6.715953,6.715953


In [74]:
from eli5.sklearn import PermutationImportance
import eli5
import numpy as np

perm = PermutationImportance(model).fit(X_test, y_test)

exp_df = eli5.explain_weights_df(perm, feature_names = list(X_train.columns))



In [75]:
exp_df

Unnamed: 0,feature,weight,std
0,last_reviewYear_is_na,34240120000.0,237797800.0
1,cleaning_fee,1400609000.0,859310400.0
2,cancellation_policy_x0_moderate,810051200.0,67165010.0
3,cancellation_policy_x0_strict_14_with_grace_pe...,583879900.0,34055750.0
4,calculated_host_listings_count,124728100.0,102257700.0
5,host_sinceYear,112099300.0,17587520.0
6,neighbourhood_cleansed_x0_Moabit West,2103706.0,4367836.0
7,neighbourhood_group_cleansed_x0_Friedrichshain...,1966080.0,3380095.0
8,neighbourhood_cleansed_x0_Osloer Straße,956825.6,2574606.0
9,host_listings_count_is_na,353894.4,32105.95


In [76]:
from sklearn.linear_model import LassoCV

In [98]:
alphas=np.linspace(.01, 10, 100)
model = LassoCV(alphas = alphas, n_alphas = 100, max_iter= 100000).fit(X_train, y_train)


In [99]:
df_X = pd.DataFrame({'features': X_train.columns, 'coef':model.coef_, 'coef_abs': np.abs(model.coef_)})

In [109]:
df_X

Unnamed: 0,features,coef,coef_abs
0,cancellation_policy_x0_flexible,-27.10808,27.10808
1,cancellation_policy_x0_moderate,-25.57715,25.57715
2,cancellation_policy_x0_strict_14_with_grace_pe...,-24.00903,24.00903
3,accommodates,6.798253,6.798253
4,room_type_x0_Entire home/apt,20.18709,20.18709
5,availability_90,0.2471423,0.2471423
6,bedrooms,10.13042,10.13042
7,neighbourhood_group_cleansed_x0_Mitte,14.80026,14.80026
8,guests_included,6.232226,6.232226
9,cleaning_fee,0.216338,0.216338


In [139]:
from sklearn.linear_model import Ridge

In [140]:
ridge = Ridge(alpha = 5)

In [141]:
ridge.fit(X_train_transformed, y_train)

Ridge(alpha=5)

In [142]:
ridge.score(X_test_transformed, y_test)

0.5031503207706551

In [143]:
ridge_df = pd.DataFrame({'features': X_train.columns, 'coef': ridge.coef_, 'abs_coef': np.abs(ridge.coef_)})

In [144]:
ridge_df.sort_values('abs_coef', ascending = False)

Unnamed: 0,features,coef,abs_coef
0,cancellation_policy_x0_flexible,-18.455542,18.455542
1,cancellation_policy_x0_moderate,-16.798064,16.798064
2,cancellation_policy_x0_strict_14_with_grace_pe...,-15.421565,15.421565
3,accommodates,10.101354,10.101354
4,room_type_x0_Entire home/apt,10.05807,10.05807
18,last_reviewYear_is_na,8.292196,8.292196
5,availability_90,6.979854,6.979854
6,bedrooms,6.346088,6.346088
7,neighbourhood_group_cleansed_x0_Mitte,6.175305,6.175305
22,last_reviewElapsed,5.955547,5.955547


### Linear Model

In [134]:
from sklearn.linear_model import LinearRegression

linear = LinearRegression().fit(X_train_transformed, y_train)

In [135]:
linear.score(X_test_transformed, y_test)

0.5033415469476993

In [136]:
linear_coef = pd.DataFrame({'features': X_train.columns, 'coef':linear.coef_, 'coef_abs': np.abs(linear.coef_)})

In [137]:
linear_coef.sort_values('coef_abs', ascending = False)

Unnamed: 0,features,coef,coef_abs
18,last_reviewYear_is_na,139.217008,139.217008
22,last_reviewElapsed,136.917935,136.917935
0,cancellation_policy_x0_flexible,-20.365775,20.365775
1,cancellation_policy_x0_moderate,-18.703297,18.703297
2,cancellation_policy_x0_strict_14_with_grace_pe...,-17.266226,17.266226
3,accommodates,10.077555,10.077555
4,room_type_x0_Entire home/apt,10.076962,10.076962
19,host_listings_count_is_na,-8.711553,8.711553
20,host_sinceElapsed,-8.476337,8.476337
5,availability_90,6.715953,6.715953


### The goal

The idea is that we can reduce the complexity of the model without sacrificing too much in terms of the performance of the model.

So we can see that by reducing, if not removing altogether, the influence of the last few columns we were able to achieve a simpler model.

In [101]:
model.score(X_test, y_test)

0.5036268239153381

In [102]:
model.alpha_

0.01

In [105]:
sum(df_X.coef_abs) # lasso model

239.45315102777792

In [107]:
sum(df_coef.abs_coef)

423.8848505366087

So we can see that we cut in half the score coeficients of the model without drastically changing the score of the model.