Skip to content

learn-co-students/ds-skills-cv-atlanta-ds-100918

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Cross Validation

Cross validation is very useful for determining optimal model parameters such as our regularization parameter alpha. It first divides the training set into subsets (by default the sklearn package uses 3) and then selects an optimal hyperparameter (in this case alpha, our regularization parameter) based on test performance. For example, if we have 3 splits: A, B and C, we can do 3 training and testing combinations and then average test performance as an overall estimate of model performance for those given parameters. (The three combinations are: Train on A+B test on c, train on A+C test on B, train on B+C test on A.) We can do this across various alpha values in order to determine an optimal regularization parameter. By default, sklearn will even estimate potential alpha for you, or you can explicit check the performance of specific alpha.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn.model_selection import train_test_split
df = pd.read_csv('Housing_Prices/train.csv')
print(len(df))
df.head()
1460
<style> .dataframe thead tr:only-child th { text-align: right; }
.dataframe thead th {
    text-align: left;
}

.dataframe tbody tr th {
    vertical-align: top;
}
</style>
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
0 1 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 2 2008 WD Normal 208500
1 2 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 5 2007 WD Normal 181500
2 3 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 9 2008 WD Normal 223500
3 4 70 RL 60.0 9550 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 2 2006 WD Abnorml 140000
4 5 60 RL 84.0 14260 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 12 2008 WD Normal 250000

5 rows × 81 columns

from sklearn.linear_model import LassoCV, RidgeCV
#Define X and Y
feats = [col for col in df.columns if df[col].dtype in [np.int64, np.float64]]

X = df[feats].drop('SalePrice', axis=1)

#Impute null values
for col in X:
    avg = X[col].mean()
    X[col] = X[col].fillna(value=avg)

y = df.SalePrice

print('Number of X features: {}'.format(len(X.columns)))

#Train Test Split
X_train, X_test, y_train, y_test = train_test_split(X,y)
L1 = LassoCV()
print('Model Details:\n', L1)

L1.fit(X_train, y_train)

print('Optimal alpha: {}'.format(L1.alpha_))
print('First 5 coefficients:\n', L1.coef_[:5])
count = 0
for num in L1.coef_:
    if num == 0:
        count += 1
print(count)
print('Number of coefficients set to zero: {}'.format(count))
Number of X features: 37
Model Details:
 LassoCV(alphas=None, copy_X=True, cv=None, eps=0.001, fit_intercept=True,
    max_iter=1000, n_alphas=100, n_jobs=1, normalize=False, positive=False,
    precompute='auto', random_state=None, selection='cyclic', tol=0.0001,
    verbose=False)
Optimal alpha: 198489.80980228688
First 5 coefficients:
 [-2.80735194 -0.         -0.          0.25507382  0.        ]
25
Number of coefficients set to zero: 25

Notes on Coefficients and Using Lasso for Feature Selection

The Lasso technique also has a very important and profound effect: feature selection. That is, many of your feature coefficients will be optimized to zero, effectively removing their impact on the model. This can be a useful application in practice when trying to reduce the number of features in the model. Note that which variables are set to zero can change if multicollinearity is present in the data. That is, if two features within the X space are highly correlated, then which takes precendence in the model is somewhat arbitrary, and as such, coefficient weights between multiple runs of .fit() could lead to substantially different coefficient values.

With Normalization

#Define X and Y
feats = [col for col in df.columns if df[col].dtype in [np.int64, np.float64]]

X = df[feats].drop('SalePrice', axis=1)

#Impute null values
for col in X:
    avg = X[col].mean()
    X[col] = X[col].fillna(value=avg)

y = df.SalePrice

print('Number of X features: {}'.format(len(X.columns)))

#Train Test Split
X_train, X_test, y_train, y_test = train_test_split(X,y)
L1 = LassoCV(normalize = True)
print('Model Details:\n', L1)
L1.fit(X_train, y_train)

print('Optimal alpha: {}'.format(L1.alpha_))
print('First 5 coefficients:\n', L1.coef_[:5])
count = 0
for num in L1.coef_:
    if num == 0:
        count += 1
print(count)
print('Number of coefficients set to zero: {}'.format(count))
Number of X features: 37
Model Details:
 LassoCV(alphas=None, copy_X=True, cv=None, eps=0.001, fit_intercept=True,
    max_iter=1000, n_alphas=100, n_jobs=1, normalize=True, positive=False,
    precompute='auto', random_state=None, selection='cyclic', tol=0.0001,
    verbose=False)
Optimal alpha: 141.0984264427501
First 5 coefficients:
 [ -0.00000000e+00  -5.95275404e+01   0.00000000e+00   1.60217484e-01
   2.00649624e+04]
21
Number of coefficients set to zero: 21

Calculate the Mean Squarred Error

Calculate the mean squarred error between both of the models above and the test set.

# Your code here

Repeat this Process for the Ridge Regression Object

# Your code here

Practice Preprocessing and Feature Engineering

Use some of our previous techniques including normalization, feature engineering, and dummy variables on the dataset. Then, repeat fitting and tuning a model, observing the performance impact compared to above.

# Your code here

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published