# Predicting Housing Prices in Ames, Iowa
## Part 3: Regularized Regression Models
### Lisa Hwang
#### 1/17/2020

Continuing from the previous notebook where I created a linear regression model from the Ames Housing Dataset, I will next try building regularized regression models using ridge and LASSO. 

### Importing libraries

In [1]:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

from sklearn.linear_model import LassoCV, RidgeCV
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score

### Importing the data

I'll import my cleaned train dataset from the previous notebook. This has the dummy variables from Condition 2 and Kitchen Quality:
 - Condition 2_Feedr: proximity to condition (adjacent to feeder street)
 - Condition 2_Norm: proximity to condition (normal)
 - Condition 2_PosA: proximity to condition (adjacent to postive off-site feature such as park, greenbelt)
 - Kitchen Qual_Ex: kitchen quality (excellent)
 - Kitchen Qual_Fa: kitchen quality (fair)
 - Kitchen Qual_Gd: kitchen quality (good)
 - Kitchen Qual_TA: kitchen quality (typical/average)

In [2]:
train = pd.read_csv('Datasets/train_clean.csv')
test = pd.read_csv('Datasets/test.csv')

### Reviewing the data

In [3]:
# So I can see all of the columns in the dataframe
pd.set_option('display.max_columns', 200)

In [4]:
# Making sure the import worked
train.head()

Unnamed: 0,Id,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,Utilities,Lot Config,Land Slope,Neighborhood,Condition 1,Bldg Type,House Style,Overall Qual,Overall Cond,Year Built,Year Remod/Add,Roof Style,Roof Matl,Exterior 1st,Exterior 2nd,Mas Vnr Type,Mas Vnr Area,Exter Qual,Exter Cond,Foundation,Bsmt Qual,Bsmt Cond,Bsmt Exposure,BsmtFin Type 1,BsmtFin SF 1,BsmtFin Type 2,BsmtFin SF 2,Bsmt Unf SF,Total Bsmt SF,Heating,Heating QC,Central Air,Electrical,1st Flr SF,2nd Flr SF,Low Qual Fin SF,Gr Liv Area,Bsmt Full Bath,Bsmt Half Bath,Full Bath,Half Bath,Bedroom AbvGr,Kitchen AbvGr,TotRms AbvGrd,Functional,Fireplaces,Fireplace Qu,Garage Type,Garage Yr Blt,Garage Finish,Garage Cars,Garage Area,Garage Qual,Garage Cond,Paved Drive,Wood Deck SF,Open Porch SF,Enclosed Porch,3Ssn Porch,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type,SalePrice,Condition 2_Feedr,Condition 2_Norm,Condition 2_PosA,Kitchen Qual_Fa,Kitchen Qual_Gd,Kitchen Qual_TA
0,109,533352170,60,RL,,13517,Pave,,IR1,Lvl,AllPub,CulDSac,Gtl,Sawyer,RRAe,1Fam,2Story,6,8,1976,2005,Gable,CompShg,HdBoard,Plywood,BrkFace,289.0,Gd,TA,CBlock,TA,TA,No,GLQ,533.0,Unf,0.0,192.0,725.0,GasA,Ex,Y,SBrkr,725,754,0,1479,0.0,0.0,2,1,3,1,6,Typ,0,,Attchd,1976.0,RFn,2.0,475.0,TA,TA,Y,0,44,0,0,0,0,,,,0,3,2010,WD,130500,0,1,0,0,1,0
1,544,531379050,60,RL,43.0,11492,Pave,,IR1,Lvl,AllPub,CulDSac,Gtl,SawyerW,Norm,1Fam,2Story,7,5,1996,1997,Gable,CompShg,VinylSd,VinylSd,BrkFace,132.0,Gd,TA,PConc,Gd,TA,No,GLQ,637.0,Unf,0.0,276.0,913.0,GasA,Ex,Y,SBrkr,913,1209,0,2122,1.0,0.0,2,1,4,1,8,Typ,1,TA,Attchd,1997.0,RFn,2.0,559.0,TA,TA,Y,0,74,0,0,0,0,,,,0,4,2009,WD,220000,0,1,0,0,1,0
2,153,535304180,20,RL,68.0,7922,Pave,,Reg,Lvl,AllPub,Inside,Gtl,NAmes,Norm,1Fam,1Story,5,7,1953,2007,Gable,CompShg,VinylSd,VinylSd,,0.0,TA,Gd,CBlock,TA,TA,No,GLQ,731.0,Unf,0.0,326.0,1057.0,GasA,TA,Y,SBrkr,1057,0,0,1057,1.0,0.0,1,0,3,1,5,Typ,0,,Detchd,1953.0,Unf,1.0,246.0,TA,TA,Y,0,52,0,0,0,0,,,,0,1,2010,WD,109000,0,1,0,0,1,0
3,318,916386060,60,RL,73.0,9802,Pave,,Reg,Lvl,AllPub,Inside,Gtl,Timber,Norm,1Fam,2Story,5,5,2006,2007,Gable,CompShg,VinylSd,VinylSd,,0.0,TA,TA,PConc,Gd,TA,No,Unf,0.0,Unf,0.0,384.0,384.0,GasA,Gd,Y,SBrkr,744,700,0,1444,0.0,0.0,2,1,3,1,7,Typ,0,,BuiltIn,2007.0,Fin,2.0,400.0,TA,TA,Y,100,0,0,0,0,0,,,,0,4,2010,WD,174000,0,1,0,0,0,1
4,255,906425045,50,RL,82.0,14235,Pave,,IR1,Lvl,AllPub,Inside,Gtl,SawyerW,Norm,1Fam,1.5Fin,6,8,1900,1993,Gable,CompShg,Wd Sdng,Plywood,,0.0,TA,TA,PConc,Fa,Gd,No,Unf,0.0,Unf,0.0,676.0,676.0,GasA,TA,Y,SBrkr,831,614,0,1445,0.0,0.0,2,0,3,1,6,Typ,0,,Detchd,1957.0,Unf,2.0,484.0,TA,TA,N,0,59,0,0,0,0,,,,0,3,2010,WD,138500,0,1,0,0,0,1


In [5]:
test.head()

Unnamed: 0,Id,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,Utilities,Lot Config,Land Slope,Neighborhood,Condition 1,Condition 2,Bldg Type,House Style,Overall Qual,Overall Cond,Year Built,Year Remod/Add,Roof Style,Roof Matl,Exterior 1st,Exterior 2nd,Mas Vnr Type,Mas Vnr Area,Exter Qual,Exter Cond,Foundation,Bsmt Qual,Bsmt Cond,Bsmt Exposure,BsmtFin Type 1,BsmtFin SF 1,BsmtFin Type 2,BsmtFin SF 2,Bsmt Unf SF,Total Bsmt SF,Heating,Heating QC,Central Air,Electrical,1st Flr SF,2nd Flr SF,Low Qual Fin SF,Gr Liv Area,Bsmt Full Bath,Bsmt Half Bath,Full Bath,Half Bath,Bedroom AbvGr,Kitchen AbvGr,Kitchen Qual,TotRms AbvGrd,Functional,Fireplaces,Fireplace Qu,Garage Type,Garage Yr Blt,Garage Finish,Garage Cars,Garage Area,Garage Qual,Garage Cond,Paved Drive,Wood Deck SF,Open Porch SF,Enclosed Porch,3Ssn Porch,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type
0,2658,902301120,190,RM,69.0,9142,Pave,Grvl,Reg,Lvl,AllPub,Inside,Gtl,OldTown,Norm,Norm,2fmCon,2Story,6,8,1910,1950,Gable,CompShg,AsbShng,AsbShng,,0.0,TA,Fa,Stone,Fa,TA,No,Unf,0,Unf,0,1020,1020,GasA,Gd,N,FuseP,908,1020,0,1928,0,0,2,0,4,2,Fa,9,Typ,0,,Detchd,1910.0,Unf,1,440,Po,Po,Y,0,60,112,0,0,0,,,,0,4,2006,WD
1,2718,905108090,90,RL,,9662,Pave,,IR1,Lvl,AllPub,Inside,Gtl,Sawyer,Norm,Norm,Duplex,1Story,5,4,1977,1977,Gable,CompShg,Plywood,Plywood,,0.0,TA,TA,CBlock,Gd,TA,No,Unf,0,Unf,0,1967,1967,GasA,TA,Y,SBrkr,1967,0,0,1967,0,0,2,0,6,2,TA,10,Typ,0,,Attchd,1977.0,Fin,2,580,TA,TA,Y,170,0,0,0,0,0,,,,0,8,2006,WD
2,2414,528218130,60,RL,58.0,17104,Pave,,IR1,Lvl,AllPub,Inside,Gtl,Gilbert,Norm,Norm,1Fam,2Story,7,5,2006,2006,Gable,CompShg,VinylSd,VinylSd,,0.0,Gd,TA,PConc,Gd,Gd,Av,GLQ,554,Unf,0,100,654,GasA,Ex,Y,SBrkr,664,832,0,1496,1,0,2,1,3,1,Gd,7,Typ,1,Gd,Attchd,2006.0,RFn,2,426,TA,TA,Y,100,24,0,0,0,0,,,,0,9,2006,New
3,1989,902207150,30,RM,60.0,8520,Pave,,Reg,Lvl,AllPub,Inside,Gtl,OldTown,Norm,Norm,1Fam,1Story,5,6,1923,2006,Gable,CompShg,Wd Sdng,Wd Sdng,,0.0,Gd,TA,CBlock,TA,TA,No,Unf,0,Unf,0,968,968,GasA,TA,Y,SBrkr,968,0,0,968,0,0,1,0,2,1,TA,5,Typ,0,,Detchd,1935.0,Unf,2,480,Fa,TA,N,0,0,184,0,0,0,,,,0,7,2007,WD
4,625,535105100,20,RL,,9500,Pave,,IR1,Lvl,AllPub,Inside,Gtl,NAmes,Norm,Norm,1Fam,1Story,6,5,1963,1963,Gable,CompShg,Plywood,Plywood,BrkFace,247.0,TA,TA,CBlock,Gd,TA,No,BLQ,609,Unf,0,785,1394,GasA,Gd,Y,SBrkr,1394,0,0,1394,1,0,1,1,3,1,TA,6,Typ,2,Gd,Attchd,1963.0,RFn,2,514,TA,TA,Y,0,76,0,0,185,0,,,,0,7,2009,WD


Since there are a lot of columns with strings, I'll start by taking them out. 

In [6]:
num_cols = train._get_numeric_data().columns[2:]  # [2:] removes ID and PID
train_reg = train[num_cols]
train_reg

Unnamed: 0,MS SubClass,Lot Frontage,Lot Area,Overall Qual,Overall Cond,Year Built,Year Remod/Add,Mas Vnr Area,BsmtFin SF 1,BsmtFin SF 2,Bsmt Unf SF,Total Bsmt SF,1st Flr SF,2nd Flr SF,Low Qual Fin SF,Gr Liv Area,Bsmt Full Bath,Bsmt Half Bath,Full Bath,Half Bath,Bedroom AbvGr,Kitchen AbvGr,TotRms AbvGrd,Fireplaces,Garage Yr Blt,Garage Cars,Garage Area,Wood Deck SF,Open Porch SF,Enclosed Porch,3Ssn Porch,Screen Porch,Pool Area,Misc Val,Mo Sold,Yr Sold,SalePrice,Condition 2_Feedr,Condition 2_Norm,Condition 2_PosA,Kitchen Qual_Fa,Kitchen Qual_Gd,Kitchen Qual_TA
0,60,,13517,6,8,1976,2005,289.0,533.0,0.0,192.0,725.0,725,754,0,1479,0.0,0.0,2,1,3,1,6,0,1976.0,2.0,475.0,0,44,0,0,0,0,0,3,2010,130500,0,1,0,0,1,0
1,60,43.0,11492,7,5,1996,1997,132.0,637.0,0.0,276.0,913.0,913,1209,0,2122,1.0,0.0,2,1,4,1,8,1,1997.0,2.0,559.0,0,74,0,0,0,0,0,4,2009,220000,0,1,0,0,1,0
2,20,68.0,7922,5,7,1953,2007,0.0,731.0,0.0,326.0,1057.0,1057,0,0,1057,1.0,0.0,1,0,3,1,5,0,1953.0,1.0,246.0,0,52,0,0,0,0,0,1,2010,109000,0,1,0,0,1,0
3,60,73.0,9802,5,5,2006,2007,0.0,0.0,0.0,384.0,384.0,744,700,0,1444,0.0,0.0,2,1,3,1,7,0,2007.0,2.0,400.0,100,0,0,0,0,0,0,4,2010,174000,0,1,0,0,0,1
4,50,82.0,14235,6,8,1900,1993,0.0,0.0,0.0,676.0,676.0,831,614,0,1445,0.0,0.0,2,0,3,1,6,0,1957.0,2.0,484.0,0,59,0,0,0,0,0,3,2010,138500,0,1,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2042,20,79.0,11449,8,5,2007,2007,0.0,1011.0,0.0,873.0,1884.0,1728,0,0,1728,1.0,0.0,2,0,3,1,7,1,2007.0,2.0,520.0,0,276,0,0,0,0,0,1,2008,298751,0,1,0,0,1,0
2043,30,,12342,4,5,1940,1950,0.0,262.0,0.0,599.0,861.0,861,0,0,861,0.0,0.0,1,0,1,1,4,0,1961.0,2.0,539.0,158,0,0,0,0,0,0,3,2009,82500,0,1,0,0,0,1
2044,50,57.0,7558,6,6,1928,1950,0.0,0.0,0.0,896.0,896.0,1172,741,0,1913,0.0,0.0,1,1,3,1,9,1,1929.0,2.0,342.0,0,0,0,0,0,0,0,3,2009,177000,0,1,0,0,0,1
2045,20,80.0,10400,4,5,1956,1956,0.0,155.0,750.0,295.0,1200.0,1200,0,0,1200,1.0,0.0,1,0,3,1,6,2,1956.0,1.0,294.0,0,189,140,0,0,0,0,11,2009,144000,0,1,0,0,0,1


Are there any nulls?

In [7]:
train_reg.isnull().sum().sort_values(ascending=False).head(10)

Lot Frontage       330
Garage Yr Blt      113
Mas Vnr Area        22
Bsmt Half Bath       1
Bsmt Full Bath       1
Kitchen Qual_TA      0
Total Bsmt SF        0
Full Bath            0
Gr Liv Area          0
Low Qual Fin SF      0
dtype: int64

In [8]:
train_reg.shape

(2047, 43)

There are nulls in the dataframe. However, since there are 2047 total rows, it would be acceptable to part with a maximum of 467 rows. After dropping all the null rows, there are a total of 1594.

In [9]:
train_reg = train_reg.dropna()

In [10]:
train_reg.describe()

Unnamed: 0,MS SubClass,Lot Frontage,Lot Area,Overall Qual,Overall Cond,Year Built,Year Remod/Add,Mas Vnr Area,BsmtFin SF 1,BsmtFin SF 2,Bsmt Unf SF,Total Bsmt SF,1st Flr SF,2nd Flr SF,Low Qual Fin SF,Gr Liv Area,Bsmt Full Bath,Bsmt Half Bath,Full Bath,Half Bath,Bedroom AbvGr,Kitchen AbvGr,TotRms AbvGrd,Fireplaces,Garage Yr Blt,Garage Cars,Garage Area,Wood Deck SF,Open Porch SF,Enclosed Porch,3Ssn Porch,Screen Porch,Pool Area,Misc Val,Mo Sold,Yr Sold,SalePrice,Condition 2_Feedr,Condition 2_Norm,Condition 2_PosA,Kitchen Qual_Fa,Kitchen Qual_Gd,Kitchen Qual_TA
count,1594.0,1594.0,1594.0,1594.0,1594.0,1594.0,1594.0,1594.0,1594.0,1594.0,1594.0,1594.0,1594.0,1594.0,1594.0,1594.0,1594.0,1594.0,1594.0,1594.0,1594.0,1594.0,1594.0,1594.0,1594.0,1594.0,1594.0,1594.0,1594.0,1594.0,1594.0,1594.0,1594.0,1594.0,1594.0,1594.0,1594.0,1594.0,1594.0,1594.0,1594.0,1594.0,1594.0
mean,56.19197,69.301129,9675.897742,6.188206,5.5734,1972.429737,1984.77478,101.220828,432.098494,48.85069,584.091593,1065.040778,1160.847553,328.329987,4.193852,1493.371393,0.417817,0.063363,1.571518,0.363237,2.81995,1.031995,6.418444,0.581556,1978.109159,1.861982,497.363864,93.207654,46.638018,21.994981,2.526976,17.468632,2.321205,40.513802,6.200125,2007.782936,184498.786073,0.006274,0.986826,0.001882,0.018193,0.39335,0.505646
std,42.147957,22.390943,4764.658652,1.423844,1.064143,30.551991,21.283342,176.929468,442.868253,165.528734,449.987296,428.406855,374.938853,418.826289,45.836065,473.443863,0.514531,0.248789,0.541062,0.492687,0.790451,0.179572,1.500345,0.621794,25.974189,0.668376,191.763804,124.903172,64.385482,59.500885,25.30893,58.86874,36.77677,426.628191,2.739261,1.325518,82219.414699,0.078982,0.114057,0.043355,0.133692,0.488647,0.500125
min,20.0,21.0,1300.0,1.0,1.0,1879.0,1950.0,0.0,0.0,0.0,0.0,0.0,438.0,0.0,0.0,438.0,0.0,0.0,0.0,0.0,0.0,1.0,3.0,0.0,1895.0,1.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0,12789.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,20.0,59.0,7440.25,5.0,5.0,1953.0,1965.0,0.0,0.0,0.0,231.25,793.0,876.0,0.0,0.0,1141.25,0.0,0.0,1.0,0.0,2.0,1.0,5.0,0.0,1959.0,1.0,339.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,2007.0,130000.0,0.0,1.0,0.0,0.0,0.0,0.0
50%,50.0,69.0,9350.0,6.0,5.0,1975.0,1994.0,0.0,360.0,0.0,492.0,1001.0,1091.0,0.0,0.0,1449.0,0.0,0.0,2.0,0.0,3.0,1.0,6.0,1.0,1980.0,2.0,482.5,0.0,28.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0,163000.0,0.0,1.0,0.0,0.0,0.0,1.0
75%,70.0,80.0,11199.5,7.0,6.0,2003.0,2005.0,162.75,717.75,0.0,816.0,1336.0,1405.0,685.0,0.0,1719.75,1.0,0.0,2.0,1.0,3.0,1.0,7.0,1.0,2003.0,2.0,584.75,168.0,69.75,0.0,0.0,0.0,0.0,0.0,8.0,2009.0,214000.0,0.0,1.0,0.0,0.0,1.0,1.0
max,190.0,313.0,70761.0,10.0,9.0,2010.0,2010.0,1600.0,2188.0,1474.0,2336.0,3206.0,2726.0,1862.0,1064.0,3672.0,2.0,2.0,3.0,2.0,6.0,3.0,12.0,4.0,2010.0,5.0,1348.0,870.0,547.0,432.0,508.0,480.0,800.0,12500.0,12.0,2010.0,611657.0,1.0,1.0,1.0,1.0,1.0,1.0


### Setting up X and y

In [11]:
X = train_reg.drop(columns=['SalePrice']).fillna(0)  # To get every other columns besides SalePrice
y = train_reg['SalePrice']

### Train test splitting

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

### Getting polynomial features

Next is getting two-way terms from our columns.

In [13]:
pf = PolynomialFeatures(include_bias=False)

In [14]:
pf.fit(X_train)
X_train_pf = pf.transform(X_train)
X_test_pf = pf.transform(X_test)

### Standard scaling the values

In [15]:
ss = StandardScaler()
ss.fit(X_train_pf)
X_train_pfs = ss.transform(X_train_pf)
X_test_pfs = ss.transform(X_test_pf)

In [16]:
pf.get_feature_names(X.columns)[:5]  # Looking at some of the column names

['MS SubClass', 'Lot Frontage', 'Lot Area', 'Overall Qual', 'Overall Cond']

In [17]:
X_train.shape  # Original number of columns

(1195, 42)

In [18]:
X_train_pfs.shape  # New number of columns after polynomial features were added

(1195, 945)

We went from 43 columns to 945 columns!

### LASSO modeling

In [19]:
lcv = LassoCV(max_iter=1500, cv=3)

In [20]:
lcv.fit(X_train_pfs, y_train)   

LassoCV(alphas=None, copy_X=True, cv=3, eps=0.001, fit_intercept=True,
        max_iter=1500, n_alphas=100, n_jobs=None, normalize=False,
        positive=False, precompute='auto', random_state=None,
        selection='cyclic', tol=0.0001, verbose=False)

In [21]:
lcv.score(X_train_pfs, y_train)  # Score on train data

0.9422987783572577

In [22]:
lcv.score(X_test_pfs, y_test)  # Score on test data

0.9243622984910218

Our Rsquared values for LASSO regression went down slightly from 0.94 on the train data to 0.92 on the test data. In general, this is a good model since the scores are high and similar. But what are the coefficients? Which coefficients correspond with what features and what are the values?

In [23]:
# Creating a dataframe consisting of features and coefficients
coef_df = pd.DataFrame({
    'feature': pf.get_feature_names(num_cols),
    'coef': lcv.coef_
})

In [24]:
# Showing coefficients not equal to 0
coef_df[coef_df['coef'] != 0].sort_values('coef')  

Unnamed: 0,feature,coef
479,Total Bsmt SF Kitchen Qual_Gd,-4715.313567
509,1st Flr SF Kitchen Qual_Gd,-4157.609044
790,Garage Yr Blt Kitchen Qual_Fa,-4043.140543
667,Full Bath Kitchen Qual_Fa,-3659.354386
693,Bedroom AbvGr Kitchen AbvGr,-2765.294259
...,...,...
243,Year Built Year Remod/Add,6161.600930
188,Overall Qual Garage Area,6194.459004
170,Overall Qual BsmtFin SF 1,6780.320160
173,Overall Qual Total Bsmt SF,7682.779906


There are 99 variables in our model. The coefficients with the highest positive or negative values are all interaction terms which can be hard to interpret. It does appear that Overall Qual is a multiplier in four of the variables and Kitchen Qual is as well. 

In [25]:
# Showing coefficients equal to 0
coef_df[coef_df['coef'] == 0].sort_values('coef')  

Unnamed: 0,feature,coef
0,MS SubClass,-0.0
634,Bsmt Half Bath Screen Porch,-0.0
635,Bsmt Half Bath Pool Area,-0.0
636,Bsmt Half Bath Misc Val,0.0
637,Bsmt Half Bath Mo Sold,0.0
...,...,...
322,Mas Vnr Area Low Qual Fin SF,0.0
323,Mas Vnr Area Gr Liv Area,0.0
325,Mas Vnr Area Bsmt Half Bath,0.0
298,Year Remod/Add Garage Cars,0.0


A total of 846 variables were zeroed out and dropped from the model. With the exception of MS SubClass (type of dwelling in the sale), the rest from this sample are interaction terms that often seem to incorporate Bsmt Half Bath (basement half bathrooms) and Mas Vnr Area (masonry veneer area in square feet). 

### Ridge modeling

In [26]:
ridge_cv = RidgeCV(
    scoring = "r2",
    cv = 5
    )

In [27]:
ridge_cv.fit(X_train_pfs, y_train)

RidgeCV(alphas=array([ 0.1,  1. , 10. ]), cv=5, fit_intercept=True,
        gcv_mode=None, normalize=False, scoring='r2', store_cv_values=False)

In [28]:
ridge_cv.score(X_train_pfs, y_train)

0.9642715656112831

In [29]:
ridge_cv.score(X_test_pfs, y_test)

0.8863352998016714

Taking a look at the coefficients:

In [30]:
# Creating a dataframe with feature names and their coefficients
coef_ridge = pd.DataFrame({
    'feature': pf.get_feature_names(num_cols),
    'coef': ridge_cv.coef_
})

In [31]:
coef_ridge[coef_ridge['coef'] != 0].sort_values('coef')

Unnamed: 0,feature,coef
667,Full Bath Kitchen Qual_Fa,-9276.675893
64,MS SubClass TotRms AbvGrd,-8345.021544
204,Overall Cond^2,-7951.133392
362,BsmtFin SF 1 Bedroom AbvGr,-7630.830719
866,Enclosed Porch Kitchen Qual_Gd,-6638.169663
...,...,...
185,Overall Qual Fireplaces,7323.431959
333,Mas Vnr Area Garage Cars,8185.021265
174,Overall Qual 1st Flr SF,8815.124767
177,Overall Qual Gr Liv Area,8853.650866


There are 912 coefficients in this model. The coefficients with the highest positive or negative values are all interaction terms which can be hard to interpret. It does appear that Overall Qual is a multiplier in four of the variables and Kitchen Qual is as well. 

In [32]:
print(ridge_cv.score(X_train_pfs, y_train))
print(ridge_cv.score(X_test_pfs, y_test))

0.9642715656112831
0.8863352998016714


Our Rsquared values for ridge regression went down from 0.96 on the train data to 0.88 on the test data. This suggests that there is overfitting which is reasonable given the large number of predictor variables in the model.

### Summary
The regularization techniques ridge and LASSO were performed on the Ames housing data after polynomial features and scaling were implemented.  LASSO performed better with more consistend Rsquared values on train (0.9423) and test data (0.9244). A total of 99 variables were included in the model. 