I've been wanting to get some practice with PCA, clustering, and other dimension reduction techniques. I found a house price dataset on Kaggle which looks to be a great dataset to practice these techniques with. While not technically too large to model out a regression with, 81 variables is a lot to take in and I thought it would be interesting to finally explore feature reduction. Below is my step by step process from importing the data, cleaning and standardizing, reducing, to finally fitting a model for kaggle submission.  I iterate over steps a few times, try stuff that eventually doesn't work, and even conduct a KMeans which is really pointless in the projet aside from my exploring SKlearn some.

In [1]:
import pandas as pd
import mca

from sklearn.preprocessing import Imputer
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LassoCV
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_validate
from sklearn.metrics import confusion_matrix, r2_score, mean_squared_error

import matplotlib.pyplot as plt

First step I'll import the data and inspect it a bit. It's already split between a train and test split- typical of Kaggle. Let's import all of this and then work with primarily the train set. I'll also grab the sample submission now as I plan to upload to Kaggle and see is a regression using either clusters or principal components is effective.

In [2]:
data_dir = 'data/'
df_train = pd.read_csv(data_dir + 'train.csv')
df_test = pd.read_csv(data_dir + 'test.csv')
df_sample = pd.read_csv(data_dir + 'sample_submission.csv')

df_train.shape

(1460, 81)

In [3]:
df_train.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [4]:
df_train.dtypes.head()

Id               int64
MSSubClass       int64
MSZoning        object
LotFrontage    float64
LotArea          int64
dtype: object

Cut this to 5,but there is a combination of about half continuous and half categorical variables in the dataset. 81 features in total.. A perfect option for dimensionality reduction! Further exploring:

In [5]:
df_train.describe()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
count,1460.0,1460.0,1201.0,1460.0,1460.0,1460.0,1460.0,1460.0,1452.0,1460.0,...,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,730.5,56.89726,70.049958,10516.828082,6.099315,5.575342,1971.267808,1984.865753,103.685262,443.639726,...,94.244521,46.660274,21.95411,3.409589,15.060959,2.758904,43.489041,6.321918,2007.815753,180921.19589
std,421.610009,42.300571,24.284752,9981.264932,1.382997,1.112799,30.202904,20.645407,181.066207,456.098091,...,125.338794,66.256028,61.119149,29.317331,55.757415,40.177307,496.123024,2.703626,1.328095,79442.502883
min,1.0,20.0,21.0,1300.0,1.0,1.0,1872.0,1950.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0,34900.0
25%,365.75,20.0,59.0,7553.5,5.0,5.0,1954.0,1967.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,2007.0,129975.0
50%,730.5,50.0,69.0,9478.5,6.0,5.0,1973.0,1994.0,0.0,383.5,...,0.0,25.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0,163000.0
75%,1095.25,70.0,80.0,11601.5,7.0,6.0,2000.0,2004.0,166.0,712.25,...,168.0,68.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0,214000.0
max,1460.0,190.0,313.0,215245.0,10.0,9.0,2010.0,2010.0,1600.0,5644.0,...,857.0,547.0,552.0,508.0,480.0,738.0,15500.0,12.0,2010.0,755000.0


Thats a lot to take in.. eough description of the dataframe. Time to check for any missing values and correct if necessary.

In [6]:
df_train.isnull().sum()

Id                  0
MSSubClass          0
MSZoning            0
LotFrontage       259
LotArea             0
Street              0
Alley            1369
LotShape            0
LandContour         0
Utilities           0
LotConfig           0
LandSlope           0
Neighborhood        0
Condition1          0
Condition2          0
BldgType            0
HouseStyle          0
OverallQual         0
OverallCond         0
YearBuilt           0
YearRemodAdd        0
RoofStyle           0
RoofMatl            0
Exterior1st         0
Exterior2nd         0
MasVnrType          8
MasVnrArea          8
ExterQual           0
ExterCond           0
Foundation          0
                 ... 
BedroomAbvGr        0
KitchenAbvGr        0
KitchenQual         0
TotRmsAbvGrd        0
Functional          0
Fireplaces          0
FireplaceQu       690
GarageType         81
GarageYrBlt        81
GarageFinish       81
GarageCars          0
GarageArea          0
GarageQual         81
GarageCond         81
PavedDrive

A few of these are almost all null values.. Lets completely drop any variables with over 1/3 of the values missing:

In [7]:
df_train = df_train.drop(["Alley", 'FireplaceQu', 'PoolQC', 'Fence', 'MiscFeature'], axis = 1)
df_train.shape

(1460, 76)

In [8]:
df_train_cont = pd.DataFrame()

for i in df_train:
    if df_train[i].dtype == "int64" or df_train[i].dtype == "float64":
        df_train_cont[i] = df_train[i]
        del df_train[i]
        
df_train_cont.head()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
0,1,60,65.0,8450,7,5,2003,2003,196.0,706,...,0,61,0,0,0,0,0,2,2008,208500
1,2,20,80.0,9600,6,8,1976,1976,0.0,978,...,298,0,0,0,0,0,0,5,2007,181500
2,3,60,68.0,11250,7,5,2001,2002,162.0,486,...,0,42,0,0,0,0,0,9,2008,223500
3,4,70,60.0,9550,7,5,1915,1970,0.0,216,...,0,35,272,0,0,0,0,2,2006,140000
4,5,60,84.0,14260,8,5,2000,2000,350.0,655,...,192,84,0,0,0,0,0,12,2008,250000


So we split up the continuous and the categorical variables in the dataset. Time to fill in the remaining missing values in the continuous set with their respective columns means:

In [9]:
imp = Imputer(missing_values='NaN', strategy='mean')

df_train_cont_imp = imp.fit_transform(df_train_cont)
df_train_cont_imp = pd.DataFrame(df_train_cont_imp, columns = df_train_cont.columns)

df_train_cont_imp.isnull().sum().any()

False

Successfully filled in null values.. lets split out the result (sale price) and perform some PCA!

In [10]:
y_train = df_train_cont_imp["SalePrice"]
del df_train_cont_imp["SalePrice"]

0    208500.0
1    181500.0
2    223500.0
3    140000.0
4    250000.0
Name: SalePrice, dtype: float64

Before we perform any dimensionality reduction, lets review were we're at!

df_train:          categorical training variables

df_train_cont_imp:  (no missing) continuous training variables

y_train:            dependant variable of modeling

df_test:            test data for kaggle-   NEEDS TO GO THROUGH PIPELINE



Objective from here:  

-standardize continuous values

-PCA on continuous

-K-modes clustering on categorical!?

-Combine principal components and clusters of categorical to get X_train

-fit models

In [11]:
scaler = StandardScaler()


for i in df_train_cont_imp:
    X_train = scaler.fit_transform(df_train_cont_imp)
X_train = pd.DataFrame(X_train, columns = df_train_cont_imp.columns)

X_train.head()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold
0,-1.730865,0.073375,-0.229372,-0.207142,0.651479,-0.5172,1.050994,0.878668,0.511418,0.575425,...,0.351,-0.752176,0.216503,-0.359325,-0.116339,-0.270208,-0.068692,-0.087688,-1.599111,0.138777
1,-1.728492,-0.872563,0.451936,-0.091886,-0.071836,2.179628,0.156734,-0.429577,-0.57441,1.171992,...,-0.060731,1.626195,-0.704483,-0.359325,-0.116339,-0.270208,-0.068692,-0.087688,-0.48911,-0.614439
2,-1.72612,0.073375,-0.09311,0.07348,0.651479,-0.5172,0.984752,0.830215,0.32306,0.092907,...,0.631726,-0.752176,-0.070361,-0.359325,-0.116339,-0.270208,-0.068692,-0.087688,0.990891,0.138777
3,-1.723747,0.309859,-0.456474,-0.096897,0.651479,-0.5172,-1.863632,-0.720298,-0.57441,-0.499274,...,0.790804,-0.752176,-0.176048,4.092524,-0.116339,-0.270208,-0.068692,-0.087688,-1.599111,-1.367655
4,-1.721374,0.073375,0.633618,0.375148,1.374795,-0.5172,0.951632,0.733308,1.36457,0.463568,...,1.698485,0.780197,0.56376,-0.359325,-0.116339,-0.270208,-0.068692,-0.087688,2.100892,0.138777


In [12]:
del X_train['Id']

X_train.shape

(1460, 36)

Time to run PCA on this set. I'll first transfer this to an array and then apply sklearn.decomp.PCA to the X_train dataset.

In [13]:
X_train = X_train.values

pca = PCA(n_components = 5)
X_train = pca.fit_transform(X_train)

explained_variance = pca.explained_variance_ratio_
print(explained_variance)

X_train_pca = X_train[:,:5]
X_train_pca = pd.DataFrame(X_train_pca, columns = ["pca1", "pca2", "pca3", "pca4", "pca5"])

[ 0.19812105  0.08900553  0.0714888   0.05614703  0.04091307]


About 46% of the explained variance from only 5 principal components.  Not too bad... I think.. lets continue with this and try a k-means on the components just for fun!

In [14]:
kmeans = KMeans(n_clusters = 3, random_state = 42)

X_train_clusters = kmeans.fit(X_train)
 
X_train_clusters.labels_
X_train_clusters.cluster_centers_

array([[ 1.8471566 ,  1.67522714, -0.68752213, -0.74411962, -0.14086944],
       [-2.20431745, -0.19357077,  0.38318628, -0.00859105, -0.08315146],
       [ 2.21848876, -1.55823088,  0.03890235,  0.87915433,  0.32763546]])

Alright, on to the categorical data..  first to clean it of nulls and then dummy variable it out. As this is more a project to explore dimensionality reduction, I'll allow the label encoder to encoder null values over to a value.

In [15]:
encoder = LabelEncoder()
hot_encoder = OneHotEncoder()

#initialize empty frames for storage
df_train_enc = pd.DataFrame()
df_train_hot_enc = pd.DataFrame()

#iterate over categorical cols and transform into int values
for i in df_train:
    df_train_enc[i] = encoder.fit_transform(df_train[i])
    
#encoder into binary dummies
df_train_hot_enc = hot_encoder.fit_transform(df_train_enc)
df_train_dummies = pd.DataFrame(df_train_hot_enc.toarray())

df_train_dummies.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,744,745,746,747,748,749,750,751,752,753
0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
1,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
2,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
3,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,...,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0


okay.. so we just did the exact opposite of dimensionality reduction.. thats what dummy variables do. Lets try multiple correspondence analysis, the categorical PCA!

In [16]:
df_train_mca = mca.MCA(df_train_dummies, ncols = 5)
print(df_train_mca.L) #eigenvalues

[ 0.00895017  0.00755986  0.00481571  0.0029876   0.00267983  0.00247353
  0.00244682  0.00078794]


Well, not too valueable really. It's not explaining all too much of the variance.  Lets go back to the the first categorical and take a more traditional approach. First to view unique counts per column:

In [17]:
for i in df_train:
    print i
    print(df_train[i].value_counts())

MSZoning
RL         1151
RM          218
FV           65
RH           16
C (all)      10
Name: MSZoning, dtype: int64
Street
Pave    1454
Grvl       6
Name: Street, dtype: int64
LotShape
Reg    925
IR1    484
IR2     41
IR3     10
Name: LotShape, dtype: int64
LandContour
Lvl    1311
Bnk      63
HLS      50
Low      36
Name: LandContour, dtype: int64
Utilities
AllPub    1459
NoSeWa       1
Name: Utilities, dtype: int64
LotConfig
Inside     1052
Corner      263
CulDSac      94
FR2          47
FR3           4
Name: LotConfig, dtype: int64
LandSlope
Gtl    1382
Mod      65
Sev      13
Name: LandSlope, dtype: int64
Neighborhood
NAmes      225
CollgCr    150
OldTown    113
Edwards    100
Somerst     86
Gilbert     79
NridgHt     77
Sawyer      74
NWAmes      73
SawyerW     59
BrkSide     58
Crawfor     51
Mitchel     49
NoRidge     41
Timber      38
IDOTRR      37
ClearCr     28
SWISU       25
StoneBr     25
MeadowV     17
Blmngtn     17
BrDale      16
Veenker     11
NPkVill      9
Blueste  

Several of these cetagories are overwhelmingly distributed to one value. As such, they will not provide useful information in forecasting.. I'm going to set a threshold of 75% within one value.. IE: if one value in a category holds >= 75% of the total count, it gets removed from the dataset.. This acocunts for:

MSZoning, Street, LandContour, Utilities, LandSlope, Condition1, Condition2, BldgType, RoofStyle, RoofMat1, ExterCond, BsmtQual, BsmtFinType2, Heating, CentralAir, Electrical, Functional, GarageQual, GarageCond, PavedDrive, SaleType, SaleCondition

While Numerous of these would be expected to be useful in predicting price (central air, paved driveway, sale condition) I'm still removing at this threshold to try to reduce dimensionality. I'll evbentually compare to a full-on regression with all variables and see what the difference relates to.


Lets remove them:

In [18]:
df_train = df_train.drop(["MSZoning", "Street", "LandContour", "Utilities", "LandSlope", "Condition1", "Condition2", "BldgType", 
                          "RoofStyle", "RoofMatl", "ExterCond", "BsmtQual", "BsmtFinType2", "Heating", "CentralAir", "Electrical", 
                          "Functional", "GarageQual", "GarageCond", "PavedDrive", "SaleType", "SaleCondition"], axis=1)
df_train.head()

Unnamed: 0,LotShape,LotConfig,Neighborhood,HouseStyle,Exterior1st,Exterior2nd,MasVnrType,ExterQual,Foundation,BsmtCond,BsmtExposure,BsmtFinType1,HeatingQC,KitchenQual,GarageType,GarageFinish
0,Reg,Inside,CollgCr,2Story,VinylSd,VinylSd,BrkFace,Gd,PConc,TA,No,GLQ,Ex,Gd,Attchd,RFn
1,Reg,FR2,Veenker,1Story,MetalSd,MetalSd,,TA,CBlock,TA,Gd,ALQ,Ex,TA,Attchd,RFn
2,IR1,Inside,CollgCr,2Story,VinylSd,VinylSd,BrkFace,Gd,PConc,TA,Mn,GLQ,Ex,Gd,Attchd,RFn
3,IR1,Corner,Crawfor,2Story,Wd Sdng,Wd Shng,,TA,BrkTil,Gd,No,ALQ,Gd,Gd,Detchd,Unf
4,IR1,FR2,NoRidge,2Story,VinylSd,VinylSd,BrkFace,Gd,PConc,TA,Av,GLQ,Ex,Gd,Attchd,RFn


Lets encode these out and get to model fitting.  (This is just a rewriting of the above encoding, but with the smaller dataset.)

In [19]:
#initialize empty frames for storage
df_train_enc = pd.DataFrame()
df_train_hot_enc = pd.DataFrame()

#iterate over categorical cols and transform into int values
for i in df_train:
    df_train_enc[i] = encoder.fit_transform(df_train[i])
    
#encoder into binary dummies
df_train_hot_enc = hot_encoder.fit_transform(df_train_enc)
df_train_dummies = pd.DataFrame(df_train_hot_enc.toarray())

df_train_dummies.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,391,392,393,394,395,396,397,398,399,400
0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
4,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


Alright,  lets combine the two train sets (continuous and categorical). We'll split the Kaggle train set into a train and test subset so we can test the accuracy as well. After that, We'll fit a regression and see the results:

In [20]:
X_train = X_train_pca
X_train.head()

for i in df_train_dummies:
    X_train[i] =df_train_dummies[i]
    
X_train.shape

(1460, 406)

In [21]:
X_train_save = X_train #needed for later
y_train_save = y_train #needed for later

X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size=0.33, random_state=42)
print("okay, this is getting messy and difficult to track in jupyter.. lets model and gtfo")

okay, this is getting messy and difficult to track in jupyter.. lets model and gtfo


In [22]:
#linear
clf = LinearRegression().fit(X_train, y_train)

y_pred = clf.predict(X_test)

# The mean squared error
print("Mean squared error: %.2f") % mean_squared_error(y_test, y_pred)     

# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f') % r2_score(y_test, y_pred)    


Mean squared error: 1063481822160301124330534731776.00
Variance score: -144861071958222913536.00


Well something is off here.  lets print out the preds to true values:

In [23]:
print("True       Predicted")
for i in range(0, len(y_pred)):
    print("%.2f        %.2f") %(y_test.iloc[i], y_pred[i])

True       Predicted
154500.00        150211.00
325000.00        301361.00
115000.00        108992.00
159000.00        173719.00
315500.00        312825.00
75500.00        138320126160922.00
311500.00        253260.00
146000.00        148172.00
84500.00        138320126158674.00
135500.00        131382.00
145000.00        129060.00
130000.00        112941.00
81000.00        124811.00
214000.00        234873.00
181000.00        184833.00
134500.00        117976.00
183500.00        193918.00
135000.00        123773.00
118400.00        137025.00
226000.00        208449.00
155000.00        160173.00
210000.00        199413.00
173500.00        177851.00
129000.00        120808.00
192000.00        203707.00
153900.00        162917.00
181134.00        202883.00
141000.00        120239.00
181000.00        181342.00
208900.00        191221.00
127000.00        125141.00
284000.00        288563.00
200500.00        138320126372920.00
135750.00        96995.00
255000.00        256505.00
140000.00  

So it appears one of the variables is drastically throwing off a few of the predictions.  the majority are, while not great, not too far off from expected.

In [24]:
clf.coef_

array([  1.62516326e+04,   4.40655927e+03,   2.03253328e+03,
        -3.04495955e+03,  -7.80482467e+03,   9.28415933e+15,
         9.28415933e+15,   9.28415933e+15,   9.28415933e+15,
         7.63267190e+15,   7.63267190e+15,   7.63267190e+15,
         7.63267190e+15,   7.63267190e+15,  -8.36533038e+14,
        -8.36533038e+14,  -8.36533038e+14,  -8.36533038e+14,
        -8.36533038e+14,  -8.36533038e+14,  -8.36533038e+14,
        -8.36533038e+14,  -8.36533038e+14,  -8.36533038e+14,
        -8.36533038e+14,  -8.36533038e+14,  -8.36533038e+14,
        -8.36533038e+14,  -8.36533038e+14,  -8.36533038e+14,
        -8.36533038e+14,  -8.36533038e+14,  -8.36533038e+14,
        -8.36533038e+14,  -8.36533038e+14,  -8.36533038e+14,
        -8.36533038e+14,  -8.36533038e+14,  -8.36533038e+14,
        -5.77037165e+15,  -5.77037165e+15,  -5.77037165e+15,
        -5.77037165e+15,  -5.77037165e+15,  -5.77037165e+15,
        -5.77037165e+15,  -5.77037165e+15,  -1.77139543e+15,
        -7.02077691e+13,

Looks like a large portion of the dummy variables are.. severe.  lets just cut them and see how solely the principal vectors do.

In [37]:
X_train = X_train_save[["pca1","pca2","pca3", "pca4", "pca5"]]
y_train = y_train_save
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size=0.33, random_state=42)

In [38]:
#linear
clf = LinearRegression().fit(X_train, y_train)

y_pred = clf.predict(X_test)

# The mean squared error
print("Mean squared error: %.2f") % mean_squared_error(y_test, y_pred)     

# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f') % r2_score(y_test, y_pred)    


Mean squared error: 1570704767.92
Variance score: 0.79


In [40]:
#Lasso
clf = LassoCV().fit(X_train, y_train)

y_pred = clf.predict(X_test)

# The mean squared error
print("Mean squared error: %.2f") % mean_squared_error(y_test, y_pred)     

# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f') % r2_score(y_test, y_pred)    


Mean squared error: 1611351349.64
Variance score: 0.78


In [41]:
#Ridge
clf = Ridge().fit(X_train, y_train)

y_pred = clf.predict(X_test)

# The mean squared error
print("Mean squared error: %.2f") % mean_squared_error(y_test, y_pred)     

# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f') % r2_score(y_test, y_pred)    


Mean squared error: 1570874057.65
Variance score: 0.79


In [42]:
print("True       Predicted")
for i in range(0, 10):
    print("%.2f        %.2f") %(y_test.iloc[i], y_pred[i])

True       Predicted
154500.00        147888.31
325000.00        293508.12
115000.00        96145.85
159000.00        162123.22
315500.00        272257.67
75500.00        53688.24
311500.00        213377.88
146000.00        160354.30
84500.00        51587.39
135500.00        138603.64


So in effect, We've reduced an 81 variable dataset (>1000 if we count all dummy variables possible) into only 5 variables thanks to PCA. while 79% variance explained ins't all too great for something like predicting house final sale prices, I'll take it for now (as I've run out of free time for the week). I'd like to get back into this and further explore possibilities of fine-tune these categorical vars and fit to some non-linear regression models... Till then,  lets upload this to Kaggle just to see how it goes!
 
To do this, I'll need to run the df_test data through the pipeline.

In [44]:
df_test = df_test.drop(["Alley", 'FireplaceQu', 'PoolQC', 'Fence', 'MiscFeature'], axis = 1)



Id                  0
MSSubClass          0
MSZoning            4
LotFrontage       227
LotArea             0
Street              0
Alley            1352
LotShape            0
LandContour         0
Utilities           2
LotConfig           0
LandSlope           0
Neighborhood        0
Condition1          0
Condition2          0
BldgType            0
HouseStyle          0
OverallQual         0
OverallCond         0
YearBuilt           0
YearRemodAdd        0
RoofStyle           0
RoofMatl            0
Exterior1st         1
Exterior2nd         1
MasVnrType         16
MasVnrArea         15
ExterQual           0
ExterCond           0
Foundation          0
                 ... 
HalfBath            0
BedroomAbvGr        0
KitchenAbvGr        0
KitchenQual         1
TotRmsAbvGrd        0
Functional          2
Fireplaces          0
FireplaceQu       730
GarageType         76
GarageYrBlt        78
GarageFinish       78
GarageCars          1
GarageArea          1
GarageQual         78
GarageCond

In [45]:
df_test_cont = pd.DataFrame()

for i in df_test:
    if df_test[i].dtype == "int64" or df_test[i].dtype == "float64":
        df_test_cont[i] = df_test[i]

df_train_cont.head()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
0,1,60,65.0,8450,7,5,2003,2003,196.0,706,...,0,61,0,0,0,0,0,2,2008,208500
1,2,20,80.0,9600,6,8,1976,1976,0.0,978,...,298,0,0,0,0,0,0,5,2007,181500
2,3,60,68.0,11250,7,5,2001,2002,162.0,486,...,0,42,0,0,0,0,0,9,2008,223500
3,4,70,60.0,9550,7,5,1915,1970,0.0,216,...,0,35,272,0,0,0,0,2,2006,140000
4,5,60,84.0,14260,8,5,2000,2000,350.0,655,...,192,84,0,0,0,0,0,12,2008,250000


In [47]:
df_test_cont_imp = imp.fit_transform(df_test_cont)
df_test_cont_imp = pd.DataFrame(df_test_cont_imp, columns = df_test_cont.columns)

df_test_cont_imp.isnull().sum().any()

False

In [48]:
for i in df_test_cont_imp:
    X_test = scaler.transform(df_test_cont_imp)
X_test = pd.DataFrame(X_test, columns = df_test_cont_imp.columns)

X_test.head()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold
0,1.733238,-0.872563,0.451936,0.110763,-0.795151,0.381743,-0.340077,-1.15638,-0.57441,0.053428,...,1.202536,0.365179,-0.704483,-0.359325,-0.116339,1.882709,-0.068692,-0.087688,-0.11911,1.64521
1,1.73561,-0.872563,0.497357,0.37585,-0.071836,0.381743,-0.43944,-1.30174,0.023903,1.051363,...,-0.753188,2.3844,-0.16095,-0.359325,-0.116339,-0.270208,-0.068692,25.116309,-0.11911,1.64521
2,1.737983,0.073375,0.179413,0.332053,-0.795151,-0.5172,0.852269,0.6364,-0.57441,0.761852,...,0.042202,0.939819,-0.191147,-0.359325,-0.116339,-0.270208,-0.068692,-0.087688,-1.229111,1.64521
3,1.740356,0.073375,0.361095,-0.054002,-0.071836,0.381743,0.88539,0.6364,-0.463612,0.347326,...,-0.013943,2.121024,-0.16095,-0.359325,-0.116339,-0.270208,-0.068692,-0.087688,-0.11911,1.64521
4,1.742728,1.492282,-1.228623,-0.552407,1.374795,-0.5172,0.686666,0.345679,-0.57441,-0.39619,...,0.154492,-0.752176,0.533564,-0.359325,-0.116339,2.313293,-0.068692,-0.087688,-1.969111,1.64521


In [49]:
del X_test['Id']
X_test = X_test.values

X_test = pca.transform(X_test)

X_test_pca = X_test[:,:5]
X_test_pca = pd.DataFrame(X_test_pca, columns = ["pca1", "pca2", "pca3", "pca4", "pca5"])
X_test_pca.head()

Unnamed: 0,pca1,pca2,pca3,pca4,pca5
0,-2.480251,-1.442848,0.787127,0.232835,-0.824734
1,-1.205705,0.018728,2.461575,-0.715957,-0.199358
2,0.719761,0.213787,-1.04333,-1.509953,-0.001801
3,1.176124,0.54612,-1.11803,-1.286063,-0.589375
4,0.137644,-1.025616,-2.117799,0.992937,0.336105


In [53]:
preds = clf.predict(X_test_pca)

df_sample.SalePrice = preds

'''WRITE TO CSV'''
df_sample.to_csv('Linear_Reg_PCA.csv', index=False)

Kaggle results: As expected.. Not all too great. About 75 percentile.. The project was great practice though. I was able to get some practice with: PCA, KMeans clustering, Multiple Correspondence Analysis, along with all the typical cleaning and scaling that was done throughout. 

As this page eventually turned into a much longer and messier version than I had hoped for, any further exploration into this project will be done and uploaded on a new post!