# EDA and Cleaning - Ames Housing Data

This notebook contains all data cleaning and Exploratory Data Analysis performed on Ames Housing Data

## Initial comments from data description review

- "There are 5 observations that an instructor may wish to remove from the data set before giving it to students (a plot of SALE PRICE versus GR LIV AREA will indicate them quickly). Three of them are true outliers (Partial Sales that likely don’t represent actual market values) and two of them are simply unusual sales (very large houses priced relatively appropriately). **I would recommend removing any houses with more than 4000 square feet from the data set (which eliminates these 5 unusual observations)** before assigning it to students."

- "... if the purpose is to once again create a common use model to estimate a “typical” sale, it is in the modeler’s best interest to remove any observations that do not seem typical **(such as foreclosures or family sales)**."

## Interesting features after reading data description:

- Lot Shape
- Land Contour
- Lot Config
- Neighborhood
- Year Built
- Year Remod/Add
- Exter Qual
- Exter Cond
- Overall Qual
- Overall Cond
- Gr Liv Area
- Bedroom
- KitchenQual
- Garage Area
- Garage Qual
- Garage Cond
- Mo Sold
- Yr Sold
- Sale Type
- Sale Condition

In [518]:
# Import the usual suspects
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Lasso, LassoCV, RidgeCV

In [519]:
# Import data
test = pd.read_csv('../data/test.csv')
train = pd.read_csv('../data/train.csv')
test.head()

Unnamed: 0,Id,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,3Ssn Porch,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type
0,2658,902301120,190,RM,69.0,9142,Pave,Grvl,Reg,Lvl,...,0,0,0,,,,0,4,2006,WD
1,2718,905108090,90,RL,,9662,Pave,,IR1,Lvl,...,0,0,0,,,,0,8,2006,WD
2,2414,528218130,60,RL,58.0,17104,Pave,,IR1,Lvl,...,0,0,0,,,,0,9,2006,New
3,1989,902207150,30,RM,60.0,8520,Pave,,Reg,Lvl,...,0,0,0,,,,0,7,2007,WD
4,625,535105100,20,RL,,9500,Pave,,IR1,Lvl,...,0,185,0,,,,0,7,2009,WD


In [520]:
# List interesting features from reading data description
interesting = [
    'Lot Shape',
    'Land Contour',
    'Lot Config',
    'Neighborhood',
    'Year Built',
    'Year Remod/Add',
    'Exter Qual',
    'Exter Cond',
    'Overall Qual',
    'Overall Cond',
    'Gr Liv Area',
    'Bedroom AbvGr',
    'Kitchen Qual',
    'Garage Area',
    'Garage Qual',
    'Garage Cond',
    'Mo Sold',
    'Yr Sold',
    'Sale Type',
    'SalePrice',
]

In [521]:
# Keep only interesting features
train = train[interesting]


In [522]:
train.head()

Unnamed: 0,Lot Shape,Land Contour,Lot Config,Neighborhood,Year Built,Year Remod/Add,Exter Qual,Exter Cond,Overall Qual,Overall Cond,Gr Liv Area,Bedroom AbvGr,Kitchen Qual,Garage Area,Garage Qual,Garage Cond,Mo Sold,Yr Sold,Sale Type,SalePrice
0,IR1,Lvl,CulDSac,Sawyer,1976,2005,Gd,TA,6,8,1479,3,Gd,475.0,TA,TA,3,2010,WD,130500
1,IR1,Lvl,CulDSac,SawyerW,1996,1997,Gd,TA,7,5,2122,4,Gd,559.0,TA,TA,4,2009,WD,220000
2,Reg,Lvl,Inside,NAmes,1953,2007,TA,Gd,5,7,1057,3,Gd,246.0,TA,TA,1,2010,WD,109000
3,Reg,Lvl,Inside,Timber,2006,2007,TA,TA,5,5,1444,3,TA,400.0,TA,TA,4,2010,WD,174000
4,IR1,Lvl,Inside,SawyerW,1900,1993,TA,TA,6,8,1445,3,TA,484.0,TA,TA,3,2010,WD,138500


In [523]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2051 entries, 0 to 2050
Data columns (total 20 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Lot Shape       2051 non-null   object 
 1   Land Contour    2051 non-null   object 
 2   Lot Config      2051 non-null   object 
 3   Neighborhood    2051 non-null   object 
 4   Year Built      2051 non-null   int64  
 5   Year Remod/Add  2051 non-null   int64  
 6   Exter Qual      2051 non-null   object 
 7   Exter Cond      2051 non-null   object 
 8   Overall Qual    2051 non-null   int64  
 9   Overall Cond    2051 non-null   int64  
 10  Gr Liv Area     2051 non-null   int64  
 11  Bedroom AbvGr   2051 non-null   int64  
 12  Kitchen Qual    2051 non-null   object 
 13  Garage Area     2050 non-null   float64
 14  Garage Qual     1937 non-null   object 
 15  Garage Cond     1937 non-null   object 
 16  Mo Sold         2051 non-null   int64  
 17  Yr Sold         2051 non-null   i

These columns need to be filled: 'Garage Area', 'Garage Qual', 'Garage Cond'

In [524]:
train.corr()

Unnamed: 0,Year Built,Year Remod/Add,Overall Qual,Overall Cond,Gr Liv Area,Bedroom AbvGr,Garage Area,Mo Sold,Yr Sold,SalePrice
Year Built,1.0,0.629116,0.602964,-0.370988,0.258838,-0.042149,0.487177,-0.007083,-0.003559,0.571849
Year Remod/Add,0.629116,1.0,0.584654,0.042614,0.322407,-0.019748,0.398999,0.011568,0.042744,0.55037
Overall Qual,0.602964,0.584654,1.0,-0.08277,0.566701,0.053373,0.563814,0.019242,-0.011578,0.800207
Overall Cond,-0.370988,0.042614,-0.08277,1.0,-0.109804,-0.009908,-0.137917,-0.003144,0.047664,-0.097019
Gr Liv Area,0.258838,0.322407,0.566701,-0.109804,1.0,0.507579,0.490949,0.049644,-0.015891,0.697038
Bedroom AbvGr,-0.042149,-0.019748,0.053373,-0.009908,0.507579,1.0,0.06994,0.068281,-0.011692,0.137067
Garage Area,0.487177,0.398999,0.563814,-0.137917,0.490949,0.06994,1.0,0.009964,-0.003589,0.65027
Mo Sold,-0.007083,0.011568,0.019242,-0.003144,0.049644,0.068281,0.009964,1.0,-0.147494,0.032735
Yr Sold,-0.003559,0.042744,-0.011578,0.047664,-0.015891,-0.011692,-0.003589,-0.147494,1.0,-0.015203
SalePrice,0.571849,0.55037,0.800207,-0.097019,0.697038,0.137067,0.65027,0.032735,-0.015203,1.0


High correlation features: 'Overall Qual', 'Gr Liv Area', 'Garage Area', 'Year Built', 'Year Remod/Add'

### **Might be good to include a garage y/n column**

## Dropping rows with 'Gr Liv Area' > 4000 per data description suggestion.

In [525]:
train = train[train['Gr Liv Area'] < 4000]
# test = test[test['Gr Liv Area'] < 4000]

In [526]:
# plt.figure(figsize=(10,10))
# sns.pairplot(train, corner=True)
# ;

'Bedroom AbvGr', 'Mo Sold' and 'Yr Sold' seem to be evenly distributed, so I will drop them.

In [527]:
train.drop(columns=['Bedroom AbvGr', 'Mo Sold', 'Yr Sold'], inplace=True)
test.drop(columns=['Bedroom AbvGr', 'Mo Sold', 'Yr Sold'], inplace=True)

In [528]:
train['Garage Area'].fillna(0, inplace=True)
test['Garage Area'].fillna(0, inplace=True)

## Linear Regression

In [529]:
X_test = test[['Overall Qual', 'Gr Liv Area', 'Garage Area', 'Year Built', 'Year Remod/Add']]
X = train[['Overall Qual', 'Gr Liv Area', 'Garage Area', 'Year Built', 'Year Remod/Add']]
y = train['SalePrice']

In [530]:
linreg = LinearRegression()
linreg.fit(X, y)

LinearRegression()

In [531]:
preds = linreg.predict(X_test)

In [532]:
linreg.score(X, y)

0.7940701022936076

In [533]:
test['SalePrice'] = preds

In [534]:
preds.shape

(878,)

In [535]:
test[['Id', 'SalePrice']].to_csv('../data/submission_linreg1.csv', index=False)

## Linear Regression with Standard Scaling

In [536]:
ss = StandardScaler()

X_scaled = ss.fit_transform(X)
X_test_scaled = ss.transform(X_test)

In [537]:
linreg.fit(X_scaled, y)
preds_scaled = linreg.predict(X_test_scaled)

In [538]:
test['SalePrice'] = preds

In [539]:
test[['Id', 'SalePrice']].to_csv('../data/submission_linreg_scaled.csv', index=False)

## Linear Regression, including binarized 'Neighborhood', and Standard Scaler

In [540]:
X_test = test[['Neighborhood', 'Overall Qual', 'Gr Liv Area', 'Garage Area', 'Year Built', 'Year Remod/Add']]
X = train[['Neighborhood', 'Overall Qual', 'Gr Liv Area', 'Garage Area', 'Year Built', 'Year Remod/Add']]
y = train['SalePrice']

In [541]:
# Use pandas .get_dummies() to binarize 'Neighborhood'. With help from https://stackoverflow.com/questions/32387266/converting-categorical-values-to-binary-using-pandas
# Drop first binary column
X_test = pd.get_dummies(X_test, drop_first=True)
X = pd.get_dummies(X, drop_first=True)

**I'm getting and error when standard scaling, apparently there are neighborhood columns missing in X or X_test, will add these columns with all "0" where necesssary**

In [542]:
X.columns

Index(['Overall Qual', 'Gr Liv Area', 'Garage Area', 'Year Built',
       'Year Remod/Add', 'Neighborhood_Blueste', 'Neighborhood_BrDale',
       'Neighborhood_BrkSide', 'Neighborhood_ClearCr', 'Neighborhood_CollgCr',
       'Neighborhood_Crawfor', 'Neighborhood_Edwards', 'Neighborhood_Gilbert',
       'Neighborhood_Greens', 'Neighborhood_GrnHill', 'Neighborhood_IDOTRR',
       'Neighborhood_Landmrk', 'Neighborhood_MeadowV', 'Neighborhood_Mitchel',
       'Neighborhood_NAmes', 'Neighborhood_NPkVill', 'Neighborhood_NWAmes',
       'Neighborhood_NoRidge', 'Neighborhood_NridgHt', 'Neighborhood_OldTown',
       'Neighborhood_SWISU', 'Neighborhood_Sawyer', 'Neighborhood_SawyerW',
       'Neighborhood_Somerst', 'Neighborhood_StoneBr', 'Neighborhood_Timber',
       'Neighborhood_Veenker'],
      dtype='object')

In [543]:
X_test.columns

Index(['Overall Qual', 'Gr Liv Area', 'Garage Area', 'Year Built',
       'Year Remod/Add', 'Neighborhood_Blueste', 'Neighborhood_BrDale',
       'Neighborhood_BrkSide', 'Neighborhood_ClearCr', 'Neighborhood_CollgCr',
       'Neighborhood_Crawfor', 'Neighborhood_Edwards', 'Neighborhood_Gilbert',
       'Neighborhood_Greens', 'Neighborhood_IDOTRR', 'Neighborhood_MeadowV',
       'Neighborhood_Mitchel', 'Neighborhood_NAmes', 'Neighborhood_NPkVill',
       'Neighborhood_NWAmes', 'Neighborhood_NoRidge', 'Neighborhood_NridgHt',
       'Neighborhood_OldTown', 'Neighborhood_SWISU', 'Neighborhood_Sawyer',
       'Neighborhood_SawyerW', 'Neighborhood_Somerst', 'Neighborhood_StoneBr',
       'Neighborhood_Timber', 'Neighborhood_Veenker'],
      dtype='object')

In [544]:
X_test['Neighborhood_GrnHill'] = 0
X_test['Neighborhood_Landmrk'] = 0

In [545]:
# A little help here from https://stackoverflow.com/questions/11067027/re-ordering-columns-in-pandas-dataframe-based-on-column-name
# Keeps dummies in same place as X
X_test = X_test.reindex(sorted(X_test.columns), axis=1)
X = X.reindex(sorted(X.columns), axis=1)


In [546]:
X.columns

Index(['Garage Area', 'Gr Liv Area', 'Neighborhood_Blueste',
       'Neighborhood_BrDale', 'Neighborhood_BrkSide', 'Neighborhood_ClearCr',
       'Neighborhood_CollgCr', 'Neighborhood_Crawfor', 'Neighborhood_Edwards',
       'Neighborhood_Gilbert', 'Neighborhood_Greens', 'Neighborhood_GrnHill',
       'Neighborhood_IDOTRR', 'Neighborhood_Landmrk', 'Neighborhood_MeadowV',
       'Neighborhood_Mitchel', 'Neighborhood_NAmes', 'Neighborhood_NPkVill',
       'Neighborhood_NWAmes', 'Neighborhood_NoRidge', 'Neighborhood_NridgHt',
       'Neighborhood_OldTown', 'Neighborhood_SWISU', 'Neighborhood_Sawyer',
       'Neighborhood_SawyerW', 'Neighborhood_Somerst', 'Neighborhood_StoneBr',
       'Neighborhood_Timber', 'Neighborhood_Veenker', 'Overall Qual',
       'Year Built', 'Year Remod/Add'],
      dtype='object')

In [547]:
X_test.columns

Index(['Garage Area', 'Gr Liv Area', 'Neighborhood_Blueste',
       'Neighborhood_BrDale', 'Neighborhood_BrkSide', 'Neighborhood_ClearCr',
       'Neighborhood_CollgCr', 'Neighborhood_Crawfor', 'Neighborhood_Edwards',
       'Neighborhood_Gilbert', 'Neighborhood_Greens', 'Neighborhood_GrnHill',
       'Neighborhood_IDOTRR', 'Neighborhood_Landmrk', 'Neighborhood_MeadowV',
       'Neighborhood_Mitchel', 'Neighborhood_NAmes', 'Neighborhood_NPkVill',
       'Neighborhood_NWAmes', 'Neighborhood_NoRidge', 'Neighborhood_NridgHt',
       'Neighborhood_OldTown', 'Neighborhood_SWISU', 'Neighborhood_Sawyer',
       'Neighborhood_SawyerW', 'Neighborhood_Somerst', 'Neighborhood_StoneBr',
       'Neighborhood_Timber', 'Neighborhood_Veenker', 'Overall Qual',
       'Year Built', 'Year Remod/Add'],
      dtype='object')

In [548]:
X.columns == X_test.columns

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True])

In [549]:
X.shape

(2049, 32)

In [550]:
ss = StandardScaler()

X_scaled = ss.fit_transform(X)
X_test_scaled = ss.transform(X_test)

In [551]:
linreg = LinearRegression()

linreg.fit(X_scaled, y)
preds_scaled = linreg.predict(X_test_scaled)

In [552]:
test['SalePrice'] = preds_scaled

In [553]:
test[['Id', 'SalePrice']].to_csv('../data/submission_linreg_scaled_neighbrhd.csv', index=False)

In [554]:
linreg.coef_

array([ 1.20206877e+04,  2.69861637e+04, -1.08123379e+03, -1.87182677e+03,
        3.45006715e+03,  4.24984468e+03,  1.90114592e+03,  7.12190079e+03,
        3.40125078e+03, -3.67078584e+02,  1.22132479e+02,  3.25718961e+03,
        2.15354755e+03, -6.16093181e+02, -2.15516167e+01,  2.31407440e+03,
        6.32190929e+03, -9.95315228e+02,  1.17837953e+03,  6.05905230e+03,
        1.41463450e+04,  1.47079796e+03,  8.52954702e+02,  3.94287425e+03,
       -3.68072367e+02,  1.81162990e+03,  9.82085388e+03,  4.20981769e+03,
        3.43041944e+03,  2.52201396e+04,  1.10150181e+04,  6.10335591e+03])

In [555]:
pd.DataFrame(linreg.coef_, X_test.columns).sort_values(0, ascending=False)

Unnamed: 0,0
Gr Liv Area,26986.163714
Overall Qual,25220.139553
Neighborhood_NridgHt,14146.344981
Garage Area,12020.687689
Year Built,11015.018075
Neighborhood_StoneBr,9820.853883
Neighborhood_Crawfor,7121.900785
Neighborhood_NAmes,6321.909294
Year Remod/Add,6103.35591
Neighborhood_NoRidge,6059.052296


## LinReg on Neighborhoods only

In [556]:
X_test = test['Neighborhood']
X = train['Neighborhood']
y = train['SalePrice']

In [557]:
# Use pandas .get_dummies() to binarize 'Neighborhood'. With help from https://stackoverflow.com/questions/32387266/converting-categorical-values-to-binary-using-pandas
# Drop first binary column
X_test = pd.get_dummies(X_test, drop_first=True)
X = pd.get_dummies(X, drop_first=True)

**I'm getting and error when standard scaling, apparently there are neighborhood columns missing in X or X_test, will add these columns with all "0" where necesssary**

In [558]:
X.columns

Index(['Blueste', 'BrDale', 'BrkSide', 'ClearCr', 'CollgCr', 'Crawfor',
       'Edwards', 'Gilbert', 'Greens', 'GrnHill', 'IDOTRR', 'Landmrk',
       'MeadowV', 'Mitchel', 'NAmes', 'NPkVill', 'NWAmes', 'NoRidge',
       'NridgHt', 'OldTown', 'SWISU', 'Sawyer', 'SawyerW', 'Somerst',
       'StoneBr', 'Timber', 'Veenker'],
      dtype='object')

In [559]:
X_test.columns

Index(['Blueste', 'BrDale', 'BrkSide', 'ClearCr', 'CollgCr', 'Crawfor',
       'Edwards', 'Gilbert', 'Greens', 'IDOTRR', 'MeadowV', 'Mitchel', 'NAmes',
       'NPkVill', 'NWAmes', 'NoRidge', 'NridgHt', 'OldTown', 'SWISU', 'Sawyer',
       'SawyerW', 'Somerst', 'StoneBr', 'Timber', 'Veenker'],
      dtype='object')

In [560]:
X_test['GrnHill'] = 0
X_test['Landmrk'] = 0

In [561]:
# A little help here from https://stackoverflow.com/questions/11067027/re-ordering-columns-in-pandas-dataframe-based-on-column-name
# Keeps dummies in same place as X
X_test = X_test.reindex(sorted(X_test.columns), axis=1)
X = X.reindex(sorted(X.columns), axis=1)


In [562]:
X.columns

Index(['Blueste', 'BrDale', 'BrkSide', 'ClearCr', 'CollgCr', 'Crawfor',
       'Edwards', 'Gilbert', 'Greens', 'GrnHill', 'IDOTRR', 'Landmrk',
       'MeadowV', 'Mitchel', 'NAmes', 'NPkVill', 'NWAmes', 'NoRidge',
       'NridgHt', 'OldTown', 'SWISU', 'Sawyer', 'SawyerW', 'Somerst',
       'StoneBr', 'Timber', 'Veenker'],
      dtype='object')

In [563]:
X_test.columns

Index(['Blueste', 'BrDale', 'BrkSide', 'ClearCr', 'CollgCr', 'Crawfor',
       'Edwards', 'Gilbert', 'Greens', 'GrnHill', 'IDOTRR', 'Landmrk',
       'MeadowV', 'Mitchel', 'NAmes', 'NPkVill', 'NWAmes', 'NoRidge',
       'NridgHt', 'OldTown', 'SWISU', 'Sawyer', 'SawyerW', 'Somerst',
       'StoneBr', 'Timber', 'Veenker'],
      dtype='object')

In [564]:
X.columns == X_test.columns

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True])

In [565]:
ss = StandardScaler()

X_scaled = ss.fit_transform(X)
X_test_scaled = ss.transform(X_test)

In [566]:
linreg = LinearRegression()

linreg.fit(X_scaled, y)
preds_scaled = linreg.predict(X_test_scaled)

In [567]:
test['SalePrice'] = preds_scaled

In [568]:
test[['Id', 'SalePrice']].to_csv('../data/submission_linreg_scaled_neighbrhd_only.csv', index=False)

In [569]:
linreg.coef_

array([ -3017.8588369 ,  -9301.72115647, -13870.58452362,   1946.81587901,
          588.66084219,   1002.90671646, -17849.21689879,  -2585.86378544,
         -449.31081253,   2485.12676334, -17915.4538171 ,  -1400.66210985,
       -10779.16414808,  -6421.31477152, -19533.48465478,  -5437.13189796,
        -1169.215938  ,  17526.5873118 ,  28967.34787843, -20333.01877875,
        -8141.02270795, -14313.94162136,  -2552.20657137,   6524.60491394,
        17438.6753209 ,   6145.93947363,   4821.38009926])

In [570]:
pd.DataFrame(linreg.coef_, X_test.columns).sort_values(0, ascending=False)

Unnamed: 0,0
NridgHt,28967.347878
NoRidge,17526.587312
StoneBr,17438.675321
Somerst,6524.604914
Timber,6145.939474
Veenker,4821.380099
GrnHill,2485.126763
ClearCr,1946.815879
Crawfor,1002.906716
CollgCr,588.660842


## LinReg using only location features

First let's go back to the data description and redefine our 'interesting' features to include location proxies only.

MS Zoning, Lot Config, Neighborhood, Condition 1, Condition 2, 

In [571]:
# Import data
test = pd.read_csv('../data/test.csv')
train = pd.read_csv('../data/train.csv')
test.columns

Index(['Id', 'PID', 'MS SubClass', 'MS Zoning', 'Lot Frontage', 'Lot Area',
       'Street', 'Alley', 'Lot Shape', 'Land Contour', 'Utilities',
       'Lot Config', 'Land Slope', 'Neighborhood', 'Condition 1',
       'Condition 2', 'Bldg Type', 'House Style', 'Overall Qual',
       'Overall Cond', 'Year Built', 'Year Remod/Add', 'Roof Style',
       'Roof Matl', 'Exterior 1st', 'Exterior 2nd', 'Mas Vnr Type',
       'Mas Vnr Area', 'Exter Qual', 'Exter Cond', 'Foundation', 'Bsmt Qual',
       'Bsmt Cond', 'Bsmt Exposure', 'BsmtFin Type 1', 'BsmtFin SF 1',
       'BsmtFin Type 2', 'BsmtFin SF 2', 'Bsmt Unf SF', 'Total Bsmt SF',
       'Heating', 'Heating QC', 'Central Air', 'Electrical', '1st Flr SF',
       '2nd Flr SF', 'Low Qual Fin SF', 'Gr Liv Area', 'Bsmt Full Bath',
       'Bsmt Half Bath', 'Full Bath', 'Half Bath', 'Bedroom AbvGr',
       'Kitchen AbvGr', 'Kitchen Qual', 'TotRms AbvGrd', 'Functional',
       'Fireplaces', 'Fireplace Qu', 'Garage Type', 'Garage Yr Blt',
       'G

In [572]:
# List interesting features from reading data description
interesting = [
    'MS Zoning',
    'Lot Config',
    'Neighborhood',
    'Condition 1',
    'Condition 2',
    'SalePrice',
]

In [573]:
# Keep only interesting features
train = train[interesting]


In [574]:
train.dtypes

MS Zoning       object
Lot Config      object
Neighborhood    object
Condition 1     object
Condition 2     object
SalePrice        int64
dtype: object

In [575]:
X_test = test[interesting[:-1]]
X = train.drop(columns='SalePrice')
y = train['SalePrice']

In [576]:
# Use pandas .get_dummies() to binarize categorical columns. With help from https://stackoverflow.com/questions/32387266/converting-categorical-values-to-binary-using-pandas
# Drop first binary column
X_test = pd.get_dummies(X_test, drop_first=True)
X = pd.get_dummies(X, drop_first=True)

In [577]:
X.columns

Index(['MS Zoning_C (all)', 'MS Zoning_FV', 'MS Zoning_I (all)',
       'MS Zoning_RH', 'MS Zoning_RL', 'MS Zoning_RM', 'Lot Config_CulDSac',
       'Lot Config_FR2', 'Lot Config_FR3', 'Lot Config_Inside',
       'Neighborhood_Blueste', 'Neighborhood_BrDale', 'Neighborhood_BrkSide',
       'Neighborhood_ClearCr', 'Neighborhood_CollgCr', 'Neighborhood_Crawfor',
       'Neighborhood_Edwards', 'Neighborhood_Gilbert', 'Neighborhood_Greens',
       'Neighborhood_GrnHill', 'Neighborhood_IDOTRR', 'Neighborhood_Landmrk',
       'Neighborhood_MeadowV', 'Neighborhood_Mitchel', 'Neighborhood_NAmes',
       'Neighborhood_NPkVill', 'Neighborhood_NWAmes', 'Neighborhood_NoRidge',
       'Neighborhood_NridgHt', 'Neighborhood_OldTown', 'Neighborhood_SWISU',
       'Neighborhood_Sawyer', 'Neighborhood_SawyerW', 'Neighborhood_Somerst',
       'Neighborhood_StoneBr', 'Neighborhood_Timber', 'Neighborhood_Veenker',
       'Condition 1_Feedr', 'Condition 1_Norm', 'Condition 1_PosA',
       'Condition 1_PosN'

In [578]:
X_test.columns

Index(['MS Zoning_FV', 'MS Zoning_I (all)', 'MS Zoning_RH', 'MS Zoning_RL',
       'MS Zoning_RM', 'Lot Config_CulDSac', 'Lot Config_FR2',
       'Lot Config_FR3', 'Lot Config_Inside', 'Neighborhood_Blueste',
       'Neighborhood_BrDale', 'Neighborhood_BrkSide', 'Neighborhood_ClearCr',
       'Neighborhood_CollgCr', 'Neighborhood_Crawfor', 'Neighborhood_Edwards',
       'Neighborhood_Gilbert', 'Neighborhood_Greens', 'Neighborhood_IDOTRR',
       'Neighborhood_MeadowV', 'Neighborhood_Mitchel', 'Neighborhood_NAmes',
       'Neighborhood_NPkVill', 'Neighborhood_NWAmes', 'Neighborhood_NoRidge',
       'Neighborhood_NridgHt', 'Neighborhood_OldTown', 'Neighborhood_SWISU',
       'Neighborhood_Sawyer', 'Neighborhood_SawyerW', 'Neighborhood_Somerst',
       'Neighborhood_StoneBr', 'Neighborhood_Timber', 'Neighborhood_Veenker',
       'Condition 1_Feedr', 'Condition 1_Norm', 'Condition 1_PosA',
       'Condition 1_PosN', 'Condition 1_RRAe', 'Condition 1_RRAn',
       'Condition 1_RRNe', 'Condit

Missing columns in X_test: 

In [579]:
missing = [col for col in X.columns if col not in X_test.columns]
missing

['MS Zoning_C (all)',
 'Neighborhood_GrnHill',
 'Neighborhood_Landmrk',
 'Condition 2_Feedr',
 'Condition 2_PosN',
 'Condition 2_RRAe',
 'Condition 2_RRAn',
 'Condition 2_RRNn']

In [580]:
X_test[missing] = 0

In [581]:
# A little help here from https://stackoverflow.com/questions/11067027/re-ordering-columns-in-pandas-dataframe-based-on-column-name
# Keeps dummies in same place as X
X_test = X_test.reindex(sorted(X_test.columns), axis=1)
X = X.reindex(sorted(X.columns), axis=1)


In [582]:
X.columns

Index(['Condition 1_Feedr', 'Condition 1_Norm', 'Condition 1_PosA',
       'Condition 1_PosN', 'Condition 1_RRAe', 'Condition 1_RRAn',
       'Condition 1_RRNe', 'Condition 1_RRNn', 'Condition 2_Feedr',
       'Condition 2_Norm', 'Condition 2_PosA', 'Condition 2_PosN',
       'Condition 2_RRAe', 'Condition 2_RRAn', 'Condition 2_RRNn',
       'Lot Config_CulDSac', 'Lot Config_FR2', 'Lot Config_FR3',
       'Lot Config_Inside', 'MS Zoning_C (all)', 'MS Zoning_FV',
       'MS Zoning_I (all)', 'MS Zoning_RH', 'MS Zoning_RL', 'MS Zoning_RM',
       'Neighborhood_Blueste', 'Neighborhood_BrDale', 'Neighborhood_BrkSide',
       'Neighborhood_ClearCr', 'Neighborhood_CollgCr', 'Neighborhood_Crawfor',
       'Neighborhood_Edwards', 'Neighborhood_Gilbert', 'Neighborhood_Greens',
       'Neighborhood_GrnHill', 'Neighborhood_IDOTRR', 'Neighborhood_Landmrk',
       'Neighborhood_MeadowV', 'Neighborhood_Mitchel', 'Neighborhood_NAmes',
       'Neighborhood_NPkVill', 'Neighborhood_NWAmes', 'Neighborhood

In [583]:
X_test.columns

Index(['Condition 1_Feedr', 'Condition 1_Norm', 'Condition 1_PosA',
       'Condition 1_PosN', 'Condition 1_RRAe', 'Condition 1_RRAn',
       'Condition 1_RRNe', 'Condition 1_RRNn', 'Condition 2_Feedr',
       'Condition 2_Norm', 'Condition 2_PosA', 'Condition 2_PosN',
       'Condition 2_RRAe', 'Condition 2_RRAn', 'Condition 2_RRNn',
       'Lot Config_CulDSac', 'Lot Config_FR2', 'Lot Config_FR3',
       'Lot Config_Inside', 'MS Zoning_C (all)', 'MS Zoning_FV',
       'MS Zoning_I (all)', 'MS Zoning_RH', 'MS Zoning_RL', 'MS Zoning_RM',
       'Neighborhood_Blueste', 'Neighborhood_BrDale', 'Neighborhood_BrkSide',
       'Neighborhood_ClearCr', 'Neighborhood_CollgCr', 'Neighborhood_Crawfor',
       'Neighborhood_Edwards', 'Neighborhood_Gilbert', 'Neighborhood_Greens',
       'Neighborhood_GrnHill', 'Neighborhood_IDOTRR', 'Neighborhood_Landmrk',
       'Neighborhood_MeadowV', 'Neighborhood_Mitchel', 'Neighborhood_NAmes',
       'Neighborhood_NPkVill', 'Neighborhood_NWAmes', 'Neighborhood

In [584]:
X.columns == X_test.columns

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True])

In [585]:
ss = StandardScaler()

X_scaled = ss.fit_transform(X)
X_test_scaled = ss.transform(X_test)

In [586]:
linreg = LinearRegression()

linreg.fit(X_scaled, y)
preds_scaled = linreg.predict(X_test_scaled)

In [587]:
test['SalePrice'] = preds_scaled

In [588]:
test[['Id', 'SalePrice']].to_csv('../data/submission_linreg_scaled_location.csv', index=False)

## Lasso Regression using interesting features and location features

In [589]:
# Import data
test = pd.read_csv('../data/test.csv')
train = pd.read_csv('../data/train.csv')
test.columns

Index(['Id', 'PID', 'MS SubClass', 'MS Zoning', 'Lot Frontage', 'Lot Area',
       'Street', 'Alley', 'Lot Shape', 'Land Contour', 'Utilities',
       'Lot Config', 'Land Slope', 'Neighborhood', 'Condition 1',
       'Condition 2', 'Bldg Type', 'House Style', 'Overall Qual',
       'Overall Cond', 'Year Built', 'Year Remod/Add', 'Roof Style',
       'Roof Matl', 'Exterior 1st', 'Exterior 2nd', 'Mas Vnr Type',
       'Mas Vnr Area', 'Exter Qual', 'Exter Cond', 'Foundation', 'Bsmt Qual',
       'Bsmt Cond', 'Bsmt Exposure', 'BsmtFin Type 1', 'BsmtFin SF 1',
       'BsmtFin Type 2', 'BsmtFin SF 2', 'Bsmt Unf SF', 'Total Bsmt SF',
       'Heating', 'Heating QC', 'Central Air', 'Electrical', '1st Flr SF',
       '2nd Flr SF', 'Low Qual Fin SF', 'Gr Liv Area', 'Bsmt Full Bath',
       'Bsmt Half Bath', 'Full Bath', 'Half Bath', 'Bedroom AbvGr',
       'Kitchen AbvGr', 'Kitchen Qual', 'TotRms AbvGrd', 'Functional',
       'Fireplaces', 'Fireplace Qu', 'Garage Type', 'Garage Yr Blt',
       'G

In [590]:
# List interesting features from reading data description
interesting = [
    'MS Zoning',
    'Lot Config',
    'Neighborhood',
    'Condition 1',
    'Condition 2',
    'Overall Qual',
    'Gr Liv Area',
    'Garage Area',
    'Year Built',
    'Year Remod/Add',
    'SalePrice',
]

In [591]:
# Keep only interesting features
train = train[interesting]


In [592]:
train.dtypes

MS Zoning          object
Lot Config         object
Neighborhood       object
Condition 1        object
Condition 2        object
Overall Qual        int64
Gr Liv Area         int64
Garage Area       float64
Year Built          int64
Year Remod/Add      int64
SalePrice           int64
dtype: object

In [593]:
train['Garage Area'].fillna(0, inplace=True)
test['Garage Area'].fillna(0, inplace=True)

In [594]:
X_test = test[interesting[:-1]]
X = train.drop(columns='SalePrice')
y = train['SalePrice']

In [595]:
# Use pandas .get_dummies() to binarize categorical columns. With help from https://stackoverflow.com/questions/32387266/converting-categorical-values-to-binary-using-pandas
# Drop first binary column
X_test = pd.get_dummies(X_test, drop_first=True)
X = pd.get_dummies(X, drop_first=True)

In [596]:
X.columns

Index(['Overall Qual', 'Gr Liv Area', 'Garage Area', 'Year Built',
       'Year Remod/Add', 'MS Zoning_C (all)', 'MS Zoning_FV',
       'MS Zoning_I (all)', 'MS Zoning_RH', 'MS Zoning_RL', 'MS Zoning_RM',
       'Lot Config_CulDSac', 'Lot Config_FR2', 'Lot Config_FR3',
       'Lot Config_Inside', 'Neighborhood_Blueste', 'Neighborhood_BrDale',
       'Neighborhood_BrkSide', 'Neighborhood_ClearCr', 'Neighborhood_CollgCr',
       'Neighborhood_Crawfor', 'Neighborhood_Edwards', 'Neighborhood_Gilbert',
       'Neighborhood_Greens', 'Neighborhood_GrnHill', 'Neighborhood_IDOTRR',
       'Neighborhood_Landmrk', 'Neighborhood_MeadowV', 'Neighborhood_Mitchel',
       'Neighborhood_NAmes', 'Neighborhood_NPkVill', 'Neighborhood_NWAmes',
       'Neighborhood_NoRidge', 'Neighborhood_NridgHt', 'Neighborhood_OldTown',
       'Neighborhood_SWISU', 'Neighborhood_Sawyer', 'Neighborhood_SawyerW',
       'Neighborhood_Somerst', 'Neighborhood_StoneBr', 'Neighborhood_Timber',
       'Neighborhood_Veenker', '

In [597]:
X_test.columns

Index(['Overall Qual', 'Gr Liv Area', 'Garage Area', 'Year Built',
       'Year Remod/Add', 'MS Zoning_FV', 'MS Zoning_I (all)', 'MS Zoning_RH',
       'MS Zoning_RL', 'MS Zoning_RM', 'Lot Config_CulDSac', 'Lot Config_FR2',
       'Lot Config_FR3', 'Lot Config_Inside', 'Neighborhood_Blueste',
       'Neighborhood_BrDale', 'Neighborhood_BrkSide', 'Neighborhood_ClearCr',
       'Neighborhood_CollgCr', 'Neighborhood_Crawfor', 'Neighborhood_Edwards',
       'Neighborhood_Gilbert', 'Neighborhood_Greens', 'Neighborhood_IDOTRR',
       'Neighborhood_MeadowV', 'Neighborhood_Mitchel', 'Neighborhood_NAmes',
       'Neighborhood_NPkVill', 'Neighborhood_NWAmes', 'Neighborhood_NoRidge',
       'Neighborhood_NridgHt', 'Neighborhood_OldTown', 'Neighborhood_SWISU',
       'Neighborhood_Sawyer', 'Neighborhood_SawyerW', 'Neighborhood_Somerst',
       'Neighborhood_StoneBr', 'Neighborhood_Timber', 'Neighborhood_Veenker',
       'Condition 1_Feedr', 'Condition 1_Norm', 'Condition 1_PosA',
       'Conditio

Missing columns in X_test: 

In [598]:
missing = [col for col in X.columns if col not in X_test.columns]
missing

['MS Zoning_C (all)',
 'Neighborhood_GrnHill',
 'Neighborhood_Landmrk',
 'Condition 2_Feedr',
 'Condition 2_PosN',
 'Condition 2_RRAe',
 'Condition 2_RRAn',
 'Condition 2_RRNn']

In [599]:
X_test[missing] = 0

In [600]:
X.shape

(2051, 57)

In [601]:
X_test.shape

(878, 57)

In [602]:
# A little help here from https://stackoverflow.com/questions/11067027/re-ordering-columns-in-pandas-dataframe-based-on-column-name
# Keeps dummies in same place as X
X_test = X_test.reindex(sorted(X_test.columns), axis=1)
X = X.reindex(sorted(X.columns), axis=1)


In [603]:
X.columns

Index(['Condition 1_Feedr', 'Condition 1_Norm', 'Condition 1_PosA',
       'Condition 1_PosN', 'Condition 1_RRAe', 'Condition 1_RRAn',
       'Condition 1_RRNe', 'Condition 1_RRNn', 'Condition 2_Feedr',
       'Condition 2_Norm', 'Condition 2_PosA', 'Condition 2_PosN',
       'Condition 2_RRAe', 'Condition 2_RRAn', 'Condition 2_RRNn',
       'Garage Area', 'Gr Liv Area', 'Lot Config_CulDSac', 'Lot Config_FR2',
       'Lot Config_FR3', 'Lot Config_Inside', 'MS Zoning_C (all)',
       'MS Zoning_FV', 'MS Zoning_I (all)', 'MS Zoning_RH', 'MS Zoning_RL',
       'MS Zoning_RM', 'Neighborhood_Blueste', 'Neighborhood_BrDale',
       'Neighborhood_BrkSide', 'Neighborhood_ClearCr', 'Neighborhood_CollgCr',
       'Neighborhood_Crawfor', 'Neighborhood_Edwards', 'Neighborhood_Gilbert',
       'Neighborhood_Greens', 'Neighborhood_GrnHill', 'Neighborhood_IDOTRR',
       'Neighborhood_Landmrk', 'Neighborhood_MeadowV', 'Neighborhood_Mitchel',
       'Neighborhood_NAmes', 'Neighborhood_NPkVill', 'Neigh

In [604]:
X_test.columns

Index(['Condition 1_Feedr', 'Condition 1_Norm', 'Condition 1_PosA',
       'Condition 1_PosN', 'Condition 1_RRAe', 'Condition 1_RRAn',
       'Condition 1_RRNe', 'Condition 1_RRNn', 'Condition 2_Feedr',
       'Condition 2_Norm', 'Condition 2_PosA', 'Condition 2_PosN',
       'Condition 2_RRAe', 'Condition 2_RRAn', 'Condition 2_RRNn',
       'Garage Area', 'Gr Liv Area', 'Lot Config_CulDSac', 'Lot Config_FR2',
       'Lot Config_FR3', 'Lot Config_Inside', 'MS Zoning_C (all)',
       'MS Zoning_FV', 'MS Zoning_I (all)', 'MS Zoning_RH', 'MS Zoning_RL',
       'MS Zoning_RM', 'Neighborhood_Blueste', 'Neighborhood_BrDale',
       'Neighborhood_BrkSide', 'Neighborhood_ClearCr', 'Neighborhood_CollgCr',
       'Neighborhood_Crawfor', 'Neighborhood_Edwards', 'Neighborhood_Gilbert',
       'Neighborhood_Greens', 'Neighborhood_GrnHill', 'Neighborhood_IDOTRR',
       'Neighborhood_Landmrk', 'Neighborhood_MeadowV', 'Neighborhood_Mitchel',
       'Neighborhood_NAmes', 'Neighborhood_NPkVill', 'Neigh

In [605]:
X.columns == X_test.columns

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True])

In [606]:
ss = StandardScaler()

X_scaled = ss.fit_transform(X)
X_test_scaled = ss.transform(X_test)

In [607]:
X_scaled.shape

(2051, 57)

In [608]:
X_test_scaled.shape

(878, 57)

In [609]:
lasso = Lasso(alpha=10, max_iter=10_000)

lasso.fit(X_scaled, y)
preds_scaled = lasso.predict(X_test_scaled)

In [610]:
test['SalePrice'] = preds_scaled

In [611]:
test[['Id', 'SalePrice']].to_csv('../data/submission_lasso_alpha_10.csv', index=False)

In [612]:
pd.DataFrame(lasso.coef_, X.columns).sort_values(by=0, ascending=False)

Unnamed: 0,0
Overall Qual,24723.935991
Gr Liv Area,24167.262596
Neighborhood_NridgHt,14589.375079
Garage Area,10570.823521
Year Built,9563.694417
Neighborhood_StoneBr,8932.054079
Neighborhood_NoRidge,6513.491755
Neighborhood_Crawfor,6084.388277
Year Remod/Add,5884.750787
Neighborhood_Somerst,3890.059378


## Lasso Regression with CV

In [613]:
lassocv = LassoCV(max_iter=10_000)

lassocv.fit(X_scaled, y)
preds_scaled = lassocv.predict(X_test_scaled)
lassocv.alpha_

238.72652365520813

In [614]:
test['SalePrice'] = preds_scaled

In [615]:
test[['Id', 'SalePrice']].to_csv('../data/submission_lasso_cv.csv', index=False)

In [616]:
pd.DataFrame(lassocv.coef_, X.columns).sort_values(by=0, ascending=False)

Unnamed: 0,0
Overall Qual,24890.048671
Gr Liv Area,23977.887783
Neighborhood_NridgHt,13108.77144
Garage Area,10762.578771
Year Built,8380.51234
Neighborhood_StoneBr,8037.617353
Year Remod/Add,5640.722027
Neighborhood_NoRidge,5470.185794
Neighborhood_Crawfor,4516.972711
Lot Config_CulDSac,3118.770075


## Ridge Regression with CV

In [617]:
ridge = RidgeCV()

ridge.fit(X_scaled, y)
preds_scaled = ridge.predict(X_test_scaled)

In [618]:
test['SalePrice'] = preds_scaled

In [619]:
test[['Id', 'SalePrice']].to_csv('../data/submission_ridge_cv.csv', index=False)

In [620]:
pd.DataFrame(ridge.coef_, X.columns).sort_values(by=0, ascending=False)

Unnamed: 0,0
Overall Qual,24560.234634
Gr Liv Area,24085.035322
Neighborhood_NridgHt,14139.83951
Garage Area,10624.597018
Year Built,9435.305169
Neighborhood_StoneBr,8676.718318
Neighborhood_NoRidge,6244.345225
Year Remod/Add,5928.148429
Neighborhood_Crawfor,5690.714516
Neighborhood_Somerst,3437.627133


## Lasso Regression using interesting features and location features.

Since this has been my best scoring model so far, I'm going to run it again but using only the features from my LinReg that also scored well, and using CV.

In [621]:
# Import data
test = pd.read_csv('../data/test.csv')
train = pd.read_csv('../data/train.csv')

In [622]:
# List interesting features from reading data description
interesting = [
    'Neighborhood',
    'Overall Qual',
    'Gr Liv Area',
    'Garage Area',
    'Year Built',
    'Year Remod/Add',
    'SalePrice',
]

In [623]:
# Keep only interesting features
train = train[interesting]


In [624]:
train.dtypes

Neighborhood       object
Overall Qual        int64
Gr Liv Area         int64
Garage Area       float64
Year Built          int64
Year Remod/Add      int64
SalePrice           int64
dtype: object

In [625]:
train['Garage Area'].fillna(0, inplace=True)
test['Garage Area'].fillna(0, inplace=True)

In [626]:
X_test = test[interesting[:-1]]
X = train.drop(columns='SalePrice')
y = train['SalePrice']

In [627]:
# Use pandas .get_dummies() to binarize categorical columns. With help from https://stackoverflow.com/questions/32387266/converting-categorical-values-to-binary-using-pandas
# Drop first binary column
X_test = pd.get_dummies(X_test, drop_first=True)
X = pd.get_dummies(X, drop_first=True)

Missing columns in X_test: 

In [628]:
missing = [col for col in X.columns if col not in X_test.columns]
missing

['Neighborhood_GrnHill', 'Neighborhood_Landmrk']

In [629]:
X_test[missing] = 0

In [630]:
X.shape

(2051, 32)

In [631]:
X_test.shape

(878, 32)

In [632]:
# A little help here from https://stackoverflow.com/questions/11067027/re-ordering-columns-in-pandas-dataframe-based-on-column-name
# Keeps dummies in same place as X
X_test = X_test.reindex(sorted(X_test.columns), axis=1)
X = X.reindex(sorted(X.columns), axis=1)


In [633]:
X.columns == X_test.columns

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True])

In [634]:
ss = StandardScaler()

X_scaled = ss.fit_transform(X)
X_test_scaled = ss.transform(X_test)

In [635]:
X_scaled.shape

(2051, 32)

In [636]:
X_test_scaled.shape

(878, 32)

In [637]:
lasso = LassoCV(max_iter=10_000)

lasso.fit(X_scaled, y)
preds_scaled = lasso.predict(X_test_scaled)

In [638]:
test['SalePrice'] = preds_scaled

In [639]:
test[['Id', 'SalePrice']].to_csv('../data/submission_lasso_neigh_num.csv', index=False)

In [640]:
pd.DataFrame(lasso.coef_, X.columns).sort_values(by=0, ascending=False)

Unnamed: 0,0
Overall Qual,25292.739105
Gr Liv Area,24167.878209
Neighborhood_NridgHt,13807.384528
Garage Area,11402.958529
Neighborhood_StoneBr,9504.286302
Year Built,8579.083115
Neighborhood_NoRidge,6235.96723
Year Remod/Add,5972.238929
Neighborhood_Crawfor,5449.158682
Neighborhood_ClearCr,3476.464373
