<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 15px; height: 80px">

# Project 3

### Regression and Classification with the Ames Housing Data

---

You have just joined a new "full stack" real estate company in Ames, Iowa. The strategy of the firm is two-fold:
- Own the entire process from the purchase of the land all the way to sale of the house, and anything in between.
- Use statistical analysis to optimize investment and maximize return.

The company is still small, and though investment is substantial the short-term goals of the company are more oriented towards purchasing existing houses and flipping them as opposed to constructing entirely new houses. That being said, the company has access to a large construction workforce operating at rock-bottom prices.

This project uses the [Ames housing data recently made available on kaggle](https://www.kaggle.com/c/house-prices-advanced-regression-techniques).

In [1]:
import numpy as np
import scipy.stats as stats
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import re

sns.set_style('whitegrid')

%config InlineBackend.figure_format = 'retina'
%matplotlib inline

In [2]:
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.linear_model import LassoCV, Lasso, RidgeCV, Ridge, LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from warnings import filterwarnings
filterwarnings('ignore')

<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

## 1. Estimating the value of homes from fixed characteristics.

---

Your superiors have outlined this year's strategy for the company:
1. Develop an algorithm to reliably estimate the value of residential houses based on *fixed* characteristics.
2. Identify characteristics of houses that the company can cost-effectively change/renovate with their construction team.
3. Evaluate the mean dollar value of different renovations.

Then we can use that to buy houses that are likely to sell for more than the cost of the purchase plus renovations.

Your first job is to tackle #1. You have a dataset of housing sale data with a huge amount of features identifying different aspects of the house. The full description of the data features can be found in a separate file:

    housing.csv
    data_description.txt
    
You need to build a reliable estimator for the price of the house given characteristics of the house that cannot be renovated. Some examples include:
- The neighborhood
- Square feet
- Bedrooms, bathrooms
- Basement and garage space

and many more. 

Some examples of things that **ARE renovate-able:**
- Roof and exterior features
- "Quality" metrics, such as kitchen quality
- "Condition" metrics, such as condition of garage
- Heating and electrical components

and generally anything you deem can be modified without having to undergo major construction on the house.

---

**Your goals:**
1. Perform any cleaning, feature engineering, and EDA you deem necessary.
- Be sure to remove any houses that are not residential from the dataset.
- Identify **fixed** features that can predict price.
- Train a model on pre-2010 data and evaluate its performance on the 2010 houses.
- Characterize your model. How well does it perform? What are the best estimates of price?

> **Note:** The EDA and feature engineering component to this project is not trivial! Be sure to always think critically and creatively. Justify your actions! Use the data description file!

## Load the data

In [3]:
df = pd.read_csv('./housing.csv')

### Read Data Dictionary

In [4]:
import re
with open('data_description.txt') as f:
    reader = f.read()
    data_dict = [i for i in reader.splitlines()]
data_dict[:5]

['MSSubClass: Identifies the type of dwelling involved in the sale.\t',
 '',
 '        20\t1-STORY 1946 & NEWER ALL STYLES',
 '        30\t1-STORY 1945 & OLDER',
 '        40\t1-STORY W/FINISHED ATTIC ALL AGES']

### Create data_dictionary filter

In [5]:
def dict_filter(string, lines=10, find_one=True):
    for c, i in enumerate(data_dict):
        if re.search(string, i):
            [print(j) for j in data_dict[c:c+lines]]
            print(' ')
            if find_one == True: break
dict_filter('Misc')

MiscFeature: Miscellaneous feature not covered in other categories
		
       Elev	Elevator
       Gar2	2nd Garage (if not described in garage section)
       Othr	Other
       Shed	Shed (over 100 SF)
       TenC	Tennis Court
       NA	None
		
MiscVal: $Value of miscellaneous feature
 


### Drop many columns
    1. drop columns with only 1 unique value
    2. drop columns with Sale in their name (data leakage)
    3. drop columns with mostly null values
    4. drop columns with an extremely high number of same values

In [6]:
to_drop = df.columns[df.apply(lambda x: x.nunique()==1) | df.apply(lambda x: x.nunique()==len(df))]
df.drop(to_drop, axis=1, inplace = True)

to_drop = df.columns[df.columns.str.contains('[Ss]ale')].difference(['SalePrice'])
df.drop(to_drop, axis=1, inplace=True)

_ = df.apply(lambda x: x.isnull().sum()/df.shape[0]).sort_values(ascending=False)
to_drop = _[(_!=1) & (_!=0)]
df.drop(to_drop.head(2).index, axis=1, inplace=True)

to_drop = df.apply(lambda x: x.value_counts(dropna=False).iloc[0]/df.shape[0]).sort_values(ascending=False).head(10).index
df.drop(to_drop, axis=1, inplace=True)

### Some columns to convert binary

In [7]:
binary = df.columns[df.apply(lambda x: x.nunique() == 2)]
df[binary].apply(lambda x: pd.Series([', '.join(x.dropna().unique()), x.isnull().sum()/df.shape[0]], index=['uniq_vals', 'null_vals'])).T

Unnamed: 0,uniq_vals,null_vals
Alley,"Grvl, Pave",0.937671
CentralAir,"Y, N",0.0


In [8]:
df.CentralAir = df.CentralAir.map(lambda x: 0 if x == 'N' else 1)
df.Alley = df.Alley.map(lambda x: 0 if re.search('nan', str(x)) else 1)

In [9]:
time = df.columns[df.columns.str.contains('[Yy]r|[Yy]ear')]

### The numeric columns containing nulls

In [10]:
numeric = df.dtypes[df.dtypes!='object'].index.difference(time)
df[numeric].isnull().sum().sort_values(ascending=False).head(3)

LotFrontage    259
MasVnrArea       8
WoodDeckSF       0
dtype: int64

In [11]:
df.loc[:, df.columns.str.contains('Lot')].isnull().sum()

LotFrontage    259
LotArea          0
LotShape         0
LotConfig        0
dtype: int64

    Remove LotFrontage, too many unexplainable null values

In [12]:
df.drop('LotFrontage', axis=1, inplace=True)
numeric = numeric.drop('LotFrontage')

In [13]:
(df.MasVnrType.isnull() & df.MasVnrArea.isnull()).sum()
df.loc[:, df.columns.str.contains('MasVnr')].isnull().sum()

MasVnrType    8
MasVnrArea    8
dtype: int64

    Replace nulls with 0, since area for a non-existent feature is 0

In [14]:
df.MasVnrArea.fillna(0, inplace=True)
df[numeric].isnull().sum().sum()

0

### Object columns

In [15]:
categorical = df.columns[df.dtypes=='object']

In [16]:
isnull = df[categorical].isnull().sum().sort_values(ascending=False).head(20)
isnull

Fence           1179
FireplaceQu      690
GarageCond        81
GarageQual        81
GarageFinish      81
GarageType        81
BsmtExposure      38
BsmtFinType2      38
BsmtFinType1      37
BsmtCond          37
BsmtQual          37
MasVnrType         8
Electrical         1
LandContour        0
LotConfig          0
RoofStyle          0
LotShape           0
LandSlope          0
Neighborhood       0
Condition1         0
dtype: int64

In [17]:
df[isnull[isnull!=0].index].apply(lambda x: ', '.join(x.dropna().unique()))

Fence                                    MnPrv, GdWo, GdPrv, MnWw
FireplaceQu                                    TA, Gd, Fa, Ex, Po
GarageCond                                     TA, Fa, Gd, Po, Ex
GarageQual                                     TA, Fa, Gd, Ex, Po
GarageFinish                                        RFn, Unf, Fin
GarageType      Attchd, Detchd, BuiltIn, CarPort, Basment, 2Types
BsmtExposure                                       No, Gd, Mn, Av
BsmtFinType2                         Unf, BLQ, ALQ, Rec, LwQ, GLQ
BsmtFinType1                         GLQ, ALQ, Unf, Rec, BLQ, LwQ
BsmtCond                                           TA, Gd, Fa, Po
BsmtQual                                           Gd, TA, Ex, Fa
MasVnrType                           BrkFace, None, Stone, BrkCmn
Electrical                        SBrkr, FuseF, FuseA, FuseP, Mix
dtype: object

In [18]:
df[isnull[isnull==0].index].apply(lambda x: ', '.join(x.dropna().unique()))

LandContour                                    Lvl, Bnk, Low, HLS
LotConfig                       Inside, FR2, Corner, CulDSac, FR3
RoofStyle                Gable, Hip, Gambrel, Mansard, Flat, Shed
LotShape                                       Reg, IR1, IR2, IR3
LandSlope                                           Gtl, Mod, Sev
Neighborhood    CollgCr, Veenker, Crawfor, NoRidge, Mitchel, S...
Condition1      Norm, Feedr, PosN, Artery, RRAe, RRNn, RRAn, P...
dtype: object

In [19]:
df.Fence.value_counts(dropna=False)

NaN      1179
MnPrv     157
GdPrv      59
GdWo       54
MnWw       11
Name: Fence, dtype: int64

In [20]:
dict_filter('Fence')

Fence: Fence quality
		
       GdPrv	Good Privacy
       MnPrv	Minimum Privacy
       GdWo	Good Wood
       MnWw	Minimum Wood/Wire
       NA	No Fence
	
MiscFeature: Miscellaneous feature not covered in other categories
		
 


### Safe to convert object columns to dummies

In [21]:
df = pd.concat([df, pd.get_dummies(df[categorical])], axis=1)
df.drop(categorical, axis=1, inplace=True)

In [22]:
df.drop('GarageYrBlt', axis=1, inplace=True)

In [23]:
uniqs = df.apply(lambda x: x.nunique())

### Split data train and test
    Test set is 2010 data

In [24]:
df2010 = df[df.YrSold==2010]
df = df.loc[df.index.difference(df2010.index)]

In [25]:
fixed = ['1stFlrSF', '2ndFlrSF', 'Alley', 'BedroomAbvGr', 'BldgType_1Fam', 'BldgType_2fmCon', 'BldgType_Duplex', 'BldgType_Twnhs', 'BldgType_TwnhsE', 'BsmtExposure_Av', 'BsmtExposure_Gd', 'BsmtExposure_Mn', 'BsmtExposure_No', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtFullBath', 'BsmtHalfBath', 'BsmtUnfSF', 'Condition1_Artery', 'Condition1_Feedr', 'Condition1_Norm', 'Condition1_PosA', 'Condition1_PosN', 'Condition1_RRAe', 'Condition1_RRAn', 'Condition1_RRNe', 'Condition1_RRNn', 'Electrical_FuseA', 'Electrical_FuseF', 'Electrical_FuseP', 'Electrical_Mix', 'Electrical_SBrkr', 'EnclosedPorch', 'Exterior1st_CBlock', 'Exterior1st_CemntBd', 'Exterior1st_Wd Sdng', 'Exterior2nd_AsbShng', 'Exterior2nd_HdBoard', 'Exterior2nd_MetalSd', 'Exterior2nd_VinylSd', 'Fireplaces', 'Foundation_BrkTil', 'Foundation_CBlock', 'Foundation_PConc', 'Foundation_Slab', 'Foundation_Stone', 'Foundation_Wood', 'FullBath', 'GarageArea', 'GarageCars', 'GarageType_2Types', 'GarageType_Attchd', 'GarageType_Basment', 'GarageType_BuiltIn', 'GarageType_CarPort', 'GarageType_Detchd', 'GrLivArea', 'HalfBath', 'HouseStyle_1.5Fin', 'HouseStyle_1.5Unf', 'HouseStyle_1Story', 'HouseStyle_2.5Fin', 'HouseStyle_2.5Unf', 'HouseStyle_2Story', 'HouseStyle_SFoyer', 'HouseStyle_SLvl', 'LandContour_Bnk', 'LandContour_HLS', 'LandContour_Low', 'LandContour_Lvl', 'LandSlope_Gtl', 'LandSlope_Mod', 'LandSlope_Sev', 'LotArea', 'LotConfig_Corner', 'LotConfig_CulDSac', 'LotConfig_FR2', 'LotConfig_FR3', 'LotConfig_Inside', 'LotShape_IR1', 'LotShape_IR2', 'LotShape_IR3', 'LotShape_Reg', 'MSSubClass', 'MSZoning_C (all)', 'MSZoning_FV', 'MSZoning_RH', 'MSZoning_RL', 'MSZoning_RM', 'MasVnrArea', 'MasVnrType_BrkCmn', 'MasVnrType_BrkFace', 'MasVnrType_None', 'MasVnrType_Stone', 'MoSold', 'Neighborhood_Blmngtn', 'Neighborhood_Blueste', 'Neighborhood_BrDale', 'Neighborhood_BrkSide', 'Neighborhood_ClearCr', 'Neighborhood_CollgCr', 'Neighborhood_Crawfor', 'Neighborhood_Edwards', 'Neighborhood_Gilbert', 'Neighborhood_IDOTRR', 'Neighborhood_MeadowV', 'Neighborhood_Mitchel', 'Neighborhood_NAmes', 'Neighborhood_NPkVill', 'Neighborhood_NWAmes', 'Neighborhood_NoRidge', 'Neighborhood_NridgHt', 'Neighborhood_OldTown', 'Neighborhood_SWISU', 'Neighborhood_Sawyer', 'Neighborhood_SawyerW', 'Neighborhood_Somerst', 'Neighborhood_StoneBr', 'Neighborhood_Timber', 'Neighborhood_Veenker', 'OpenPorchSF', 'PavedDrive_N', 'PavedDrive_P', 'PavedDrive_Y', 'RoofStyle_Flat', 'RoofStyle_Gable', 'RoofStyle_Gambrel', 'RoofStyle_Hip', 'RoofStyle_Mansard', 'RoofStyle_Shed', 'ScreenPorch', 'TotRmsAbvGrd', 'TotalBsmtSF', 'WoodDeckSF', 'YearBuilt', 'YearRemodAdd', 'YrSold']
variable = ['BsmtCond_Fa', 'BsmtCond_Gd', 'BsmtCond_Po', 'BsmtCond_TA', 'BsmtFinType1_ALQ', 'BsmtFinType1_BLQ', 'BsmtFinType1_GLQ', 'BsmtFinType1_LwQ', 'BsmtFinType1_Rec', 'BsmtFinType1_Unf', 'BsmtFinType2_ALQ', 'BsmtFinType2_BLQ', 'BsmtFinType2_GLQ', 'BsmtFinType2_LwQ', 'BsmtFinType2_Rec', 'BsmtFinType2_Unf', 'BsmtQual_Ex', 'BsmtQual_Fa', 'BsmtQual_Gd', 'BsmtQual_TA', 'CentralAir', 'ExterCond_Ex', 'ExterCond_Fa', 'ExterCond_Gd', 'ExterCond_Po', 'ExterCond_TA', 'ExterQual_Ex', 'ExterQual_Fa', 'ExterQual_Gd', 'ExterQual_TA', 'Exterior1st_AsbShng', 'Exterior1st_AsphShn', 'Exterior1st_BrkComm', 'Exterior1st_BrkFace', 'Exterior1st_HdBoard', 'Exterior1st_ImStucc', 'Exterior1st_MetalSd', 'Exterior1st_Plywood', 'Exterior1st_Stone', 'Exterior1st_Stucco', 'Exterior1st_VinylSd', 'Exterior1st_WdShing', 'Exterior2nd_AsphShn', 'Exterior2nd_Brk Cmn', 'Exterior2nd_BrkFace', 'Exterior2nd_CBlock', 'Exterior2nd_CmentBd', 'Exterior2nd_ImStucc', 'Exterior2nd_Other', 'Exterior2nd_Plywood', 'Exterior2nd_Stone', 'Exterior2nd_Stucco', 'Exterior2nd_Wd Sdng', 'Exterior2nd_Wd Shng', 'Fence_GdPrv', 'Fence_GdWo', 'Fence_MnPrv', 'Fence_MnWw', 'FireplaceQu_Ex', 'FireplaceQu_Fa', 'FireplaceQu_Gd', 'FireplaceQu_Po', 'FireplaceQu_TA', 'Functional_Maj1', 'Functional_Maj2', 'Functional_Min1', 'Functional_Min2', 'Functional_Mod', 'Functional_Sev', 'Functional_Typ', 'GarageCond_Ex', 'GarageCond_Fa', 'GarageCond_Gd', 'GarageCond_Po', 'GarageCond_TA', 'GarageFinish_Fin', 'GarageFinish_RFn', 'GarageFinish_Unf', 'GarageQual_Ex', 'GarageQual_Fa', 'GarageQual_Gd', 'GarageQual_Po', 'GarageQual_TA', 'HeatingQC_Ex', 'HeatingQC_Fa', 'HeatingQC_Gd', 'HeatingQC_Po', 'HeatingQC_TA', 'KitchenQual_Ex', 'KitchenQual_Fa', 'KitchenQual_Gd', 'KitchenQual_TA', 'OverallCond', 'OverallQual']

## Modelling without correlation checks

### Random Forest

In [26]:
y = df.SalePrice
X = df[fixed]

model = RandomForestRegressor(100, n_jobs=-1)

scores = cross_val_score(model, X, y, cv=5)
print(np.mean(scores))
scores

0.8261430967468163


array([0.85167684, 0.75355833, 0.86218785, 0.8618899 , 0.80140256])

### Lasso

In [27]:
y = df.SalePrice
X = df[fixed]

model = Lasso(alpha=105)
ss = StandardScaler()
toscale = X.columns[X.apply(lambda x: x.nunique()!=2)]
X.loc[:, toscale] = ss.fit_transform(X[toscale])
model.fit(X, y)

scores = cross_val_score(model, X, y, cv=5)
print(np.mean(scores))
scores

0.7955057689694927


array([0.86198737, 0.80352694, 0.83082916, 0.82824685, 0.65293852])

In [28]:
S = pd.Series(model.coef_, index=X.columns)
new_idx = S[S!=0].map(abs).sort_values().tail(67).index

### Random Forest

In [29]:
y = df.SalePrice
X = df[new_idx]

model = RandomForestRegressor(100, n_jobs=-1)
# model.fit(X, y)

scores = cross_val_score(model, X, y, cv=5)
print(np.mean(scores))
scores

0.8353184397490298


array([0.85651775, 0.75399192, 0.87599537, 0.86809739, 0.82198978])

## Correlation checks

In [30]:
to_drop = [0]
while len(to_drop) > 0:
    corrtype = df.corr()
    corrs = corrtype.where(~np.triu(np.ones(corrtype.shape)).astype(np.bool)).applymap(abs)
    high = corrs[corrs>0.8].dropna(how='all').dropna(how='all', axis=1)
    high = high.apply(lambda x: x.dropna().index)
    if len(high) == 0: break
    high = set(high.values.ravel()) | set(high.columns.ravel())
    to_drop = df[high].apply(lambda x: np.corrcoef(x, df.SalePrice)[0,1]).map(abs).sort_values().index[0]
    df.drop(to_drop, axis=1, inplace=True)

In [31]:
df.shape

(1285, 209)

### Random Forest
    Slight improvement

In [32]:
y = df.SalePrice
X = df.drop('SalePrice', axis=1)

model = RandomForestRegressor(100, n_jobs=-1)

scores = cross_val_score(model, X, y, cv=5)
print(np.mean(scores))
scores

0.8498286783732493


array([0.87130619, 0.84068549, 0.87645603, 0.87318342, 0.78751226])

### Lasso

In [33]:
y = df.SalePrice
X = df.drop('SalePrice', axis=1)

model = Lasso(alpha=169)
ss = StandardScaler()
toscale = X.columns[X.apply(lambda x: x.nunique()!=2)]
X.loc[:, toscale] = ss.fit_transform(X[toscale])
model.fit(X, y)

scores = cross_val_score(model, X, y, cv=5)
print(np.mean(scores))
scores

0.8447594275355279


array([0.90521844, 0.83498698, 0.86736469, 0.88124903, 0.73497799])

In [34]:
S = pd.Series(model.coef_, index=X.columns)
new_idx = S[S!=0].map(abs).sort_values().tail(67).index

### Random Forest

In [35]:
y = df.SalePrice
X = df[new_idx]

model = RandomForestRegressor(100, n_jobs=-1)
model.fit(X, y)

scores = cross_val_score(model, X, y, cv=5)
print(np.mean(scores))
scores

0.8498681632744635


array([0.87041357, 0.82196584, 0.87741304, 0.87017147, 0.8093769 ])

### Ridge

In [36]:
y = df.SalePrice
X = df.drop('SalePrice', axis=1)

model = Ridge(alpha = 19.63)
ss = StandardScaler()
toscale = X.columns[X.apply(lambda x: x.nunique()!=2)]
X.loc[:, toscale] = ss.fit_transform(X[toscale])
model.fit(X, y)

scores = cross_val_score(model, X, y, cv=5)
print(np.mean(scores))
scores

0.8389512868439455


array([0.90040411, 0.83676783, 0.86003491, 0.87682695, 0.72072263])

### XGB Regressor

In [37]:
y = df.SalePrice
X = df[new_idx]

model = XGBRegressor(n_jobs=-1, verbosity=0)
model.fit(X, y)

scores = cross_val_score(model, X, y, cv=5)
print(np.mean(scores))
scores

0.8884112193843936


array([0.89680547, 0.85423059, 0.8998484 , 0.89930515, 0.89186649])

### XGB Regressor with GridSearch

In [38]:
y = df.SalePrice
X = df[new_idx]

model = XGBRegressor(n_jobs=-1, verbosity=0)
gs = GridSearchCV(model, {'learning_rate': np.logspace(-1, -0.5, 5)})

gs.fit(X,y)

scores = cross_val_score(gs.best_estimator_, X, y, cv=5)
print(np.mean(scores))
scores

0.8939679253552051


array([0.90530138, 0.86498187, 0.89454745, 0.89932857, 0.90568036])

### Final Score

In [39]:
gs.score(df2010[new_idx], df2010.SalePrice)

0.9041280309146664

<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

## 2. Determine any value of *changeable* property characteristics unexplained by the *fixed* ones.

---

Now that you have a model that estimates the price of a house based on its static characteristics, we can move forward with part 2 and 3 of the plan: what are the costs/benefits of quality, condition, and renovations?

There are two specific requirements for these estimates:
1. The estimates of effects must be in terms of dollars added or subtracted from the house value. 
2. The effects must be on the variance in price remaining from the first model.

The residuals from the first model (training and testing) represent the variance in price unexplained by the fixed characteristics. Of that variance in price remaining, how much of it can be explained by the easy-to-change aspects of the property?

---

**Your goals:**
1. Evaluate the effect in dollars of the renovate-able features. 
- How would your company use this second model and its coefficients to determine whether they should buy a property or not? Explain how the company can use the two models you have built to determine if they can make money. 
- Investigate how much of the variance in price remaining is explained by these features.
- Do you trust your model? Should it be used to evaluate which properties to buy and fix up?

### XGB Regressor

In [40]:
y = df.SalePrice
X = df[new_idx]

model = XGBRegressor(n_jobs=-1, verbosity=0)
gs = GridSearchCV(model, {'learning_rate': np.logspace(-1, -0.5, 5)})

gs.fit(X,y)

scores = cross_val_score(gs.best_estimator_, X, y, cv=5)
print(np.mean(scores))
scores

yhat = gs.predict(X)
res = y-yhat

yy = res
X = df[set(df.columns)&set(variable)]

gs.fit(X,yy)

scores = cross_val_score(gs.best_estimator_, X, yy, cv=5)
print(np.mean(scores))
scores

0.8939679253552051
-0.06291220995871086


array([-6.59882862e-02, -6.15254712e-02, -6.63075592e-02, -1.20768916e-01,
        2.91823412e-05])

### Random Forest

In [41]:
y = df.SalePrice
X = df[new_idx]

model = RandomForestRegressor(100, n_jobs=-1)

model.fit(X,y)

scores = cross_val_score(model, X, y, cv=5)
print(np.mean(scores))
scores

yhat = model.predict(X)
res = y-yhat

yy = res
X = df[set(df.columns)&set(variable)]

model.fit(X,yy)

scores = cross_val_score(model, X, yy, cv=5)
print(np.mean(scores))
scores

0.8503232447710765
-0.16709875000606775


array([-0.00567811, -0.11835262, -0.36669306, -0.34993428,  0.00516431])

### Lasso

In [42]:
y = df.SalePrice
X = df[new_idx]

model = LassoCV(n_alphas=1000, n_jobs=-1)
ss = StandardScaler()
toscale = X.columns[X.apply(lambda x: x.nunique()!=2)]
X.loc[:, toscale] = ss.fit_transform(X[toscale])
model.fit(X,y)
model = Lasso(alpha=model.alpha_)
model.fit(X,y)

scores = cross_val_score(model, X, y, cv=5)
print(np.mean(scores))
scores

yhat = model.predict(X)
res = y-yhat

yy = res
X = df[set(df.columns)&set(variable)]
ss = StandardScaler()
toscale = X.columns[X.apply(lambda x: x.nunique()!=2)]
X.loc[:, toscale] = ss.fit_transform(X[toscale])

model.fit(X,yy)

scores = cross_val_score(model, X, yy, cv=5)
print(np.mean(scores))
scores

0.8570236166235616
-0.04095540009753895


array([-0.04450073, -0.02898882, -0.06693489, -0.03345198, -0.03090057])

### Ridge

In [43]:
y = df.SalePrice
X = df[new_idx]

ridge = RidgeCV(alphas=np.logspace(-3,1,50))
ss = StandardScaler()
toscale = X.columns[X.apply(lambda x: x.nunique()!=2)]
X.loc[:, toscale] = ss.fit_transform(X[toscale])
ridge.fit(X,y)
ridge = Ridge(alpha=ridge.alpha_)
ridge.fit(X,y)

scores = cross_val_score(ridge, X, y, cv=5)
print(np.mean(scores))
scores

yhat = ridge.predict(X)
res = y-yhat

yy = res
X = df[set(df.columns)&set(variable)]
ss = StandardScaler()
toscale = X.columns[X.apply(lambda x: x.nunique()!=2)]
X.loc[:, toscale] = ss.fit_transform(X[toscale])

ridge.fit(X,yy)

scores = cross_val_score(model, X, yy, cv=5)
print(np.mean(scores))
scores

0.8573548977938014
-0.03781581248727175


array([-0.03458611, -0.03160238, -0.06217189, -0.02485805, -0.03586064])

### Linear Regression

In [44]:
y = df.SalePrice
X = df[new_idx]
ss = StandardScaler()
toscale = X.columns[X.apply(lambda x: x.nunique()!=2)]
X.loc[:, toscale] = ss.fit_transform(X[toscale])

model = LinearRegression()
model.fit(X,y)

scores = cross_val_score(model, X, y, cv=5)
print(np.mean(scores))
scores

yhat = model.predict(X)
res = y-yhat

yy = res
X = df[set(df.columns)&set(variable)]
ss = StandardScaler()
toscale = X.columns[X.apply(lambda x: x.nunique()!=2)]
X.loc[:, toscale] = ss.fit_transform(X[toscale])

model.fit(X,yy)

scores = cross_val_score(model, X, yy, cv=5)
print(np.mean(scores))
scores

0.8533209220288704
-2.652589872901171e+20


array([-1.40735732e-01, -1.06386273e-01, -1.99091519e-01, -1.73082099e-01,
       -1.32629494e+21])

In [45]:
pd.Series(ridge.coef_, X.columns).sort_values(ascending=False).head(20)

GarageQual_Ex          10012.332247
FireplaceQu_Ex          7557.610738
Exterior2nd_ImStucc     7157.484455
BsmtFinType2_ALQ        5955.832035
GarageCond_TA           4378.700592
BsmtQual_Fa             3915.291528
Functional_Mod          3785.633664
ExterQual_Fa            3707.609669
Functional_Min1         3432.550682
Exterior1st_BrkFace     3364.252505
Functional_Min2         3041.104787
BsmtFinType1_ALQ        2959.067221
ExterCond_Ex            2895.131231
Functional_Typ          2483.070422
KitchenQual_Fa          2357.937924
Exterior1st_WdShing     2301.877600
BsmtCond_Po             2032.738375
Exterior2nd_CmentBd     1851.360756
BsmtFinType2_GLQ        1814.517368
ExterCond_Fa            1636.396486
dtype: float64

In [46]:
pd.Series(ridge.coef_, X.columns).sort_values(ascending=False).tail(20)

Exterior2nd_Stucco    -1413.476121
Fence_MnWw            -1521.628528
GarageQual_Gd         -1638.196247
HeatingQC_Po          -1906.997089
Functional_Maj1       -1985.759760
BsmtFinType1_LwQ      -2180.162244
Exterior2nd_Wd Shng   -2223.711559
GarageQual_Po         -2237.141618
GarageCond_Gd         -2295.867083
Exterior2nd_Plywood   -2354.838261
Fence_GdPrv           -2381.683182
GarageCond_Po         -2486.860330
FireplaceQu_Fa        -3146.284959
Functional_Maj2       -3696.626993
Exterior1st_ImStucc   -3787.823961
GarageQual_TA         -3953.188048
BsmtFinType2_BLQ      -3968.926103
GarageQual_Fa         -4260.335817
Exterior2nd_Stone     -4837.153069
Functional_Sev        -7059.972803
dtype: float64

In [47]:
dict_filter('GarageQ')

GarageQual: Garage quality

       Ex	Excellent
       Gd	Good
       TA	Typical/Average
       Fa	Fair
       Po	Poor
       NA	No Garage
		
GarageCond: Garage condition
 


### Coefficients do not seem reliable
    Garage does not seem reliable in either ridge nor XGB
    Fireplace seems reliable in Ridge (Qualitatively) but not XGB
    Kitcen seems reliable in XGB (Qualitatively) but not ridge

In [48]:
pd.Series(ridge.coef_, X.columns).sort_values(ascending=False)[pd.Series(ridge.coef_, X.columns).sort_values(ascending=False).index.str.contains('Garage')]

GarageQual_Ex       10012.332247
GarageCond_TA        4378.700592
GarageCond_Fa        1035.794182
GarageFinish_Unf     -363.869203
GarageFinish_RFn     -566.874891
GarageFinish_Fin    -1145.785389
GarageQual_Gd       -1638.196247
GarageQual_Po       -2237.141618
GarageCond_Gd       -2295.867083
GarageCond_Po       -2486.860330
GarageQual_TA       -3953.188048
GarageQual_Fa       -4260.335817
dtype: float64

In [49]:
pd.Series(gs.best_estimator_.feature_importances_, X.columns).sort_values(ascending=False)[pd.Series(gs.best_estimator_.feature_importances_, X.columns).sort_values(ascending=False).index.str.contains('Garage')]

GarageCond_Fa       0.029195
GarageFinish_RFn    0.026824
GarageQual_Gd       0.014135
GarageCond_TA       0.013967
GarageQual_TA       0.011669
GarageQual_Fa       0.010957
GarageCond_Gd       0.010824
GarageFinish_Fin    0.008166
GarageFinish_Unf    0.007432
GarageQual_Ex       0.003373
GarageQual_Po       0.000000
GarageCond_Po       0.000000
dtype: float32

In [50]:
pd.Series(ridge.coef_, X.columns).sort_values(ascending=False)[pd.Series(ridge.coef_, X.columns).sort_values(ascending=False).index.str.contains('Fire')]

FireplaceQu_Ex    7557.610738
FireplaceQu_Gd     648.581420
FireplaceQu_TA    -256.661314
FireplaceQu_Po    -707.185794
FireplaceQu_Fa   -3146.284959
dtype: float64

In [51]:
pd.Series(gs.best_estimator_.feature_importances_, X.columns).sort_values(ascending=False)[pd.Series(gs.best_estimator_.feature_importances_, X.columns).sort_values(ascending=False).index.str.contains('Fire')]

FireplaceQu_Po    0.045796
FireplaceQu_TA    0.035792
FireplaceQu_Ex    0.026630
FireplaceQu_Gd    0.016848
FireplaceQu_Fa    0.005801
dtype: float32

In [52]:
pd.Series(ridge.coef_, X.columns).sort_values(ascending=False)[pd.Series(ridge.coef_, X.columns).sort_values(ascending=False).index.str.contains('Kitchen')]

KitchenQual_Fa    2357.937924
KitchenQual_Ex     550.888334
KitchenQual_TA     385.742378
dtype: float64

In [53]:
pd.Series(gs.best_estimator_.feature_importances_, X.columns).sort_values(ascending=False)[pd.Series(gs.best_estimator_.feature_importances_, X.columns).sort_values(ascending=False).index.str.contains('Kitchen')]

KitchenQual_Ex    0.020487
KitchenQual_TA    0.016267
KitchenQual_Fa    0.014468
dtype: float32