After analyzing different models including a RandomForestRegressor, a Neural Network, combining scikit learn PolynomialFeatures and spending a lot amount of time grid searching without getting better results, it's time to take a next step on feature engineering, so let's put our hands to it.

In [1]:
import numpy as np 
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib as mpl

import ipynb.fs.full.plotFunctions as myplt
import ipynb.fs.full.preprocessFunctions as pp
import ipynb.fs.full.tableFunctions as tab

plt.style.use('Solarize_Light2')
%matplotlib widget

In [2]:
pd.options.display.max_rows = 4000 #this is so that the notebook won't truncate results

In [3]:
test = pd.read_csv("test.csv", index_col="Id")
train = pd.read_csv("train.csv", index_col="Id")

We apply the transformations we already saw fit and we'll create new bivariates to test and new categoricals based on numerical ranges.

In [4]:
train['SalePrice'] = np.log(train['SalePrice'])
outliers =[1299, 524, 935]
train.drop(index=outliers, inplace=True)

Instead of filling all numericals with mean, we should consider the fact that some of them might imply the feature does not apply, for example, garage cars, and so fill them with zero

In [5]:
nums = [col for col in train.select_dtypes(include='number').columns]
train[nums] = train[nums].fillna(0)

In [6]:
train['LotArea'] = np.log(train['LotArea'])
train['GrLivArea'] = np.log(train['GrLivArea'])
train['OpenPorchSF'] = np.cbrt(train['OpenPorchSF'])

We'll fill all categorical features Na values with 'NA' since taking a deeper look at data description we can see that NA is in fact a category which means the absence of the feature in the property.

In [7]:
cats = [col for col in train.select_dtypes(include='object').columns]
train[cats] = train[cats].fillna('NA')

Old numerical bivariates:

In [8]:
# train['LotFrontageOverArea'] = train['LotFrontage'] * train['LotArea']
train['YearsBTWbuiltAndRemod'] = train['YearRemodAdd'] * train['YearBuilt']
train['RemodAfter1984'] = (train['YearRemodAdd'] >= 1984).astype(int)
train['BsmtUnfPCT'] = train['BsmtUnfSF'] * train['TotalBsmtSF']
train['2ndFlr'] = (train['2ndFlrSF'] > 0).astype(int)
train['GrOverLotArea'] = train['GrLivArea'] * train['LotArea']
# train['LowQualFin'] = (train['LowQualFinSF'] > 0).astype(int)
# train['Pool'] = (train['PoolArea'] > 0).astype(int) both highly skewed

New bivariates considering the absence of certain features:

In [9]:


# train['HasGarage'] = (train['GarageArea']>0).astype(int) highly correlated to GarageArea
train['HasBasement'] = (train['TotalBsmtSF']>0).astype(int)
train['HasFinBasement'] = (train['BsmtFinSF1']>0).astype(int)
train['HasFullBath'] = (train['FullBath']>0).astype(int)
train['HasMasVnr'] = (train['MasVnrArea']>0).astype(int)
# train['HasHalfBath'] = (train['HalfBath']>0).astype(int) Highly correlated to HalfBath
train['HasWoodDeck'] = (train['WoodDeckSF']>0).astype(int)
train['HasFireplace'] = (train['Fireplaces']>0).astype(int)
train['HasOpenPorch'] = (train['OpenPorchSF']>0).astype(int)
train['HasScreenPorch'] = (train['ScreenPorch']>0).astype(int)
train['HasEnclosedPorch'] = (train['EnclosedPorch']>0).astype(int)
train['HasKitchen'] = (train['KitchenAbvGr']>0).astype(int)




In [10]:
train['YearsBTWbuiltAndSold'] = train['YrSold'] - train['YearBuilt'] # Highly correlated to YearBuilt and no better correlation to SalePrice
# train['YearsBTWRemodAndSold'] = train['YrSold'] * train['YearRemodAdd'] Highly correlated to YearRemodAdd and no better correlation to SalePrice
train['GarageAreaPerCars'] = np.sqrt(train['GarageArea'] * train['GarageCars'])


So, the more years between the house was built and sold, the cheaper seems to be...up to one point where we could think it has become cultural patrimony with over a hundred years.

In [11]:
plt.figure(figsize=(10, 5))
sns.scatterplot(data=train, x='YearsBTWbuiltAndSold', y='SalePrice', hue=list(train['SalePrice']))

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

<AxesSubplot:xlabel='YearsBTWbuiltAndSold', ylabel='SalePrice'>

In [12]:
train['AgeInterval'] = pd.cut(train['YearsBTWbuiltAndSold'], bins=[-1, 20, 60, np.inf], labels=[20, 60, 140]).astype(int)

In [13]:
plt.figure(figsize=(10, 5))
sns.scatterplot(data=train, x='AgeInterval', y='SalePrice', hue=list(train['SalePrice']))

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

<AxesSubplot:xlabel='AgeInterval', ylabel='SalePrice'>

In [14]:
train.groupby('AgeInterval')['SalePrice'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
AgeInterval,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
20,562.0,12.329585,0.309667,11.344507,12.104178,12.281178,12.512361,13.534473
60,603.0,11.883832,0.277101,10.47195,11.736069,11.870271,12.029931,12.860999
140,292.0,11.724075,0.38006,10.460242,11.530265,11.701475,11.916685,13.07107


We will leave these two options for categorical binings of the age of the property at the time of the sale, and we will see later which one does better.

### Year Built CAT

In [15]:
plt.figure(figsize=(10, 5))
fig = sns.scatterplot(data=train, x='YearBuilt', y='SalePrice', hue=list(train['SalePrice']))

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

In [16]:
train['YearBuiltCat'] = pd.cut(train['YearBuilt'], bins=[0, 1919, 1959, 1979, 1999, 2019], labels=[1900,1920, 1960, 1980, 2000]).astype(int)

In [17]:
train.groupby('YearBuiltCat')['SalePrice'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
YearBuiltCat,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1900,88.0,11.777438,0.361131,10.542706,11.585013,11.755864,11.95518,13.07107
1920,397.0,11.750925,0.339687,10.460242,11.589887,11.767568,11.916389,12.850555
1960,362.0,11.903123,0.264532,11.041048,11.75685,11.898188,12.051609,12.834681
1980,224.0,12.23724,0.314827,11.445717,12.061047,12.17561,12.398732,13.534473
2000,386.0,12.349751,0.309405,11.344507,12.127706,12.320516,12.532221,13.323927


### Year Remod CAT

In [18]:
plt.figure(figsize=(10, 5))
fig = sns.scatterplot(data=train, x='YearRemodAdd', y='SalePrice', hue=list(train['SalePrice']))

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

In [19]:
train['YearRemodCat'] = pd.cut(train['YearRemodAdd'], bins=[0, 1950, 1964, 1984, 1994, 2019], labels=[1950, 1964, 1984, 1994, 2000]).astype(int)

In [20]:
train.groupby('YearRemodCat')['SalePrice'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
YearRemodCat,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1950,178.0,11.616425,0.350995,10.460242,11.42789,11.652687,11.826822,12.452933
1964,152.0,11.793431,0.219023,10.932982,11.686879,11.817649,11.947786,12.323856
1984,289.0,11.873906,0.273584,11.041048,11.728037,11.8706,12.043554,12.834681
1994,131.0,12.148561,0.305672,11.445717,11.97664,12.100712,12.354409,13.07107
2000,707.0,12.21395,0.372943,10.858999,11.956762,12.190959,12.449019,13.534473


#### PORCH FEATURES
Put together all porch SF and may be make a categorical

In [21]:
train['PorchSF'] = train['OpenPorchSF'] + train['EnclosedPorch'] + train['3SsnPorch'] + train['ScreenPorch']
fig_psf= myplt.plotRegRes(train, 'PorchSF', 'SalePrice')

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

In [22]:
plt.figure(figsize=(10, 5))
sns.scatterplot(data=train, x='PorchSF', y='SalePrice', hue=list(train['SalePrice']))

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

<AxesSubplot:xlabel='PorchSF', ylabel='SalePrice'>

In [23]:
plt.figure(figsize=(10, 5))
sns.scatterplot(data=train, x='OpenPorchSF', y='SalePrice', hue=list(train['SalePrice']))

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

<AxesSubplot:xlabel='OpenPorchSF', ylabel='SalePrice'>

We shall leave the total SF and then divide Open porch in 3 categories, 0: doesn't have, 100:has up to 100SF, 200> Has more than 200SF

In [24]:
train['OpenPorchCat'] = pd.cut(train['OpenPorchSF'], bins=[-1, 1, 4, 6, np.inf], labels=[0,1,4, 6]).astype(int)
# train['PorchSFCat'] = pd.cut(train['PorchSF'], bins=[-1, 7.5, 100, 200, 300, np.inf], labels=[0,100, 200, 300, 400]).astype(int) highly correlated to porchSF and no better correlation to SalePrice

In [25]:
train.groupby('OpenPorchCat')['SalePrice'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
OpenPorchCat,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,655.0,11.824592,0.332102,10.47195,11.652687,11.82408,11.982929,12.991753
1,419.0,12.138798,0.346764,10.596635,11.935869,12.128111,12.356643,13.534473
4,342.0,12.241406,0.391918,11.288531,11.982929,12.21106,12.481149,13.521139
6,41.0,12.214224,0.465391,10.460242,12.001505,12.242887,12.452933,13.07107


### TotalBsmt CAT

In [26]:

plt.figure(figsize=(10, 5))
fig = sns.scatterplot(data=train, x='TotalBsmtSF', y='SalePrice', hue=list(train['SalePrice']))

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

In [27]:
train['TotalBsmtCat'] = pd.cut(train['TotalBsmtSF'], bins=[-1, 0, 700, 1500, 2000, np.inf], labels=[0,700,1500, 2000, 3000]).astype(int)

In [28]:
train.groupby('TotalBsmtCat')['SalePrice'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
TotalBsmtCat,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,37.0,11.52968,0.287846,10.57898,11.407565,11.530765,11.685685,12.198544
700,181.0,11.677094,0.347558,10.47195,11.418615,11.71994,11.963568,12.524417
1500,1037.0,12.008436,0.319876,10.460242,11.794338,11.99226,12.206073,13.07107
2000,174.0,12.45364,0.317123,11.608236,12.273731,12.460707,12.660328,13.345507
3000,28.0,12.813355,0.379176,11.834284,12.52983,12.854463,12.991104,13.534473


### YearsBTWbuiltAndRemod CAT

In [29]:

plt.figure(figsize=(10, 5))
fig = sns.scatterplot(data=train, x='YearsBTWbuiltAndRemod', y='SalePrice', hue=list(train['SalePrice']))

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

In [30]:
train['YearsBTWbuiltAndRemodCat'] = pd.cut(train['YearsBTWbuiltAndRemod'], bins=[-1, 3.8e6, 3.9e6, 4e6, np.inf], labels=[1,2,3, 4]).astype(int)

In [31]:
train.groupby('YearsBTWbuiltAndRemodCat')['SalePrice'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
YearsBTWbuiltAndRemodCat,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,188.0,11.655413,0.373976,10.460242,11.473382,11.66521,11.846714,13.07107
2,457.0,11.818282,0.286181,10.47195,11.669929,11.81857,11.976659,12.850555
3,446.0,12.117514,0.314755,11.320554,11.898188,12.089539,12.297393,13.534473
4,366.0,12.355258,0.314578,11.344507,12.131281,12.322521,12.545219,13.323927


### MasVnrArea CAT

In [32]:

plt.figure(figsize=(10, 5))
fig = sns.scatterplot(data=train, x='MasVnrArea', y='SalePrice', hue=list(train['SalePrice']))

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

In [33]:
train['MasVnrAreaCat'] = pd.cut(train['MasVnrArea'], bins=[-1, 0, 200, 400, np.inf], labels=[1,2,3, 4]).astype(int)

In [34]:
train.groupby('MasVnrAreaCat')['SalePrice'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
MasVnrAreaCat,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,868.0,11.897592,0.366933,10.460242,11.686879,11.871473,12.122827,13.521139
2,293.0,12.098934,0.291041,11.225243,11.881035,12.089539,12.278393,13.226723
3,198.0,12.248645,0.373135,11.320554,11.999939,12.232322,12.554264,13.229568
4,98.0,12.462026,0.447726,11.326596,12.107145,12.500563,12.810232,13.534473


#### GENERAL SF FEATURES

Let's put some together and make some categories.
For interactive plots install https://github.com/matplotlib/ipympl/issues/148

In [35]:

plt.figure(figsize=(10, 5))
fig = sns.scatterplot(data=train, x='GrOverLotArea', y='SalePrice', hue=list(train['SalePrice']))

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

In [36]:
train['GrOverLotCat'] = pd.cut(train['GrOverLotArea'], bins=[-1, 60, 75, np.inf], labels=[1,2,3]).astype(int)

In [37]:
plt.figure(figsize=(10, 5))
sns.scatterplot(data=train, x='GrOverLotCat', y='SalePrice', hue=list(train['SalePrice']))

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

<AxesSubplot:xlabel='GrOverLotCat', ylabel='SalePrice'>

In [38]:
train.groupby('GrOverLotCat')['SalePrice'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
GrOverLotCat,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,179.0,11.687826,0.346881,10.460242,11.469411,11.736069,11.925035,12.363076
2,1205.0,12.037921,0.358598,10.596635,11.803354,12.016726,12.273731,13.323927
3,73.0,12.613593,0.383087,11.767568,12.367341,12.567237,12.904207,13.534473


In [39]:
train['TotalSF'] = train['TotalBsmtSF'] + train['1stFlrSF'] + train['2ndFlrSF'] + train['GarageArea']

In [40]:
%matplotlib widget
plt.figure(figsize=(10, 5))
fig = sns.scatterplot(data=train, x='TotalSF', y='SalePrice', hue=list(train['SalePrice']))

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

In [41]:
train['TotalSFCat'] = pd.cut(train['TotalSF'], bins=[-1, 2500, 4500, np.inf], labels=[1,2,3]).astype(int)

In [42]:
plt.figure(figsize=(10, 5))
sns.scatterplot(data=train, x='TotalSFCat', y='SalePrice', hue=list(train['SalePrice']))

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

<AxesSubplot:xlabel='TotalSFCat', ylabel='SalePrice'>

In [43]:
train.groupby('TotalSFCat')['SalePrice'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
TotalSFCat,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,449.0,11.647067,0.260677,10.460242,11.522876,11.693162,11.81303,12.154779
2,912.0,12.133388,0.275011,11.0021,11.941456,12.106252,12.323856,12.887127
3,96.0,12.744012,0.304925,11.834284,12.559366,12.742041,12.925432,13.534473


creating dummies for all bin categories

In [44]:
# dummy_col = ['OpenPorchCat', 'TotalBsmtCat','MasVnrAreaCat','GrOverLotCat', 'AgeInterval','TotalSFCat','YearsBTWbuiltAndRemodCat',
#              'YearBuiltCat', 'YearRemodCat']
# train = pd.get_dummies(train, columns=dummy_col, drop_first=False)

In [45]:
# from sklearn.preprocessing import OneHotEncoder

# ohe = OneHotEncoder(handle_unknown='ignore', sparse=False)

# df = pd.DataFrame(ohe.fit_transform(train[['OpenPorchCat']]))
# df.head()

Adding descriptive stats as features

In [46]:
# colds = [col for col in train.select_dtypes(include=['number']).columns if train[col].nunique()>=10]
# #Add mean and median column to data set having more then 10 categories
# for col in colds:
#     train[col+str('_median_range')] = (train[col] > train[col].median()).astype(np.int8)
#     train[col+str('_mean_range')] = (train[col] > train[col].mean()).astype(np.int8)
#     train[col+str('_q1')] = (train[col] > train[col].quantile(0.25)).astype(np.int8)
#     train[col+str('_q3')] = (train[col] > train[col].quantile(0.75)).astype(np.int8)

In [47]:
# Creating dictionary for custom transformer
# bools ={}
# colds = [col for col in train.select_dtypes(include=['number']).columns if train[col].nunique()>=10]
# #Add mean and median column to data set having more then 10 categories
# for col in colds:
#     bools[col+str('_median_range')] = (col, train[col].median())
#     bools[col+str('_mean_range')] = (col, train[col].mean())
#     bools[col+str('_q1')] = (col, train[col].quantile(0.25))
#     bools[col+str('_q3')] = (col, train[col].quantile(0.75))

We would also like to see how scaling affects our features, so let's apply our robustScaler.
Both scalings seem to have an awful effect on the correlation to SalePrice. Let's not scale anything. Trees are supposed to work despite scaling

In [48]:
# from sklearn.preprocessing import RobustScaler, MinMaxScaler
# nums = [col for col in train.select_dtypes(include='number')]
# nums.remove('SalePrice')
# scaler = RobustScaler(quantile_range=(0.5, 0.95))
# mmscaler = MinMaxScaler((-1, 1))
# rscaled = pd.DataFrame(scaler.fit_transform(train[nums]))
# rscaled.columns = nums
# mmscaled = pd.DataFrame(mmscaler.fit_transform(train[nums]))
# mmscaled.columns = nums
# train = pd.concat([train, rscaled.add_suffix('_rs'), mmscaled.add_suffix('_mms')], axis=1)


#### Categoricals into numericals

In [49]:
train['MSSubClass']= train['MSSubClass'].astype(str)
train['OverallCond']=train['OverallCond'].astype(str)
train['OverallQual']=train['OverallQual'].astype(str)


In [50]:
train['HasBsmt'] = train['BsmtQual'] != 'NA'

In [51]:
train['Conditions'] = train['Condition1'] + '-' + train['Condition2']
#train['Overall'] = train['OverallQual'] + '-' + train['OverallCond'] #this is highly correlated to OverallQual and no better correlation to SalePrice
train['Roof'] = train['RoofStyle'] + '-' + train['RoofMatl']
train['Exterior'] = train['Exterior1st'] + '-' + train['Exterior2nd']
train['External'] = train['ExterQual'] + '-' + train['ExterCond']
train['Basement'] = train['BsmtQual'] + '-' + train['BsmtCond']
train['BasementFin'] = train['BsmtFinType1'] + '-' + train['BsmtFinType2']
train['Garage'] = train['GarageQual'] + '-' + train['GarageCond']
train['GarageTF'] = train['GarageType'] + '-' + train['GarageFinish']
train['HeatingCond'] = train['Heating'] + '-' + train['HeatingQC']
train['Sale'] = train['SaleType'] + '-' + train['SaleCondition']
train['Lot'] = train['LotShape'] + '-' + train['LotConfig']
train['OverallNeigh'] = train['OverallQual'] + '-' + train['Neighborhood']

In [52]:
meta_cat = [col for col in train.select_dtypes(include='object').columns]
import category_encoders as ce
train[meta_cat].fillna(value='DNA', inplace=True)
cb_enc = ce.CatBoostEncoder(cols=meta_cat)
cb_enc.fit(train[meta_cat], train['SalePrice'])
train[meta_cat] = cb_enc.transform(train[meta_cat])


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  downcast=downcast,
  elif pd.api.types.is_categorical(cols):


In [53]:
#Removing highly correlated variables to the ones we have created
AfterBivs=['ExterQual', 'BsmtQual', 'HeatingQC', 'BsmtFinType1', 'SaleCondition', 'EnclosedPorch', 'PoolArea', 'YearsBTWbuiltAndSold'] #Lower correlation
dismiss = ['ScreenPorch', '3SsnPorch', 'BsmtFinSF2', 'BsmtHalfBath', 'MiscVal', 'LowQualFinSF'] #Highly skewed and low correlation
train.drop(columns=AfterBivs+dismiss, inplace=True)

num =[col for col in train.select_dtypes(include='number').columns]

In [54]:
df_numInfo = tab.numTable(train[num], 'SalePrice')
df_numInfo.to_csv('meanfillnumInfo.csv')
df_numInfo.style.background_gradient(subset=['Corr'])

Unnamed: 0,Count,Mean,Std,Min,25%,50%,75%,Max,Skew,Kurt,% Missing,% Zero,Nuniques,Corr
SalePrice,1457.0,12.023753,0.399733,10.460242,11.77452,12.001505,12.273731,13.534473,0.123,0.806498,0.0,0.0,662,1.0
OverallNeigh,1457.0,12.022962,0.316014,11.319448,11.812644,11.982166,12.2301,13.095577,0.423649,-0.030952,0.0,0.0,97,0.857595
TotalSF,1457.0,3029.111187,910.116521,334.0,2388.0,2931.0,3570.0,7685.0,0.69035,1.141819,0.0,0.0,1082,0.855457
OverallQual,1457.0,12.024277,0.320807,11.124976,11.781268,11.967459,12.220604,12.970445,0.531248,0.1489,0.0,0.0,10,0.823441
Neighborhood,1457.0,12.023601,0.296225,11.46207,11.814298,12.089574,12.162715,12.66047,0.365441,-0.497286,0.0,0.0,25,0.756733
GrLivArea,1457.0,7.26577,0.330362,5.811141,7.028201,7.285507,7.482119,8.406485,-0.070542,0.098228,0.0,0.0,858,0.737248
TotalSFCat,1457.0,1.757721,0.56176,1.0,1.0,2.0,2.0,3.0,0.006534,-0.376124,0.0,0.0,3,0.728785
GarageAreaPerCars,1457.0,28.732605,12.19877,0.0,18.330303,30.789609,33.941125,73.647811,-0.210635,0.396596,0.0,5.56,488,0.69022
External,1457.0,12.024246,0.270501,11.171919,11.843162,11.843162,12.313264,12.802568,0.760101,0.264782,0.0,0.0,11,0.689615
Basement,1457.0,12.024422,0.270124,11.384943,11.81365,12.175649,12.175649,12.67062,0.63884,0.113701,0.0,0.0,12,0.686538


In [55]:
df_numInfo['Nuniques'].sum()

13226

In [56]:
corr_list = train[num].corr().abs().unstack().sort_values(kind="quicksort", ascending=False)
corr_list[corr_list<1].head(25)


GarageCars                GarageAreaPerCars           0.971737
GarageAreaPerCars         GarageCars                  0.971737
                          GarageArea                  0.970474
GarageArea                GarageAreaPerCars           0.970474
Condition1                Conditions                  0.957243
Conditions                Condition1                  0.957243
GarageCond                Garage                      0.956337
Garage                    GarageCond                  0.956337
Lot                       LotShape                    0.955272
LotShape                  Lot                         0.955272
YearsBTWbuiltAndRemodCat  YearsBTWbuiltAndRemod       0.954139
YearsBTWbuiltAndRemod     YearsBTWbuiltAndRemodCat    0.954139
YearRemodAdd              YearRemodCat                0.952013
YearRemodCat              YearRemodAdd                0.952013
YearBuilt                 YearBuiltCat                0.948429
YearBuiltCat              YearBuilt                   0

In [57]:
toLog = list(df_numInfo[(df_numInfo['Skew']>1) & (df_numInfo['% Zero']==0)].index)
toLog2 = list(df_numInfo[(df_numInfo['Skew']<-1) & (df_numInfo['% Zero']==0)].index)

toRoot = list(df_numInfo[(df_numInfo['Skew']>1) & (df_numInfo['% Zero']>0)].index)
toRoot3 = list(df_numInfo[(df_numInfo['Skew']<-1) & (df_numInfo['% Zero']>0)].index)




In [58]:
for col in toLog+toLog2:
    if col not in meta_cat:
        print(col)
        print(np.log(train[col]).skew())

MasVnrAreaCat
0.755999417488788
YearRemodCat
-1.074686198807721


In [59]:
## Skew does not get better in the categorical variables applying log, but we won't do so anyways
for col in toRoot+toRoot3 :
    if col not in meta_cat:
        print(col)
        print(np.cbrt(train[col]).skew())

MasVnrArea
0.7065789239844504
OpenPorchCat
0.12902841441881852
BsmtUnfPCT
-0.3233379209747692
WoodDeckSF
0.2752423770771553
HasScreenPorch
3.1091392632455572
PorchSF
1.0101533551208601
KitchenAbvGr
0.3442922997811717
HasEnclosedPorch
2.044491516874515
GarageYrBlt
-3.880938712008725
HasBasement
-6.039828623281647
HasFullBath
-12.618354617067514
HasKitchen
-38.17066936798464


In [60]:
dismiss = ['ScreenPorch', '3SsnPorch', 'BsmtFinSF2', 'BsmtHalfBath', 'MiscVal', 'LowQualFinSF']
log=['LotArea', 'GrLivArea', 'MasVnrAreaCat']
sqrt=['GarageAreaPerCars']
cbrt=['MasVnrArea', 'BsmtUnfPCT', 'WoodDeckSF', 'OpenPorchSF', 'PorchSF', 'KitchenAbvGr', 'OpenPorchCat']