## Preprocessing

In the file `src/preprocessing.py`, we perform the preprocessing steps necessary for the next phase of this project. This file does the following:

1. Load and merge data, using the `Id` feature as we did previously
2. Assign correct data types, in particular, designating which features are categorical
3. Handle missing values, using the work we did previously

In [1]:
run src/preprocessing.py

### Skew-Normalization

Next, we look at skew-normalizing our data. We have two methods that we have worked with to apply skew-normalization:

- applying a log transform
- applying a box-cox transform

In the past, we have seen that the box-cox transform has been more performant in terms of removing skew from a dataset. With this data set, however, there is another issue.


In [2]:
import scipy.stats as st

To see this issue, let's look at the `LotArea` feature from the numeric dataset.

In [3]:
box_cox_trans = st.boxcox(numeric_df['LotArea'] + 1)

  llf -= N / 2.0 * np.log(np.sum((y - y_mean)**2. / N, axis=0))
  w = xb - ((xb - xc) * tmp2 - (xb - xa) * tmp1) / denom
  tmp1 = (x - w) * (fx - fv)
  tmp2 = (x - v) * (fx - fw)
  tmp2 = numpy.abs(tmp2)
  p = (x - v) * tmp2 - (x - w) * tmp1
  tmp2 = 2.0 * (tmp2 - tmp1)


Note that applying a box-cox transform to this feature causes a `RunTimeWarning`. This is a known an open issue for the `scipy` library and can be tracked here: https://github.com/scipy/scipy/issues/6873. The details of the issue are complicated, but the short of it is that a floating-point arithmetic error is introduced. As such, we are not able to easily use the Box-Cox transform on this data set. We will stick to applying a log transform.

In [4]:
numeric_log_df = np.log(numeric_df + 1)

### One-hot Encoding

In order to understand our categorical data from a numerical perspective, and ultimately in order to use our categorical data in a machine learning model, we need to numerically encode our categorical data.The standard way to do this is to perform a so-called "One-hot Encoding". This is also known as encoding with dummy variables. Using Pandas, it is possible to perform this encoding on properly typed data using the function `pd.get_dummies()`.

In [5]:
categorical_encoded_df = pd.get_dummies(categorical_df)

In [6]:
categorical_encoded_df.shape

(1451, 359)

In [7]:
categorical_encoded_df.sample(5)

Unnamed: 0_level_0,MSSubClass_20,MSSubClass_30,MSSubClass_40,MSSubClass_45,MSSubClass_50,MSSubClass_60,MSSubClass_70,MSSubClass_75,MSSubClass_80,MSSubClass_85,...,SaleType_ConLw,SaleType_New,SaleType_Oth,SaleType_WD,SaleCondition_Abnorml,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
201,1,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,1,0
579,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,1,0,0,0,0,0
1316,0,0,0,0,0,1,0,0,0,0,...,0,0,0,1,0,0,0,0,1,0
783,1,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,1,0
55,0,0,0,0,0,0,0,0,1,0,...,0,0,0,1,0,0,0,0,1,0


Let us now consider this statistical description of an encoded categorical feature. Note that the categorical feature `MSSubClass` Has been converted it to 15 columns, one for each possible category.

In [8]:
ms_sub_class_encoded_cols = [col for col in categorical_encoded_df.columns if 'MSSubClass' in col]
ms_sub_class_encoded_cols

['MSSubClass_20',
 'MSSubClass_30',
 'MSSubClass_40',
 'MSSubClass_45',
 'MSSubClass_50',
 'MSSubClass_60',
 'MSSubClass_70',
 'MSSubClass_75',
 'MSSubClass_80',
 'MSSubClass_85',
 'MSSubClass_90',
 'MSSubClass_120',
 'MSSubClass_160',
 'MSSubClass_180',
 'MSSubClass_190']

We can use this list to filter the full categorical data frame to simply look at the encoded `MSSubClass` feature.

In [9]:
categorical_encoded_df[ms_sub_class_encoded_cols].head()

Unnamed: 0_level_0,MSSubClass_20,MSSubClass_30,MSSubClass_40,MSSubClass_45,MSSubClass_50,MSSubClass_60,MSSubClass_70,MSSubClass_75,MSSubClass_80,MSSubClass_85,MSSubClass_90,MSSubClass_120,MSSubClass_160,MSSubClass_180,MSSubClass_190
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
5,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0


Next, we take a sum for each column and over the whole filtered dataframe. Note that to sum over the whole data frame, we must request the `.values` of the DataFrame which has the effect of converting the data into a simple Numpy array. This is because the `.sum()` method in Pandas can only be performed over columns (`.sum(axis=1)`) or rows (`.sum(axis=0)`), whereas the `.sum()` method in Numpy can be performed over the entire array.

In [10]:
(categorical_encoded_df[ms_sub_class_encoded_cols].sum(), 
 categorical_encoded_df[ms_sub_class_encoded_cols].values.sum())

(MSSubClass_20     532
 MSSubClass_30      69
 MSSubClass_40       4
 MSSubClass_45      12
 MSSubClass_50     144
 MSSubClass_60     296
 MSSubClass_70      60
 MSSubClass_75      16
 MSSubClass_80      57
 MSSubClass_85      20
 MSSubClass_90      52
 MSSubClass_120     86
 MSSubClass_160     63
 MSSubClass_180     10
 MSSubClass_190     30
 dtype: int64, 1451)

In [11]:
categorical_encoded_df[ms_sub_class_encoded_cols].shape

(1451, 15)

It is useful to think about the sparsity of the one-hot encoded data. This filtered dataframe, `categorical_encoded_df[ms_sub_class_encoded_cols]` has a shape of `(1451, 15)`, that is, 21765 datapoints, but only 1451 contain a value of 1, the rest containing a value of 0. In other words, 14 out of 15 or 93% of values in this filtered dataframe are 0.

Next, let's look at mean and standard deviation of the filtered dataframe.

In [12]:
stats = pd.DataFrame()
stats['mean'] = categorical_encoded_df[ms_sub_class_encoded_cols].mean()
stats['std'] = categorical_encoded_df[ms_sub_class_encoded_cols].std()
stats['var'] = categorical_encoded_df[ms_sub_class_encoded_cols].var()
stats.sort_values('std', ascending=False)

Unnamed: 0,mean,std,var
MSSubClass_20,0.366644,0.482054,0.232376
MSSubClass_60,0.203997,0.403106,0.162494
MSSubClass_50,0.099242,0.29909,0.089455
MSSubClass_120,0.059269,0.23621,0.055795
MSSubClass_30,0.047553,0.212893,0.045323
MSSubClass_160,0.043418,0.203867,0.041562
MSSubClass_70,0.041351,0.199169,0.039668
MSSubClass_80,0.039283,0.194335,0.037766
MSSubClass_90,0.035837,0.185949,0.034577
MSSubClass_190,0.020675,0.142344,0.020262


We note that most of the one-hot encoded columns have very little variance. In a moment, we'll restrict our analysis to one-hot encoded features that have a variance greater than 0.2. Below is a list of these features. Remember that each of these represents a Boolean variable as to whether or not each row has this particular category-attribute. 

In [13]:
stats = pd.DataFrame()
stats['mean'] = categorical_encoded_df.mean()
stats['std'] = categorical_encoded_df.std()
stats['var'] = categorical_encoded_df.var()
categorical_encoded_features_significant_variance_stats = stats[stats['var'] > 0.2].sort_values('std', ascending=False)
categorical_encoded_features_insignificant_variance_stats = stats[stats['var'] <= 0.2].sort_values('std', ascending=False)
categorical_encoded_features_significant_variance_stats.head(5)

Unnamed: 0,mean,std,var
HouseStyle_1Story,0.496899,0.500163,0.250163
HeatingQC_Ex,0.505858,0.500138,0.250138
KitchenQual_TA,0.505858,0.500138,0.250138
FullBath_2,0.524466,0.499573,0.249573
Fireplaces_0,0.472088,0.499392,0.249393


### Gelman Scaling 

This data set is a mixed data set. It includes both numerical and categorical features. In his [2007 paper](http://www.stat.columbia.edu/~gelman/research/published/standardizing7.pdf), Gelman outlines a simple adjustment to the standard scaling technique we have been using that can help with mixed datasets. The standard scaling technique performs the following transformation

$$Z = \frac{X-\mu}{\sigma}$$

Gelman proposes this alternative

$$Z_g = \frac{X-\mu}{2\sigma}$$

Here, we explore the implications of this.

In [14]:
numeric_log_df.shape, categorical_df.shape

((1451, 23), (1451, 56))

First, let's look at the statistics for data prepared by each scaling technique. For ease of viewing, we will only look at the first five features.

In [15]:
numeric_first_five_features = numeric_log_df.columns[:5]

In [16]:
numeric_log_std_sc_df = (numeric_log_df - numeric_log_df.mean())/numeric_log_df.std()
numeric_log_gel_sc_df = (numeric_log_df - numeric_log_df.mean())/(2*numeric_log_df.std())

In [17]:
stats = pd.DataFrame()
stats['mean'] = numeric_log_std_sc_df[numeric_first_five_features].mean()
stats['std'] = numeric_log_std_sc_df[numeric_first_five_features].std()
stats['var'] = numeric_log_std_sc_df[numeric_first_five_features].var()
stats

Unnamed: 0,mean,std,var
LotFrontage,2.199301e-14,1.0,1.0
LotArea,8.004126e-15,1.0,1.0
YearBuilt,-8.855706e-14,1.0,1.0
YearRemodAdd,1.895718e-13,1.0,1.0
MasVnrArea,-8.185504e-16,1.0,1.0


In [18]:
stats = pd.DataFrame()
stats['mean'] = numeric_log_gel_sc_df[numeric_first_five_features].mean()
stats['std'] = numeric_log_gel_sc_df[numeric_first_five_features].std()
stats['var'] = numeric_log_gel_sc_df[numeric_first_five_features].var()
stats

Unnamed: 0,mean,std,var
LotFrontage,1.099651e-14,0.5,0.25
LotArea,4.002063e-15,0.5,0.25
YearBuilt,-4.427853e-14,0.5,0.25
YearRemodAdd,9.478592e-14,0.5,0.25
MasVnrArea,-4.092752e-16,0.5,0.25


Note that the diagonal of each covariance matrix signifies the variance of each feature. With standard scaling, the standard deviation $\sigma$ of a scaled feature is 1 and the variance $\sigma^2$ is also 1. With Gelman scaling, the standard deviation $\sigma$ of a scaled feature is 0.5. The variance $\sigma^2$ is 0.25. Compare this to the standard deviation and variance of the categorical features:

In [19]:
categorical_encoded_features_significant_variance_stats.head(5)

Unnamed: 0,mean,std,var
HouseStyle_1Story,0.496899,0.500163,0.250163
HeatingQC_Ex,0.505858,0.500138,0.250138
KitchenQual_TA,0.505858,0.500138,0.250138
FullBath_2,0.524466,0.499573,0.249573
Fireplaces_0,0.472088,0.499392,0.249393


Note that with Gelman scaling, we are able to directly compare one-hot encoded categorical features with significant variance to our numerical features. 

Gelman notes:

> Our procedure scales inputs to be comparable with binary variables that are roughly symmetric: if the probability falls between 0.3 and 0.7, then 2 standard deviations will be between 0.9 and 1. Highly skewed binary inputs still create difficulty in interpretation, however; for example, two standard deviations for a 90 per cent/10 per cent binary variable come to only 0.6. Thus, leaving this binary variable unscaled is not quite equivalent to dividing by two standard deviations. One might argue, however, that when considering rare subsets of the population, a full comparison from 0 to 1 could overstate the importance of the predictor in the regression, hence it might be reasonable to consider this two-standard-deviation comparison, which is less than the comparison of the extremes. Our main point, however, is that 2 standard deviations is a more reasonable scaling than 1—even if neither automatic approach solves all problems of interpretation.

Following these guidelines, we will only compare Gelman scaled numeric features and one-hot encoded categorical features with a variance above 0.2. Note that a variance of 0.2 corresponds approximately to features for whom "2 standard deviations will be between 0.9 and 1".

In [20]:
categorical_encoded_features_significant_variance = categorical_encoded_df[categorical_encoded_features_significant_variance_stats.index]
categorical_encoded_features_insignificant_variance = categorical_encoded_df[categorical_encoded_features_insignificant_variance_stats.index]

In [21]:
categorical_encoded_features_significant_variance.columns

Index(['HouseStyle_1Story', 'HeatingQC_Ex', 'KitchenQual_TA', 'FullBath_2',
       'Fireplaces_0', 'FireplaceQu_None', 'BedroomAbvGr_3', 'FullBath_1',
       'Fireplaces_1', 'BsmtQual_TA', 'Foundation_PConc', 'OverallCond_5',
       'GarageCars_2', 'Foundation_CBlock', 'BsmtQual_Gd', 'GarageFinish_Unf',
       'BsmtFullBath_0', 'MasVnrType_None', 'GarageType_Attchd',
       'BsmtFullBath_1', 'KitchenQual_Gd', 'ExterQual_TA', 'HalfBath_0',
       'LotShape_Reg', 'MSSubClass_20', 'HalfBath_1', 'Exterior1st_VinylSd',
       'BsmtExposure_No', 'Exterior2nd_VinylSd', 'LotShape_IR1',
       'ExterQual_Gd', 'MasVnrType_BrkFace', 'HouseStyle_2Story',
       'BsmtFinType1_Unf', 'HeatingQC_TA', 'GarageFinish_RFn',
       'BsmtFinType1_GLQ', 'LotConfig_Inside', 'TotRmsAbvGrd_6'],
      dtype='object')

In [22]:
categorical_encoded_features_insignificant_variance.columns

Index(['OverallQual_5', 'GarageType_Detchd', 'FireplaceQu_Gd', 'OverallQual_6',
       'GarageCars_1', 'BedroomAbvGr_2', 'GarageFinish_Fin', 'TotRmsAbvGrd_7',
       'RoofStyle_Gable', 'OverallQual_7',
       ...
       'Exterior1st_AsphShn', 'Condition2_RRAn', 'Exterior1st_ImStucc',
       'Condition2_RRAe', 'RoofMatl_Roll', 'RoofMatl_ClyTile', 'Heating_Floor',
       'Exterior2nd_CBlock', 'Exterior1st_CBlock', 'MiscFeature_TenC'],
      dtype='object', length=320)

### Centering Categorical Features with Significant Variance

We apply one last transformation to the `categorical_encoded_features_significant_variance` dataframe. Namely, we subtract the meaning from each column. In doing this we have an apples to apples comparison between the numeric features and the categorical features with significant variance. Simply subtracting the meaning is known as centering

$$Z_c = X - \mu$$

In [23]:
categorical_encoded_features_significant_variance_centered = (categorical_encoded_features_significant_variance - 
                                                              categorical_encoded_features_significant_variance.mean())

This work was added to `src/preprocessing.py` as we continue.