## Remove Outliers

In the file `src/preprocessing.py`, we perform the preprocessing steps necessary for the next phase of this project. This file does the following:

1. Load and merge data, using the `Id` feature as we did previously
2. Assign correct data types, in particular, designating which features are categorical
3. Handle missing values, using the work we did previously
4. Preprocessing

In [1]:
run src/preprocessing.py

#### Complete Feature Sets

We are starting to do some fairly complicated feature engineering. It makes sense that we should spend some time thinking about the different data sets we are creating so that we can keep track of what we have.

##### Standard Scaled Data Set

One data set is the standard scaled data set. For this data set, there is no need to separate the encoded categorical features. These two dataframes comprise a complete data set

- Log Transformed, Standard Scaled Numerical Features (`numeric_log_std_sc_df`)
- Complete One-hot Encoded Categorical Features (`categorical_encoded_df`)

##### Gelman Scaled Data Set

The other data set is the Gelman scaled data set. For this data set, we have separated the encoded categorical features based on a threshold for variance. These three dataframes comprise a complete data set

- Log Transformed, Gelman Scaled Numerical Features (`numeric_log_gel_sc_df`)
- One-hot Encoded Categorical Features with Significant Variance, Centered (`categorical_encoded_features_significant_variance_centered`)
- One-hot Encoded Categorical Features with Insignificant Variance (`categorical_encoded_features_insignificant_variance`)




### Identify Outliers

Next, we will work with the numeric features to identify outliers. As before, we will use the Tukey Method.

In [2]:
def display_outliers(dataframe, col, param=1.5):
    Q1 = np.percentile(dataframe[col], 25)
    Q3 = np.percentile(dataframe[col], 75)
    tukey_window = param*(Q3-Q1)
    less_than_Q1 = dataframe[col] < Q1 - tukey_window
    greater_than_Q3 = dataframe[col] > Q3 + tukey_window
    tukey_mask = (less_than_Q1 | greater_than_Q3)
    return dataframe[tukey_mask]

In [3]:
print("Column.             Standard      Gelman ")
print("--------------------------------------------")
for col in numeric_log_std_sc_df.columns:
    print("{:20} {:12} {}".format(col, 
                               str(display_outliers(numeric_log_std_sc_df, col).shape),
                               str(display_outliers(numeric_log_gel_sc_df, col).shape)
                              ))

Column.             Standard      Gelman 
--------------------------------------------
LotFrontage          (122, 23)    (122, 23)
LotArea              (128, 23)    (128, 23)
YearBuilt            (9, 23)      (9, 23)
YearRemodAdd         (0, 23)      (0, 23)
MasVnrArea           (0, 23)      (0, 23)
BsmtFinSF1           (0, 23)      (0, 23)
BsmtFinSF2           (167, 23)    (167, 23)
BsmtUnfSF            (125, 23)    (125, 23)
TotalBsmtSF          (52, 23)     (52, 23)
FirstFlrSF           (7, 23)      (7, 23)
SecondFlrSF          (0, 23)      (0, 23)
LowQualFinSF         (26, 23)     (26, 23)
GrLivArea            (10, 23)     (10, 23)
GarageYrBlt          (1, 23)      (1, 23)
GarageArea           (84, 23)     (84, 23)
WoodDeckSF           (0, 23)      (0, 23)
OpenPorchSF          (0, 23)      (0, 23)
EnclosedPorch        (207, 23)    (207, 23)
ThreeSsnPorch        (24, 23)     (24, 23)
ScreenPorch          (116, 23)    (116, 23)
PoolArea             (7, 23)      (7, 23)
MiscVal       

Note that both scaling techniques return the same number of outliers.

### Count Multiple Outliers

As before, we will count row that are outlier for more than one feature.

In [4]:
from collections import Counter

In [5]:
def multiple_outliers(dataframe, count=2):
    raw_outliers = []
    for col in dataframe:
        outlier_df = feature_outliers(dataframe, col)
        raw_outliers += list(outlier_df.index)

    outlier_count = Counter(raw_outliers)
    outliers = [k for k,v in outlier_count.items() if v >= count]
    return outliers

In [6]:
len(multiple_outliers(numeric_log_std_sc_df)), len(multiple_outliers(numeric_log_gel_sc_df))

(292, 292)

Again, the two scaling techniques return the same number of multiple outliers. Unfortunately, this number of outliers represents an unacceptable loss of data, approximately 20% of our data.

In [7]:
len(multiple_outliers(numeric_log_std_sc_df))/numeric_log_std_sc_df.shape[0]

0.20124052377670573

We set the multiple feature count higher and reassess.

In [8]:
print(len(multiple_outliers(numeric_log_std_sc_df, count=4)), len(multiple_outliers(numeric_log_gel_sc_df, count=4)))
print(len(multiple_outliers(numeric_log_std_sc_df, count=4))/numeric_log_std_sc_df.shape[0])

20 20
0.013783597518952447


In [9]:
print(len(multiple_outliers(numeric_log_std_sc_df, count=5)), len(multiple_outliers(numeric_log_gel_sc_df, count=5)))
print(len(multiple_outliers(numeric_log_std_sc_df, count=5))/numeric_log_std_sc_df.shape[0])

7 7
0.004824259131633356


Instances that are an outlier in four or more features amount to 1.3% of the data. Instances that are an outlier in five or more features amount to 0.5% of the data.  Both of these represent acceptable losses. Here we will use Instances that are outlier in five or more features.

In [10]:
numeric_log_std_sc_out_rem_df = numeric_log_std_sc_df.drop(multiple_outliers(numeric_log_std_sc_df, 5))
numeric_log_gel_sc_out_rem_df = numeric_log_gel_sc_df.drop(multiple_outliers(numeric_log_gel_sc_df, 5))