# Outlier Analysis

In [2]:
import pandas as pd
import numpy as np

In [3]:
housing_box_sc = pd.read_pickle('final_box_cox_sc.p')

## Using Tukey's Method to Identify Outliers
In this case, I changed the Tukey parameter to 1, to identify more outliers. With the original 1.5 Tukey Parameter, we were only seeing outliers in 3 features.

In [4]:
def display_outliers(dataframe, col, param=1):
    Q1 = np.percentile(dataframe[col], 25)
    Q3 = np.percentile(dataframe[col], 75)
    tukey_window = param*(Q3-Q1)
    less_than_Q1 = dataframe[col] < Q1 - tukey_window
    greater_than_Q3 = dataframe[col] > Q3 + tukey_window
    tukey_mask = (less_than_Q1 | greater_than_Q3)
    return dataframe[tukey_mask]

In [5]:
for col in housing_box_sc:
    print(col, display_outliers(housing_box_sc, col, param=1.5).shape)

CRIM (0, 12)
INDUS (0, 12)
NOX (0, 12)
RM (26, 12)
AGE (0, 12)
DIS (0, 12)
RAD (0, 12)
TAX (0, 12)
PTRATIO (0, 12)
B (70, 12)
LSTAT (0, 12)
MEDV (50, 12)


In [6]:
for col in housing_box_sc:
    print(col, display_outliers(housing_box_sc, col, param=1).shape)

CRIM (0, 12)
INDUS (0, 12)
NOX (0, 12)
RM (55, 12)
AGE (0, 12)
DIS (0, 12)
RAD (0, 12)
TAX (0, 12)
PTRATIO (0, 12)
B (84, 12)
LSTAT (5, 12)
MEDV (79, 12)


## Identify instances that are an outlier in...

In [11]:
from collections import Counter

In [7]:
raw_outliers = []
for col in housing_box_sc:
    outlier_df = display_outliers(housing_box_sc, col)
    raw_outliers += list(outlier_df.index)

### 1 Outliers

In [20]:
outlier_count = Counter(raw_outliers)
outliers1 = [k for k,v in outlier_count.items() if v == 1]
len(outliers1)

95

### 2 Outliers

In [21]:
outliers2 = [k for k,v in outlier_count.items() if v == 2]
len(outliers2)

56

### > 2 Outliers

In [23]:
outliersmore = [k for k,v in outlier_count.items() if v > 2]
len(outliersmore)

5

### Proportions of total data

In [28]:
print(len(outliers1)/len(housing_box_sc))
print(len(outliers2)/len(housing_box_sc))
print(len(outliersmore)/len(housing_box_sc))

0.18774703557312253
0.11067193675889328
0.009881422924901186


## Discussion
I will definitely throw out the 5 instances that have more than 2 outliers, as they represent less than 1 percent of the total data.

Of the 56 and 95 instances that have 1 or 2 instances of outliers, I will likely run the regression both including and excluding the outliers to see how it impacts the performance of the model.