## Identifying and Removing Outliers

In [None]:
%run load_data.py

In [None]:
%matplotlib inline

To identify outliers in the data, we will use what is [the Tukey Method](http://datapigtechnologies.com/blog/index.php/highlighting-outliers-in-your-data-with-the-tukey-method/). 

- leverages the Interquartile Range 
- isn’t dependent on distributional assumptions 
- ignores the mean and standard deviation
- making it resistant to being influenced by the extreme values in the range

**Tukey's Method:** look for points that are more than 1.5 times the Inter-quartile range above the third quartile or below the first quartile.

In [None]:
def feature_outliers(dataframe, col, param=1.5):
    Q1 = np.percentile(dataframe[col], 25)
    Q3 = np.percentile(dataframe[col], 75)
    tukey_window = param*(Q3-Q1)
    less_than_Q1 = dataframe[col] < Q1 - tukey_window
#     print(less_than_Q1)
    greater_than_Q3 = dataframe[col] > Q3 + tukey_window
    tukey_mask = (less_than_Q1 | greater_than_Q3)
    return dataframe.loc[tukey_mask]

In [None]:
feature_outliers(customer_features, 'Grocery')

In [None]:
for col in customer_log_sc_df:
    print(col, feature_outliers(customer_features, col).shape)

What if we count the rows that show up as an outlier more than once?

In [None]:
from collections import Counter

In [None]:
def multiple_outliers(dataframe, count=2):
    raw_outliers = []
    for col in dataframe:
        outlier_df = feature_outliers(dataframe, col)
        raw_outliers += list(outlier_df.index)

    outlier_count = Counter(raw_outliers)
    outliers = [k for k,v in outlier_count.items() if v >= count]
    return outliers

In [None]:
len(multiple_outliers(customer_features))

In [None]:
len(multiple_outliers(customer_sc_df))

In [None]:
len(multiple_outliers(customer_log_sc_df))

In [None]:
len(multiple_outliers(customer_box_cox_sc_df))

In [None]:
customer_log_sc_df.shape

In [None]:
_, ax = plt.subplots(1,4,figsize=(20,6))

for i, df in enumerate([customer_features, customer_sc_df, customer_log_sc_df, customer_box_cox_sc_df]):
    sns.boxplot(df, ax=ax[i])

In [None]:
customer_features_outliers_removed = customer_features.drop(multiple_outliers(customer_features))
customer_sc_df_outliers_removed = customer_sc_df.drop(multiple_outliers(customer_sc_df))
customer_log_sc_df_outliers_removed = customer_log_sc_df.drop(multiple_outliers(customer_log_sc_df))
customer_box_cox_sc_df_outliers_removed = customer_box_cox_sc_df.drop(multiple_outliers(customer_box_cox_sc_df))

In [None]:
(customer_features_outliers_removed.shape,
 customer_sc_df_outliers_removed.shape,
 customer_log_sc_df_outliers_removed.shape,
 customer_box_cox_sc_df_outliers_removed.shape)