# Identifying and Removing Outliers

To identify outliers in the data, we will use what is [the Tukey Method](http://datapigtechnologies.com/blog/index.php/highlighting-outliers-in-your-data-with-the-tukey-method/). 

This means that we will look for points that are more than 1.5 times the Inter-quartile range above the third quartile or below the first quartile.

In [2]:
import pandas as pd
import numpy as np

In [3]:
customer_log_sc_df = pd.read_pickle('final_log_sc.p')

In [4]:
def display_outliers(dataframe, col, param=1.5):
    Q1 = np.percentile(dataframe[col], 25)
    Q3 = np.percentile(dataframe[col], 75)
    tukey_window = param*(Q3-Q1)
    less_than_Q1 = dataframe[col] < Q1 - tukey_window
    greater_than_Q3 = dataframe[col] > Q3 + tukey_window
    tukey_mask = (less_than_Q1 | greater_than_Q3)
    return dataframe[tukey_mask]

In [5]:
display_outliers(customer_log_sc_df, 'Milk')
display_outliers(customer_log_sc_df, 'Grocery')

Unnamed: 0,Fresh,Milk,Grocery,Frozen,Detergents_Paper,Delicatessen
75,0.806722,-1.004409,-6.585828,0.849171,-3.308406,0.165965
154,-1.554127,-3.808515,-3.158292,-2.325583,-2.815523,-3.502289


In [4]:
for col in customer_log_sc_df:
    print(col, display_outliers(customer_log_sc_df, col).shape)

Fresh (16, 6)
Milk (4, 6)
Grocery (2, 6)
Frozen (10, 6)
Detergents_Paper (2, 6)
Delicatessen (14, 6)


What if we count the rows that show up as an outlier more than once?

In [5]:
from collections import Counter

In [6]:
raw_outliers = []
for col in customer_log_sc_df:
    outlier_df = display_outliers(customer_log_sc_df, col)
    raw_outliers += list(outlier_df.index)

In [17]:
raw_outliers

[65,
 66,
 81,
 95,
 96,
 128,
 171,
 193,
 218,
 304,
 305,
 338,
 353,
 355,
 357,
 412,
 86,
 98,
 154,
 356,
 75,
 154,
 38,
 57,
 65,
 145,
 175,
 264,
 325,
 420,
 429,
 439,
 75,
 161,
 66,
 109,
 128,
 137,
 142,
 154,
 183,
 184,
 187,
 203,
 233,
 285,
 289,
 343]

In [7]:
outlier_count = Counter(raw_outliers)
outliers = [k for k,v in outlier_count.items() if v > 1]

In [8]:
len(outliers)

5

In [9]:
customer_log_sc_df.shape

(440, 6)

In [18]:
outliers

[65, 66, 128, 154, 75]