# Identifying and Removing Outliers

To identify outliers in the data, we will use what is [the Tukey Method](http://datapigtechnologies.com/blog/index.php/highlighting-outliers-in-your-data-with-the-tukey-method/). 

This means that we will look for points that are more than 1.5 times the Inter-quartile range above the third quartile or below the first quartile.

In [1]:
import pandas as pd
import numpy as np

In [2]:
customer_log_sc_df = pd.read_pickle('final_log_sc.p')

In [3]:
def display_outliers(dataframe, col, param=1.5):
    Q1 = np.percentile(dataframe[col], 25)
    Q3 = np.percentile(dataframe[col], 75)
    tukey_window = param*(Q3-Q1)
    less_than_Q1 = dataframe[col] < Q1 - tukey_window
    greater_than_Q3 = dataframe[col] > Q3 + tukey_window
    tukey_mask = (less_than_Q1 | greater_than_Q3)
    return dataframe[tukey_mask]

In [10]:
display_outliers(customer_log_sc_df, 'Milk')

Unnamed: 0,Fresh,Milk,Grocery,Frozen,Detergents_Paper,Delicatessen
86,0.885721,2.855165,1.736366,-0.316992,1.815529,0.107374
98,-1.697763,-3.150111,-1.600536,-0.393251,-1.605886,-1.361243
154,-1.554127,-3.808515,-3.158292,-2.325583,-2.815523,-3.502289
356,0.878632,-2.984076,-2.741651,0.589193,-2.66933,-0.274075


In [11]:
display_outliers(customer_log_sc_df, 'Grocery')

Unnamed: 0,Fresh,Milk,Grocery,Frozen,Detergents_Paper,Delicatessen
75,0.806722,-1.004409,-6.585828,0.849171,-3.308406,0.165965
154,-1.554127,-3.808515,-3.158292,-2.325583,-2.815523,-3.502289


In [12]:
raw_outliers

[65,
 66,
 81,
 95,
 96,
 128,
 171,
 193,
 218,
 304,
 305,
 338,
 353,
 355,
 357,
 412,
 86,
 98,
 154,
 356,
 75,
 154,
 38,
 57,
 65,
 145,
 175,
 264,
 325,
 420,
 429,
 439,
 75,
 161,
 66,
 109,
 128,
 137,
 142,
 154,
 183,
 184,
 187,
 203,
 233,
 285,
 289,
 343]

In [4]:
for col in customer_log_sc_df:
    print(col, display_outliers(customer_log_sc_df, col).shape)

Fresh (16, 6)
Milk (4, 6)
Grocery (2, 6)
Frozen (10, 6)
Detergents_Paper (2, 6)
Delicatessen (14, 6)


What if we count the rows that show up as an outlier more than once?

In [5]:
from collections import Counter

In [6]:
raw_outliers = []
for col in customer_log_sc_df:
    outlier_df = display_outliers(customer_log_sc_df, col)
    raw_outliers += list(outlier_df.index)

In [7]:
outlier_count = Counter(raw_outliers)
outliers = [k for k,v in outlier_count.items() if v > 1]

In [13]:
outlier_count

Counter({38: 1,
         57: 1,
         65: 2,
         66: 2,
         75: 2,
         81: 1,
         86: 1,
         95: 1,
         96: 1,
         98: 1,
         109: 1,
         128: 2,
         137: 1,
         142: 1,
         145: 1,
         154: 3,
         161: 1,
         171: 1,
         175: 1,
         183: 1,
         184: 1,
         187: 1,
         193: 1,
         203: 1,
         218: 1,
         233: 1,
         264: 1,
         285: 1,
         289: 1,
         304: 1,
         305: 1,
         325: 1,
         338: 1,
         343: 1,
         353: 1,
         355: 1,
         356: 1,
         357: 1,
         412: 1,
         420: 1,
         429: 1,
         439: 1})

In [14]:
outliers

[65, 66, 128, 154, 75]

In [8]:
len(outliers)

5

In [9]:
customer_log_sc_df.shape

(440, 6)