# Identifying and Removing Outliers

To identify outliers in the data, we will use what is [the Tukey Method](http://datapigtechnologies.com/blog/index.php/highlighting-outliers-in-your-data-with-the-tukey-method/). 

This means that we will look for points that are more than 1.5 times the Inter-quartile range above the third quartile or below the first quartile.

In [1]:
import pandas as pd
import numpy as np

In [2]:
cd ..

/home/jovyan/UCLA_CSX_450_2_2018_W-1/09-wholesale_customers-3


In [3]:
%run src/load_data.py

In [4]:
whos DataFrame

Variable             Type         Data/Info
-------------------------------------------
customer_df          DataFrame         Fresh   Milk  Grocer<...>n\n[440 rows x 6 columns]
customer_final_df    DataFrame            Fresh      Milk  <...>n\n[435 rows x 6 columns]
customer_log_df      DataFrame             Fresh       Milk<...>n\n[440 rows x 6 columns]
customer_log_sc_df   DataFrame            Fresh      Milk  <...>n\n[440 rows x 6 columns]
customer_sc_df       DataFrame            Fresh      Milk  <...>n\n[440 rows x 6 columns]


In [5]:
def display_outliers(dataframe, col, param=1.5):
    Q1 = np.percentile(dataframe[col], 25)
    Q3 = np.percentile(dataframe[col], 75)
    tukey_window = param*(Q3-Q1)
    less_than_Q1 = dataframe[col] < Q1 - tukey_window
    greater_than_Q3 = dataframe[col] > Q3 + tukey_window
    tukey_mask = (less_than_Q1 | greater_than_Q3)
    return dataframe[tukey_mask]

In [6]:
for col in customer_log_sc_df:
    print(col, display_outliers(customer_log_sc_df, col).shape)

Fresh (16, 6)
Milk (4, 6)
Grocery (2, 6)
Frozen (10, 6)
Detergents_Paper (2, 6)
Delicatessen (14, 6)


What if we count the rows that show up as an outlier more than once?

In [7]:
from collections import Counter

In [8]:
raw_outliers = []
for col in customer_log_sc_df:
    outlier_df = display_outliers(customer_log_sc_df, col)
    raw_outliers += list(outlier_df.index)

In [9]:
outlier_count = Counter(raw_outliers)
outliers = [k for k,v in outlier_count.items() if v > 1]

In [10]:
len(outliers)

5

In [11]:
customer_log_sc_df.shape

(440, 6)

In [12]:
outliers

[65, 66, 128, 154, 75]