# Anomaly Detection

In [1]:
import pandas as pd
import numpy as np

pd.options.display.max_rows = None
pd.options.display.max_columns = None

import warnings
warnings.filterwarnings('ignore')

## Continuous Probablistic Methods Exercises

1. Define a function named get_lower_and_upper_bounds that has two arguments. 
- The first argument is a pandas Series. 
- The second argument is the multiplier, which should have a default argument of 1.5.

In [2]:
def get_lower_and_upper_bounds(series, multiplier=1.5, extreme_multiplier=3, extreme = False):
    '''
    This funciton takes in a series, two multipliers and a flag and returns the upper and lower bounds
    requested.
    '''
    # Compute the first quartile
    q1 = series.quantile(0.25)
    # Compute the third quartile
    q3 = series.quantile(0.75)
    # Compute the Interquartile Range
    iqr = q3 - q1
    # Compute the mild outlier upper bound
    upper_bound = (multiplier * iqr) + q3
    # Compute the mild outlier lower bound
    lower_bound = q1 - (multiplier * iqr)
    # Compute the extreme outlier upper bound
    upper_outer_bound = (extreme_multiplier * iqr) + q3
    # Compute the extreem outlier lower bound
    lower_outer_bound = q1 - (extreme_multiplier * iqr)
    # check extreme flag and return appropriate values
    if extreme:
        return upper_outer_bound, upper_bound, lower_bound, lower_outer_bound
    else:
        return upper_bound, lower_bound

2. Using lemonade.csv dataset and focusing on continuous variables:

In [3]:
lemonade = pd.read_csv('https://gist.githubusercontent.com/ryanorsinger/19bc7eccd6279661bd13307026628ace/raw/e4b5d6787015a4782f96cad6d1d62a8bdbac54c7/lemonade.csv')
lemonade.head()

Unnamed: 0,Date,Day,Temperature,Rainfall,Flyers,Price,Sales
0,1/1/17,Sunday,27.0,2.0,15,0.5,10
1,1/2/17,Monday,28.9,1.33,15,0.5,13
2,1/3/17,Tuesday,34.5,1.33,27,0.5,15
3,1/4/17,Wednesday,44.1,1.05,28,0.5,17
4,1/5/17,Thursday,42.4,1.0,33,0.5,18


In [17]:
for col in lemonade.columns[[(lemonade.dtypes == int) | (lemonade.dtypes == float)]]:
    eub,ub,lb,elb = get_lower_and_upper_bounds(lemonade[col],extreme=True)
    print(f'{col} Max: {np.max(lemonade[col])}, {col} Min: {np.min(lemonade[col])}\n')
    print(f'eub: {eub}, ub: {ub}, lb: {lb}, elb: {elb}')
    print('----------------------------\n')

Temperature Max: 212.0, Temperature Min: 15.1

eub: 137.7, ub: 104.7, lb: 16.700000000000003, elb: -16.299999999999997
----------------------------

Rainfall Max: 2.5, Rainfall Min: 0.4

eub: 1.69, ub: 1.3, lb: 0.26, elb: -0.13
----------------------------

Flyers Max: 80, Flyers Min: -38

eub: 103.0, ub: 76.0, lb: 4.0, elb: -23.0
----------------------------

Price Max: 0.5, Price Min: 0.5

eub: 0.5, ub: 0.5, lb: 0.5, elb: 0.5
----------------------------

Sales Max: 534, Sales Min: 7

eub: 60.0, ub: 45.0, lb: 5.0, elb: -10.0
----------------------------



a) Use the IQR Range Rule and the upper and lower bounds to identify the lower outliers of each column of lemonade.csv, using the multiplier of 1.5.

In [19]:
for col in lemonade.columns[[(lemonade.dtypes == int) | (lemonade.dtypes == float)]]:
    eub,ub,lb,elb = get_lower_and_upper_bounds(lemonade[col],extreme=True)
    print(f'{col} Max: {np.max(lemonade[col])}, {col} Min: {np.min(lemonade[col])}\n')
    print(f'lb: {lb}')
    print('\nLower Outliers\n')
    print(lemonade[lemonade[col] < lb][col].value_counts().sort_index(ascending=False))
    print('----------------------------\n\n')

Temperature Max: 212.0, Temperature Min: 15.1

lb: 16.700000000000003

Lower Outliers

15.1    1
Name: Temperature, dtype: int64
----------------------------


Rainfall Max: 2.5, Rainfall Min: 0.4

lb: 0.26

Lower Outliers

Series([], Name: Rainfall, dtype: int64)
----------------------------


Flyers Max: 80, Flyers Min: -38

lb: 4.0

Lower Outliers

-38    1
Name: Flyers, dtype: int64
----------------------------


Price Max: 0.5, Price Min: 0.5

lb: 0.5

Lower Outliers

Series([], Name: Price, dtype: int64)
----------------------------


Sales Max: 534, Sales Min: 7

lb: 5.0

Lower Outliers

Series([], Name: Sales, dtype: int64)
----------------------------




- Do these lower outliers make sense? 



- Which outliers should be kept?




b) Use the IQR Range Rule and the upper and upper bounds to identify the upper outliers of each column of lemonade.csv, using the multiplier of 1.5.

In [20]:
for col in lemonade.columns[[(lemonade.dtypes == int) | (lemonade.dtypes == float)]]:
    eub,ub,lb,elb = get_lower_and_upper_bounds(lemonade[col],extreme=True)
    print(f'{col} Max: {np.max(lemonade[col])}, {col} Min: {np.min(lemonade[col])}\n')
    print(f'ub: {ub}')
    print('\nLower Outliers\n')
    print(lemonade[lemonade[col] > ub][col].value_counts().sort_index(ascending=False))
    print('----------------------------\n\n\n\n\n')

Temperature Max: 212.0, Temperature Min: 15.1

ub: 104.7

Lower Outliers

212.0    1
Name: Temperature, dtype: int64
----------------------------





Rainfall Max: 2.5, Rainfall Min: 0.4

ub: 1.3

Lower Outliers

2.50    1
2.00    1
1.82    2
1.67    1
1.54    7
1.43    7
1.33    9
Name: Rainfall, dtype: int64
----------------------------





Flyers Max: 80, Flyers Min: -38

ub: 76.0

Lower Outliers

80    1
77    1
Name: Flyers, dtype: int64
----------------------------





Price Max: 0.5, Price Min: 0.5

ub: 0.5

Lower Outliers

Series([], Name: Price, dtype: int64)
----------------------------





Sales Max: 534, Sales Min: 7

ub: 45.0

Lower Outliers

534    1
235    1
158    1
143    1
Name: Sales, dtype: int64
----------------------------







- Do these upper outliers make sense? 



- Which outliers should be kept?




c) Using the multiplier of 3, IQR Range Rule, and the lower bounds, identify the outliers below the lower bound in each colum of lemonade.csv.

In [None]:
for col in lemonade.columns[[(lemonade.dtypes == int) | (lemonade.dtypes == float)]]:
    eub,ub,lb,elb = get_lower_and_upper_bounds(lemonade[col],extreme=True)
    print(f'{col} Max: {np.max(lemonade[col])}, {col} Min: {np.min(lemonade[col])}\n')
    print(f'eub: {eub}, ub: {ub}, lb: {lb}, elb: {elb}')
    print('\nLower Outliers\n')
    print(lemonade[lemonade[col] < lb][col].value_counts().sort_index(ascending=False))
    print('----------------------------\n\n\n\n\n')

- Do these lower outliers make sense? 



- Which outliers should be kept?




d) Using the multiplier of 3, IQR Range Rule, and the upper bounds, identify the outliers above the upper_bound in each colum of lemonade.csv.

In [None]:
for col in lemonade.columns[[(lemonade.dtypes == int) | (lemonade.dtypes == float)]]:
    eub,ub,lb,elb = get_lower_and_upper_bounds(lemonade[col],extreme=True)
    print(f'{col} Max: {np.max(lemonade[col])}, {col} Min: {np.min(lemonade[col])}\n')
    print(f'eub: {eub}, ub: {ub}, lb: {lb}, elb: {elb}')
    print('\nLower Outliers\n')
    print(lemonade[lemonade[col] < lb][col].value_counts().sort_index(ascending=False))
    print('----------------------------\n\n\n\n\n')

- Do these upper outliers make sense?



- Which outliers should be kept?




3. Identify if any columns in lemonade.csv are normally distributed. For normally distributed columns:
- Use a 2 sigma decision rule to isolate the outliers.

In [None]:
for col in lemonade.columns[[(lemonade.dtypes == int) | (lemonade.dtypes == float)]]:
    eub,ub,lb,elb = get_lower_and_upper_bounds(lemonade[col],extreme=True)
    print(f'{col} Max: {np.max(lemonade[col])}, {col} Min: {np.min(lemonade[col])}\n')
    print(f'eub: {eub}, ub: {ub}, lb: {lb}, elb: {elb}')
    print('\nLower Outliers\n')
    print(lemonade[lemonade[col] < lb][col].value_counts().sort_index(ascending=False))
    print('----------------------------\n\n\n\n\n')

- Do these make sense?



- Should certain outliers be kept or removed?




4. Now use a 3 sigma decision rule to isolate the outliers in the normally distributed columns from lemonade.csv

In [None]:
for col in lemonade.columns[[(lemonade.dtypes == int) | (lemonade.dtypes == float)]]:
    eub,ub,lb,elb = get_lower_and_upper_bounds(lemonade[col],extreme=True)
    print(f'{col} Max: {np.max(lemonade[col])}, {col} Min: {np.min(lemonade[col])}\n')
    print(f'eub: {eub}, ub: {ub}, lb: {lb}, elb: {elb}')
    print('\nLower Outliers\n')
    print(lemonade[lemonade[col] < lb][col].value_counts().sort_index(ascending=False))
    print('----------------------------\n\n\n\n\n')