# Continuous Variable Probabilistic Methods for Identifying Outliers Exercises
Using the repo setup directions, setup a new local and remote repository named anomaly-detection-exercises. The local version of your repo should live inside of ~/codeup-data-science. This repo should be named anomaly-detection-exercises

Save this work in your anomaly-detection-exercises repo. Then add, commit, and push your changes.

continuous_probabilistic_methods.py or continuous_probabilistic_methods.ipynb 1. Define a function named get_lower_and_upper_bounds that has two arguments. The first argument is a pandas Series. The second argument is the multiplier, which should have a default argument of 1.5.

## 1. Using lemonade.csv dataset and focusing on continuous variables:


In [2]:
import pandas as pd
import numpy as np

In [6]:
lemon = pd.read_csv('https://gist.githubusercontent.com/ryanorsinger/19bc7eccd6279661bd13307026628ace/raw/e4b5d6787015a4782f96cad6d1d62a8bdbac54c7/lemonade.csv')
lemon.head()

Unnamed: 0,Date,Day,Temperature,Rainfall,Flyers,Price,Sales
0,1/1/17,Sunday,27.0,2.0,15,0.5,10
1,1/2/17,Monday,28.9,1.33,15,0.5,13
2,1/3/17,Tuesday,34.5,1.33,27,0.5,15
3,1/4/17,Wednesday,44.1,1.05,28,0.5,17
4,1/5/17,Thursday,42.4,1.0,33,0.5,18


In [18]:
def get_lower_and_upper_bounds(df_series, multiplier=1.5):
    stats = df_series.describe()
    q1 = stats.loc['25%']
    q3 = stats.loc['75%']
    iqr = q3 - q1
    lower_bound = q1 - (multiplier * iqr)
    upper_bound = q3 + (multiplier * iqr)
    print(f'Lower bound: {lower_bound}')
    print(f'Upper bound: {upper_bound}')
    return lower_bound, upper_bound


### * Use the IQR Range Rule and the upper and lower bounds to identify the lower outliers of each column of lemonade.csv, using the multiplier of 1.5. Do these lower outliers make sense? Which outliers should be kept?

In [81]:
def outliers(lemon, mult=1.5, up=True, down=True):
    for col in lemon.select_dtypes('number').columns:
        print(col)
        l, u = get_lower_and_upper_bounds(lemon[col], multiplier=mult)
        if down is True:
            print()
            print('Lower outliers')
            print( lemon[(lemon[col] < l)].sort_values(col) )
        if up is True:
            print()
            print('Upper outliers')
            print( lemon[(lemon[col] > u)].sort_values(col) )
        print('--- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---')
        print()

In [82]:
outliers(lemon, up=False)

Temperature
Lower bound: 16.700000000000003
Upper bound: 104.7

Lower outliers
         Date     Day  Temperature  Rainfall  Flyers  Price  Sales
364  12/31/17  Sunday         15.1       2.5       9    0.5      7
--- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---

Rainfall
Lower bound: 0.26
Upper bound: 1.3

Lower outliers
Empty DataFrame
Columns: [Date, Day, Temperature, Rainfall, Flyers, Price, Sales]
Index: []
--- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---

Flyers
Lower bound: 4.0
Upper bound: 76.0

Lower outliers
         Date      Day  Temperature  Rainfall  Flyers  Price  Sales
324  11/21/17  Tuesday         47.0      0.95     -38    0.5     20
--- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---

Price
Lower bound: 0.5
Upper bound: 0.5

Lower outliers
Empty DataFrame
Columns: [Date, Day, Temperature, Rainfall

### * Use the IQR Range Rule and the upper and upper bounds to identify the upper outliers of each column of lemonade.csv, using the multiplier of 1.5. Do these upper outliers make sense? Which outliers should be kept?

In [83]:
outliers(lemon, down=False)

Temperature
Lower bound: 16.700000000000003
Upper bound: 104.7

Upper outliers
       Date       Day  Temperature  Rainfall  Flyers  Price  Sales
41  2/11/17  Saturday        212.0      0.91      35    0.5     21
--- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---

Rainfall
Lower bound: 0.26
Upper bound: 1.3

Upper outliers
         Date        Day  Temperature  Rainfall  Flyers  Price  Sales
28    1/29/17     Sunday         35.2      1.33      27    0.5     14
354  12/21/17   Thursday         40.5      1.33      23    0.5     15
350  12/17/17     Sunday         32.2      1.33      16    0.5     14
345  12/12/17    Tuesday         33.5      1.33      22    0.5     15
12    1/13/17     Friday         37.5      1.33      19    0.5     15
11    1/12/17   Thursday         38.2      1.33      16    0.5     14
27    1/28/17   Saturday         34.9      1.33      15    0.5     13
1      1/2/17     Monday         28.9      1.33      15   

### * Using the multiplier of 3, IQR Range Rule, and the lower bounds, identify the outliers below the lower bound in each colum of lemonade.csv. Do these lower outliers make sense? Which outliers should be kept?

In [84]:
outliers(lemon, mult=3, up=False)

Temperature
Lower bound: -16.299999999999997
Upper bound: 137.7

Lower outliers
Empty DataFrame
Columns: [Date, Day, Temperature, Rainfall, Flyers, Price, Sales]
Index: []
--- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---

Rainfall
Lower bound: -0.13
Upper bound: 1.69

Lower outliers
Empty DataFrame
Columns: [Date, Day, Temperature, Rainfall, Flyers, Price, Sales]
Index: []
--- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---

Flyers
Lower bound: -23.0
Upper bound: 103.0

Lower outliers
         Date      Day  Temperature  Rainfall  Flyers  Price  Sales
324  11/21/17  Tuesday         47.0      0.95     -38    0.5     20
--- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---

Price
Lower bound: 0.5
Upper bound: 0.5

Lower outliers
Empty DataFrame
Columns: [Date, Day, Temperature, Rainfall, Flyers, Price, Sales]
Index: []
--

### * Using the multiplier of 3, IQR Range Rule, and the upper bounds, identify the outliers above the upper_bound in each colum of lemonade.csv. Do these upper outliers make sense? Which outliers should be kept?

In [85]:
outliers(lemon, mult=3, down=False)

Temperature
Lower bound: -16.299999999999997
Upper bound: 137.7

Upper outliers
       Date       Day  Temperature  Rainfall  Flyers  Price  Sales
41  2/11/17  Saturday        212.0      0.91      35    0.5     21
--- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---

Rainfall
Lower bound: -0.13
Upper bound: 1.69

Upper outliers
         Date      Day  Temperature  Rainfall  Flyers  Price  Sales
338   12/5/17  Tuesday         22.0      1.82      11    0.5     10
343  12/10/17   Sunday         31.3      1.82      15    0.5     11
0      1/1/17   Sunday         27.0      2.00      15    0.5     10
364  12/31/17   Sunday         15.1      2.50       9    0.5      7
--- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---

Flyers
Lower bound: -23.0
Upper bound: 103.0

Upper outliers
Empty DataFrame
Columns: [Date, Day, Temperature, Rainfall, Flyers, Price, Sales]
Index: []
--- --- --- -

## 2. Identify if any columns in lemonade.csv are normally distributed. For normally distributed columns:

    * Use a 2 sigma decision rule to isolate the outliers.

        * Do these make sense?
        * Should certain outliers be kept or removed?

In [59]:
lemon.head()

Unnamed: 0,Date,Day,Temperature,Rainfall,Flyers,Price,Sales
0,1/1/17,Sunday,27.0,2.0,15,0.5,10
1,1/2/17,Monday,28.9,1.33,15,0.5,13
2,1/3/17,Tuesday,34.5,1.33,27,0.5,15
3,1/4/17,Wednesday,44.1,1.05,28,0.5,17
4,1/5/17,Thursday,42.4,1.0,33,0.5,18


In [67]:
lemon.select_dtypes('number')

Unnamed: 0,Temperature,Rainfall,Flyers,Price,Sales
0,27.0,2.00,15,0.5,10
1,28.9,1.33,15,0.5,13
2,34.5,1.33,27,0.5,15
3,44.1,1.05,28,0.5,17
4,42.4,1.00,33,0.5,18
...,...,...,...,...,...
360,42.7,1.00,33,0.5,19
361,37.8,1.25,32,0.5,16
362,39.5,1.25,17,0.5,15
363,30.9,1.43,22,0.5,13


In [86]:
def sigma_outliers(df, sig = 2):
    for col in df.select_dtypes('number').columns:
        z = (df[col] - df[col].mean()) / df[col].std()
        print(f' {col} outliers')
        print(df[z.abs() >= sig])
        print('--- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---')
        print()

In [87]:
sigma_outliers(lemon)

 Temperature outliers
         Date       Day  Temperature  Rainfall  Flyers  Price  Sales
41    2/11/17  Saturday        212.0      0.91      35    0.5     21
166   6/16/17    Friday         99.3      0.47      77    0.5     41
176   6/26/17    Monday        102.6      0.47      60    0.5     42
181    7/1/17  Saturday        102.9      0.47      59    0.5    143
190   7/10/17    Monday         98.0      0.49      66    0.5     40
198   7/18/17   Tuesday         99.3      0.47      76    0.5     41
202   7/22/17  Saturday         99.6      0.47      49    0.5     42
207   7/27/17  Thursday         97.9      0.47      74    0.5     43
338   12/5/17   Tuesday         22.0      1.82      11    0.5     10
364  12/31/17    Sunday         15.1      2.50       9    0.5      7
--- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---

 Rainfall outliers
         Date        Day  Temperature  Rainfall  Flyers  Price  Sales
0      1/1/17     Sun

## 3. Now use a 3 sigma decision rule to isolate the outliers in the normally distributed columns from lemonade.csv

In [70]:
sigma_outliers(lemon, 3)

 Temperature outliers
       Date       Day  Temperature  Rainfall  Flyers  Price  Sales
41  2/11/17  Saturday        212.0      0.91      35    0.5     21

 Rainfall outliers
         Date      Day  Temperature  Rainfall  Flyers  Price  Sales
0      1/1/17   Sunday         27.0      2.00      15    0.5     10
15    1/16/17   Monday         30.6      1.67      24    0.5     12
338   12/5/17  Tuesday         22.0      1.82      11    0.5     10
343  12/10/17   Sunday         31.3      1.82      15    0.5     11
364  12/31/17   Sunday         15.1      2.50       9    0.5      7

 Flyers outliers
         Date      Day  Temperature  Rainfall  Flyers  Price  Sales
324  11/21/17  Tuesday         47.0      0.95     -38    0.5     20

 Price outliers
Empty DataFrame
Columns: [Date, Day, Temperature, Rainfall, Flyers, Price, Sales]
Index: []

 Sales outliers
       Date       Day  Temperature  Rainfall  Flyers  Price  Sales
181  7/1/17  Saturday        102.9      0.47      59    0.5    143
18