In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import zscore

In [None]:
def get_lower_and_upper_bounds(series, df, multiplier=1.5):
    '''
    This function takes in a series and a multiplier and returns series items that are greater than the multiplier
    times the interquartile range above the 75th percentile and less than the multiplier times the interquartile range
    below the 25th percentile
    '''
    # get value at the 25th percentile
    q1 = df[series].quantile(0.25)
    # get value at the 75th percentile
    q3 = df[series].quantile(0.75)
    # calculate interquartile range
    iqr = q3 - q1
    # calculate upper and lower limits for the whiskers
    lower_limit = q1 - (multiplier * iqr)
    upper_limit = q3 + (multiplier * iqr)
    # get outliers outside this range
    upper_outliers = df[(df[series] > upper_limit)]
    lower_outliers = df[(df[series] < lower_limit)]
    
    return upper_outliers, lower_outliers

# Exercise 1

Using lemonade.csv dataset and focusing on continuous variables:

In [2]:
# pull in data
df = pd.read_csv('https://gist.githubusercontent.com/ryanorsinger/19bc7eccd6279661bd13307026628ace/raw/e4b5d6787015a4782f96cad6d1d62a8bdbac54c7/lemonade.csv')
df.head()

Unnamed: 0,Date,Day,Temperature,Rainfall,Flyers,Price,Sales
0,1/1/17,Sunday,27.0,2.0,15,0.5,10
1,1/2/17,Monday,28.9,1.33,15,0.5,13
2,1/3/17,Tuesday,34.5,1.33,27,0.5,15
3,1/4/17,Wednesday,44.1,1.05,28,0.5,17
4,1/5/17,Thursday,42.4,1.0,33,0.5,18


In [None]:
df.info()

## Exercise 1a

Use the IQR Range Rule and the upper and lower bounds to identify the lower outliers of each column of lemonade.csv, using the multiplier of 1.5. Do these lower outliers make sense? Which outliers should be kept?

In [None]:
# loop through columns and for each column that is not object datatype use function to get outliers
for col in df.columns: 
    if df[col].dtype != 'object': 
        upper_outliers, lower_outliers = get_lower_and_upper_bounds(col, df=df, multiplier=1.5)
        print(col) 
        print(lower_outliers) 
        print('-------------------') 

#### The temperature outlier makes sense based on the date so I would keep that. The number of flyers does not make sense so is likely a typo and should be 38. Rainfall, price, and sales do not have any lower outliers.

## Exercise 1b

Use the IQR Range Rule and the upper and upper bounds to identify the upper outliers of each column of lemonade.csv, using the multiplier of 1.5. Do these upper outliers make sense? Which outliers should be kept?

In [None]:
# loop through columns and for each column that is not object datatype use function to get outliers
for col in df.columns: 
    if df[col].dtype != 'object': 
        upper_outliers, lower_outliers = get_lower_and_upper_bounds(col, df=df, multiplier=1.5)
        print(col) 
        print(upper_outliers) 
        print('-------------------') 

#### A temperature of 212 is obviously an error and likely a typo. I would probably change it to 21 based on the temperature for prior days. The rainfall does not seem unusual other than maybe just a rainy year so I would keep that. The number of flyers and total sales is also not unusual considering it is during the hottest period of the year. I might question sales of 534 as that is quite a bit more than the others, however, it happened on July 4th so they might have been selling at some event with a lot of attendees.

## Exercise 1c

Using the multiplier of 3, IQR Range Rule, and the lower bounds, identify the outliers below the lower bound in each colum of lemonade.csv. Do these lower outliers make sense? Which outliers should be kept?

In [None]:
# loop through columns and for each column that is not object datatype use function to get outliers
for col in df.columns: 
    if df[col].dtype != 'object': 
        upper_outliers, lower_outliers = get_lower_and_upper_bounds(col, df=df, multiplier=3)
        print(col) 
        print(lower_outliers) 
        print('-------------------') 

#### The only outlier with a multiplier of 3 is -38 fliers which I would assume to be a typo and change it to 38

## Exercise 1d

Using the multiplier of 3, IQR Range Rule, and the upper bounds, identify the outliers above the upper_bound in each colum of lemonade.csv. Do these upper outliers make sense? Which outliers should be kept?

In [None]:
# loop through columns and for each column that is not object datatype use function to get outliers
for col in df.columns: 
    if df[col].dtype != 'object': 
        upper_outliers, lower_outliers = get_lower_and_upper_bounds(col, df=df, multiplier=3)
        print(col) 
        print(upper_outliers) 
        print('-------------------') 

#### As mentioned in the previous comment, I would keep all of these other than the temperature which is obviously a typo or error.

In [None]:
df.head()

# Exercise 2

Identify if any columns in lemonade.csv are normally distributed. For normally distributed columns:

* Use a 2 sigma decision rule to isolate the outliers.

* Do these make sense?

* Should certain outliers be kept or removed?

In [None]:
# plot histograms for each of the columns
fig, axs = plt.subplots(len(df.columns), figsize=(5, 25))
for n, col in enumerate(df.columns):
    df[col].hist(ax=axs[n])
    plt.title(col)
    plt.tight_layout()

#### Temperature, rainfall, and flyers are all fairly normally distributed.

In [None]:
# use assign to create new columns with the zscores for each of the normally distributed columns
df = df.assign(temp_zscore = zscore(df.Temperature), 
               rain_zscore = zscore(df.Rainfall), 
               flyers_zscore = zscore(df.Flyers))
df.head()

In [None]:
# get all observations where the temperature is greater than two standard deviations from the mean
df[abs(df.temp_zscore) > 2]

#### For temperature, even though more than two standard deviations from the mean, these are not uncommon other than 212. It may just have been a very hot summer and a few cold days during the winter so I would keep these but change 212 to 21.

In [None]:
# get all observations where the rainfall is greater than two standard deviations from the mean
df[abs(df.rain_zscore) > 2]

#### For rain, I would keep all observations as none of these are excessive even though outside two standard deviations from the mean.

In [None]:
# get all observations where the rainfall is greater than two standard deviations from the mean
df[abs(df.flyers_zscore) > 2]

#### The only observation from flyers I would change is -38 as that is impossible and obviously a typo. All others are reasonable even though outside two standard deviations from the mean.

# Exercise 3

Now use a 3 sigma decision rule to isolate the outliers in the normally distributed columns from lemonade.csv

In [None]:
# get all observations where the temperature is greater than three standard deviations from the mean
df[abs(df.temp_zscore) > 3]

In [None]:
# get all observations where the rainfall is greater than three standard deviations from the mean
df[abs(df.rain_zscore) > 3]

In [None]:
# get all observations where the rainfall is greater than three standard deviations from the mean
df[abs(df.flyers_zscore) > 3]