In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [28]:
def get_lower_and_upper_bounds(series, df=df, multiplier=1.5):
    '''
    This function takes in a series and a multiplier and returns series items that are greater than the multiplier
    times the interquartile range above the 75th percentile and less than the multiplier times the interquartile range
    below the 25th percentile
    '''
    # get value at the 25th percentile
    q1 = df[series].quantile(0.25)
    # get value at the 75th percentile
    q3 = df[series].quantile(0.75)
    # calculate interquartile range
    iqr = q3 - q1
    # calculate upper and lower limits for the whiskers
    lower_limit = q1 - (multiplier * iqr)
    upper_limit = q3 + (multiplier * iqr)
    # get outliers outside this range
    upper_outliers = df[(df[series] > upper_limit)]
    lower_outliers = df[(df[series] < lower_limit)]
    
    return upper_outliers, lower_outliers

# Exercise 1

Using lemonade.csv dataset and focusing on continuous variables:

In [5]:
# pull in data
df = pd.read_csv('https://gist.githubusercontent.com/ryanorsinger/19bc7eccd6279661bd13307026628ace/raw/e4b5d6787015a4782f96cad6d1d62a8bdbac54c7/lemonade.csv')
df.head()

Unnamed: 0,Date,Day,Temperature,Rainfall,Flyers,Price,Sales
0,1/1/17,Sunday,27.0,2.0,15,0.5,10
1,1/2/17,Monday,28.9,1.33,15,0.5,13
2,1/3/17,Tuesday,34.5,1.33,27,0.5,15
3,1/4/17,Wednesday,44.1,1.05,28,0.5,17
4,1/5/17,Thursday,42.4,1.0,33,0.5,18


In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 365 entries, 0 to 364
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Date         365 non-null    object 
 1   Day          365 non-null    object 
 2   Temperature  365 non-null    float64
 3   Rainfall     365 non-null    float64
 4   Flyers       365 non-null    int64  
 5   Price        365 non-null    float64
 6   Sales        365 non-null    int64  
dtypes: float64(3), int64(2), object(2)
memory usage: 20.1+ KB


## Exercise 1a

Use the IQR Range Rule and the upper and lower bounds to identify the lower outliers of each column of lemonade.csv, using the multiplier of 1.5. Do these lower outliers make sense? Which outliers should be kept?

In [35]:
# loop through columns and for each column that is not object datatype use function to get outliers
for col in df.columns: 
    if df[col].dtype != 'object': 
        upper_outliers, lower_outliers = get_lower_and_upper_bounds(col, df=df, multiplier=1.5)
        print(col) 
        print(lower_outliers) 
        print('-------------------') 

Temperature
         Date     Day  Temperature  Rainfall  Flyers  Price  Sales
364  12/31/17  Sunday         15.1       2.5       9    0.5      7
-------------------
Rainfall
Empty DataFrame
Columns: [Date, Day, Temperature, Rainfall, Flyers, Price, Sales]
Index: []
-------------------
Flyers
         Date      Day  Temperature  Rainfall  Flyers  Price  Sales
324  11/21/17  Tuesday         47.0      0.95     -38    0.5     20
-------------------
Price
Empty DataFrame
Columns: [Date, Day, Temperature, Rainfall, Flyers, Price, Sales]
Index: []
-------------------
Sales
Empty DataFrame
Columns: [Date, Day, Temperature, Rainfall, Flyers, Price, Sales]
Index: []
-------------------


#### The temperature outlier makes sense based on the date so I would keep that. The number of flyers does not make sense so is likely a typo and should be 38. Rainfall, price, and sales do not have any lower outliers.

## Exercise 1b

Use the IQR Range Rule and the upper and upper bounds to identify the upper outliers of each column of lemonade.csv, using the multiplier of 1.5. Do these upper outliers make sense? Which outliers should be kept?

In [36]:
# loop through columns and for each column that is not object datatype use function to get outliers
for col in df.columns: 
    if df[col].dtype != 'object': 
        upper_outliers, lower_outliers = get_lower_and_upper_bounds(col, df=df, multiplier=1.5)
        print(col) 
        print(upper_outliers) 
        print('-------------------') 

Temperature
       Date       Day  Temperature  Rainfall  Flyers  Price  Sales
41  2/11/17  Saturday        212.0      0.91      35    0.5     21
-------------------
Rainfall
         Date        Day  Temperature  Rainfall  Flyers  Price  Sales
0      1/1/17     Sunday         27.0      2.00      15    0.5     10
1      1/2/17     Monday         28.9      1.33      15    0.5     13
2      1/3/17    Tuesday         34.5      1.33      27    0.5     15
5      1/6/17     Friday         25.3      1.54      23    0.5     11
6      1/7/17   Saturday         32.9      1.54      19    0.5     13
10    1/11/17  Wednesday         32.6      1.54      23    0.5     12
11    1/12/17   Thursday         38.2      1.33      16    0.5     14
12    1/13/17     Friday         37.5      1.33      19    0.5     15
15    1/16/17     Monday         30.6      1.67      24    0.5     12
16    1/17/17    Tuesday         32.2      1.43      26    0.5     14
19    1/20/17     Friday         31.6      1.43      20

#### A temperature of 212 is obviously an error and likely a typo. I would probably change it to 21 based on the temperature for prior days. The rainfall does not seem unusual other than maybe just a rainy year so I would keep that. The number of flyers and total sales is also not unusual considering it is during the hottest period of the year.