# Continuous Probabalistic Methods

## Import Libraries

In [2]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

## Exercises

Define a function named get_lower_and_upper_bounds that has two arguments. The first argument is a pandas Series. The second argument is the multiplier, which should have a default argument of 1.5.

In [4]:
def get_lower_and_upper_bounds(df, multiplier):
    #Calculate Q1 value
    q1 = df.quantile(0.25)
    #Calculate Q3 value
    q3 = df.quantile(0.75)
    #Calculate interquartile range value
    iqr = q3 - q1
    
    #Lower Bound
    inner_lower_fence = q1 - (multiplier * iqr)
    
    #Upper bound
    inner_upper_fence = q3 + (multiplier * iqr)
    
    return inner_lower_fence, inner_upper_fence

## Part 1

In [6]:
url = 'https://gist.githubusercontent.com/ryanorsinger/19bc7eccd6279661bd13307026628ace/raw/e4b5d6787015a4782f96cad6d1d62a8bdbac54c7/lemonade.csv'
lemonade = pd.read_csv(url)
lemonade.head()

Unnamed: 0,Date,Day,Temperature,Rainfall,Flyers,Price,Sales
0,1/1/17,Sunday,27.0,2.0,15,0.5,10
1,1/2/17,Monday,28.9,1.33,15,0.5,13
2,1/3/17,Tuesday,34.5,1.33,27,0.5,15
3,1/4/17,Wednesday,44.1,1.05,28,0.5,17
4,1/5/17,Thursday,42.4,1.0,33,0.5,18


Using lemonade.csv dataset and focusing on continuous variables:

### A

Use the IQR Range Rule and the upper and lower bounds to identify the lower outliers of each column of lemonade.csv, using the multiplier of 1.5. Do these lower outliers make sense? Which outliers should be kept?

In [13]:
inner_lower_fence, inner_upper_fence = get_lower_and_upper_bounds(lemonade, 1.5)

In [21]:
inner_lower_fence

Temperature    16.70
Rainfall        0.26
Flyers          4.00
Price           0.50
Sales           5.00
dtype: float64

In [26]:
inner_upper_fence

Temperature    104.7
Rainfall         1.3
Flyers          76.0
Price            0.5
Sales           45.0
dtype: float64

In [63]:
lower_outliers = {}
for col in lemonade.select_dtypes(np.number).columns:
    lower_outliers[col] = pd.DataFrame(lemonade[lemonade[col] < inner_lower_fence[col]])    

In [64]:
lower_outliers

{'Temperature':          Date     Day  Temperature  Rainfall  Flyers  Price  Sales
 364  12/31/17  Sunday         15.1       2.5       9    0.5      7,
 'Rainfall': Empty DataFrame
 Columns: [Date, Day, Temperature, Rainfall, Flyers, Price, Sales]
 Index: [],
 'Flyers':          Date      Day  Temperature  Rainfall  Flyers  Price  Sales
 324  11/21/17  Tuesday         47.0      0.95     -38    0.5     20,
 'Price': Empty DataFrame
 Columns: [Date, Day, Temperature, Rainfall, Flyers, Price, Sales]
 Index: [],
 'Sales': Empty DataFrame
 Columns: [Date, Day, Temperature, Rainfall, Flyers, Price, Sales]
 Index: []}

There are two lower outliers in the dataset. One for `Temperature`:

In [65]:
lower_outliers['Temperature']

Unnamed: 0,Date,Day,Temperature,Rainfall,Flyers,Price,Sales
364,12/31/17,Sunday,15.1,2.5,9,0.5,7


And one for `Flyers`:

In [66]:
lower_outliers['Flyers']

Unnamed: 0,Date,Day,Temperature,Rainfall,Flyers,Price,Sales
324,11/21/17,Tuesday,47.0,0.95,-38,0.5,20


Let's see if these outliers make sense in the context of our data.

In [45]:
lemonade.sort_values(by = "Temperature").head()

Unnamed: 0,Date,Day,Temperature,Rainfall,Flyers,Price,Sales
364,12/31/17,Sunday,15.1,2.5,9,0.5,7
338,12/5/17,Tuesday,22.0,1.82,11,0.5,10
5,1/6/17,Friday,25.3,1.54,23,0.5,11
0,1/1/17,Sunday,27.0,2.0,15,0.5,10
23,1/24/17,Tuesday,28.6,1.54,20,0.5,12


The `Temperature` outlier looks like just a particularly cold day, which isn't that unusual for December.

In [44]:
lemonade.sort_values(by = "Flyers").head()

Unnamed: 0,Date,Day,Temperature,Rainfall,Flyers,Price,Sales
324,11/21/17,Tuesday,47.0,0.95,-38,0.5,20
364,12/31/17,Sunday,15.1,2.5,9,0.5,7
338,12/5/17,Tuesday,22.0,1.82,11,0.5,10
343,12/10/17,Sunday,31.3,1.82,15,0.5,11
27,1/28/17,Saturday,34.9,1.33,15,0.5,13


The `Flyers` outlier is the only negative value in the `Flyers` column. I am not sure what it means to have distributed a negative amount of Flyers. I would recommend dropping this outlier.

### B

Use the IQR Range Rule and the upper and lower bounds to identify the upper outliers of each column of lemonade.csv, using the multiplier of 1.5. Do these lower outliers make sense?Which outliers should be kept?

In [49]:
upper_outliers = {}
for col in lemonade.select_dtypes(np.number).columns:
    upper_outliers[col] = pd.DataFrame(lemonade[lemonade[col] > inner_upper_fence[col]])

In [55]:
upper_outliers['Temperature']

Unnamed: 0,Date,Day,Temperature,Rainfall,Flyers,Price,Sales
41,2/11/17,Saturday,212.0,0.91,35,0.5,21


The outlier in the Temperature `dataset` far exceeds the highest recordest temperature in the history of the world (regar 

In [57]:
upper_outliers['Rainfall']

Unnamed: 0,Date,Day,Temperature,Rainfall,Flyers,Price,Sales
0,1/1/17,Sunday,27.0,2.0,15,0.5,10
1,1/2/17,Monday,28.9,1.33,15,0.5,13
2,1/3/17,Tuesday,34.5,1.33,27,0.5,15
5,1/6/17,Friday,25.3,1.54,23,0.5,11
6,1/7/17,Saturday,32.9,1.54,19,0.5,13
10,1/11/17,Wednesday,32.6,1.54,23,0.5,12
11,1/12/17,Thursday,38.2,1.33,16,0.5,14
12,1/13/17,Friday,37.5,1.33,19,0.5,15
15,1/16/17,Monday,30.6,1.67,24,0.5,12
16,1/17/17,Tuesday,32.2,1.43,26,0.5,14


In [59]:
upper_outliers['Flyers']

Unnamed: 0,Date,Day,Temperature,Rainfall,Flyers,Price,Sales
166,6/16/17,Friday,99.3,0.47,77,0.5,41
194,7/14/17,Friday,92.0,0.5,80,0.5,40


In [60]:
upper_outliers['Price']

Unnamed: 0,Date,Day,Temperature,Rainfall,Flyers,Price,Sales


In [61]:
upper_outliers['Sales']

Unnamed: 0,Date,Day,Temperature,Rainfall,Flyers,Price,Sales
181,7/1/17,Saturday,102.9,0.47,59,0.5,143
182,7/2/17,Sunday,93.4,0.51,68,0.5,158
183,7/3/17,Monday,81.5,0.54,68,0.5,235
184,7/4/17,Tuesday,84.2,0.59,49,0.5,534
