# Hypothesis Testing Exercises

### Hypothesis Testing Overview Exercise Questions-

For each of the following questions, formulate a null and alternative hypothesis (be as specific as you can be), then give an example of what a true positive, true negative, type I and type II errors would look like. Note that some of the questions are intentionally phrased in a vague way. It is your job to reword these as more precise questions that could be tested.

#### Has the network latency gone up since we switched internet service providers?



$H_{0}$
 : The network latency (time in  milliseconds) has remained consistent or has decreased since switching internet providers.
 
$H_{a}$
: The network latency (time in  milliseconds) has increased since switching from 
    internet provider one to internet provider two.
    
   True positive: I determine there is a positive correlation. Based on sample data, average latency before the switch is 55ms 
    and average latency after the switch is 90ms with a 99% confidence interval.

   True negative : I determine there is no increase. Based on sample data, latency before the switch is 70ms 
    and average latency after the switch is 70ms with a 99% confidence interval.

   Type I : False positive. I determine there is a positive correlation, but I'm incorrect. Based on sample data, average latency before the switch is 55ms 
    and average latency after the switch is 70ms with a 99% confidence interval, but the 
    change in latency was actually due to a company software change.

   Type II : False negative. I determine there is a not a positive correlation, but I'm incorrect. Based on sample data, latency before the switch is 70ms and average latency after the switch is 70ms with a 99% confidence interval, but the sample periods used were not a fair representation of the population.


#### Is the website redesign any good? 


$H_{0}$
 : Website clicks have not changed or have decreased since the website redesign was launched

$H_{a}$
: Website clicks have increased since the website redesign was launched
    
True positive: I determine there is a positive correlation.  Based on sample data, average clicks before the switch is 10 per hour and clicks after the switch is 20 per hour with a 99% confidence interval.

True negative : I determine there is no increase. Based on sample data, average clicks before the switch is 10 per hour and clicks after the switch is 10 per hour with a 99% confidence interval.

Type I : False positive. I determine there is a positive correlation, but I'm incorrect. Based on sample data, Based on sample data, average clicks before the switch is 10 per hour
    and clicks after the switch is 20 per hour with a 99% confidence interval, but the increase in clicks is actually 
    due to a product launch and not the website redesign.

Type II : False negative. I determine there is a not a positive correlation, but I'm incorrect. Based on sample data, average clicks before the switch is 10 per hour
    and clicks after the switch is 10 per hour with a 99% confidence interval, but measuring the click rate was not
    a good measure of the resign success. 


#### Is our television ad driving more sales?


$H_{0}$
 : Sales have stayed the same or have decreased since the roll out of the the television ad.
 
$H_{a}$
: Sales have increase since the television ad started airing
    
True positive: I determine there is a positive correlation. Based on sample data, sales have increased 30% since the television
    ad started airing with a 99% confidence interval.

True negative : I determine there is no increase. Based on sample data, sales have not increased since the television
    ad started airing with a 99% confidence interval.

Type I : False positive. I determine there is a positive correlation, but I'm incorrect. Based on sample data, Based on sample data, sales have increased 30% since the television
    ad started airing with a 99% confidence interval, but the sample did not reflect the population and there was actually no increase in sales in the population.

Type II : False negative. I determine there is a not a positive correlation, but I'm incorrect. Based on sample data, sales have not increased since the television
    ad started airing with a 99% confidence interval, but the sample did not reflect the population and there was a 30% increase in sales in the population.


### T-Test Exercises

Ace Realty wants to determine whether the average time it takes to sell homes is different for its two offices. A sample of 40 sales from office #1 revealed a mean of 90 days and a standard deviation of 15 days. A sample of 50 sales from office #2 revealed a mean of 100 days and a standard deviation of 20 days. Use a .05 level of significance.

In [4]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from scipy import stats

In [204]:
# difference = 2 tailed test
# two offices = 2 sample t-test

# H_0 : the average time is takes to sell a home is the same for 
# office 1 and office 2

# H_a : the average time is takes to sell a home is different for 
# office 1 and office 2

# sample 1
mean1 = 90
sdev1 = 15
ssize1 = 40

# sample 2
mean2 = 100
sdev2 = 20
ssize2 = 50

t, p = stats.ttest_ind_from_stats (mean1, sdev1, ssize1, mean2, sdev2, 
ssize2, equal_var= False)
t, p

(-2.7091418459143854, 0.00811206270346016)

In [None]:
# Reject null hypothesis, I determine that the average time is takes 
# to sell a home is different for office 1 and office 2

In [None]:
# Load the mpg dataset and use it to answer the following questions:

# Is there a difference in fuel-efficiency in cars from 2008 vs 1999?
# Are compact cars more fuel-efficient than the average car?
# Do manual cars get better gas mileage than automatic cars?

In [205]:
from pydataset import data
mpg = data('mpg')
mpg

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class
1,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact
2,audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact
3,audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact
4,audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact
5,audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact
...,...,...,...,...,...,...,...,...,...,...,...
230,volkswagen,passat,2.0,2008,4,auto(s6),f,19,28,p,midsize
231,volkswagen,passat,2.0,2008,4,manual(m6),f,21,29,p,midsize
232,volkswagen,passat,2.8,1999,6,auto(l5),f,16,26,p,midsize
233,volkswagen,passat,2.8,1999,6,manual(m5),f,18,26,p,midsize


In [206]:
mpg = mpg.dropna()

In [207]:
mpg.head()

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class
1,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact
2,audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact
3,audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact
4,audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact
5,audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact


In [None]:
# Is there a difference in fuel-efficiency in cars from 2008 vs 1999?

In [210]:
# difference = 2 tailed test
# two years = 2 sample t-test

# H_0 : there is no difference in fuel efficiency 

# H_a : there is a difference in fuel efficiency

In [208]:
mpg.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 234 entries, 1 to 234
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   manufacturer  234 non-null    object 
 1   model         234 non-null    object 
 2   displ         234 non-null    float64
 3   year          234 non-null    int64  
 4   cyl           234 non-null    int64  
 5   trans         234 non-null    object 
 6   drv           234 non-null    object 
 7   cty           234 non-null    int64  
 8   hwy           234 non-null    int64  
 9   fl            234 non-null    object 
 10  class         234 non-null    object 
dtypes: float64(1), int64(4), object(6)
memory usage: 21.9+ KB


In [215]:
year1999_cty = mpg[mpg.year == 1999].cty
year1999_hwy = mpg[mpg.year == 1999].cty
year1999 = (year1999_cty + year1999_hwy)/2
year1999.head()

1    18.0
2    21.0
5    16.0
6    18.0
8    18.0
Name: cty, dtype: float64

In [216]:
#year2008 = mpg[mpg.year == 2008].cty
year2008_cty = mpg[mpg.year == 2008].cty
year2008_hwy = mpg[mpg.year == 2008].cty
year2008 = (year2008_cty + year2008_hwy)/2
year2008.head()

3     20.0
4     21.0
7     18.0
10    20.0
11    19.0
Name: cty, dtype: float64

In [219]:
t, p = stats.ttest_ind(year1999, year2008, equal_var=True)
t, p

(0.5674988409997608, 0.5709240495406107)

In [220]:
# Based on sample and p value, 
# I decide I cannot reject the null hypothesis

In [214]:
# Are compact cars more fuel-efficient than the average car?
mpg.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 234 entries, 1 to 234
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   manufacturer  234 non-null    object 
 1   model         234 non-null    object 
 2   displ         234 non-null    float64
 3   year          234 non-null    int64  
 4   cyl           234 non-null    int64  
 5   trans         234 non-null    object 
 6   drv           234 non-null    object 
 7   cty           234 non-null    int64  
 8   hwy           234 non-null    int64  
 9   fl            234 non-null    object 
 10  class         234 non-null    object 
dtypes: float64(1), int64(4), object(6)
memory usage: 21.9+ KB


In [52]:
mpg.head()

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class
1,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact
2,audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact
3,audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact
4,audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact
5,audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact


In [72]:
compact = (mpg['class'] == 'compact')
compact_fuel = (mpg[compact].hwy + mpg[compact].cty)/2

In [74]:
avg_fuel = (mpg.hwy + mpg.cty)/2

In [82]:
compact.var()

0.16120098308939518

In [226]:
t, p = stats.ttest_1samp(compact_fuel, avg_fuel.mean())
t, p/2

# I reject null hypothesis

(7.896888573132535, 2.0992818971585668e-10)

In [None]:
# Do manual cars get better gas mileage than automatic cars?

In [84]:
mpg.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 234 entries, 1 to 234
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   manufacturer  234 non-null    object 
 1   model         234 non-null    object 
 2   displ         234 non-null    float64
 3   year          234 non-null    int64  
 4   cyl           234 non-null    int64  
 5   trans         234 non-null    object 
 6   drv           234 non-null    object 
 7   cty           234 non-null    int64  
 8   hwy           234 non-null    int64  
 9   fl            234 non-null    object 
 10  class         234 non-null    object 
dtypes: float64(1), int64(4), object(6)
memory usage: 21.9+ KB


In [182]:
manual = mpg.trans.str.contains('man')

In [184]:
mpg[manual]

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class
2,audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact
3,audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact
6,audi,a4,2.8,1999,6,manual(m5),f,18,26,p,compact
8,audi,a4 quattro,1.8,1999,4,manual(m5),4,18,26,p,compact
10,audi,a4 quattro,2.0,2008,4,manual(m6),4,20,28,p,compact
13,audi,a4 quattro,2.8,1999,6,manual(m5),4,17,25,p,compact
15,audi,a4 quattro,3.1,2008,6,manual(m6),4,15,25,p,compact
24,chevrolet,corvette,5.7,1999,8,manual(m6),r,16,26,p,2seater
26,chevrolet,corvette,6.2,2008,8,manual(m6),r,16,26,p,2seater
28,chevrolet,corvette,7.0,2008,8,manual(m6),r,15,24,p,2seater


In [185]:
auto = mpg.trans.str.contains('aut')

In [187]:
mpg[auto]

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class
1,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact
4,audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact
5,audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact
7,audi,a4,3.1,2008,6,auto(av),f,18,27,p,compact
9,audi,a4 quattro,1.8,1999,4,auto(l5),4,16,25,p,compact
11,audi,a4 quattro,2.0,2008,4,auto(s6),4,19,27,p,compact
12,audi,a4 quattro,2.8,1999,6,auto(l5),4,15,25,p,compact
14,audi,a4 quattro,3.1,2008,6,auto(s6),4,17,25,p,compact
16,audi,a6 quattro,2.8,1999,6,auto(l5),4,15,24,p,midsize
17,audi,a6 quattro,3.1,2008,6,auto(s6),4,17,25,p,midsize


In [194]:
manual_fuel = (mpg[manual].hwy + mpg[manual].cty)/2
auto_fuel = (mpg[auto].hwy + mpg[auto].cty)/2
manual_fuel.var()
auto_fuel.var()

21.942777233382337

In [228]:
t, p = stats.ttest_ind (manual_fuel, auto_fuel, equal_var=True)
t, p/2

(4.593437735750014, 3.5771872005728416e-06)

In [196]:
# Based on t, p, I reject the null hypothesis