### For each of the following questions, formulate a null and alternative hypothesis (be as specific as you can be), then give an example of what a true positive, true negative, type I and type II errors would look like. Note that some of the questions are intentionally phrased in a vague way. It is your job to reword these as more precise questions that could be tested.

### Has the network latency gone up since we switched internet service providers?

$H_0$: Network latency has stayed the same or decreased since we switched internet service providers

$H_a$: Network latency has increased since we switched internet service providers.

True Positive: Rejecting the null hypothesis when the network latency did actually increase.

True Negative: Accepting the null hypothesis when the network latency actually stayed the same or decreased.

Type 1 Error: Rejecting the null hypothesis when the network latency actually stayed the same or decreased.

Type 2 Error: Accepting the null hypothesis when the network latency actually increased.

### Is the website redesign any good?

$H_0$: Website performance is the same after the redesign.

$H_a$: Website performance is not the same after the redesign.

True Positive: Accepting the null hypothesis when website performance actually is the same after the redesign.

True Negative: Rejecting the null hypothesis when the website performance actually is different after the redesign.

Type 1 Error: Rejecting the null hypothesis when website performance did not actually change after the redesign.

Type 2 Error: Accepting the null hypothesis when the website performance really did change after the redesign.

### Is our television ad driving more sales?


$H_0$: Our television ad is not increasing sales.

$H_a$: Our television ad is increasing sales.

True Positive: Accepting the null hypothesis when the television ad is not increasing sales.

True Negative: Rejecting the null hypothesis when the television ad is increasing sales.

Type 1 Error: Rejecting the null hypothesis when the television ad is not increasing sales.

Type 2 Error: Accepting the null hypothesis when the television ad is increasing sales.

### Ace Realty wants to determine whether the average time it takes to sell homes is different for its two offices. A sample of 40 sales from office #1 revealed a mean of 90 days and a standard deviation of 15 days. A sample of 50 sales from office #2 revealed a mean of 100 days and a standard deviation of 20 days. Use a .05 level of significance.

In [1]:
#We will need to use the stats.ttest_ind() function since we're comparing 
#the means of two different subgroups of a population.

#Check the variance
#variance is equal to standard deviation squared
office1_var = 15**2
office2_var = 20**2

In [4]:
office1_var

225

In [3]:
office2_var

400

#### Are the requirements met?
    1) Both samples are independent of each other
    2) Both samples have n > 30, so they are assumed to be normally distributed
    3) I would say that the variances are not equal (we can modify the function for this)

In [67]:
alpha = 0.05

$H_0$: The average time it takes office 1 to sell houses == The average time it takes office 2 to sell houses.

$H_a$: The average time it takes office 1 to sell houses != The average time it takes office 2 to sell houses.

In [9]:
import numpy as np
import pandas as pd
from scipy import stats
from math import sqrt

In [62]:
#Doing this the easy way
x1 = stats.norm(90, 15).rvs(40)
x2 = stats.norm(100, 20).rvs(50)

In [63]:
t, p = stats.ttest_ind(x1, x2, equal_var = False)

In [64]:
t

-4.610987158809246

In [65]:
p

1.3597791650919027e-05

In [66]:
p < alpha

True

#### Since p is less than 0.05, we reject the null hypothesis

In [59]:
#Doing this the long way
n1 = 40
n2 = 50

#s1 and s2 are the standard deviations
s1 = 15
s2 = 20

#degf is degrees of freedom
degf = n1 + n2 - 2

#s_p is pooled standard deviation
s_p = sqrt(((n1 - 1) * s1**2 + (n2 - 1) * s2**2) / (n1 + n2 - 2))
s_p


17.956702977389302

In [71]:
t = (90 - 100) / (s_p * sqrt(1/n1 + 1/n2))
t

-2.6252287036468456

In [70]:
#Since the t value is negative, we will use the cdf method to find the probability
#Also, since this is a two-tailed test, we will multiply the result by 2
p = stats.t(degf).cdf(t) * 2
p

1.3506018028562515e-05

In [72]:
p < 0.05

True

#### Since p is less than 0.05 we reject the null hypothesis

### Load the mpg dataset and use it to answer the following questions:

In [16]:
from pydataset import data
mpg = data('mpg')
mpg

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class
1,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact
2,audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact
3,audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact
4,audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact
5,audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact
...,...,...,...,...,...,...,...,...,...,...,...
230,volkswagen,passat,2.0,2008,4,auto(s6),f,19,28,p,midsize
231,volkswagen,passat,2.0,2008,4,manual(m6),f,21,29,p,midsize
232,volkswagen,passat,2.8,1999,6,auto(l5),f,16,26,p,midsize
233,volkswagen,passat,2.8,1999,6,manual(m5),f,18,26,p,midsize


### Is there a difference in fuel-efficiency in cars from 2008 vs 1999?

$H_0$: Fuel-efficiency in cars from 2008 == fuel-efficiency in cars from 1999

$H_a$: Fuel-efficiency in cars from 2008 != fuel-efficiency in cars from 1999

In [17]:
alpha = 0.05

In [19]:
#This will be a two tailed t-test using two samples.
#x1 will be the sample of cars from 2008
#x2 will be the sample of cars from 1999
x1 = mpg[mpg.year == 2008]
x2 = mpg[mpg.year == 1999]

In [24]:
x1['fuel_efficiency'] = (x1.cty + x1.hwy) / 2
x1_mean = x1.fuel_efficiency.mean()
x1_mean

20.076923076923077

In [23]:
x2['fuel_efficiency'] = (x2.cty + x2.hwy) / 2
x2_mean = x2.fuel_efficiency.mean()
x2_mean

20.22222222222222

In [25]:
x1.shape

(117, 12)

In [26]:
x2.shape

(117, 12)

In [27]:
x1.fuel_efficiency.var()

24.097480106100797

In [28]:
x2.fuel_efficiency.var()

27.122605363984675

### Do they meet the 3 requirements?
    1) Both are independent of each other.
    2) Both samples have n > 30, so are assumed to be normally distributed
    3) The variances are close enough to be considered equal

In [29]:
t, p = stats.ttest_ind(x1.fuel_efficiency, x2.fuel_efficiency)

In [30]:
t

-0.21960177245940962

In [31]:
p

0.8263744040323578

In [32]:
p < alpha

False

#### Since p is greater than 0.05, we fail to reject the null hypothesis.

### Are compact cars more fuel-efficient than the average car?

$H_0$: Average compact car fuel-efficiency <= average fuel-efficiency

$H_a$: Average compact car fuel-efficiency > average fuel-efficiency

In [34]:
mpg['fuel_efficiency'] = (mpg.cty + mpg.hwy) / 2
mpg

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class,fuel_efficiency
1,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact,23.5
2,audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact,25.0
3,audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact,25.5
4,audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact,25.5
5,audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact,21.0
...,...,...,...,...,...,...,...,...,...,...,...,...
230,volkswagen,passat,2.0,2008,4,auto(s6),f,19,28,p,midsize,23.5
231,volkswagen,passat,2.0,2008,4,manual(m6),f,21,29,p,midsize,25.0
232,volkswagen,passat,2.8,1999,6,auto(l5),f,16,26,p,midsize,21.0
233,volkswagen,passat,2.8,1999,6,manual(m5),f,18,26,p,midsize,22.0


In [39]:
#x1 will be compact cars
#x2 will be all cars
x1 = mpg[mpg['class'] == 'compact'].fuel_efficiency

In [38]:
x2 = mpg.fuel_efficiency

In [41]:
x1.shape

(47,)

In [42]:
x2.shape

(234,)

Both have n > 30 so are assumed to be normally distributed

In [43]:
overall_mean = x2.mean()

In [44]:
t, p = stats.ttest_1samp(x1, overall_mean)

In [45]:
t

7.896888573132535

In [46]:
p

4.1985637943171336e-10

In [47]:
p < alpha

True

#### Since p is less than 0.05 and t is positive, we reject the null hypothesis.

### Do manual cars get better gas mileage than automatic cars?

$H_0$: Average manual car fuel-efficiency <= Average automatic car fuel-efficiency

$H_a$: Average manual car fuel-efficiency > Average automatic car fuel-efficinecy

In [50]:
#This test will be a single tailed, two sample t-test
#x1 will be manual car fuel efficiency
#x2 will be automatic car fuel efficiency
x1 = mpg[mpg.trans.str.startswith('manual')].fuel_efficiency
x2 = mpg[mpg.trans.str.startswith('auto')].fuel_efficiency

In [51]:
x1.shape

(77,)

In [52]:
x2.shape

(157,)

In [53]:
x1.var()

26.635167464114833

In [54]:
x2.var()

21.942777233382323

### Are the requirements met?
    1) Both are considered independent of each other
    2) Both have n > 30 so are assumed to be normally distributed
    3) Variances are not close enough for me to say they are equal, 
        but this can be accounted for in the function

In [55]:
t, p = stats.ttest_ind(x1, x2, equal_var = False)

In [56]:
t

4.443514012903071

In [57]:
p

1.795224899991793e-05

In [58]:
p < alpha

True

#### Since p is less than 0.05 and t is positive, we reject the null hypothesis