### For each of the following questions, formulate a null and alternative hypothesis (be as specific as you can be), then give an example of what a true positive, true negative, type I and type II errors would look like. Note that some of the questions are intentionally phrased in a vague way. It is your job to reword these as more precise questions that could be tested.

### Has the network latency gone up since we switched internet service providers?

$H_0$: Network latency has stayed the same or decreased since we switched internet service providers

$H_a$: Network latency has increased since we switched internet service providers.

True Positive: Rejecting the null hypothesis when the network latency did actually increase.

True Negative: Accepting the null hypothesis when the network latency actually stayed the same or decreased.

Type 1 Error: Rejecting the null hypothesis when the network latency actually stayed the same or decreased.

Type 2 Error: Accepting the null hypothesis when the network latency actually increased.

### Is the website redesign any good?

$H_0$: Website performance is the same after the redesign.

$H_a$: Website performance is not the same after the redesign.

True Positive: Accepting the null hypothesis when website performance actually is the same after the redesign.

True Negative: Rejecting the null hypothesis when the website performance actually is different after the redesign.

Type 1 Error: Rejecting the null hypothesis when website performance did not actually change after the redesign.

Type 2 Error: Accepting the null hypothesis when the website performance really did change after the redesign.

### Is our television ad driving more sales?


$H_0$: Our television ad is not increasing sales.

$H_a$: Our television ad is increasing sales.

True Positive: Accepting the null hypothesis when the television ad is not increasing sales.

True Negative: Rejecting the null hypothesis when the television ad is increasing sales.

Type 1 Error: Rejecting the null hypothesis when the television ad is not increasing sales.

Type 2 Error: Accepting the null hypothesis when the television ad is increasing sales.

### Ace Realty wants to determine whether the average time it takes to sell homes is different for its two offices. A sample of 40 sales from office #1 revealed a mean of 90 days and a standard deviation of 15 days. A sample of 50 sales from office #2 revealed a mean of 100 days and a standard deviation of 20 days. Use a .05 level of significance.

In [1]:
#We will need to use the stats.ttest_ind() function since we're comparing 
#the means of two different subgroups of a population.

#Check the variance
#variance is equal to standard deviation squared
office1_var = 15**2
office2_var = 20**2

In [2]:
office1_var

225

In [3]:
office2_var

400

#### Are the requirements met?
    1) Both samples are independent of each other
    2) Both samples have n > 30, so they are assumed to be normally distributed
    3) I would say that the variances are not equal (we can modify the function for this)

In [4]:
alpha = 0.05

$H_0$: The average time it takes office 1 to sell houses == The average time it takes office 2 to sell houses.

$H_a$: The average time it takes office 1 to sell houses != The average time it takes office 2 to sell houses.

In [62]:
import numpy as np
import pandas as pd
from scipy import stats
from math import sqrt

In [6]:
#Doing this the easy way
x1 = stats.norm(90, 15).rvs(40)
x2 = stats.norm(100, 20).rvs(50)

In [7]:
t, p = stats.ttest_ind(x1, x2, equal_var = False)

In [8]:
t

-2.5186754928527946

In [9]:
p

0.013671660823837788

In [10]:
p < alpha

True

#### Since p is less than 0.05, we reject the null hypothesis

In [11]:
#Doing this the long way
n1 = 40
n2 = 50

#s1 and s2 are the standard deviations
s1 = 15
s2 = 20

#degf is degrees of freedom
degf = n1 + n2 - 2

#s_p is pooled standard deviation
s_p = sqrt(((n1 - 1) * s1**2 + (n2 - 1) * s2**2) / (n1 + n2 - 2))
s_p


17.956702977389302

In [12]:
t = (90 - 100) / (s_p * sqrt(1/n1 + 1/n2))
t

-2.6252287036468456

In [13]:
#Since the t value is negative, we will use the cdf method to find the probability
#Also, since this is a two-tailed test, we will multiply the result by 2
p = stats.t(degf).cdf(t) * 2
p

0.01020985244923939

In [14]:
p < 0.05

True

#### Since p is less than 0.05 we reject the null hypothesis

### Load the mpg dataset and use it to answer the following questions:

In [15]:
from pydataset import data
mpg = data('mpg')
mpg

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class
1,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact
2,audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact
3,audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact
4,audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact
5,audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact
...,...,...,...,...,...,...,...,...,...,...,...
230,volkswagen,passat,2.0,2008,4,auto(s6),f,19,28,p,midsize
231,volkswagen,passat,2.0,2008,4,manual(m6),f,21,29,p,midsize
232,volkswagen,passat,2.8,1999,6,auto(l5),f,16,26,p,midsize
233,volkswagen,passat,2.8,1999,6,manual(m5),f,18,26,p,midsize


### Is there a difference in fuel-efficiency in cars from 2008 vs 1999?

$H_0$: Fuel-efficiency in cars from 2008 == fuel-efficiency in cars from 1999

$H_a$: Fuel-efficiency in cars from 2008 != fuel-efficiency in cars from 1999

In [16]:
alpha = 0.05

In [17]:
#This will be a two tailed t-test using two samples.
#x1 will be the sample of cars from 2008
#x2 will be the sample of cars from 1999
x1 = mpg[mpg.year == 2008]
x2 = mpg[mpg.year == 1999]

In [18]:
x1['fuel_efficiency'] = (x1.cty + x1.hwy) / 2
x1_mean = x1.fuel_efficiency.mean()
x1_mean

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._set_item(key, value)


20.076923076923077

In [19]:
x2['fuel_efficiency'] = (x2.cty + x2.hwy) / 2
x2_mean = x2.fuel_efficiency.mean()
x2_mean

20.22222222222222

In [20]:
x1.shape

(117, 12)

In [21]:
x2.shape

(117, 12)

In [22]:
x1.fuel_efficiency.var()

24.097480106100797

In [23]:
x2.fuel_efficiency.var()

27.122605363984675

### Do they meet the 3 requirements?
    1) Both are independent of each other.
    2) Both samples have n > 30, so are assumed to be normally distributed
    3) The variances are close enough to be considered equal

In [24]:
t, p = stats.ttest_ind(x1.fuel_efficiency, x2.fuel_efficiency)

In [25]:
t

-0.21960177245940962

In [26]:
p

0.8263744040323578

In [27]:
p < alpha

False

#### Since p is greater than 0.05, we fail to reject the null hypothesis.

### Are compact cars more fuel-efficient than the average car?

$H_0$: Average compact car fuel-efficiency <= average fuel-efficiency

$H_a$: Average compact car fuel-efficiency > average fuel-efficiency

In [28]:
mpg['fuel_efficiency'] = (mpg.cty + mpg.hwy) / 2
mpg

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class,fuel_efficiency
1,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact,23.5
2,audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact,25.0
3,audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact,25.5
4,audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact,25.5
5,audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact,21.0
...,...,...,...,...,...,...,...,...,...,...,...,...
230,volkswagen,passat,2.0,2008,4,auto(s6),f,19,28,p,midsize,23.5
231,volkswagen,passat,2.0,2008,4,manual(m6),f,21,29,p,midsize,25.0
232,volkswagen,passat,2.8,1999,6,auto(l5),f,16,26,p,midsize,21.0
233,volkswagen,passat,2.8,1999,6,manual(m5),f,18,26,p,midsize,22.0


In [29]:
#x1 will be compact cars
#x2 will be all cars
x1 = mpg[mpg['class'] == 'compact'].fuel_efficiency

In [30]:
x2 = mpg.fuel_efficiency

In [31]:
x1.shape

(47,)

In [32]:
x2.shape

(234,)

Both have n > 30 so are assumed to be normally distributed

In [33]:
overall_mean = x2.mean()

In [34]:
t, p = stats.ttest_1samp(x1, overall_mean)

In [35]:
t

7.896888573132535

In [36]:
p

4.1985637943171336e-10

In [37]:
p < alpha

True

#### Since p is less than 0.05 and t is positive, we reject the null hypothesis.

### Do manual cars get better gas mileage than automatic cars?

$H_0$: Average manual car fuel-efficiency <= Average automatic car fuel-efficiency

$H_a$: Average manual car fuel-efficiency > Average automatic car fuel-efficinecy

In [38]:
#This test will be a single tailed, two sample t-test
#x1 will be manual car fuel efficiency
#x2 will be automatic car fuel efficiency
x1 = mpg[mpg.trans.str.startswith('manual')].fuel_efficiency
x2 = mpg[mpg.trans.str.startswith('auto')].fuel_efficiency

In [39]:
x1.shape

(77,)

In [40]:
x2.shape

(157,)

In [41]:
x1.var()

26.635167464114833

In [42]:
x2.var()

21.942777233382323

### Are the requirements met?
    1) Both are considered independent of each other
    2) Both have n > 30 so are assumed to be normally distributed
    3) Variances are not close enough for me to say they are equal, 
        but this can be accounted for in the function

In [43]:
t, p = stats.ttest_ind(x1, x2, equal_var = False)

In [44]:
t

4.443514012903071

In [45]:
p

1.795224899991793e-05

In [46]:
p < alpha

True

#### Since p is less than 0.05 and t is positive, we reject the null hypothesis

# Correlation

### Use the telco_churn data.

In [47]:
telco = pd.read_csv("clean_telco.csv")
telco

Unnamed: 0,id,customer_id,gender,is_senior_citizen,partner,dependents,phone_service,internet_service,contract_int,payment_type,...,tenure_month,has_churned,has_phone,has_internet,has_internet_and_phone,partner_dependents,start_day,phone_type,internet_type,contract_type
0,0,0002-ORFBO,Female,0,Yes,Yes,1,1,1,Mailed check,...,9.0,False,True,True,True,3,2020-05-03,One Line,DSL,1 Year
1,1,0003-MKNFE,Male,0,No,No,2,1,0,Mailed check,...,9.1,False,True,True,True,0,2020-05-03,Two or More Lines,DSL,Month-to-Month
2,2,0004-TLHLJ,Male,0,No,No,1,2,0,Electronic check,...,3.8,True,True,True,True,0,2020-11-03,One Line,Fiber Optic,Month-to-Month
3,3,0011-IGKFF,Male,1,Yes,No,1,2,0,Electronic check,...,12.6,True,True,True,True,1,2020-02-03,One Line,Fiber Optic,Month-to-Month
4,4,0013-EXCHZ,Female,1,Yes,No,1,2,0,Mailed check,...,3.2,True,True,True,True,1,2020-11-03,One Line,Fiber Optic,Month-to-Month
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7027,7038,9987-LUTYD,Male,0,Yes,Yes,1,0,2,Credit card (automatic),...,43.4,False,True,False,False,3,2017-07-03,One Line,No Internet Service,2 Year
7028,7039,9992-RRAMN,Male,0,No,No,1,0,0,Mailed check,...,1.0,False,True,False,False,0,2021-01-03,One Line,No Internet Service,Month-to-Month
7029,7040,9992-UJOEL,Male,1,Yes,Yes,1,2,1,Bank transfer (automatic),...,47.3,False,True,True,True,3,2017-03-03,One Line,Fiber Optic,1 Year
7030,7041,9993-LHIEB,Female,0,No,No,1,2,1,Mailed check,...,6.7,False,True,True,True,0,2020-08-03,One Line,Fiber Optic,1 Year


### Does tenure correlate with monthly charges?

$H_0$: Tenure does not correlate with monthly charges
    
$H_a$: Tenure does correlate with monthly charges

In [100]:
x = telco.tenure_month
y = telco.monthly_charges
alpha = 0.05

In [49]:
r, p = stats.pearsonr(x,y)

In [50]:
r, p

(0.24602222678861455, 1.8834273042677756e-97)

Since p is less than 0.05, we reject the null hypothesis.

### Total Charges?

$H_0$: Tenure does not correlate with total charges.
    
$H_a$: Tenure does correlate with total charges.

In [51]:
x = telco.tenure_month
y = telco.total_charges

In [52]:
r, p = stats.pearsonr(x,y)
r, p

(0.8257328669183033, 0.0)

### What happens if you control for phone and internet service?

In [99]:
### Controlling for phone service
has_phone = telco[telco.has_phone == True]

### Does tenure correlate with monthly charges for people with phone service

$H_0$: Tenure for people with phone service does not correlate with monthly charges for people with phone service.

$H_a$: Tenure for people with phone service does correlate with monthly charges for people with phone service.

In [80]:
x = has_phone.tenure_month
y = has_phone.monthly_charges

In [81]:
r, p = stats.pearsonr(x, y)
r, p

(0.24296622457649625, 5.166669584745184e-86)

Since p is less than 0.05, we reject the null hypothesis.

### Does tenure correlate with total charges for people with phone service

In [82]:
y = has_phone.total_charges

In [83]:
r, p = stats.pearsonr(x, y)
r, p

(0.8296284669385314, 0.0)

### Does tenure correlate with monthly charges for people without phone service

In [84]:
no_phone = telco[telco.has_phone == False]

In [85]:
x = no_phone.tenure_month
y = no_phone.monthly_charges

In [86]:
r, p = stats.pearsonr(x, y)
r, p

(0.5917977775067615, 1.7360392790538536e-65)

### Does tenure correlate with total charges for people without phone service

In [87]:
y = no_phone.total_charges

In [88]:
r, p = stats.pearsonr(x, y)
r, p

(0.9542614300389016, 0.0)

### Controlling for internet service

In [89]:
has_internet = telco[telco.has_internet == True]

### Does tenure correlate with monthly charges for people with internet

In [90]:
x = has_internet.tenure_month
y = has_internet.monthly_charges

In [91]:
r, p = stats.pearsonr(x, y)
r, p

(0.3728532182671543, 2.69748010351122e-181)

### Does tenure correlate with total charges for people with internet

In [92]:
y = has_internet.total_charges

In [93]:
r, p = stats.pearsonr(x,y)
r, p

(0.932687995564361, 0.0)

### Does tenure correlate with monthly charges for people without internet

In [94]:
without_internet = telco[telco.has_internet == False]

In [95]:
x = without_internet.tenure_month
y = without_internet.monthly_charges

In [96]:
r, p = stats.pearsonr(x, y)
r, p

(0.346384114942396, 3.8275378977119957e-44)

### Does tenure correlate with total charges for people without internet

In [97]:
y = without_internet.total_charges

In [98]:
r, p = stats.pearsonr(x,y)
r, p

(0.984290673908761, 0.0)

### Use the employees database.

In [64]:
from env import user, password, host

employees_url = f'mysql+pymysql://{user}:{password}@{host}/employees'

employees_query = """
SELECT employees.emp_no, IF(dept_emp_latest_date.to_date >= CURDATE(), TIMESTAMPDIFF(year, hire_date, CURDATE()), TIMESTAMPDIFF(year, hire_date, dept_emp_latest_date.to_date)) AS years_tenure, salary AS final_salary 
FROM employees JOIN dept_emp_latest_date ON dept_emp_latest_date.emp_no = employees.emp_no 
JOIN salaries ON salaries.emp_no = employees.emp_no 
WHERE salaries.to_date = dept_emp_latest_date.to_date
"""


In [65]:
employees = pd.read_sql(employees_query, employees_url)
employees

Unnamed: 0,emp_no,years_tenure,final_salary
0,10001,35,88958
1,10002,35,72527
2,10003,34,43311
3,10004,34,74057
4,10005,31,94692
...,...,...,...
300174,499995,28,52868
300175,499996,30,69501
300176,499997,35,83441
300177,499998,27,55003


### Is there a relationship between how long an employee has been with the company and their salary?

In [66]:
x = employees.years_tenure
y = employees.final_salary

In [67]:
r, p = stats.pearsonr(x, y)
r, p

(0.32481783990896573, 0.0)

### Is there a relationship between how long an employee has been with the company and the number of titles they have had?


In [72]:
employees_query = """
SELECT titles.emp_no, COUNT(DISTINCT(title)) AS num_titles, IF(dept_emp_latest_date.to_date >= CURDATE(), TIMESTAMPDIFF(year, hire_date, CURDATE()), TIMESTAMPDIFF(year, hire_date, dept_emp_latest_date.to_date)) AS years_tenure
FROM titles
JOIN dept_emp_latest_date ON dept_emp_latest_date.emp_no = titles.emp_no
JOIN employees ON employees.emp_no = titles.emp_no
GROUP BY emp_no;
"""
employees_titles = pd.read_sql(employees_query, employees_url)
employees_titles

Unnamed: 0,emp_no,num_titles,years_tenure
0,10001,1,35
1,10002,1,35
2,10003,1,34
3,10004,2,34
4,10005,2,31
...,...,...,...
300019,499995,1,28
300020,499996,2,30
300021,499997,2,35
300022,499998,2,27


In [73]:
x = employees_titles.years_tenure
y = employees_titles.num_titles

In [74]:
r, p = stats.pearsonr(x, y)
r, p

(0.3510678732106849, 0.0)

### Use the sleepstudy data. Is there a relationship between days and reaction time?

In [75]:
sleepstudy = data('sleepstudy')
sleepstudy

Unnamed: 0,Reaction,Days,Subject
1,249.5600,0,308
2,258.7047,1,308
3,250.8006,2,308
4,321.4398,3,308
5,356.8519,4,308
...,...,...,...
176,329.6076,5,372
177,334.4818,6,372
178,343.2199,7,372
179,369.1417,8,372


In [76]:
x = sleepstudy.Days
y = sleepstudy.Reaction

In [77]:
r, p = stats.pearsonr(x, y)
r, p

(0.5352302262650253, 9.894096322214812e-15)