# Stats Overview

For each of the following questions, formulate a null and alternative hypothesis (be as specific as you can be), then give an example of what a true positive, true negative, type I and type II errors would look like. Note that some of the questions are intentionally phrased in a vague way. It is your job to reword these as more precise questions that could be tested.

1. Has the network latency gone up since we switched internet service providers?
2. Is the website redesign any good?
3. Is our television ad driving more sales?

1.  
    - H0 = No significant change in network latency
    - H1 = There is a significant change
    - 
    - TP = New ISP has higher latency
    - TN = New ISP has equal or lower latency
    - Type 1 error = 
    - Type 2 error = 

2.  
    - H0 = No significant change in website traffic
    - H1 = There is a significant change in website traffic and clicks
    - 
    - TP = Website traffic and clicks increased 
    - TN = Web traffic has not changed
    - Type 1 error = Lower load times, more menus, hard to read webpage could increase web traffic because the website isn't good
    - Type 2 error = Website redesign could be too good and efficient, that website traffic was indirectly reduced

3.  
    - H0 = The TV ad had no effect on sales
    - H1 = The TV ad had a positive effect on sales
    - 
    - TP = Since implementing the TV ad, sales have increase
    - TN = Since implementing the TV ad, sale did not change
    - Type 1 error = The sales have increased because of other factors that do not include the TV ad, i.e. vehicle sales, booming economy, sales people working harder, etc. 
    - Type 2 error = The sales have not changed or decreased due to other factors, i.e. economic recessions, high interest rates, etc.

# T-test

In [1]:
from pydataset import data

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from scipy import stats

1. 
Ace Realty wants to determine whether the average time it takes to sell homes is different for its two offices. A sample of 40 sales from office #1 revealed a mean of 90 days and a standard deviation of 15 days. A sample of 50 sales from office #2 revealed a mean of 100 days and a standard deviation of 20 days. Use a .05 level of significance.

In [2]:
office_one = 90, 15, 40
office_two = 100, 20, 50

In [3]:
null_hyp = "No significant change in time to sell homes between both offices"
alt_hyp = "There is a significant change in time it takes to sell a home between both offices "

confid = .95
a = 1 - confid

In [4]:
t, p = stats.ttest_ind_from_stats(90,15,40, 100,20,50)
t, p

(-2.6252287036468456, 0.01020985244923939)

In [5]:
if p < a:
    print(alt_hyp)
else:
    print(null_hyp)

There is a significant change in time it takes to sell a home between both offices 


Load the mpg dataset and use it to answer the following questions:



In [6]:
mpg = data("mpg")
mpg.head(2)

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class
1,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact
2,audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact


In [7]:
mpg['avg_mpg'] = (mpg.hwy + mpg.cty) / 2
mpg.head(2)

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class,avg_mpg
1,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact,23.5
2,audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact,25.0


- Is there a difference in fuel-efficiency in cars from 2008 vs 1999?


In [8]:
new_cars = mpg[mpg.year == 2008]
old_cars = mpg[mpg.year == 1999]

null_hyp = "There is no change in fuel-efficiency in 2008 and 1999"
alt_hyp = "There is a significant difference in fuel-efficiency in 2008 vs. 1999"

In [9]:
confid = .95
a = 1 - confid

In [10]:
t, p = stats.ttest_ind(new_cars.avg_mpg, old_cars.avg_mpg)
t, p

(-0.21960177245940962, 0.8263744040323578)

In [11]:
if p < a:
    print(alt_hyp)
else:
    print(null_hyp)

There is no change in fuel-efficiency in 2008 and 1999


- Are compact cars more fuel-efficient than the average car?


In [12]:
def compact(x):
    if x == 'compact':
        return 1
    else:
        return 0

In [13]:
mpg['is_compact'] = mpg['class'].apply(compact)

In [14]:
compact_car = mpg[mpg.is_compact == 1]
non_compact = mpg[mpg.is_compact == 0]

null_hyp = "There is no change in fuel-efficiency whether it is classified as a compact car or not."
alt_hyp = "Compact cars have a higher fuel efficiency than other cars."

conifd = .95
a = 1 - confid

In [15]:
t, p = stats.ttest_ind(compact_car.avg_mpg, mpg.avg_mpg) 
t, p

(5.260311926248542, 2.8684546158129373e-07)

In [16]:
if p < a:
    print(alt_hyp)
else:
    print(null_hyp)

Compact cars have a higher fuel efficiency than other cars.


- Do manual cars get better gas mileage than automatic cars?

In [17]:
mpg.head(1)

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class,avg_mpg,is_compact
1,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact,23.5,1


In [18]:
def transmission(x):
    if x[0] == "m":
        return 1
    else:
        return 0

In [19]:
mpg['is_manual'] = mpg.trans.apply(transmission)
mpg.head()

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class,avg_mpg,is_compact,is_manual
1,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact,23.5,1,0
2,audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact,25.0,1,1
3,audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact,25.5,1,1
4,audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact,25.5,1,0
5,audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact,21.0,1,0


In [20]:
manual = mpg[mpg.is_manual == 1]
automatic = mpg[mpg.is_manual == 0]

null_hyp = "There is no change in fuel-efficiency whether the car has a manual or automatic transmission" 
alt_hyp = "Manual cars have a higher fuel efficiency automatics."

confid = .95
a = 1 - confid

In [21]:
t, p = stats.ttest_ind(manual.avg_mpg, automatic.avg_mpg)
t, p

(4.593437735750014, 7.154374401145683e-06)

In [22]:
if (p/2) < a and t > 0:
    print(alt_hyp)
else:
    print(null_hyp)

Manual cars have a higher fuel efficiency automatics.


Q1. Load “Cust_Churn_Telco.csv” data. Using this data answer the following questions:
- Is the mean of monthly charges of customers who churn significantly higher than the mean across all customers?
- Is the mean of monthly charges of customers who churn significantly higher than the mean of those who don't churn?

In [23]:
telco = pd.DataFrame(pd.read_csv("Cust_Churn_Telco - Cust_Churn_Telco.csv"))
telco.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


- Is the mean of monthly charges of customers who churn significantly higher than the mean across all customers?

In [24]:
churn = telco[telco.Churn == "Yes"]
stay = telco[telco.Churn == "No"]

null_hyp = "There is no significant difference in the monthly charges of people who churn and people who stay."
alt_hyp = "People who churn have higher monthly charges than the population"

confid = .95
a = 1 - confid

In [25]:
t, p = stats.ttest_1samp(churn.MonthlyCharges, telco.MonthlyCharges.mean())
t, p

(16.965403080505645, 3.7406392993841064e-60)

In [26]:
if p < a:
    print(alt_hyp)
else:
    print(null_hyp)

People who churn have higher monthly charges than the population


- Is the mean of monthly charges of customers who churn significantly higher than the mean of those who don't churn?

In [27]:
null_hyp = "There is no significant difference in the monthly charges of people who churn and people who stay."
alt_hyp = "People who churn have higher monthly charges than people who stay."

confid = .95
a = 1 - confid

In [28]:
t, p = stats.ttest_ind(churn.MonthlyCharges, stay.MonthlyCharges)
t, p

(16.53673801593631, 2.706645606888261e-60)

In [29]:
if (p/2) < a and t > 0:
    print(alt_hyp)
else:
    print(null_hyp)

People who churn have higher monthly charges than people who stay.


Q2. Load Iris dataset from pydataset or sns. Using this data answer the following questions:

In [30]:
iris = data("iris")
iris.sample(5)

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
58,4.9,2.4,3.3,1.0,versicolor
139,6.0,3.0,4.8,1.8,virginica
131,7.4,2.8,6.1,1.9,virginica
39,4.4,3.0,1.3,0.2,setosa
83,5.8,2.7,3.9,1.2,versicolor


In [31]:
iris.columns = ['sep_len', 'sep_wid','pet_len','pet_wid','species']
iris.head(2)

Unnamed: 0,sep_len,sep_wid,pet_len,pet_wid,species
1,5.1,3.5,1.4,0.2,setosa
2,4.9,3.0,1.4,0.2,setosa


- Is the sepal length significantly different between Veriscolor and Virginica?

In [32]:
versicolor = iris[iris.species == 'versicolor']
virginica = iris[iris.species == 'virginica']
setosa = iris[iris.species == 'setosa']

null_hyp = "There is no significant difference in the sepal length of species versicolor and virginica."
alt_hyp = "There is a significant difference in sepal length of species versicolor and virginica."

confid = .95
a = 1 - confid

In [33]:
t, p = stats.ttest_ind(versicolor.sep_len, virginica.sep_len)
t, p

(-5.629165259719801, 1.7248563024547942e-07)

In [34]:
if (p/2) < a:
    print(alt_hyp)
else:
    print(null_hyp) 

There is a significant difference in sepal length of species versicolor and virginica.


- Is the sepal length significantly different between Setosa and Virginica?

In [35]:
null_hyp = "There is no significant difference in the sepal length of species setosa and virginica."
alt_hyp = "There is a significant difference in sepal length of species setosa and virginica."

In [36]:
t, p = stats.ttest_ind(setosa.sep_len, virginica.sep_len)
t, p

(-15.386195820079404, 6.892546060674059e-28)

In [37]:
if (p/2) < a:
    print(alt_hyp)
else:
    print(null_hyp) 

There is a significant difference in sepal length of species setosa and virginica.


# Correlation

1. Use the telco_churn data. 

In [42]:
# telco.columns = ['cust_id','gender','senior_citizen','partner','dependents','tenure','phone_service','multiple_lines','internet_service','online_security','online_backup','device_protection','tech_support','stream_tv','stream_movies','contract','paperless_bill','payment_method','monthly_charges','total_charges','churn']    


- Does tenure correlate with monthly charges? Total charges?

In [43]:
x = telco.tenure
y = telco.monthly_charges

corr, p = stats.pearsonr(x, y)
corr, p

(0.2478998562861525, 4.094044991483017e-99)

In [44]:
non_na_telco = telco[telco.total_charges.isna() == False]

In [45]:
non_na_telco.sample(3)

Unnamed: 0,cust_id,gender,senior_citizen,partner,dependents,tenure,phone_service,multiple_lines,internet_service,online_security,...,device_protection,tech_support,stream_tv,stream_movies,contract,paperless_bill,payment_method,monthly_charges,total_charges,churn
4146,9821-POOTN,Male,0,Yes,No,35,Yes,Yes,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,75.2,2576.2,Yes
176,2656-FMOKZ,Female,1,No,No,15,Yes,Yes,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Mailed check,74.45,1145.7,Yes
200,9323-HGFWY,Female,0,Yes,No,27,Yes,No,Fiber optic,Yes,...,No,Yes,Yes,Yes,One year,Yes,Credit card (automatic),101.9,2681.15,No


In [46]:
x = non_na_telco.tenure
y = non_na_telco.total_charges

corr, p = stats.pearsonr(x, y)
corr, p

(0.8258804609332093, 0.0)

- What happens if you control for phone and internet service?

In [47]:
telco_phone = non_na_telco[non_na_telco.phone_service=='Yes']

In [48]:
telco_phone_and_internet = telco_phone[telco_phone.internet_service != "No"]


In [49]:
x = telco_phone_and_internet.tenure
y = telco_phone_and_internet.total_charges

corr, p = stats.pearsonr(x, y)
corr, p

(0.957922977802917, 0.0)

Use the employees database.


- Is there a relationship between how long an employee has been with the company and their salary?

In [138]:
from env import host, user, password

get_db_url = f'mysql+pymysql://{user}:{password}@{host}/employees'
    
query = """
    SELECT * 
    FROM employees
    JOIN salaries USING(emp_no)
    WHERE to_date > curdate()
"""

salaries = pd.read_sql(query, get_db_url)
df = pd.DataFrame(salaries)

In [181]:
df.head()


Unnamed: 0,emp_no,birth_date,first_name,last_name,gender,hire_date,salary,from_date,to_date
0,10001,1953-09-02,Georgi,Facello,M,1986-06-26,88958,2002-06-22,9999-01-01
1,10002,1964-06-02,Bezalel,Simmel,F,1985-11-21,72527,2001-08-02,9999-01-01
2,10003,1959-12-03,Parto,Bamford,M,1986-08-28,43311,2001-12-01,9999-01-01
3,10004,1954-05-01,Chirstian,Koblick,M,1986-12-01,74057,2001-11-27,9999-01-01
4,10005,1955-01-21,Kyoichi,Maliniak,M,1989-09-12,94692,2001-09-09,9999-01-01


- Is there a relationship between how long an employee has been with the company and the number of titles they have had?

# χ<sup>2</sup>

1. Use the following contingency table to help answer the question of whether using a macbook and being a codeup student are independent of each other.

In [187]:
alpha = .05

In [191]:
data = np.array([[49,20],[1,30]])
data

array([[49, 20],
       [ 1, 30]])

In [189]:
chi2, p, degf, expected = stats.chi2_contingency(data)
chi2, p, degf, expected 

(36.65264142122487,
 1.4116760526193828e-09,
 1,
 array([[34.5, 34.5],
        [15.5, 15.5]]))

In [190]:
if p < alpha:
    print('We reject the null')
else:
    print("we fail to reject the null")

We reject the null


2. Choose another 2 categorical variables from the mpg dataset and perform a χ<sup>2</sup> contingency table test with them. *Be sure to state your null and alternative hypotheses.*

In [193]:
mpg.nunique()

manufacturer    15
model           38
displ           35
year             2
cyl              4
trans           10
drv              3
cty             21
hwy             27
fl               5
class            7
avg_mpg         40
is_compact       2
is_manual        2
dtype: int64

In [196]:
# drv and cyl

4    81
6    79
8    70
5     4
Name: cyl, dtype: int64

In [198]:
observed = pd.crosstab(mpg.cyl, mpg.drv)

In [199]:
chi2, p, degf, expected = stats.chi2_contingency(observed)
chi2, p, degf, expected 

(98.13550541481472,
 6.143348809351039e-19,
 6,
 array([[35.65384615, 36.69230769,  8.65384615],
        [ 1.76068376,  1.81196581,  0.42735043],
        [34.77350427, 35.78632479,  8.44017094],
        [30.81196581, 31.70940171,  7.47863248]]))

In [200]:
if p < alpha:
    print('We reject the null')
else:
    print("we fail to reject the null")

We reject the null


3. Use the data from the employees database to answer these questions:

    - Is an employee's gender independent of whether an employee works in sales or marketing? (only look at current employees)


In [229]:
query = '''
    SELECT emp_no, gender, dept_name
    FROM departments
    join dept_emp using(dept_no)
    join employees using(emp_no)
    where to_date > curdate();
'''

gender_vs_sales_marketing = pd.read_sql(query, get_db_url)
df = pd.DataFrame(gender_vs_sales_marketing)

In [230]:
df = df.dropna()

In [233]:
df = df[(df.dept_name == 'Marketing')|(df.dept_name == 'Sales')]

In [235]:
data = pd.crosstab(df.dept_name, df.gender)
data

gender,F,M
dept_name,Unnamed: 1_level_1,Unnamed: 2_level_1
Marketing,5864,8978
Sales,14999,22702


In [236]:
chi2, p, degf, expected = stats.chi2_contingency(data)
chi2, p, degf, expected 

(0.3240332004060638,
 0.5691938610810126,
 1,
 array([[ 5893.2426013,  8948.7573987],
        [14969.7573987, 22731.2426013]]))

In [237]:
if p < alpha:
    print('We reject the null')
else:
    print("we fail to reject the null")

we fail to reject the null


- Is an employee's gender independent of whether or not they are or have been a manager?

In [238]:
query = '''
select emp_no, gender
from dept_manager
left join employees using(emp_no)
;
'''

gender_vs_manager = pd.read_sql(query, get_db_url)
was_is_manager = pd.DataFrame(gender_vs_manager)

In [239]:
query = '''
select emp_no, gender
from employees
;
'''

emp_no_gender = pd.read_sql(query, get_db_url)
employees = pd.DataFrame(emp_no_gender)

In [240]:
employees.sample(5)

Unnamed: 0,emp_no,gender
131559,231535,M
296568,496544,M
167162,267138,M
119203,219179,F
149865,249841,F


In [244]:
data = pd.crosstab(was_is_manager.gender, employees.gender)
data

gender,F,M
gender,Unnamed: 1_level_1,Unnamed: 2_level_1
F,10,0
M,0,14


In [242]:
chi2, p, degf, expected = stats.chi2_contingency(data)
chi2, p, degf, expected 

(20.06204081632653,
 7.497001612880616e-06,
 1,
 array([[4.16666667, 5.83333333],
        [5.83333333, 8.16666667]]))

In [243]:
if p < alpha:
    print('We reject the null')
else:
    print("we fail to reject the null")

We reject the null
