<a href="https://colab.research.google.com/github/gmehra123/course1/blob/master/Hypothesis__testing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Hypothesis Tests
* One of the purposes of hypotheis testing id to determone whether a sample statistic is close to or far away from a hypothesized value.
* z-score=(sample_stat-hypothesized_value)/std.error
* z-score is a standardized measure of the difference between the sample statistic and the hypothesized statistic
* A hypothesis is a statement about an unknown population parameter.
  * Null hypothesis, the existing status quo. H0
  * Alternative hypotesis, the challeger HA
* Hypothesis tests check whether the **sample statistic** lies in the tails of the null distribution
* p-value represents the probability of obtaining a result under the null hypothesis


In [1]:
import pandas as pd
from scipy.stats import norm
import numpy as np

In [2]:
late=pd.read_feather('https://assets.datacamp.com/production/repositories/5982/datasets/887ec4bc2bcfd4195e7d3ad113168555f36d3afa/late_shipments.feather')

In [4]:
late.head()

Unnamed: 0,id,country,managed_by,fulfill_via,vendor_inco_term,shipment_mode,late_delivery,late,product_group,sub_classification,...,line_item_quantity,line_item_value,pack_price,unit_price,manufacturing_site,first_line_designation,weight_kilograms,freight_cost_usd,freight_cost_groups,line_item_insurance_usd
0,36203.0,Nigeria,PMO - US,Direct Drop,EXW,Air,1.0,Yes,HRDT,HIV test,...,2996.0,266644.0,89.0,0.89,"Alere Medical Co., Ltd.",Yes,1426.0,33279.83,expensive,373.83
1,30998.0,Botswana,PMO - US,Direct Drop,EXW,Air,0.0,No,HRDT,HIV test,...,25.0,800.0,32.0,1.6,"Trinity Biotech, Plc",Yes,10.0,559.89,reasonable,1.72
2,69871.0,Vietnam,PMO - US,Direct Drop,EXW,Air,0.0,No,ARV,Adult,...,22925.0,110040.0,4.8,0.08,Hetero Unit III Hyderabad IN,Yes,3723.0,19056.13,expensive,181.57
3,17648.0,South Africa,PMO - US,Direct Drop,DDP,Ocean,0.0,No,ARV,Adult,...,152535.0,361507.95,2.37,0.04,"Aurobindo Unit III, India",Yes,7698.0,11372.23,expensive,779.41
4,5647.0,Uganda,PMO - US,Direct Drop,EXW,Air,0.0,No,HRDT,HIV test - Ancillary,...,850.0,8.5,0.01,0.0,Inverness Japan,Yes,56.0,360.0,reasonable,0.01


State the hypothesis.
1. H0 : The number of late shipments is 6%
2. HA : The number of late shipments exceeds 6%

In [5]:
# Calculate the sample statistic
sample_stat=(late.late=='Yes').mean()
sample_stat

0.061

In [6]:
# now create a boot strap distribution
late_prop=np.empty(1000)
for i in range(1000):
  boot=late.sample(frac=1,replace=True)
  stat=(boot.late=='Yes').mean()
  late_prop[i]=stat


In [7]:
std_error=late_prop.std(ddof=1)
std_error

0.007875256418915112

In [9]:
# Calculate the z score
z_score=(sample_stat-0.06)/std_error

In [10]:
# Calculate the p-value
1-norm.cdf(z_score)

0.4494781170626434

* large P-value means we fail to reject the null hypothesis.



---

2 types of error

---
1. Chosen H0, Actual HA then false negative (type 2 error)
2. Chosen HA, Actual H0 then false positive (type I error)



## Hypothesis test using t stat and bootstrap
* Using a t-statistic can be computationally less intensive and give similar results as doing a boot straping for the hypothesis test.
* We want to test whether heavier packages are late and if this is statistically significant.



* H0-: No diff in mean weight between packages that are late and on time
* HA-: Packages that are late are heavier
* Alpha-: 0.05


In [16]:
# First lets see if there is a difference in weight for late packages
means=late.groupby('late')['weight_kilograms'].mean()


In [19]:
# Calculating the test statistic
test_statistic=means[1]-means[0]

In [20]:
# Performing bootstrap to calculate std error
rep=np.empty(1000)
for i in range(1000):
  samp=late.sample(frac=1,replace=True)
  means=samp.groupby('late')['weight_kilograms'].mean()
  rep[i]=means[1]-means[0]

In [22]:
std_error=np.std(rep)

In [24]:
z_score=(test_statistic-0)/std_error
z_score

2.4757003933587827

In [28]:
from scipy.stats import norm
p_val=1-norm.cdf(z_score,loc=0,scale=1)
p_val

0.006648755708450804

In [27]:
alpha=0.05
p_val<=alpha

True

* In this case the p value is less than the alpha so we reject the null hypothesis
* In other words late packages tend to be heavier packages

### Use t-statistic for the same result

In [29]:
count=late.late.value_counts()

In [35]:
n_yes=late.loc[late.late=='Yes','id'].count()
n_no=late.loc[late.late=='No','id'].count()
print(n_yes,n_no)

61 939


In [37]:
std_yes=late.loc[late.late=='Yes','weight_kilograms'].std()
std_no=late.loc[late.late=='No','weight_kilograms'].std()
print(std_yes,std_no)

2544.688210903328 3154.0395070841696


In [38]:
denominator=np.sqrt((std_yes**2/n_yes)+(std_no**2/n_no))
denominator

341.68543274794337

In [40]:
tstat=test_statistic/denominator
tstat

2.3936661778766433

In [44]:
from scipy.stats import t
pval=1-t.cdf(tstat,df=n_yes+n_no-2)

In [45]:
pval<=alpha

True

#### Running a similar test for salary
* H0: Difference in salaries is 0
* HA: the salroes of folks who started as kids are higher

In [46]:
stack=pd.read_feather('https://assets.datacamp.com/production/repositories/5982/datasets/c59033b93930652f402e30db77c3b8ef713dd701/stack_overflow.feather')

In [47]:
stack.sample(5)

Unnamed: 0,respondent,main_branch,hobbyist,age,age_1st_code,age_first_code_cut,comp_freq,comp_total,converted_comp,country,...,survey_length,trans,undergrad_major,webframe_desire_next_year,webframe_worked_with,welcome_change,work_week_hrs,years_code,years_code_pro,age_cat
395,9190.0,I am a developer by profession,Yes,23.0,18.0,adult,Monthly,100000.0,7788.0,Pakistan,...,Appropriate in length,No,"Computer science, computer engineering, or sof...",Angular;Django;Express;Flask;React.js;Ruby on ...,Django;Express;Flask;React.js,Just as welcome now as I felt last year,40.0,6.0,2.0,Under 30
2226,61860.0,I am a developer by profession,Yes,30.0,13.0,child,Yearly,165000.0,165000.0,United States,...,Appropriate in length,No,"Computer science, computer engineering, or sof...",Vue.js,ASP.NET;ASP.NET Core;Django;Express;Flask;Reac...,,30.0,17.0,7.0,Under 30
1766,48715.0,I am a developer by profession,No,37.0,16.0,adult,Monthly,348000.0,27084.0,Pakistan,...,Appropriate in length,No,"Computer science, computer engineering, or sof...",Angular.js;React.js;Ruby on Rails;Spring,jQuery;Spring,Somewhat more welcome now than last year,45.0,20.0,15.0,At least 30
1159,26985.0,I am a developer by profession,Yes,23.0,13.0,child,Yearly,60000.0,77556.0,United Kingdom,...,Appropriate in length,No,"Computer science, computer engineering, or sof...",Django;Flask;React.js,Angular;Django;Flask;React.js,Just as welcome now as I felt last year,43.0,11.0,6.0,Under 30
223,4577.0,I am a developer by profession,Yes,16.0,5.0,child,Weekly,1000.0,37800.0,Canada,...,Appropriate in length,No,,Express,Flask,Just as welcome now as I felt last year,80.0,11.0,3.0,Under 30




In [50]:
# Getting the mean salary by child_begin and adult begin coders
means=stack.groupby('age_first_code_cut')['converted_comp'].mean()

In [53]:
test_stat=means[1]-means[0]

In [54]:
counts=stack.groupby('age_first_code_cut')['converted_comp'].count()

In [56]:
n_adult=counts[0]
n_child=counts[1]

In [57]:
std=stack.groupby('age_first_code_cut')['converted_comp'].std()

In [59]:
std_a=std[0]
std_c=std[1]

In [61]:
denominator=np.sqrt(std_a**2/n_adult+std_c**2/n_child)

In [65]:
t_stat=test_stat/denominator
t_stat

1.8699313316221844

In [66]:
1-t.cdf(t_stat,df=n_adult+n_child-2)

0.030811302165157595

In [71]:
!pip install pingouin

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pingouin
  Downloading pingouin-0.5.2.tar.gz (185 kB)
[K     |████████████████████████████████| 185 kB 5.2 MB/s 
Collecting statsmodels>=0.13
  Downloading statsmodels-0.13.5-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (9.9 MB)
[K     |████████████████████████████████| 9.9 MB 56.4 MB/s 
Collecting pandas_flavor>=0.2.0
  Downloading pandas_flavor-0.3.0-py3-none-any.whl (6.3 kB)
Collecting outdated
  Downloading outdated-0.2.2-py2.py3-none-any.whl (7.5 kB)
Collecting pandas_flavor>=0.2.0
  Downloading pandas_flavor-0.2.0-py2.py3-none-any.whl (6.6 kB)
Collecting littleutils
  Downloading littleutils-0.2.2.tar.gz (6.6 kB)
Building wheels for collected packages: pingouin, littleutils
  Building wheel for pingouin (setup.py) ... [?25l[?25hdone
  Created wheel for pingouin: filename=pingouin-0.5.2-py3-none-any.whl size=196208 sha256=cad64945e5b83739637ddb7fd5791de6aa

In [72]:
import pingouin