---
title: "Proportion Tests"
format:
  html:
    code-fold: true
jupyter: python3
author: "kakamana"
date: "2023-01-19"
categories: [python, datacamp, statistics, machine learning, hypothesis ]
image: "proportionTest.jpg"

---

# Proportion Tests

Now it’s time to test for differences in proportions between two groups using proportion tests. Through hands-on exercises, you’ll extend your proportion tests to more than two groups with chi-square independence tests, and return to the one sample case with chi-square goodness of fit tests

This **Proportion Tests** is part of [Datacamp course: Hypothesis Testing in Python](https://app.datacamp.com/learn/courses/hypothesis-testing-in-python)

This is my learning experience of data science through DataCamp

In [1]:
import numpy as np
import pandas as pd
from scipy.stats import norm

In [2]:
late_shipments= pd.read_feather('dataset/late_shipments.feather')
late_shipments.head()

Unnamed: 0,id,country,managed_by,fulfill_via,vendor_inco_term,shipment_mode,late_delivery,late,product_group,sub_classification,...,line_item_quantity,line_item_value,pack_price,unit_price,manufacturing_site,first_line_designation,weight_kilograms,freight_cost_usd,freight_cost_groups,line_item_insurance_usd
0,36203.0,Nigeria,PMO - US,Direct Drop,EXW,Air,1.0,Yes,HRDT,HIV test,...,2996.0,266644.0,89.0,0.89,"Alere Medical Co., Ltd.",Yes,1426.0,33279.83,expensive,373.83
1,30998.0,Botswana,PMO - US,Direct Drop,EXW,Air,0.0,No,HRDT,HIV test,...,25.0,800.0,32.0,1.6,"Trinity Biotech, Plc",Yes,10.0,559.89,reasonable,1.72
2,69871.0,Vietnam,PMO - US,Direct Drop,EXW,Air,0.0,No,ARV,Adult,...,22925.0,110040.0,4.8,0.08,Hetero Unit III Hyderabad IN,Yes,3723.0,19056.13,expensive,181.57
3,17648.0,South Africa,PMO - US,Direct Drop,DDP,Ocean,0.0,No,ARV,Adult,...,152535.0,361507.95,2.37,0.04,"Aurobindo Unit III, India",Yes,7698.0,11372.23,expensive,779.41
4,5647.0,Uganda,PMO - US,Direct Drop,EXW,Air,0.0,No,HRDT,HIV test - Ancillary,...,850.0,8.5,0.01,0.0,Inverness Japan,Yes,56.0,360.0,reasonable,0.01


### Standardized test statistic for proportions

$p$: population proportion (unknown population parameter)
$\hat{p}$: sample proportion (sample statistic)
${p}_{0}$: hypothesized population proportion

$z = {\hat{p} - mean (\hat{p}) \over SE(\hat{p})} $
$= {\hat{p} - p \over SE(\hat{p})} $

Assuming $H_{0}$ is true, $p = p_{0}$, so

$z = {\hat{p} - p_{0} \over SE(\hat{p})} $

### Simplifying the standard error calculations

$SE_{\hat{p}}=\sqrt{ p_{0} * (1 - p_{0}) \over n }  $ -> Under $H_{0}, SE_{\hat{p}}$ depends on hypothesized $p_{0}$ and sample size $n$

Assuming $H_{0}$ is true,

$z = {\hat{p} - p_{0} \over SE(\hat{p})} $

* Only uses sample information $(\hat{p} and n)$ and the hypothesized parameter ($p_{0}$)


### Why z instead of t?

$t = {(\widehat{x}_{child} - \widehat{x}_{adult)} \over \sqrt{ {s^2_{child} \over n_{child}} + {s^2_{adult} \over n_{adult}} }$


* $s$ is calculated from $\widehat{x}$
    * $\widehat{x}$ estimates the population mean
    * $s$ estimates the population standard deviation
    * t-distribution - fatter tails than a normal distribution
    * $\hat{p}$ only appears in the numerator, so z-scores are fine

In [5]:
# Hypothesize that the proportion of late shipments is 6%
p_0 = 0.06

# Calculate the sample proportion of late shipments
p_hat = (late_shipments['late'] == "Yes").mean()
print(p_hat)

# Calculate the sample size
n = len(late_shipments)

# Calculate the numerator and denominator of the test statistic
numerator = p_hat - p_0
denominator = np.sqrt(p_0 * (1 - p_0) / n)

# Calculate the test statistic
z_score = numerator / denominator
print(z_score)

# Calculate the p-value from the z-score
p_value = 1 - norm.cdf(z_score)

# Print the p-value
print(p_value)
print("\nWhile bootstrapping can be used to estimate the standard error of any statistic, it is computationally intensive. For proportions, using a simple equation of the hypothesized proportion and sample size is easier to compute.")

0.061
0.13315591032282698
0.44703503936503364

While bootstrapping can be used to estimate the standard error of any statistic, it is computationally intensive. For proportions, using a simple equation of the hypothesized proportion and sample size is easier to compute.
