In [1]:
import pandas as pd
import numpy as np
import scipy.stats


# Estimating a population proportion with Confidence
Confidence Interval Basics

**Best Estimate ± Margin of Error**

**Best Estimate** = Unbiased Point Estimate

**Margin of Error** = “a Few” Estimated Standard Errors

**“A Few”** = Multiplier from appropriate distribution based on desired confidence level and sample design

95% Confidence Level ↔ 0.05 Significance]

*For example:*

What proportion of parents report they use a car seat for all travel with toddlers?

**Population** - Parent with a toddler

**Parameter of interest** - Proportion

Construct a 95% Confidence Interval for the population proportion of parents reporting they use a car seat for all travel with their toddler.

A sample of 659 parents with a toddler was taken and asked if they used a car seat for all travel with their toddler. 540 parents responded ‘Yest’ to this question.

In [2]:
def calculateZ(confidence, n):
    Z = scipy.stats.t.ppf((1 + confidence) / 2., n-1)
    return Z

def calculateCI(phat, n, confidence):
    Z = calculateZ(confidence, n)
    ci = Z * (np.sqrt (phat*(1-phat)/n))
    return ci

confidence = 0.95
n = 659 #Total sample
x = 540 #Total conversion

phat = x/n


ci = calculateCI(phat, n, confidence)
lcb = phat -ci 
ucb = phat +ci

print ('Confidence Interval',ci)
print('Lower Class Boundaries:',lcb)
print('Upper Class Boundaries:',ucb)

Confidence Interval 0.02942320021927683
Lower Class Boundaries: 0.7900001685212391
Upper Class Boundaries: 0.8488465689597927


___


# Calculating Margin of Error and Sample Size
Back to Car Seat example, with using conservative standard error.

= p̂ **±** 2 $\ 1\above{1pt} 2\sqrt n$

= p̂ **±** $\ 1\above{1pt} \sqrt n$

(0.81, 0.89) 95% Margin of Error is only dependent on Sample Size (regardless of the p-hat)

**Sample Size Determination**

Margin of Error (MoE) is only dependent on:

1. Our confidence level (typically 95%), and
2. Our sample size.

Having this, we could answer:

**What sample size would we need to have a 95% (conservative) confidence interval with a Margin of Error of only 3% (0.03)?**

In [3]:
import math

def calculateZCriticalPoint(confidence):
    alpha = 1-confidence
    Zcrit = scipy.stats.norm.ppf(1-(alpha/2))    
    return Zcrit

def calculateMoE(n, confidence):
    zcrit = calculateZCriticalPoint(confidence)
    MoE = zcrit * (1/(2 * np.sqrt(n)))
    return MoE

def calculateSampleSize(confidence, MoE):
    zcrit = calculateZCriticalPoint(confidence)
    n = math.ceil(((1.96/(2*MoE)))**2)
    return n


confidence = 0.95
MoE = 0.03
n = calculateSampleSize(0.95, 0.03)
print('Sample size(n) needed with {0} confidence with {1} Margin of Error: {2}'.format(confidence, MoE, n))

Sample size(n) needed with 0.95 confidence with 0.03 Margin of Error: 1068


In [4]:
calculateMoE(n, confidence)

0.029986961979154566

___


## **Estimating a Difference in Population Proportions with Confidence**

**Research Question:**

What is the difference in population proportions of parents reporting that their children aged 6-18 have had some swimming lessons between white children and black children?

**Population**:

All parents of white children aged 6-18 and all parents of black children aged 6-18

**Parameter of Interest**

Difference in population proportions (p1-p2)

p1 = White children

p2 = Black children

**Survey Results**

- A sample of 247 parents of black children aged 6-18 was taken with 91 saying that their child has some swimming lessons.
- A sample of 988 parents of white children aged 6-18 was taken with 543 saying that their child has had some swimming lessons.

Difference in Proportion Confidence Interval

**Best Estimate ± Margin of Error**

p̂1 - p̂2 **±** Margin of Error

p̂1 - p̂2 ± Z Score x SE (p̂1 - p̂2)

p̂1 - p̂2 ± 1.96 x $\sqrt {{p̂1(1-p̂1)\above{1pt}n1 } + {p̂2(1-p̂2)\above{1pt}n2}}$

In [5]:
def calculateCIforDiffPopulation(phat1, phat2, n1, n2, confidence):
    zcrit = calculateZCriticalPoint(confidence)
    standard_error = zcrit * np.sqrt(((phat1 * (1-phat1))/n1) + ((phat2 * (1-phat2))/n2))
    return standard_error

phat1 = 0.55 #Conversion for Group 1
phat2 = 0.37 #Conversion for Group 2

n1 = 988 #Sample size for Group 1 
n2 = 247 #Sample size for Group 2 

confidence = 0.95
calculateCIforDiffPopulation(phat1, phat2, n1, n2, confidence)

0.06773173792547972

In [18]:
def calculateTCriticalPoint(confidence, n):
    DoF = n-1
    significance_level = 1-confidence
    
    Tcrit = scipy.stats.t.ppf(significance_level/2, DoF) 
    return Tcrit

def calculateCIforPopulationMean(mean, std, n, confidence):
    T = calculateTCriticalPoint (confidence,n)
    CI = T * (std/np.sqrt(n))
    return CI
    


mean = 82.48
std = 15.06
n = 25
confidence = 0.95
calculateCIforPopulationMean(mean, std, n, confidence)

-6.2164624676235976

-13.333730438903219

___
## **Estimating a Mean Difference for Paired Data**

Paired Data: We want to treat the two sets of values simultaneously.

Variable: Difference of measurements in paris

**Research Questions**

What is the average difference between the older twin’s and younger twin’s self -reported education?

Population - All identical twins

Parameter of Interest - Population mean difference of self-reported education level 
(difference = Older twin - younger twin)

Construct a 95% Confidence Interval for the mean difference of self-reported education for a set of identical twins.

**Difference Summary**

n = 340

Min = -3.5 years,

Max = 4 years

72.1% had a difference of 0 year (same education level).

Mean = 0.0838 years

Standard Deviation = 0.7627 years

In [7]:
mean = 0.0838
std = 0.7627
n = 340
confidence = 0.95
CI = abs(calculateCIforPopulationMean(mean, std, n, confidence))
lcb = mean-CI
ucb = mean+CI

print(lcb, ucb)


0.0024391160305014536 0.16516088396949855


___
**Interpreting the Confidence Interval**

With 95% confidence, the population mean difference between the older twin’s less the younger twin’s self-reported education is estimated to be between 0.0025 years and 0.1652 years.

___
## **Estimating a Difference in Population Means with Confidence (for Independent Groups)**

**Research Question**

Considering Mexican American adults (ages 18-29) living in the United States, do males and females differ significantly in mean Body Mass Index (BMI)?

Population: Mexican American adults (ages 18-29) in the US

Parameter of Interest (μ1 - μ2): BMI

**BMI Variable Summary**

|  | Male | Female | Delta |
| --- | --- | --- | --- |
| Mean | 23.57 | 22.83 |  |
| Std Dev | 6.24 | 6.43 |  |
| n | 258 | 239 |  |

There are two approaches for calculating the Confidence Interval:

1. Pooled Approach → The variance of the two populations is assumed to be equal.
2. Unpooled Approach → The variance of the two populations is not equal.

**Calculating CI for Unpooled Confidence Interval** 

= **Best Estimate ± Margin of Error**

= The difference in sample mean **±** “a Few” estimated standard error.

**= (x̄1 - x̄2) ± t* $\sqrt {({s1^2\above{1pt} n1 } + {s2^2\above{1pt} n2 } )}$**

**Calculating CI for Pooled Confidence Interval** 

= **Best Estimate ± Margin of Error**

= The difference in sample mean **±** “a Few” estimated standard error.

**= (x̄1 - x̄2) ± t* $\sqrt {({(n1-1)s1^2 + (n2-1) s2^2\above{1pt} n1+n2-2} }. \sqrt {  {1\above{1pt} n1 }+{1\above{1pt} n2 } )}$**

In [8]:
def calculateCIPooled(n1, s1, n2, s2, mean_1, mean_2, confidence):
    n = n1+n2-2
    t = calculateTCriticalPoint (confidence, n)
    ci = abs(t * np.sqrt((((n1-1)*s1**2) + ((n2-1)*s2**2))/(n1+n2-2)) * (np.sqrt(1/n1 +1/n2)))
    d = mean_1-mean_2
    lcb = d -ci
    ucb = d+ci
    print(lcb, ucb)
    return ci

n1 = 258
n2 = 239
mean_1 = 23.57
mean_2 = 22.83
std1 = 6.24
std2 = 6.43
confidence = 0.95

calculateCIPooled(n1, std1, n2, std2, mean_1, mean_2, confidence)

-0.37693576089468217 1.8569357608946861


1.1169357608946842

In [9]:
def calculateCIUnpooled(n1, s1, n2, s2, mean_1, mean_2, confidence):
#     n = n1+n2-2
    t = calculateTCriticalPoint (confidence, n)
    
    ci = abs(t * np.sqrt((s1**2/n1) + (s2**2/n2)))
    
    d = mean_1-mean_2
    lcb = d -ci
    ucb = d+ci
    print(lcb, ucb)
    return ci

n1 = 258
n2 = 239
mean_1 = 23.57
mean_2 = 22.83
std1 = 6.24
std2 = 6.43
confidence = 0.95

calculateCIUnpooled(n1, std1, n2, std2, mean_1, mean_2, confidence)

-0.37947651130442317 1.8594765113044271


1.1194765113044252

___
**Interpreting the Confidence Interval**

(-0.38,1.86)

With 95% confidence, the difference in mean body mass index between males and females for all Mexican American adults (age 18-29) in the U.S is estimated between -0.38 to 1.86.

**What does “with 95% confidence” mean:**

if this procedure were repeated over and over, each time producing a 95% CI estimate, we would expect 95% of those resulting intervals to contain the difference in population mean BMI.

___

## **Setting Up a Test for a Population Proportion (Z-Test)**

**Why do we do Hypothesis Tests?**

Could the value of the parameter be ___

Use data to help support that claim.

**Research Question**

In the previous year 52% of parents believed that electronics and social media was the cause of their teenager’s lack of sleep**. Do more parents today** believe that their teenager’s lack of sleep is caused due to electronics and social media.

Hypotheses:

$H_0: p = 0.52$ (Null Hypothesis)

$H_1:p >0.52$ (Alternate Hypothesis)

Where p is the population proportion of parents with a teenager who believe that electronics and social media is the cause of their teenager’s lack of sleep

$\alpha$ = 0.05

**Survey Results**

A random sample of 1018 parents with a teenager were taken and 56% said they believe electronics and social media was the cause of their teenager’s lack of sleep.

$\hat {p} = 0.56$

$p_0= 0.52$

$n=1018$

In [10]:
import scipy.stats

def calculateStandardError(p0, n):
    se = np.sqrt(p0 * (1-p0)/n)
    return se

def calculatePValueFromZ(Z, tail =2):
    if tail==1:
        pvalue = scipy.stats.norm.sf(abs(Z))
    elif tail==2:
        pvalue = scipy.stats.norm.sf(abs(Z))*2
    else:
        pvalue =None
    return pvalue

def calculateZTest(p0, phat, n ):
    se = calculateStandardError(p0,n)
    Z = (phat - p0)/se
    pvalue = calculatePValueFromZ(Z, tail=1)
    return pvalue

p0=0.52
phat = 0.56
n=1018

confidence = 0.95
significance_level = 1-confidence

calculateZTest(p0, phat, n )

0.005316510991822442

**Conclusions**

p-value = 0.0053 < $\alpha$ = 0.05

Because it was less than our alpha that we set in the start, then we will reject the null hypothesis ($H_0:p = 0.52)$, or

There is sufficient evidence to conclude that the population proportion of parents with a teenager who believe that electronics and social media is the cause for lack of sleep is greater than 52%

____

# Setting Up a Test of Difference in Population Proportions
**Research Questions**

Is there a significant difference between the population proportions of parents of black children and parents of Hispanic children who report that their child has had some swimming lessons?

**Populations**

All parents of black children aged 6-18 and all parents of Hispanic children aged 6-18

**Parameter of Interest**

$p_1 - p_2$ 

$p_1$ is a parent of black children, and $p_2$ is a parent of Hispanic children.

**Objective**

Test for significant difference in the population proportions of parents reporting that their child has had swimming lessons at the 10% significance level $(\alpha = 0.10)$.

**Hypotheses**

$H_0: p_1 - p_2 =0$

$H_a: p_1 - p_2 \not= 0$

 $\alpha = 0.10$
 
 **Survey Results**

- A sample of 247 parents of black children aged 6-18 was taken with 91 saying that their child has had some swimming lessons.
- A sample of 308 parents of Hispanic children aged 6-18 was taken with 120 saying that their child has had some swimming lessons.
 

In [11]:
def calculatePhat(n1,p1, n2,p2):
    phat = ((n1*p1) + (n2*p2))/ (n1+n2)
    return phat
    
def calculateStandardErrorTwoPopulation(phat, n1, n2):
    se =  np.sqrt(phat * (1-phat) * ((1/n1) +(1/n2)) )
    return se

def calculateTestDifferentPopulation(n1,p1, n2,p2):
    phat = calculatePhat(n1,p1, n2,p2)
    se = calculateStandardErrorTwoPopulation(phat, n1, n2)
    Z = (p1 -p2)/se
    pvalue = calculatePValueFromZ(Z)
    return pvalue
    
    
n1 = 247
n2 = 308
x1 = 91
x2 =120
p1 = x1/n1
p2 = x2/n2

calculateTestDifferentPopulation(n1,p1, n2,p2)

0.6093128715165157

Notes:

- A one-tailed test looks for an **“increase”** or **“decrease”** in the parameter.
- A two-tailed test looks for a **“change”** (could be increase or decrease) in the parameter.

**Decision & Conclusion**

p-val =0.61 > $\alpha = 0.10$ → Fail to reject null hypothesis → Don’t have evidence against equal population proportions.

Formally, based on our sample and our p-value, we fail to reject the null hypothesis. We conclude that there is no significant difference between the population proportion of parents of black and Hispanic children who report their child has had swimming lessons.

**Alternative Approaches**

1. Can use Chi-Square Test.
2. Can use Fisher’s exact Test.

The chi-squared test applies an approximation assuming the sample is large, while the Fisher's exact test runs an exact procedure especially for small-sized samples

___

# **One Mean: Testing about a Population Mean with Confidence**

Research Question

Is the average Cartwheel distance (in inches) for adults more than 80 inches?

Population: All adults

Parameter of interest: Population mean cartwheel distance (μ)

Perform a one-sample test regarding the value for the mean cartwheel distance for the population of all such adults.

Defining Hypothesis

Null: Population mean CW distance $(μ)$ in 80 Inches

$H_0:μ =80$

Alternative: Population mean is greater than 80 inches

$H_a:μ>80$

Significance Level = 5% $(\alpha = 0.05)$

**Survey Result**

n=25

Mean $(\hat{X})$= 82.48

Std Dev = 15.06

In [15]:
import scipy.stats

def calculatePValueFromT(t, n, tail =2):
    DoF=n-1
    if tail==1:
        pvalue = scipy.stats.t.sf(abs(t), DoF)
    elif tail==2:
        pvalue = scipy.stats.t.sf(abs(t), DoF)*2
    else:
        pvalue =None
    return pvalue

def calculateTTest(xhat, x0, std, n):
    t = (xhat - x0)/ (std/np.sqrt(n))
    return t


def calculateTestPopulationMean(xhat, x0, std, n):
    t = calculateTTest(xhat, x0, std, n)
    pvalue = round(calculatePValueFromT(t, n, tail =1),2)
    return pvalue

xhat = 82.48
x0 = 80
std = 15.06
n=25

calculateTestPopulationMean(xhat, x0, std, n)

0.21

Our sample mean is only 0.82 (estimated) standard errors above null value of 80 inches.

The p-value of t = 0.82 is 0.21

If population mean CW distance was really 80 inches, then observing a sample mean of 82.48 inches or larger is quite likely.

**Make a decision about the hypothesis.**

Since our P-value is much bigger than 0.05 (our significance level), weak evidence against the nul → we fail to reject the null!

Based on estimated mean (82.48 inches) we cannot support the population mean CW distance is greater than 80 inches.

**What if normality doesn’t hold?**

- If we are not convinced that CW Distance follows a normal distribution in population
    
    → We can use **non-parametric test** that does not assume normality.
    
- Non-parametric analog of the one sample t-test = **Wilcoxon Signed Rank Test**
    
    → Use **median** to examine location of distribution of measurements.

___


____

## **Testing a Population Mean Difference**

Research Question:

Is there an **average difference** between the cabinet quotes from the suppliers?

**Populations** - All houses

**Parameter of Interes**t - Population mean difference of cabinet quotes $\mu_d$

$_d = Supplier A - Supplier B$

Test for a significant mean difference in cabinet quotes at the 5% significance level

**Hypotheses**

$H_0:\mu_d=0$

$H_a:\mu_d \not= 0$

**Data Result**

n = 20

Minimum Difference  = $-30

Maximum Difference $90

Median = $13.50

Mean = $17.30

Standard Deviation = $28.49

In [30]:
def calculateTPopulationMeanDiff(xhat, std, n):
    t = (xhat - 0)/(std/np.sqrt(n))
    return t

def calculateTestPopulationMeanDiff(xhat, std, n):
    t = calculateTPopulationMeanDiff(xhat, std, n)
    
    pvalue = calculatePValueFromT(t, n, tail=2)
    return pvalue
    
xhat = 17.3
std = 28.49
n = 20
confidence =0.95


calculateTestPopulationMeanDiff(xhat, std, n)


0.013718818836080428

In [25]:
ci = abs(calculateCIforPopulationMean(xhat, std, n, confidence))

lcb = xhat -ci
ucb = xhat +ci

print('95% CI difference:')
print(lcb, ucb)

95% CI difference:
3.966269561096782 30.63373043890322


Our observed mean difference is 2.72 (estimated) standard errors above our null value of 0 z

P-value from t-distribution of 2.72 is 0.013

Formally, based on our sample and our p-value, we reject the null hypothesis. We conclude that the mean difference of cabinet quote prices for supplier A less B is significantly different from 0

## **Testing for a Difference in Population Means (for Independent Groups)**

**Research Question:**

Considering Mexican-American adults (ages 18-29) living in the United States, do males have a significantly higher mean Body Mass Index than females?

**Population:** Mexican-American Adults ages 18-29 in the US

**Parameter of Interest** $(\mu_1 - \mu_2)$: Body Mass Index or BMI $(kg/m^2)$

**Task**: Perform an independent samples t-test regarding the value for the difference in mean BMI between males and females

**Hypotheses**

$H_0: \mu_1=\mu_2$

$H_a: \mu_1 \not= \mu_2$

with $\alpha=0.05$

**Summary of Statistics**

|  | Male | Female |
| --- | --- | --- |
| Mean | 23.57 | 22.83 |
| Std Dev | 6.24 | 6.43 |
| n | 258 | 239 |

**Calculate Test Statistic**

Best Estimate: $\hat{X}_1 - \hat{X}_2 = 23.57-22.83=0.74$

$t = \frac {best\space estimate - null\space value} {estimated \space standard \space error}$

**Two approach that we can use:**

1. Pooled approach: The variance of the two populations are assumed to be equal
2. Unpooled approach: The variance of the two populations are not equal

**Calculating Test Statistic using Pooled Approach:**

$t= \frac {\hat{x}_1 - \hat{x}_2 - 0} {S_p \sqrt {\frac {1} {n_1} + \frac {1} {n_2}}}$

$S_p = \sqrt { \frac {{{(n_1 -1){s_1}^2} + {(n_2 -1){s_2}^2 }}} {n_1+n_2-2} }$

Calculating Test Statistic using Unpooled Approach:

 $t = \frac {(\hat{X}_1 - \hat{X}_2)-0} {\sqrt  { {\frac {{s_1}^2}{n_1}}  + {\frac {{s_2}^2}{n_2}} }}$

Because the IQR’s and Standard Deviations are similar, the pooled approach will be used.

In [83]:
def calculateStandardErrorDiffPopulationMeans(n1,s1, n2,s2):
    sp = np.sqrt((((n1-1)*s1**2) + ((n2-1)*s2**2)) / (n1+n2-2))
    return sp
    
def calculateTDiffPopulationMeans(n1,s1, n2,s2, xhat1, xhat2):
    sp = calculateStandardErrorDiffPopulationMeans(n1,s1, n2,s2)
    
    t = (xhat1 -xhat2-0) / (sp * np.sqrt((1/n1+1/n2)))
    return t

def calculateTestDiffPopulationMeans(n1,s1, n2,s2, xhat1, xhat2):
    t = calculateTDiffPopulationMeans(n1,s1, n2,s2, xhat1, xhat2)
    pvalue = calculatePValueFromT(t, (n1+n2-1), tail=2)
    
    return pvalue

In [84]:
n1 = 258
n2 = 239

s1 = 6.24
s2 = 6.43

xhat1 = 23.57
xhat2 = 22.83

pvalue= calculateTestDiffPopulationMeans(n1,s1, n2,s2, xhat1, xhat2)
pvalue

0.19361821996388956

In [91]:
ci = calculateCIPooled(n1, s1, n2, s2, xhat1, xhat2, confidence)

-0.37693576089468217 1.8569357608946861
