In [2]:
import statsmodels.api as sm
import numpy as np
import pandas as pd

### One Population Proportion

#### Research Question 

In previous years 52% of parents believed that electronics and social media was the cause of their teenager’s lack of sleep. Do more parents today believe that their teenager’s lack of sleep is caused due to electronics and social media? 

**Population**: Parents with a teenager (age 13-18)  
**Parameter of Interest**: p  
**Null Hypothesis:** p = 0.52  
**Alternative Hypthosis:** p > 0.52 (note that this is a one-sided test)

1018 Parents

56% believe that their teenager’s lack of sleep is caused due to electronics and social media.


In [3]:
#Sample size
n = 1018
#Null hypothesis
pnull = 0.52
#Observed population proportion
phat = 0.56

#Carry out test statistics
sm.stats.proportions_ztest(phat * n, n, pnull)

(2.571067795759113, 0.010138547731721065)

The output corresponds to the result of our test statistics and the p-value, respectively. Since our p-value is smaller than \alpha = 0.05 we have suficient grounds to reject the null hypothesis.

### Difference in Population Proportions

#### Research Question

Is there a significant difference between the population proportions of parents of black children and parents of Hispanic children who report that their child has had some swimming lessons?

**Populations**: All parents of black children age 6-18 and all parents of Hispanic children age 6-18  
**Parameter of Interest**: p1 - p2, where p1 = black and p2 = hispanic  
**Null Hypothesis:** p1 - p2 = 0  
**Alternative Hypothesis:** p1 - p2 $\neq$ = 0  


91 out of 247 (36.8%) sampled parents of black children report that their child has had some swimming lessons.

120 out of 308 (38.9%) sampled parents of Hispanic children report that their child has had some swimming lessons.

In [14]:
#Samples
n1 = 247
n2 = 308
#Observed proportions
phat1 = 0.37
phat2 = 0.39

population1 = np.random.binomial(1, phat1, n1)
population2 = np.random.binomial(1, phat2, n2)

#Test statistics
sm.stats.ttest_ind(population1, population2)

(-1.674029500534151, 0.09469025016810637, 553.0)

The output corresponds to the result of our test statistics, the p-value and total n, respectively. Since our p-value is bigger than \alpha = 0.05 we dont't have suficient grounds to reject the null hypothesis.

### One Population Mean

#### Research Question 

Is the average cartwheel distance (in inches) for adults 
more than 80 inches?

**Population**: All adults  
**Parameter of Interest**: $\mu$, population mean cartwheel distance.
**Null Hypothesis:** $\mu$ = 80
**Alternative Hypthosis:** $\mu$ > 80

25 Adults

$\mu = 82.46$

$\sigma = 15.06$

In [15]:
df = pd.read_csv('Cartwheeldata.csv')
df.head()

Unnamed: 0,ID,Age,Gender,GenderGroup,Glasses,GlassesGroup,Height,Wingspan,CWDistance,Complete,CompleteGroup,Score
0,1,56,F,1,Y,1,62.0,61.0,79,Y,1,7
1,2,26,F,1,Y,1,62.0,60.0,70,Y,1,8
2,3,33,F,1,Y,1,66.0,64.0,85,Y,1,7
3,4,39,F,1,N,0,64.0,63.0,87,Y,1,10
4,5,27,M,2,N,0,73.0,75.0,72,N,0,4


In [16]:
#Sample size
n = len(df)
print(n)
#Null hypothesis
pnull = 80
#Observed population mean
mu = df['CWDistance'].mean()
print(mu)
#Observed standard error of the mean
sigma = df['CWDistance'].std()
print(sigma)

25
82.48
15.058552387264855


In [19]:
#Test statistics
sm.stats.ztest(df['CWDistance'], value = pnull, alternative = "larger")

(0.8234523266982029, 0.20512540845395266)

The output corresponds to the result of our test statistics and the p-value, respectively. Since our p-value is significantly bigger than \alpha = 0.05 we dont't have suficient grounds to reject the null hypothesis.

### Difference in Population Means

#### Research Question 

Considering adults in the NHANES data, do males have a significantly higher mean Body Mass Index than females?

**Population**: Adults in the NHANES data.  
**Parameter of Interest**: $\mu_1 - \mu_2$, Body Mass Index.  
**Null Hypothesis:** $\mu_1 = \mu_2$  
**Alternative Hypthosis:** $\mu_1 \neq \mu_2$

2976 Females 
$\mu_1 = 29.94$  
$\sigma_1 = 7.75$  

2759 Male Adults  
$\mu_2 = 28.78$  
$\sigma_2 = 6.25$  

$\mu_1 - \mu_2 = 1.16$

In [51]:
da = pd.read_csv('nhanes_2015_2016.csv')
da.head()

Unnamed: 0,SEQN,ALQ101,ALQ110,ALQ130,SMQ020,RIAGENDR,RIDAGEYR,RIDRETH1,DMDCITZN,DMDEDUC2,...,BPXSY2,BPXDI2,BMXWT,BMXHT,BMXBMI,BMXLEG,BMXARML,BMXARMC,BMXWAIST,HIQ210
0,83732,1.0,,1.0,1,1,62,3,1.0,5.0,...,124.0,64.0,94.8,184.5,27.8,43.3,43.6,35.9,101.1,2.0
1,83733,1.0,,6.0,1,1,53,3,2.0,3.0,...,140.0,88.0,90.4,171.4,30.8,38.0,40.0,33.2,107.9,
2,83734,1.0,,,1,1,78,3,1.0,3.0,...,132.0,44.0,83.4,170.1,28.8,35.6,37.0,31.0,116.5,2.0
3,83735,2.0,1.0,1.0,2,2,56,3,1.0,5.0,...,134.0,68.0,109.8,160.9,42.4,38.5,37.7,38.3,110.1,2.0
4,83736,2.0,1.0,1.0,2,2,42,4,1.0,4.0,...,114.0,54.0,55.2,164.9,20.3,37.4,36.0,27.2,80.4,2.0


In [64]:
#Isolating variables of interests
dx = da[['RIAGENDR', 'BMXBMI']]

#Separate populations
females = dx.where(dx.RIAGENDR == 2).dropna()
males = dx.where(dx.RIAGENDR == 1).dropna()

In [66]:
#Verify values
n1 = len(females)
mu1 = females['BMXBMI'].mean()
sigma1 = females['BMXBMI'].std()
print((n1, mu1, sigma1))

print("\n")

n2 = len(males)
mu2 = males['BMXBMI'].mean()
sigma2 = males['BMXBMI'].std()
print((n2, mu2, sigma2))

(2944, 29.939945652173996, 7.75331880954568)


(2718, 28.778072111846985, 6.252567616801485)


In [69]:
sm.stats.ztest(females['BMXBMI'], males['BMXBMI'])


(6.1755933531383205, 6.591544431126401e-10)

The output corresponds to the result of our test statistics and the p-value, respectively. Since our p-value is considerably smaller than \alpha = 0.05 we have suficient grounds to reject the null hypothesis.