**Central Limit Theorem**

The sampling distributions of many statistics of interest look similarly symmetric and bell-shaped. An important mathematical result, the Central Limit Theorem, specifies the conditions for which a statistic's sampling distribution will approximately follow a normal distribution. The **Central Limit Theorem** (CLT) states that if random samples of size  are drawn from a large population and  is large enough, then the sampling distribution of the sample mean will follow approximately a normal distribution. The CLT also applies to proportions since a proportion can be expressed as the mean of zeros and ones.



**Hypothesis testing**

A hypothesis test is a method for evaluating a claim, or hypothesis, about a population parameter by examining the statistical evidence against the claim based on a sample. The conclusion of a hypothesis test is a decision that the observed data either indicate the claim is plausible or support an alternative explanation. The following steps outline the general process of conducting a hypothesis test.

1.  State null and alternative hypotheses about parameters. **The null hypothesis**,*H<sub>0</sub>* , is typically the by-chance or no-effect explanation, and the **alternative hypothesis**, *H<sub>a</sub>* , is typically the explanation of an effect, or difference.
2.  Calculate a statistic of interest from the sample data that is used to evaluate the null hypothesis.
3.  Determine the p-value, or likelihood, of obtaining a statistic at least as extreme as the observed statistic when the null hypothesis is true.
4.  Draw a conclusion about the null hypothesis based on the statistical evidence provided by the p-value.


**Type I and type II errors**

The decision from a hypothesis test is to either reject the null hypothesis or fail to reject the null hypothesis. The **significance level**, **𝛂**, of a hypothesis test is how small the p-value must be to conclude the data provide enough statistical evidence to reject the null hypothesis. The decision is to reject the null hypothesis if the p-value is less than or equal to **𝛂**, and fail to reject the null hypothesis if the p-value is greater than **𝛂**.

In reality, either the null hypothesis is true or the alternative hypothesis is true. Thus, the conclusion from a hypothesis test is either correct or incorrect. Ex: Suppose the conclusion from a hypothesis test is that the data support a population mean commute time of 25 minutes. If the mean commute time of the population is actually about 25 minutes, then a correct decision is made. But, if the population mean commute time is actually 40 minutes, then an incorrect decision is made.

- A **type I** error is rejecting the null hypothesis in favor of the alternative when in reality the null hypothesis is true.
- A **type II** error is failing to reject the null hypothesis when in reality the alternative hypothesis is true.


**Estimation**

Another inference method provides an estimate for the value of a population parameter. Ex: What is the mean moisture content for all compost produced in home compost bins? A **confidence interval** provides an interval of possible values for the parameter being estimated. A confidence interval is constructed using the general equation ***estimate ± margin of error***. The **estimate** is a statistic calculated from the sample data and gives an initial best guess for the parameter's value. The **margin of error** measures the precision of the estimate and includes:

- the standard error, or measure of sampling variability, which comes from the statistic's sampling distribution, and
- the confidence level, or measure of interval reliability.


**Inference for Proportions in Python**

**Functions for inference about proportions.**
| Function	| Parameters	| Description |
| :--------  | :----------    | :----------- |
| proportions_ztest() |  **count**: number/array of successes<br>  **nobs**: number/array of observations<br>  **value**: value in the null hypothesis<br>  **alternative**: type of the alternative hypothesis<br>    **prop_var**=False: estimate variance based on sample proportions | Returns the test statistic and p-value for a hypothesis test based on a normal (z) test. **count** and **nobs** take a single value for a one proportion test and an array of values for a two proportion test. |
| proportion_confint()	| **count**: number of successes<br>  **nobs**: number of observations<br>  **alpha**: significance level<br>  **method='normal'**: use normal approximation to calculate interval	| Returns a (1-alpha)100% confidence interval for a population proportion. |



**Inference for proportions in Python**

The National Health and Nutrition Examination Survey (NHANES) is conducted every year to survey Americans about their health and nutrition. The dataset includes physical characteristics and behaviors, such as exercise and eating habits.

Another study reported 7% of the US population has been diagnosed with diabetes in 2012. Determine whether the NHANES data provides statistical evidence in support or against the proportion of the US population diagnosed with diabetes in 2012 is 0.07.

The code below uses the NHANES dataset to conduct a single-proportion hypothesis test using the additional study information given above, construct a confidence interval for a single proportion, and conduct a two-proportion hypothesis test comparing the proportion of the US population with diabetes for the 2009-10 and 2011-12 survey years.


In [None]:
# Import pandas package and functions from statsmodels
import pandas as pd
from statsmodels.stats.proportion import proportions_ztest
from statsmodels.stats.proportion import proportion_confint

In [None]:
# Load the dataset
nhanes = pd.read_csv('nhanes.csv')

# View dataset (first/last 5 rows and the first/last 10 columns)
nhanes

In [None]:
# Conduct the hypothesis test for testing whether or not the population
# proportion of U.S. adults diagnosed with diabetes in 2012 is 0.07

# Subset full nhanes dataset to only include the 5000 instances from the
# 2011_12 survey year

nhanes2012 = nhanes[nhanes['SurveyYr'] == "2011_12"]

# Find the total number in the 2012 sample diagnosed with
# and without diabetes

countDiabetes = nhanes2012['Diabetes'].value_counts()
print(countDiabetes)

# Find the total number of instances in 2012 for the Diabetes feature
totalInstances2012 = countDiabetes['No'] + countDiabetes['Yes']
print('2012 total:', totalInstances2012)

# Find the sample proportion
sampleProp2012 = countDiabetes['Yes'] / totalInstances2012
print('sample proportion with diabetes =', sampleProp2012)

# Find the z test statistic and p-value using proportions_ztest
proportions_ztest(
    count=countDiabetes['Yes'],
    nobs=totalInstances2012,
    value=0.07,
    alternative='two-sided',
    prop_var=0.07,
)

In [None]:
# The first value returned is the test statistic,
# the second value is the p-value

testStat, pvalue = proportions_ztest(
    count=countDiabetes['Yes'],
    nobs=totalInstances2012,
    value=0.07,
    alternative='two-sided',
    prop_var=0.07,
)

print('z test statistic =', round(testStat, 3))
print('p-value =', round(pvalue, 3))

In [None]:
# Find the 95% confidence interval for the proportion of all U.S. adults
# in 2012 with diabetes

proportion_confint(
    count=countDiabetes['Yes'], nobs=totalInstances2012, alpha=0.05, method='normal'
)

In [None]:
# Conduct the hypothesis test for testing whether or not the population
# proportions of U.S. adults diagnosed with diabetes are the same for
# the 2009_10 and 2011_12 survey years

# Find the total number in the sample diagnosed with diabetes for each
# survey year

countDiabetes2Yrs = nhanes[['SurveyYr', 'Diabetes']].value_counts()
print(countDiabetes2Yrs)

# Find the total number of instances in each survey year for the
# Diabetes feature

totalInstances2010 = (
    countDiabetes2Yrs['2009_10', 'No'] + countDiabetes2Yrs['2009_10', 'Yes']
)
totalInstances2012 = (
    countDiabetes2Yrs['2011_12', 'No'] + countDiabetes2Yrs['2011_12', 'Yes']
)

# Find the sample proportions and difference in sample proportions
sampleProp2010 = countDiabetes2Yrs['2009_10', 'Yes'] / totalInstances2010
sampleProp2012 = countDiabetes2Yrs['2011_12', 'Yes'] / totalInstances2012

sampleDiff = sampleProp2012 - sampleProp2010
print('2010 sample proportion with diabetes =', sampleProp2010)
print('2012 sample proportion with diabetes =', sampleProp2012)
print('2012 proportion - 2010 proportion =', sampleDiff)

# Find the overall proportion of diabetes for calculating
# the test statistic

overallSampleProp = (
    countDiabetes2Yrs['2009_10', 'Yes'] + countDiabetes2Yrs['2011_12', 'Yes']
) / (totalInstances2010 + totalInstances2012)

# Find the z test statistic and p-value using proportions_ztest
proportions_ztest(
    count=[countDiabetes2Yrs['2011_12', 'Yes'], countDiabetes2Yrs['2009_10', 'Yes']],
    nobs=[totalInstances2012, totalInstances2010],
    value=0,
    alternative='two-sided',
    prop_var=overallSampleProp,
)

In [None]:
# Known counts, sample sizes, and overall proportion can specify directly
# into the proportions_ztest() function
knownCounts = [373, 387]
knownNobs = [4936, 4922]
knownOverallSampleProp = (373 + 387) / (4936 + 4922)

proportions_ztest(
    count=knownCounts,
    nobs=knownNobs,
    value=0,
    alternative='two-sided',
    prop_var=knownOverallSampleProp,
)

**Functions for inference about means.**
| Function	| Parameters	| Description |
| :-------  | :----------   | :---------- |
| ttest_1samp()	| **a**: array of values<BR>**popmean**: value in null hypothesis<br>**alternative**: type of alternative hypothesis | Returns the **t**-statistic and p-value from a one-sample **t**-test for the null hypothesis that the population mean of a sample, a, is equal to a specified value. |
| ttest_ind()	| **a**: array of values from sample 1<br>**b**: array of values from sample 2<br>**equal_var=False**: assumes non-equal variances<br>**alternative**: type of alternative hypothesis	| Returns the **t**-statistic and p-value from a two-sample **t**-test for the null hypothesis that two independent samples, a and b, have equal population means.|

**Inference for means in Python.**

Doctors recommend that adults sleep at least 7 hours per night. The SleepHrsNight feature in the NHANES dataset is the self-reported number of hours participants usually get at night for participants aged 16 and older. Determine whether the NHANES data provides statistical evidence that the population mean self-reported number of hours of sleep per night is 7 or whether the population mean is less than 7.

The code below uses the NHANES dataset to conduct a single mean hypothesis test using the additional information given above, construct a confidence interval for a single mean, and conduct a hypothesis test for two independent means comparing the population mean self-reported number of hours of sleep per night for the 2009-10 and 2011-12 survey years.


In [None]:
# Import pandas and numpy packages and functions from scipy.stats 
import pandas as pd
import numpy as np
from scipy.stats import ttest_1samp
from scipy.stats import ttest_ind
from scipy.stats import t


In [None]:
# Load the dataset
nhanes = pd.read_csv('nhanes.csv')

# View dataset (first/last 5 rows and the first/last 10 columns)
nhanes

In [None]:
# Find descriptive statistics for SleepHrsNight feature
# A total of count=7755 instances have a value for the feature

nhanes['SleepHrsNight'].describe()

In [None]:
# Subset dataset to drop instances with missing values for the
# SleepHrsNight feature

nhanesSleep = nhanes.dropna(axis=0, subset=['SleepHrsNight'])
nhanesSleep

In [None]:
# Conduct the hypothesis test for testing whether the population mean
# self-reported number of hours of sleep per night is 7 or whether
# the population mean is less than 7

ttest_1samp(a=nhanesSleep['SleepHrsNight'], popmean=7, alternative='less')

In [None]:
# Construct a 95% confidence interval for the population mean

# Find sample mean, sample standard deviation, and sample size
sampleMean = nhanesSleep['SleepHrsNight'].mean()
sampleStDev = nhanesSleep['SleepHrsNight'].std()
sampleSize = nhanesSleep['SleepHrsNight'].count()

# Find multiplier using confidence Level and t-distribution
confLevel = 0.95
tMult = t.ppf(q=1 - ((1 - confLevel) / 2), df=sampleSize - 1)

# Construct interval using general equation:
# estimate +/- multiplier * standard deviation

lowerBound = sampleMean - tMult * (sampleStDev / sampleSize**0.5)
upperBound = sampleMean + tMult * (sampleStDev / sampleSize**0.5)

print(lowerBound, upperBound)

In [None]:
# Conduct the hypothesis test for testing whether or not the population
# mean self-reported number of hours of sleep per night is the same for
# the 2009_10 and 2011_12 survey years

# Find descriptive statistics for SleepHrsNight feature for each survey year
# Provides an initial comparison of the two samples, notice similar means

statsByYear = nhanes.groupby(['SurveyYr'])['SleepHrsNight'].describe()
print(statsByYear)

# Find statistic and p-value using ttest_ind()
ttest_ind(
    a=nhanes[nhanes['SurveyYr'] == '2009_10']['SleepHrsNight'],
    b=nhanes[nhanes['SurveyYr'] == '2011_12']['SleepHrsNight'],
    equal_var=False,
    nan_policy='omit',
    alternative='two-sided',
)

**Are flight delays more likely at JFK or LaGuardia?**

The Port Authority is concerned that a difference exists between the proportion of flights delayed at JFK Airport compared to LaGuardia Airport. Whether or not a flight is delayed is categorical with two possible outcomes: "delay" ('delay'=1) or "no delay" ('delay'=0). Since the proportion of delays is being compared for two airports, a hypothesis test for two independent proportions is most appropriate.



In [None]:
# Import packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.stats.proportion import proportions_ztest
from scipy.stats import ttest_ind

# Load the flights dataset
flights = pd.read_csv('flights.csv').dropna()
flights

In [None]:
# Define dataframes for each origin airport
EWR = flights[flights['origin'] == 'EWR']
JFK = flights[flights['origin'] == 'JFK']
LGA = flights[flights['origin'] == 'LGA']

In [None]:
all_counts = flights['delay'].value_counts()
all_counts

In [None]:
# Plot whether flights from all NY airports were delayed or not
plt.bar(x=['No delay', 'Delay'], height=all_counts)
plt.title('All New York City flights', fontsize=16)
plt.xlabel('Flight status', fontsize=14)
plt.ylabel('Count', fontsize=14)

In [None]:
JFK_counts = JFK['delay'].value_counts()
JFK_counts

In [None]:
# Plot whether flights from JFK were delayed or not
plt.bar(x=['No delay', 'Delay'], height=JFK_counts)
plt.title('Flights from JFK', fontsize=16)
plt.xlabel('Flight status', fontsize=14)
plt.ylabel('Count', fontsize=14)

In [None]:
LGA_counts = LGA['delay'].value_counts()
LGA_counts

In [None]:
# Plot whether flights from LGA were delayed or not
plt.bar(x=['No delay', 'Delay'], height=LGA_counts)
plt.title('Flights from LGA', fontsize=16)
plt.xlabel('Flight status', fontsize=14)
plt.ylabel('Count', fontsize=14)

Suppose the two closest airports are JFK and LGA. Which airport is best for avoiding delays?

In [None]:
sample_prop = (JFK_counts[1] + LGA_counts[1]) / (len(JFK) + len(LGA))
sample_prop

In [None]:
proportions_ztest(
    count=[JFK_counts[1], LGA_counts[1]],
    nobs=[len(JFK), len(LGA)],
    value=0,
    alternative='two-sided',
    prop_var=sample_prop,
)

**Is there a significant difference in the duration of a delay?**

The Port Authority found evidence of a difference in the proportion of delayed flights between JFK and LGA. But, whether or not a flight's departure is delayed is not the only consideration. A traveler might be willing to deal with a 5-minute departure delay, but a 30-minute departure delay is much more inconvenient. How does the average length of departure delay compare at JFK vs. LGA?

In [None]:
# Subset to only the flights from JFK and summarize
JFK = flights[flights['origin'] == 'JFK']
JFK.describe()

In [None]:
# Subset to the only the delayed flights from JFK and summarize
JFK_delays = JFK[JFK['delay'] == 1]
JFK_delays.describe()

In [None]:
# Subset to only the flights from LGA and summarize
LGA = flights[flights['origin'] == 'LGA']
LGA.describe()

In [None]:
# Subset to only the delayed flights from LGA and summarize
LGA_delays = LGA[LGA['delay'] == 1]
LGA_delays.describe()

In [None]:
# Run a t-test on the length of departure delay between the JFK and LGA
ttest_ind(
    a=JFK['dep_delay'],
    b=LGA['dep_delay'],
    equal_var=False,
    nan_policy='omit',
    alternative='two-sided',
)

In [None]:
# Plot the distribution of JFK's delays
plt.hist(JFK['dep_delay'])
plt.xlabel('Delay (minutes)', fontsize=14)
plt.ylabel('Counts', fontsize=14)
plt.title('Delays at JFK', fontsize=16)

In [None]:
# Plot the distribution of LGA's delays
plt.hist(LGA['dep_delay'])
plt.xlabel('Delay (minutes)', fontsize=14)
plt.ylabel('Counts', fontsize=14)
plt.title('Delays at LGA', fontsize=16)

**Practical significance**

A difference between two groups is practically significant if the difference is large enough to have a real-life consequence. Based on the previous analysis, the average departure delay at JFK was about 11 minutes, and the average departure delay at LGA was about 10 minutes. The hypothesis test found strong statistical evidence of a difference in the average length of departure delay. However, statistical evidence does not necessarily equate to practical significance.

Suppose the average flight delay from JFK was 1 minute, and the average flight delay from LGA was 2 minutes. An additional minute on a flight delay is not likely to cause someone to miss an important meeting or a connecting flight, so the difference is not practically significant.
Suppose the average flight delay from JFK was 1 minute, and the average flight delay from LGA was 31 minutes. An additional 30 minutes on a flight delay is more likely to cause someone to miss an important meeting or a connecting flight, so the difference is practically significant.


In [None]:
# Create subset of only the flights from JFK and LGA
JFK = flights[flights['origin'] == 'JFK']
LGA = flights[flights['origin'] == 'LGA']

In [None]:
# Create overlapping histograms to compare the
# distribution of departure delays
plt.hist(JFK['dep_delay'], edgecolor='black', alpha=0.5, label='JFK')
plt.hist(LGA['dep_delay'], edgecolor='black', alpha=0.5, label='LGA')
plt.xlabel('Delay (minutes)', fontsize=14)
plt.ylabel('Counts', fontsize=14)
plt.legend(fontsize=12)