<a href="https://colab.research.google.com/github/huki1983/Data-Analysis/blob/master/DS_Unit_1_Associate_Instructor_Presentation_Starter.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Objectives:

- Student can execute the steps of a 1-sample T-test
- Student can perform a 1-sample t-test with Scipy


# Introduction

A one sample t-test is used to detect statistically significant differences between a sample mean and a known or hypothesized population value. 

![Inferential Statistics](https://slideplayer.com/slide/5130463/16/images/2/Statistical+Inference.jpg)

### Lets use the Adult dataset to practice some 1-sample t-tests

In [None]:
import pandas as pd

data_url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data'

column_headers = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 
                  'marital-status', 'occupation', 'relationship', 'race', 'sex', 
                  'capital-gain', 'capital-loss', 'hours-per-week', 
                  'native-country', 'income']

df = pd.read_csv(data_url, names=column_headers, skipinitialspace=True)

print(df.shape)
df.head()

(32561, 15)


Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


This dataset was sourced from US census responders from 1994 so it is not necessarily representative of the United States at large but is only representative of census responders from that year. 

In order to demonstrate how sample statistics can be used estimate population parameters, we will treat this dataset as if it was the "population" that we were trying to estimate.

We'll randomly sample from this dataset and try to estimate its population parameters. Let's estimate the average number of years of education completed by census responders in 1994. 

As we select our random sample we will use a **`random_state` of 42** to make sure that we get the same random sample every time we sample the data. 

## Steps of a One Sample T-test

A one sample t-test will help us test if our sample statistic (sample mean) is reflective of our concept of the population (population mean).

### 1) Null Hypothesis

The null hypothesis in a 1-sample t-test is that there is no significant difference between our population mean (null hypothsis value) and the sample mean.

Stated in mathematical terms:

$H_0: \mu == \overline{x}$

Where $\mu$ is the population mean, and $\overline{x}$ is the sample mean.

Any sample mean that is different enough from the population mean to be deemed "statistically significant" indicates that the two values are so far apart that it is unlikely that those differences could have been due to the randomness of sampling (e.g. due to us just getting an unlucky sample). The technical term for the "randomness of sampling" is called "experimental error." 

In hypothesis testing we seek to "nullify" the null hypothesis (hence the name). However, we won't reject this hypothesis unless there is ample evidence against it.


### 2) Alternative Hypothesis

The alternative hypothesis is always the opposite of the null hypothesis. Since they are logical opposites this means that they can't both be true at the same time. Our hypothesis test will help us decide between the two statements.

$H_a: \mu \neq \overline{x}$

Our alternative hypothesis is that our sample mean is not equal to the population mean. If we reject the null hypothesis then we are indicating that the alternative hypothesis is likely to be true.

### 3) Confidence Level

A Confidence Level is threshold that we pick to determine how likely the alternative hypothesis needs to be in order for us to reject the null hypothesis. In many areas of academic study and statistics using a 95% confidence interval as this threshold is a common convention. 

If we use a 95% confidence level then we will not reject the null hypothesis unless there is less than a 5% chance that it is true. 

### At this point we are ready to run our t-test

###4) T-statistic and P-value

A t-statistic is a measure of how different our sample mean is to our population mean given the sample size that we are using. The t-statistic is translated into a p-value (a probability).

### **The P-value is the likelihood (probability) that the null hypothesis is true given the sample that we have collected.**

In our case, since the p-value is greater than .05 we will fail to reject the null hypothesis (remember that we need to be 95% confident that the alternative hypothesis is true e.g. < 5% confident that the null hypothesis is true.


### 5) Conclusion

After each test we will provide a written conclusion reporting the results of the test, an appropriate conclusion might look something like:

> Due to a t-statistic of 1.128 and a p-value of .273 we fail to reject the null hypothesis that the sample mean is different than the population mean.


Why the phrase "fail to reject" and not just say "accept"? At the end of the day we're dealing with probabilities and there's still a chance that we could be wrong. We just don't have enough evidence to nullify the null hypothesis, that doesn't necessarily mean that it's true.

## More Examples:

Imagine that someone makes the claim that the average number of years of education obtained by female census responders in the year 1994 was 10.

Lets generate a sample and test this hypothesis. Again, we'll use a sample size of 20. We will also continue to use the same `random_state` of 42.

1) Null Hypothesis:

Average number of years of education obtained by female census responders is equal to 10 years.

2) Alternative Hypothesis:

Average number of years of education obtained by female census responders is different from 10 years.

3) Confidence Level: 95%

4 & 5) Conclusion:

Due to a t-statistic of .753 and a p-value of .46, we **fail to reject** the null hypothesis that the average number of years of education obtained by female census responders is 10.

We have a low p-value here, but not low enough to reject the null hypothesis, our p-value has to be < .05 in order for us to reject the null hypothesis.

### Now lets increase the sample size.

This time we will use a sample size of 50, using a same `random_state` of 42.

1) Null Hypothesis:

Average number of years of education obtained by female census responders is equal to 10 years.

2) Alternative Hypothesis:

Average number of years of education obtained by female census responders is different from 10 years.

3) Confidence Level: 95%

4 & 5) Conclusion:

Based on a t-statistic of 2.15 and a p-value of .036 we will **reject** the null hypothesis that the average number of years of education obtained by female census respondents is equal to 10 and suggest the alternative hypothesis that that it must be different than 10 years. 

## More examples from the dataset as time allows.