In [None]:
import numpy as np
name_list = ['Sreelatha', 'Sara', 'Eva', 'Maaike', 'Victor', 'Zuzanna']
np.random.choice(name_list)

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
from scipy import stats
from scipy.stats import ttest_1samp

from scipy.stats import t
import matplotlib.pyplot as plt

## Introduction

Many times in the real world we would like to resolve a question that requires a comparison of two quantities. For example, does leaving the light on cause people to take longer to fall asleep, or do books with more pages sell more copies. Using a few basic assumptions, we can use statistical inference to come to a conclusion and determine an answer for these questions. **Hypothesis allow us to compare two samples and using certain assumptions we can either reject or not reject our hypothesis (as true statisticians we never say that we accept a hypothesis only reject or do not reject).**



We will examine statistical hypothesis testing using our good old fashioned car data set. Let us first retrieve the data.

In [None]:
data = pd.read_csv('https://raw.githubusercontent.com/loukjsmalbil/datasets_ws/master/vehicles.csv')

In [None]:
data.head()

## The Main Idea

Recall that in the previous lecture on confidence intervals, we computed an interal. This interval was a range for which with some amount of confidence we could say that it contains the the true mean of the actual population. Let us construct such an interval for the cars data set. In this case, we are interested in a 95% confidence interval for the average fuel barrels per year for Audi's 2008 models. 

In [None]:
from scipy.stats import t, sem

In [None]:
sample_data = data[(data.Make == 'Audi') & (data.Year==2008)]['Fuel Barrels/Year']

In [None]:
# Variables
alpha = 0.05
conf_level = 1 - alpha       # 0.95     
deg_freedom = len(sample_data) - 1
mean = np.mean(sample_data)
sm = sem(sample_data)   #standard error mean   s/np.srt(n)

# Confidence interval
t.interval(conf_level, df=deg_freedom, loc=mean, scale=sm)

It seems that with 95% confidence, we can say that the true mean $\mu$ lies within this interval. There are, however, some other tools to infer certain information about the population from a sample: hypothesis testing. 

In hypothesis testing, we not longer look at the interval directly, but make an hypothesis about the distribution of the population and test that hypothesis. 

https://homepage.divms.uiowa.edu/~mbognar/applets/normal.html

## Key Concepts in Hypothesis Testing

Now that we have the data, let us have a look at some of the most important concepts in hypothesis testing. 

*   **Statistical Hypothesis** - A statistical hypothesis is an assumption about the population average. For example, we can assume the average income of NYC residents is $850,000. This assumption may or may not be true.
*   **Null Hypothesis** - Denoted with H0, a null hypothesis is an assumption that the population average is identical to a specific value.  The typical notation is μ = μ0, where μ refers to the population mean and μ0 refers to the hypothesized value.  


### Example of a Null Hypothesis

For example, a null hypothesis can be that the average NYC residents' income is $850,000. The null hypothesis is represented by:

*  **H0: μ = 850,000**


## Key Concepts in Hypothesis Testing (Continued)

*  **Alternative Hypothesis** - An alternative hypothesis is the *opposite* of the null hypothesis. We compare this hypothesis with the null hypothesis to decide whether or not *we reject the null hypothesis.* We denote the alternative hypothesis with **H1 or Ha**. 

For the previous example of the null hypothesis, the alternative hypothesis can be any of the following:

*   **H1: μ > 850,000**
*   **H1: μ < 850,000**
*   **H1: μ ≠ 850,000**



## Example: Fuel Efficiency in 2008 Audi Models

Let us now have a look at a Python implentation of the null- and alternative hypotheses. Suppose that we have a car manufacturer (Audi) telling us that their model 2008 cars on average use only 16.6 barrels per year. We can test this claim (i.e. hypothesis) in Python with the scipy.stats module. In order to carry out a proper procedure we follow these steps:



1.   Specify the Null Hypothesis;
2.   Specify the Alternative Hypothesis;
3.   Set the critical value/significance level $\alpha$; 
4.   Conduct the hypothesis Test by computing the z or t-score;
5.   Calculate the p-value;
6.   Compare the p-value with $\alpha$;
7.   Reject the Null hypothesis if the p-value is less than or equal to $\alpha$.



#### 1. Specify the Null Hypothesis

We specify the null hypothesis for our problem as follows: 



*   H0: the average fuel efficiency for Audi's in 2008 is 16.6. 



In [None]:
# Ho: the average year is equal to 16.6

In [None]:
data[(data.Make == 'Audi') & (data.Year==2008)]['Fuel Barrels/Year'].mean()

#### 2. Specify the Alternative Hypothesis

Our Alternative hypothesis is the following:

*  H1: the average fuel barrels/year in 2008 > 16.6

In [None]:
# Ha: the average FB/Y is > 16.6.

#### 3. Set Critical Value

We want to set a critial value, called $\alpha$, to make sure that the difference between the hypothesised population mean and the difference in our sample is most likely not due to chance. In other words, we use the critial value as a **kind of cut-off value.** Here we will set it as follows:

*   $\alpha$ = 0.05.

Recall that we used the same value to compute the confidence interval. In that case, however, we used 1-$\alpha$ to obtain the 0.95 area under the curve probability value -- the centre of the curve. Here, on the other hand, we are interested in the tail(s). 



#### 4. Conduct the hypothesis Test by Computing the z or t-score

In order to determine whether or not we can reject our Null hypothesis, we need a test statistic. As seen in the previous lecture, we can use a $z$-statistic when the population $\sigma$ is known. However, in most cases, we will need a $t$-distribution that can deal with variability in the sample's standard deviations, the degrees of freedom. We compute the $z$-score as follwos:


$$z = \frac{\bar{X} - \mu}{\frac{\sigma}{n^{1/2}}},$$

where $\bar{X}$ is the mean of the sample, $\mu$ the population mean and $n^{1/2}$ denotes $\sqrt{n},$ the size of the sample. 

The t test can be computed as follows:

$$t = \frac{\bar{X} - \mu}{\frac{s}{n^{1/2}}},$$

where $s$ denotes the standard deviation of the sample. Here we will use the t test.

In [None]:
# Sample
sample_data = data[(data.Make == 'Audi') & (data.Year==2008)]['Fuel Barrels/Year']

# Terms
mu = 16.6
x_bar = sample_data.mean()
se = sem(sample_data)       #  standard error 

In [None]:
t = (x_bar - mu) / se
print('t statistic:', t)

In [None]:
#@title
#### Intermezzo: Degrees of Freedom

#Recall that the degree of freedom denotes the amount of variability, or rather: the number of values that are free to vary. For instance, suppose we have a sample set $S = [3,1,6,1, $x$]$ and we know that the mean is 3.6. Then we know that $x$ must be 7. $x$ *depends* for its values on the other values and the mean, whereas the other 4 values are *free to vary*. As such, these values are *independent*. 

#The idea is that if we do not know the standard deviation, we need to take into account the variability in the data. As such, we use Student's t-distribution which helps us to take into account the degrees of freemdom. Less degrees of freedom indicates fatter tails and, therefore, more variability. More degrees of freedom - and hence, more data points - reduces the variability.  

Let's plug this value in into our Student's t-distribution. Recall that the degrees of freedom is 47.

https://homepage.divms.uiowa.edu/~mbognar/applets/t.html

#### 5. Calculate the p-value

Computing the t test score is a bit of a tedious operation. Luckily, we can use a method from the scipy.stats module to compute both the t test as well as the p-value for us. 

In [None]:
# Ho: the average year is equal to 16.6
# Ha: the average year is > 16.6. 

sample_data = data[(data.Make == 'Audi') & (data.Year==2008)]['Fuel Barrels/Year']
st, p = ttest_1samp(sample_data, 16.6)

print('Statistic:', st)
print('The p-value is:', p/2)

#### 6. and 7. Compare the p-value to $\alpha$ and reject or confirm the Null Hypothesis. 

We can now compare the p-value to $\alpha$.



In [None]:
print('Statistic: ', st)
print('p-value: ', p/2)

print('\nOur Null hypothesis is: Ho: mu = 16.6')

print('\nNull hypothesis rejected') if p/2 <= alpha and st > 0 else print('Null hypothesis can\'t be rejected')

#### Steps 1-7 in overview.

We did the following:

In [None]:
# Ho: the average year is equal to 16.6
# Ha: the average year is > 16.6.

sample_data = data[(data.Make == 'Audi') & (data.Year==2008)]['Fuel Barrels/Year']

st, p = ttest_1samp(sample_data, 16.6)

print('Statistic: ', st)
print('\np-value: ', p/2)

if p <= alpha:
    print("\nNull hypothesis rejected")
    if st < 0:
        print("\nThe mean barrels per year in 2008 for Audi's was lower than in 16.6")
    else: 
        print("\nThe mean barrels per year in 2008 for Audi's was higher than 16.6")
else:
    print("\nNull hypothesis can't be rejected")

Now suppose that instead of saying that the mean barrels per year in 2008 for Audi was not 16.6, our car manufacturer told us that the mean fuel barrels per years was 18.3.

In [None]:
# Ho: the average year is equal to 18.3
# Ha: the average year is > 18.3.

sample_data = data[(data.Make == 'Audi') & (data.Year==2008)]['Fuel Barrels/Year']

st, p = ttest_1samp(sample_data, 18.3)

print('Statistic: ', st)
print('\np-value: ', p/2)

print('\nNull hypothesis rejected') if p/2 <= alpha and st > 0 else print('Null hypothesis can\'t be rejected')

## Types of Alternative Hypotheses 

There are two types of alternative hypotheses: one-sided and two-sided. One-sided alternative hypothesis (a.k.a. directional hypothesis) is used to determine whether the population average differs from the hypothesized value in a specific direction (larger but not smaller than, or vice versa). **In contrast, two-sided alternative hypothesis (a.k.a. nondirectional hypothesis) is used to determine whether the population average is either greater than or less than the hypothesized value.**

Why we differentiate these two types? Because sometimes we are not interested in whether the population average is larger or smaller than the hypothesized value. We only care if they are different. In this case, we use two-sided instead of one-sided hypothesis. However, in case we care about which is the case, we use one-sided hypothesis.

For instance, let $k$ be some value and our hypothesis that $\mu = k$. Thus,

* H0: μ = k	

Then a one-sided Alternative Hypothesis would look like this:

* H1: μ > k *OR*
* H1: μ < k	

On the other hand, a two-sides Alternative Hypothesis would look like this:

* H1: μ ≠ k

https://machinelearningmastery.com/statistical-hypothesis-tests/ 

The idea of a two-sides Alternative Hypothesis is that we no longer look at only one region to reject or accept or test, but to both the left region and right region. 

![alt text](https://camo.githubusercontent.com/1a8495533cd8c50c73ef2b881d1aaab7d6c00fa2/68747470733a2f2f736c696465706c617965722e636f6d2f736c6964652f393332353539392f32382f696d616765732f362f54797065732b6f662b4879706f7468657369732b54657374732e6a7067)

We will discuss the two-sides Alternative Hypothesis in more detail in the next lecture. The idea, however, is quite simple. If we want to determine whether we can reject the null hypothesis, we can simply implement this as follows:

In [None]:
# Ho: the average year is equal to 18.3
# Ha: the average year is not equal to 18.3

sample_data = data[(data.Make == 'Audi') & (data.Year==2008)]['Fuel Barrels/Year']

st, p = ttest_1samp(sample_data, 18.3)

print('Statistic: ', st)
print('\np-value: ', p)

print('\nNull hypothesis rejected') if p <= alpha else print('\nNull hypothesis can\'t be rejected')

By default, the ttest_1samp method carries out a two-sides hypothesis test. 

## Test Assumptions

In order to conduct a hypothesis test we need to meet certain assumptions:

1. Our observations must be **independent** of each other. For example, if we have people who live in the same household participating in a medical trial, they might be exposed to the same environmental conditions or eat the same food. This can bias our results.
2.   Normality of data - We assume that the sample is derived from a normally distributed data.
3.   Adequate sample size. In order to perform a test using the normal distribution and not approximate to the t distribution, our sample size must be greater than 30.
4.   In order to use the normal distribution for our hypothesis test, we must assume the population standard deviation is known. If the population standard deviation is not known, then we use the t-distribution for the hypothesis test.

Please also be careful for sampling bias:

*   **Sampling Bias** - Sampling bias occurs when some members of the intended population are less likely to be included than others in the selected sample. Say we collected the data at Wall Street at 10AM on Monday. Significant bias can be introduced due to the flaws of the data collection because people who appear at Wall Street at that specific time are most likely working professionals in the financial industry. The sample data will miss a lot of people from other industries and other NYC neighborhoods. Consequently, the selected sample is not a good representation of NYC residents.



## Summary

In this lesson we learned how to perform a hypothesis test. To do this, we had to introduce a large number of terms: population, sample, null hypothesis, alternative hypothesis, test statistic, p-value, and confidence interval. We then showed how we can use our test to come up with a conclusion for a real-life scenario and learned how to do this with Python.