#Hypothesis Testing

**Hypothesis testing** is used to assess the plausibility of a hypothesis by using sample data. Such data may come from surveying a population, or synthetically generating it. Commonly, two statistical datasets are compared, or a sampled dataset is compared against a synthetic ideal dataset.

Two hypothesis is proposed regarding the relation between the two datasets -
*   ```Null hypothesis``` ($H_0$) proposes that there is no relationship between these two datasets. 
*   ```Alternative hypothesis``` ($H_a$) proposes that an statistical relationship between the two datasets exists

This comparison is deemed statistically significant if the relationship between the datasets would be an unlikely realization of the ```null hypothesis``` according to a threshold probability â€” the ```significance level``` ($\alpha$). In other words, we will accept the ```alternative hypothesis``` ($H_a$) only after statistically testing and rejecting the ```null hypothesis``` ($H_0$).

Let us import the required libraries before we begin our experiment -

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats
%matplotlib inline

---

##Z-test

Let us understand the idea behind hypothesis testing with an example.

> A principal at a school claims that students in his school are above average intelligence and a ```random sample``` of 30 students IQ scores have a mean score of 112.5. We have data from a survey that mean ```population``` IQ is 100 with a standard deviation of 15. Is there sufficient evidence to support the principal claim?

In the above example, the ```sample``` is from the students of the principal's school, whereas ```population``` might mean the students of the entire country.

Thus, we have the following information -

In [None]:
sample_mu = 112.5
sample_size = 30
population_mu = 100
population_sigma = 15

First, we have to formulate the ```null hypothesis``` and the ```alternative hypothesis``` -
*   ```Null Hypothesis```: $H_0 = \mu,\;$ i.e. the students of the school have average IQ
*   ```Alternative Hypothesis```: $H_a \gt \mu,\;$ i.e. the students of the school have above average IQ

Then, we have to find the ```z-score``` using the formula below -
<br /><br />
$$z = \frac{\bar{x} - \mu}{\frac{\sigma}{\sqrt{N}}}$$

where $\bar{x}$ is the ```sample mean```, $\mu$ is ```population mean```, $\sigma$ is ```population standard deviation```, and $N$ is ```sample size```.
<br /><br />
The formula to find the ```z-score``` is different in this experiment, compared to the one we saw in the ```Binomial and Gaussian Probability Distribution``` experiment. This is because we are tying to calculate the ```sample z-score``` in this experiment, while we calculated the ```population z-score``` in the previous experiment.

Calculate the ```z-score``` for our data using the new formula -

In [None]:
z_score = None
print("Z-score for sample data: "+str(round(z_score,4)))

Now that we have calculated the z-score for our sample data, we are ready to either accept or reject the null hypothesis.

But before we can do that, we need to fix the significance level ($\alpha$).

Let us use the commonly used default value of $5\%\;(\alpha = 0.05)$. Using a significance level of $5\%\;(\alpha = 0.05)$ would give us a confidence of $95\%\;(c = 1-\alpha)$. We can use the z-score of $1.645$.

In [None]:
critical_value = stats.norm.ppf(q=1-0.05)
print("The critical value for 95% confidence is: "+str(round(critical_value,4)))

If ```z-score``` is greater than ```critical value``` then reject the ```null hypohesis``` (and accept the ```alternative hypothesis```), else accept the ```null hypothesis``` (and reject the ```alternative hypothesis```).

In [None]:
if(None):
  print("Null hypothesis accepted.")
else:
  print("Alternative hypothesis accepted.")

Let us visualize the all the data that we have -

In [None]:
x = np.linspace(population_mu - 3*population_sigma, population_mu + 3*population_sigma, 1000)
norm = stats.norm(population_mu, population_sigma)
rejection_region = critical_value * population_sigma + population_mu
plt.plot(x, norm.pdf(x))
line_mean, = plt.plot([population_mu, population_mu],[0,norm.pdf(population_mu)], label='Mean')
line_rej, = plt.plot([rejection_region, rejection_region],[0,norm.pdf(rejection_region)], label='Critical Value')
plt.legend([line_mean, line_rej], ['Mean', 'Critical Value'])
plt.show()

The area on the left of the line corresponding to the ```critical value``` on the above graph encompasses $95\%$ of the total region under the curve.

The line corresponding to our ideal ```null hypothesis``` ($H_0$) is represented by the line labelled as ```mean```.

Assuming that $H_0$ is true, we may still not observe data equal to the ```mean``` value, i.e., we still accept $H_0$ even if our observed data deviates from the ```mean```. In fact, we will accept $H_0$ as long as the observed data is on the less than the ```citical value```.

The observed data, in our example, is so much towards the right of the ```critical value``` that it would not even fit in our graph. Hence, we can reject our ```null hypothesis```, and accept the ```alternative hypothesis``` that *the students of the school have above average IQ*.

---
##Chi-Squared test

```Pearson's chi-square test``` is used to determine whether there is a statistically significant difference between the ```expected``` frequencies and the ```observed``` frequencies in one or more categories of a ```contingency table```.

There are two types of ```chi-square tests```. Both use the ```chi-square statistic and distribution``` for different purposes:

*   A ```chi-square goodness of fit test``` determines if a sample data matches a population.
*   A ```chi-square test for independence``` compares two variables in a contingency table to see if they are related.

**Let us consider the following observation as an example** -

> 256 visual artists were surveyed to find out their zodiac sign.
>
> The results were:

|Aries|Taurus|Gemini|Cancer|Leo|Virgo|Libra|Scorpio|Sagittarius|Capricorn|Aquarius|Pisces|
|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|
|29|24|22|19|21|18|19|20|23|18|20|23|

<br />
We will test the hypothesis -

*  $H_0$: Zodiac signs are evenly distributed across visual artists.
*  $H_a$: Zodiac signs are not evenly distributed across visual artists.

If the zodiac signs were evenly distributed, then we would expect each zodiac sign to have $256/12$ people.

In [None]:
observed = [29,24,22,19,21,18,19,20,23,18,20,23]
expected = [sum(observed)/len(observed) for i in range(len(observed))]

Calculate the ```Chi-squared statistic``` with $n$ observations -

$$\chi_{c}^{2} = \sum_{i=1}^{n}{\frac{(O_i - E_i)^2}{E_i}},$$

where $O_i$ and $E_i$ are the $i^{th}$ observation and expected data respectively.

In [None]:
chi_squared_statistic = None
print("Chi-squared statistic is: "+str(chi_squared_statistic))

The ```degree of freedom``` is one less than the total categories. (We substract one because given all other data, we can calculate the value for the final category - so it is not ```free```.)

In [None]:
dof = len(observed)-1
p_value = 1-stats.chi2.cdf(chi_squared_statistic , dof)
print("The p-value is: "+str(p_value))

Use the ```p-value``` to decide whether to support or reject the ```null hypothesis```. In general, small ```p-values``` ($1\%$ to $5\%$) would cause you to reject the ```null hypothesis```.

The very large ```p-value``` in our example ($92.65\%$) means that the ```null hypothesis``` should not be rejected.

In [None]:
if(None):
  print("Reject null hypothesis.")
else:
  print("Accept null hypothesis.")

---
#Conclusion

In this experiment, you -
* Learnt an overview of hypothesis testing
* Implemented Z-test
* Implemented Chi-squared test

```Hypothesis testing```, also called ```Confirmatory Data Analysis```, is an integral part of ```Data Analysis```. It can give statistical support to any hypothesis that you make by analysing datasets.

In this experiment we simply implemented a few formulas to do basic hypothesis testing without really going into the depth of the ```why's``` and ```how's``` of these things. You should research more into these topics to learn the theories and mathematics behing everything that we did. You will find these topics useful.