# Testing distributions

* [Overview](#overview) 
* [Testing distributions](#sec1)
* [Summary](#sum)
* [References](#refs)

## Overview

The previous section introduced the $\chi^2$ test. In this section we use this statistic in order to test whether the data belong to a particular distribution.

## Testing distributions

Frequently we want to know whether the data we are working on follow a specific distribution. Surely, we can use histograms in order to get an idea of how the empirical data is distributed but histograms introduce binning bias; we may interpret the results differently depending on the number of bins we use. We can use the hypothesis testing framework we saw a few sections ago in order to test the hypothesis that the data follow a specific distribution.

More formally, consider the sample $(x_1, \dots x_n)$ from a distribution $F$. Foe a given known distribution $F_0$, we want to test the following 

$$H_0: F = F_0 ~~\text{vs}~~ H_a: F \neq F_0$$

How can we perform such a test? 

In order to conduct the test above, we take all possible values of the variable $X$ under $F_0$. We split
these values into $N$ bins $b_1, \dots, b_N$ [1]. In general we require that each bin has a sufficiently high expected count of values. A rule of thumb requires anywhere from 5 to 8 bins [1]. 

----

**Remark**

The set of all possible values of $X$ under $F_0$ is called the support of $F_0$ [1]

----

The observed count for the k-th bin is the number of $X_i$ that fall into $b_k$. More formally [1], 

$$Obs(k) = \# \{i=1, \dots n: X_i \in b_k\}$$

Now if the null hypothesis is true and all $X_i$ follow $F_0$ , then $Obs(k)$, i.e. the number of “successes” in $n$ trials, has Binomial distribution with parameters $n$ and $p_k$ where the latter is given by 


$$p_k = F_0 (b_k ) = p \{X_i \in b_k | H_0 \}.$$

Then, the corresponding expected count is the expected value of this Binomial distribution i.e. [1]

$$Exp(k) = np_k = nF_0(b_k)$$

Let's make the above more concrete by considering some numerical examples. We will use the example 10.1 from [1] at page 307. 

In [1]:
from scipy.stats import chisquare

In [2]:
# The observed results 1 2 3 4 5 6
f_obs = [20, 15, 12, 17, 9, 17]

# the expected results
f_exp = [15, 15, 15, 15, 15, 15]

# compute the chi^2 statistic
chisquare(f_obs, f_exp)

Power_divergenceResult(statistic=5.2, pvalue=0.39196289159963393)

The $p-$value is 

$$p = p(\chi^2 \geq 5.2) = 0.3919$$

This means that no significant evidence exists that the die was biased.

## Summary

In this section we reviewed one of the main use cases of a $\chi^2$ test. Namely testing whether the data follows an assumed distribution $F_0$. 

## References

1. Michael Baron, _Probability and statistics for computer scientists_, 2nd Edition, CRC Press.