# Hypothesis testing


## Analysing differences in graduation probaility in olympiad winners

In this part of the analysis we estimate the graduation success probability of two groups of students: olympiad prize holders and students who did not take part in any mathematics or informatics olympiads. We show that there are statistically significant differences by utilising binomial testing.

### Motivation:
In Russian higher education system mathematics and informatics olympiads play a crucial role. They are a primary method of seeking, identifying and developing young talents. Olympiads provide equal opportunities to get access to the best educational facilities to determined schoolchildren of any background and socio-economic status.
University admission offices, on the other hand, compete with each other to get olympiad winners. Thus they provide benefits to them in the admission process.

Here we seek to justify this approach with data and also motivate further analysis of university-level performance of olympiad winners.


### Analysis:
We want to test the null-hypothesis that olympiad winners have the same probability of successfully graduating from university as regularly admitted students.

To this end we conduct an observational study, with the dependant variable being successful graduation and independent variable being the presence of prizes in mathematics and informatics olympiads. Our sampling distribution are graduated and dropped out students of one of Russian universities. This sampling strategy can be considered to be unbiased if we want to infer statements about the population of university students, or, even narrower, about the students of this particular university. Which is the kind of analysis that is of use to an admissions office. 

Our dataset includes **#TODO**  datapoints of bachelor students who either graduated or dropped out.
We divide this subset of data into two groups - students who were admitted to the study program because they were olympiad prize holders (**olympiad winners, #TODO students**) and those who were admitted according to their state exam results (**regularly admitted students, #TODO students**).


### Model:
We model each group of students as a series of independent Bernoulli experiments with unknown success probability $\pi$ that we would like to estimate. To do this we use Bayesian inference.

Let $X_o$ be the subset of $n_o$ datapoints of olympiad winners, $m_o$ of whom graduated successfully. Then $X_r$ - subset of $n_r$ regularly admitted students with $m_r$ successful graduates.
We would like to estimate the probabilities of graduation in both groups $\pi_o$ and $\pi_r$.


To give a first estimate we use use the sampling ("Monte-Carlo") estimator:

$
\hat{\pi_o} = \frac{1}{n_o}\sum_{x_i \in X_o}{x_i} = \frac{m_o}{n_o} = 
$

$
\hat{\pi_r} = \frac{1}{n_r}\sum_{x_i \in X_r}{x_i} = \frac{m_r}{n_r} =
$

Which leads us to conjecture that **#TODO** are more likely to graduate successfully. We proceed with a more rigorous analysis of this claim.

In a series of Bernoulli experiments with parameter $\pi$ the likelihood of seeing $m$ successes out of $n$ trials is given by:\
$
p(m,n ; \pi) = {n\choose m}(\pi^m (1-\pi)^{n-m})
$

We consider a uniform beta prior $\mathcal{B}(\pi ; 1,1)$ which we find to be a reasonable assumption, as we don't have any knowledge of graduating probabilities before seeing the data.
Posterior distribution over $\pi$ is then Beta distribution $\mathcal{B}(\pi ; m+1,n-m+1)$.

For $m+1>1$ and $n-m+1>1$ the mode is given by $\frac{\alpha-1}{\alpha+\beta-2} = \frac{m}{n}$ which is the maximum aposteriori estimator of $\pi$.

So:\
$\pi_{o-MAP} = \hat{\pi_o} = $\
$\pi_{r-MAP} = \hat{\pi_r} = $
\
Under our model we can say that with 95% probability $\pi_o$ lies in the interval $[;]$ and $\pi_r$ lies in the interval $[;]$.

Now we are ready to test the hypothesis $\mathcal{H_0}$: Olympiad winners have the same probability of successful graduation as students admitted under the regular procedure.
Note, that:

1) Under the null hypothesis $\mathcal{H_0}$ we have $\pi_r = \pi_0 = \pi$ and the likelihood of observing $m_o$ successfully graduated olympiad winners given a particular probability $p(m_o| \pi)$ follows a binomial distribution.

2) The previously achieved Beta-posterior over $\pi$ now acts as a prior ans so $ p(\pi | m_r, n_r) = \mathcal{B}(\pi;m_r+1, (n_r - m_r) + 1)$

3) To find the probability of observing $m_o$ successfully graduated olympiad winners under the null hypothesis we marginalise over $\pi$ $p(m_o | \mathcal{H}_0) = \int p(m_o | \pi) p(\pi | m_r, n_r)\,d\pi$

This results in Beta-binomial distribution.



\begin{equation}
    p(m_o | \mathcal{H}_0)
    = p(m_o | n_o, m_r, n_r) 
    = {n_o \choose m_o}
    \frac{\mathcal{B}(m_r + m_o + 1, (n_r - m_r) + (n_o - m_o) +1)}
    {\mathcal{B}(m_r + 1, (n_r - m_r) + 1)}
\end{equation}


We compute the probability of seeing $m_r$ or even more successfully graduated olympiad winners under the null hypothesis and achieve the p-value of **#TODO** which leads us to: **#TODO**

### Results:


---

In [3]:
import os
os.chdir('..')
os.getcwd()

%load_ext autoreload
%autoreload 2

import pandas as pd

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [4]:
df = pd.read_csv('data/preprocessed_data.csv')

  exec(code_obj, self.user_global_ns, self.user_ns)
