[<img src="logo.png">](https://www.thedataincubator.com/)

# TDI Challenge
> ### Luis Castro
> [luis.castro@mg.thedataincubator.com](luis.castro@mg.thedataincubator.com)

# Index<a id='Index'></a>
- [0.- Challenge Description](#Challenge)
- [1.- Bayesian Inference](#Inference)
    - [1.1- Bayes' Rule](#Bayes)
        - [1.1.1- Example Test](#Test)
        - [1.1.2- Example Test II](#Test)
        - [1.1.3- Underlying assumptions](#Assumptions)
        - [1.1.4- Conjugate priors](#Priors)
            - [1.1.4.1- Example](#Example3)
            - [1.1.4.2- Frequentist approach](#Frequentist)

---

## Challenge Description<a id='Challenge'></a>

We want to see how well you're able to explain topics in statistics and data science. Write a short Jupyter notebook covering these topics:

- Hypothesis Testing: Let's talk about t-tests, p-values. How are they related? What is it telling you? How does it relate to precision-recall? What are the underlying assumptions?
- Bayesian posterior inference: Explain Bayes' Rule. Write some code to actually perform posterior sampling. Work out an example using conjugate priors. How does this compare with hypothesis testing? What are the underlying assumptions?

Be prepared to give a mock "lecture" about these two topics with your prepared Jupyter notebook. We'll be looking for:

- How well you present: remember that this material should be approachable, applied, and not just a series of formulas
- How well you understand these topics in depth (the mathematics, the underlying assumptions)
- How well you understand the concepts and how you can apply them

[Back to index](#Index)

---

## Bayesian Inference<a id='Inference'></a>

In the statistical inference notebook, we saw a school of statistical inference that has dominated the XX century, the so call Frequentist inference. In it, we draw conclusions from sample data by emphasizing the frequency or proportion of the data. This is the inference framework in which the well-established methodologies of statistical hypothesis testing and confidence intervals are based. [[1]](https://en.wikipedia.org/wiki/Frequentist_inference)

The alternative is Bayesian Inference which we will discuss next.

The difference between Bayesian inference and frequentist inference is the goal.
- **Bayesian Goal:** Quantify and analyze subjective degrees of belief.
- **Frequentist Goal:** Create procedures that have frequency guarantees.

![](vs2.png)

### Bayes' Rule<a id='Bayes'></a>

Also called Bayes' theorem of Bayes' law describes the probability of an event, based on prior knowledge of conditions that might be related to the event.

One of the many applications of Bayes' rule is Bayesian inference, a form of statistical inference. 

The theorem is stated mathematically by the followin equation:

![](10.png)

where A and B are events and P(B) ≠ 0.

- P(A) and P(B) are the probabilities of observing A and B without regard to each other.
- P(A | B), a conditional probability, is the probability of observing event A given that B is true.
- P(B | A) is the probability of observing event B given that A is true. [[2]](https://en.wikipedia.org/wiki/Bayes%27_theorem)

### Example test<a id='Test'></a>

Is is customary to start explaining Bayes' Rule with a cancer testing scenario, the data provided is the following:
- The proportion of the population of women affect by cancer is 1%.
- The mammograms effectively identify cancer when the subject has cancer 80% of the time.
- Mammograms may detect cancer when it isn't there 9.6% of the time.

From the previous data we conclude:
- 99% of the women population aren't affect by cancer (100% - 1%)
- 20% of the time the mammograms fail to identify cancer when it is present (100% - 20%)
- 90.4% of the time the mammograms result is negative when there is no cancer present (100% - 9.6%)

With this information we can create a matrix for easier visualization:

|Test|Cancer = 0.01|No Cancer = 0.99|
|:---:|:-----------:|:--------------:|
|Pos |0.8|0.096|
|Neg |0.2|0.904|

How do we turn this into a Bayes' problem? We begin by identifying each of the terms. We wish to infer posterior probability as a consecuenque of two antecedents, a prior probability and a 'likelihood', more formally:
- **P(A):** Is the prior probability, is the probability of A before using B data.
- **P(B):** Is the marginal likelihood.
- **P(B|A):** Is the likelihood, which is the probability of observing B given A.
- **P(A|B):** Is the posterior probability, which is the probability of observing A given B. 

So how does that translate to our problem?
- **P(A):** Is the probability of having cancer (without taking anything else into account) = 1%
- **P(B):** Is the probability of having a cancer positive mammogram, and it is calculated as follows:
    - (0.8)(0.01)+(0.096)(0.99) = 10.3%
- **P(B|A):** the probability of having a positive test given that you have cancer = 80%.
- **P(A|B):** the probability of having cancer given a positive test, this is the result we are looking for. [[3]](https://betterexplained.com/articles/an-intuitive-and-short-explanation-of-bayes-theorem/)

We sustitute in the Bayes' rule equation:

In [9]:
'Chance that a positive test means a positive result: {0:.3f}%'.format(100*(0.8*0.01)/(0.8*0.01+0.096*0.99))

'Chance that a positive test means a positive result: 7.764%'

The table will look then this way:

|Test|Cancer|No Cancer|P(A given B)|
|:---:|:-----------:|:--------------:|:---:|
|Pos |0.008|0.09504|0.0764|
|Neg |0.002|0.89496|0.0022|

Looking at this from a frequentist point of view, the first 2 columns are the probabilities of the results any given patient will have when arriving to the clinic.

### Example test II<a id='Test2'></a>

A desk lamp produced was found to be defective (D). There are three factories (A, B, C) where such desk lamps are manufactured. Quality Control Manager is responsible for investigating the source of found defects. This is what the QCM knows about the company's desk lamp production and the possible source of defects: 


|Factory|% of production|% of defective lamps (D)|P(D given #)|
|:---:|:-----------:|:--------------:|:---:|
|A|0.35|0.015|P(D given A)|
|B|0.35|0.010|P(D given B)|
|C|0.3|0.020|P(D given C)|

Given a defective lamp, what is the probability of it coming from each of the Factories?
- P(A|D) = ?
- P(B|D) = ?
- P(C|D) = ?

P(D) = Probability of a defective lamp = (0.35)(0.015)+(0.35)(0.01)+(0.3)(0.02) = 0.01475 

Evaluating them:

In [15]:
print('Probability of Factory A given Defect: {0:.2f}%'.format(100*0.015*0.35/0.01475))
print('Probability of Factory B given Defect: {0:.2f}%'.format(100*0.010*0.35/0.01475))
print('Probability of Factory C given Defect: {0:.2f}%'.format(100*0.020*0.3/0.01475))

Probability of Factory A given Defect: 35.59%
Probability of Factory B given Defect: 23.73%
Probability of Factory C given Defect: 40.68%


With this result, it seems more likely that the defective lamp comes from from Factory C. It should also be noted that P(D) is only needed to normalize the results, not to identify the most likely Factory (which will be one with the biggest number. [[4]](https://onlinecourses.science.psu.edu/stat414/node/43)

### Underlying Bayesian critiques and assumptions<a id='Assumptions'></a>

- Prior knowledge related to the event: The main assumption of Bayesian inference about the correct election of the prior. This is because a subjective prior is, well, subjective. There is no single method for choosing a prior, so different people will produce different priors and may therefore arrive at different posteriors and conclusions. [[5]](https://ocw.mit.edu/courses/mathematics/18-05-introduction-to-probability-and-statistics-spring-2014/readings/MIT18_05S14_Reading20.pdf)
- Assigning probabilities to hypotheses, s hypotheses do not constitute outcomes of repeatable experiments in which one can measure long-term frequency. Rather, a hypothesis is either true or false, regardless of whether
one knows which is the case. A coin is either fair or unfair; treatment 1 is either better or worse than treatment 2; the sun will or will not come up tomorrow.
- P(B) ≠ 0
- P(A) ≠ 0, as P(B|A) is defines as P(A∩B)/P(A), if P(A) is = 0, P(B|A) is not defined
- Features used are independent.

### Conjugate Priors<a id='Priors'></a>

We have the function:

P(A|B) = P(B|A)P(A)/P(B)

If we have a likelihood function P(B|A) that is approximately Normal:
- P(data|ϴ) ~ Normal

and we can choose a prior that is approximately Normal:
- P(A) ~ Normal

Then our posterior turns out to be also Normal:
- P(A|B) ~ Normal

If we choose our prior in a particularly intelligent way, for it to be of a similar form of the likelihood (conjugate to it) then the posterior will have the same form as them.

When using conjugate priors we have some advantages when dealing with the distributions albeit the choice of a conjugate prior may sometimes be unrealistic.
- The posterior has the form of the prior
- There are already tables for conjugate priors, which makes the analysis straightforward
- The posterior can be generated exactly

As was with the Normal distribution, the same is true for other distributions as seen in the following table:

![](priortable.png)

Proof: Beta conjugate to Binomial

- Prior: Beta(a,b) = **Θ<sup>a-1</sup> (1-Θ)<sup>b-1</sup> B(a,b)<sup>-1</sup>** 
    - B(a,b) = normalizing constant

- Likelihood: Binomial = **Θ<sup>z</sup> (1-Θ)<sup>n-z</sup>**

- Bayes' Rule: P(Θ|data) = P(data|Θ) P(Θ) P(data)<sup>-1</sup>
    - P(data) = normalizing constant
    
Substituting:
- P(Θ|data) is proportional Likelihood x Prior = Θ<sup>z</sup> (1-Θ)<sup>n-z</sup> Θ<sup>a-1</sup> (1-Θ)<sup>b-1</sup >B(a,b)<sup>-1</sup> = **Θ<sup>a+z-1</sup> (1-Θ)<sup>n+b-z-1</sup>**
- normalizing constant = B(a+z,n+b-z)

Then:
**P(Θ|data) = Θ<sup>a+z-1</sup> (1-Θ)<sup>n+b-z-1</sup> B(a+z,n+b-z)<sup>-1</sup>** which is a Beta distribution. [[6]](https://www.youtube.com/watch?v=hKYvZF9wXkk)

### Example <a id='Example3'></a>

Let's say we want to know if there is bias in a coin using a Bayesian approach, for that purpose we create a trial and flip the coin 10 times.

The results are as follows: [HHHHTTHHHH]
- H = 8
- T = 2

We know that flipping a coin is a binomial distribution, so we can choose a Beta distribution as a conjugate prior.

So, we select the parameters Beta distribution prior, since we expect a balanced (unbiased) flip-coin toss say we choose a0 = 3 and b0 = 3 which is a  relatively flat prior concentrated over the interval .25 ≤ θ ≤ .75.

Then our posterior distribution will be a Beta distribution with the following parameters:
- a = a0+z     = 3 + 2      = 5
- b = b0+(n-z) = 3 + (10-2) = 11

In [43]:
from bokeh.io            import push_notebook, show, output_notebook
from bokeh.layouts       import gridplot, row
from bokeh.plotting      import figure
from bokeh.palettes      import Viridis3
import matplotlib.pyplot as plt
%matplotlib inline
output_notebook()

In [61]:
from scipy.stats import beta

a0,b0  = 3,3
H,T    = 2,8
a,b    = a0+H,b0+T

x      = np.linspace(beta.ppf(0.01,a0,b0),beta.ppf(0.99,a0,b0),100)
prior  = beta.pdf(x,a0,b0)
post   = beta.pdf(x,a,b)
int95  = np.linspace(beta.ppf(.025,a,b),beta.ppf(.975,a,b))

p = figure(width=600,plot_height=480,title="Bayes' Conjugate Prior")
p.line(x,prior,color=Viridis3[0],legend='Beta ('+str(a0)+','+str(b0)+')')
p.line(x,post, color=Viridis3[1],legend='Beta ('+str(a)+','+str(b)+')')
p.line(int95,[0.25]*len(int95), color=Viridis3[2],legend='95% probability interval')

p.xaxis.axis_label = 'Probability'
show(p)

![](bokeh_plot0.png)

In [68]:
'Starting by the prior = Beta(3,3), the probability that the coin is biased towards Heads is {0:.2f}%.'.format(100*beta.cdf(0.5,5,11))

'Starting by the prior = Beta(3,3), the probability that the coin is biased towards Heads is 94.08%.'

How does the choice of prior affect our result?

If we select a flat Beta distribution with a=1 and b=1 (an uninformative prior) [[7]](http://slideplayer.com/slide/4766698/)

In [69]:
a0,b0  = 1,1
H,T    = 2,8
a,b    = a0+H,b0+T

x      = np.linspace(beta.ppf(0.01,a0,b0),beta.ppf(0.99,a0,b0),100)
prior  = beta.pdf(x,a0,b0)
post   = beta.pdf(x,a,b)
int95  = np.linspace(beta.ppf(.025,a,b),beta.ppf(.975,a,b))

p = figure(width=600,plot_height=480,title="Bayes' Conjugate Prior")
p.line(x,prior,color=Viridis3[0],legend='Beta ('+str(a0)+','+str(b0)+')')
p.line(x,post, color=Viridis3[1],legend='Beta ('+str(a)+','+str(b)+')')
p.line(int95,[0.25]*len(int95), color=Viridis3[2],legend='95% probability interval')

p.xaxis.axis_label = 'Probability'
show(p)

![](bokeh_plot1.png)

In [101]:
'Starting by the prior = Beta(1,1), the probability that the coin is biased towards Heads is {0:.2f}%.'.format(100*beta.cdf(0.5,3,9))

'Starting by the prior = Beta(1,1), the probability that the coin is biased towards Heads is 96.73%.'

### Frequentist approach <a id='Frequentist'></a>

Let θ be the probability of heads. We have the null and alternative hypotheses
- H<sub>0</sub>: θ=.5
- H<sub>A</sub>: θ>.5

With the same results as before [HHHHTTHHHH]
- H = 8
- T = 2

The null distribution is binomial(10,0.5) so the one sided p-value is the probability of 8 to 10 heads in 10 tosses or having 0 to 2 tails.

In [100]:
from scipy.stats import binom
'The p-value is {0:.3f}'.format(binom.cdf(2,10,0.5)) #Acummulative distribution function, 0,1,2

'The p-value is 0.055'

At a significance level of 0.05 we conclude that there is no statistical evidence to reject the Null hypothesis, which is the coin is unbiased. [[8]](http://www.r-tutor.com/elementary-statistics/probability-distributions/binomial-distribution) [[9]](https://ocw.mit.edu/courses/mathematics/18-05-introduction-to-probability-and-statistics-spring-2014/readings/MIT18_05S14_Reading20.pdf)

[Back to index](#Index)

![](vs1.png)

[Back to index](#Index)

---