<a href="https://colab.research.google.com/github/nouretienne/DS-Unit-1-Sprint-2-Statistics/blob/master/Nour_Etienne_LS_DS_123_Introduction_to_Bayesian_Inference_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lambda School Data Science Module 123

## Introduction to Bayesian Inference




## Assignment - Code it up!

We used pure math to apply Bayes Theorem to drug tests. Now write Python code to reproduce the results! This is purposefully open ended - you'll have to think about how you should represent probabilities and events. You can and should look things up.

Specific goals/targets:

### 1) Write a function 

`def prob_drunk_given_positive(prob_drunk_prior, false_positive_rate):` 

You should only truly need these two values in order to apply Bayes Theorem. In this example, imagine that individuals are taking a breathalyzer test with an 8% false positive rate, a 100% true positive rate, and that our prior belief about drunk driving in the population is 1/1000. 
 - What is the probability that a person is drunk after one positive breathalyzer test?
 - What is the probability that a person is drunk after two positive breathalyzer tests?
 - How many positive breathalyzer tests are needed in order to have a probability that's greater than 95% that a person is drunk beyond the legal limit?

### 2) Explore `scipy.stats.bayes_mvs`  
Read its documentation, and experiment with it on data you've tested in other ways earlier this week.
 - Create a visualization comparing the results of a Bayesian approach to a traditional/frequentist approach. (with a large sample size they should look close to identical, however, take this opportunity to practice visualizing condfidence intervals in general. The following are some potential ways that you could visualize confidence intervals on your graph:
  - [Matplotlib Error Bars](https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.errorbar.html)
  - [Seaborn barplot with error bars](https://seaborn.pydata.org/generated/seaborn.barplot.html)
  - [Vertical ines to show bounds of confidence interval](https://www.simplypsychology.org/confidence-interval.jpg)
  - [Confidence Intervals on Box Plots](https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.axes.Axes.boxplot.html)

### 3) In your own words, summarize the difference between Bayesian and Frequentist statistics

If you're unsure where to start, check out [this blog post of Bayes theorem with Python](https://dataconomy.com/2015/02/introduction-to-bayes-theorem-with-python/).



1) Write a function

Let's consider two following events: 

A: the person has the disease

B: the test is positive


The **prior $P(A) $** is probability of having a disease. 

The **false positive $P(B|non A) $** is the probabiity of getting tested positive while the person does not have the disease.

Let's write the Bayes equation: $$P(A|B) = \frac{P(B|A)P(A)}{P(B)}$$

We have **$P(B|A)$ = 1 - $P(B|non A)$ = 1 - FALSE POSITIVE**

Besides $P(B) $ = $P(B|A)$ **x PRIOR** **+ (FALSE POSITIVE x** $P(nonA)$**)**. 

Indeed the probability that the test results positive is the sum of the probability that it resutls positive given an ill person ($P(B|A)$ x $P(A) $) and the probability that it results positive given a non ill person ($P(B|non A) $ x $P(nonA)$) 

So **$P(B)$ = (1 - FALSE POSITIVE) X PRIOR + FALSE POSITIVE X (1 - PRIOR)**

In [0]:
def prob_drunk_given_positive(prob_drunk_prior, false_positive_rate):
    result = (1-false_positive_rate)*(prob_drunk_prior)/((1-false_positive_rate)*(prob_drunk_prior)+false_positive_rate*(1-prob_drunk_prior))
    return result

This is the probability that a person is drunk after one positive breathalyzer test:

In [17]:
prob_drunk_given_positive(1/1000, 8/100)

0.011380504700643244

The probability that a person is drunk after two positive breathalyzer tests, is calulated the same way, but the **prior** has to be replaced by the **posterior** probability that the person is tested positively drunk after one test: 

Then in the equation, let's change $P(A)$ into $P(A/B)$

In [0]:
def prob_drunk_given_2_positive(prob_drunk_prior, false_positive_rate):
    result = (1-false_positive_rate)*prob_drunk_given_positive(prob_drunk_prior, false_positive_rate)/((1-false_positive_rate)*(prob_drunk_given_positive(prob_drunk_prior, false_positive_rate))+false_positive_rate*(1-prob_drunk_given_positive(prob_drunk_prior, false_positive_rate)))
    return result

In [18]:
prob_drunk_given_2_positive(1/1000, 8/100)  # two tsts

0.11690607734806628

In [21]:
prob_drunk_given_positive(0.011380504700643244, 8/100)  # two tests

0.11690607734806628

In [22]:
prob_drunk_given_positive(0.11690607734806628, 8/100)  # three tests

0.6035517634803313

In [23]:
prob_drunk_given_positive(0.6035517634803313, 8/100)  # four tests

0.9459680554381814

In [24]:
prob_drunk_given_positive(0.9459680554381814, 8/100)  # five tests

0.9950577515521439

So we get 11.6% probability that a person is recognized drunk after two positive tests.

And we would need four tests to get a probability of 95% to reveal that a person is drunk.

## Resources

- [Worked example of Bayes rule calculation](https://en.wikipedia.org/wiki/Bayes'_theorem#Examples) (helpful as it fully breaks out the denominator)
- [Source code for mvsdist in scipy](https://github.com/scipy/scipy/blob/90534919e139d2a81c24bf08341734ff41a3db12/scipy/stats/morestats.py#L139)

## Stretch Goals:

- Go back and study the content from Modules 1 & 2 to make sure that you're really comfortable with them.
- Apply a Bayesian technique to a problem you previously worked (in an assignment or project work) on from a frequentist (standard) perspective
- Check out [PyMC3](https://docs.pymc.io/) (note this goes beyond hypothesis tests into modeling) - read the guides and work through some examples
- Take PyMC3 further - see if you can build something with it!

In [0]:
from scipy import stats
import numpy as np


?np.random.seed