In [7]:
%%html
<style>
  table {margin-left: 0 !important;}
</style>

### True Positives and False Negatives

You have a test for a disease. 

*Tests correctly working*
- Disease is present and the test indicates the disease is present **True Positive**
- Disease is absent and the test indicates the disease is absent **True Negative**

*Tests failing*
- Disease is present and the test indicates the disease is absent **False Negative**
- Disease is absent and the test indicates the disease is present **False Positive**

|Actual||Predicted|
|---|---|---|
||*Negative*|*Positive*|
|*Negative*|True Negative|False Positive|
|*Positive*|False Negative|True Positive|


<img src="true_positive_false_negative.png" width="300" height="200">

## Bayes Theorem 

**Bayes' Theorem**  describes the probability of an event, based on conditions that might be related to the event. Bayes Theorem allows us to use previously known information to asess likelihood of another related event.

Bayes’s theorem is named after Reverend Thomas Bayes (1701?–1761 - an English statistician, philosopher and Presbyterian minister), who first used conditional probability to provide an algorithm (his Proposition 9) that uses evidence to calculate limits on an unknown parameter, published as An Essay towards solving a Problem in the Doctrine of Chances (1763).


in *Conditional Probability* we see that $P(A|B) = \frac{P(A \cap B)}{P(B)}$ provided that $P(B) > 0$

So $P(B | A) = \frac{P(B\cap A}{P(A)} = \frac{P(A\cap B)}{P(A)}$ = provided that P(A) > 0

These are two different ways to write out the probability of B, given A occuring. 

Connecting those two conditional probability formulas gets Bayes Theorem $ P(A|B) = \frac{P(B|A)P(A)}{P(B)} $

- The probability that A, given B has occurred, is equal to the probability of B, given that A has occurred, multiplied by the probability of A, divided by the probability of B, if B is > 0. 
- This joins together both "B if A has occurred" and "A if B has occurred"
- Bayes Theory is used to determine the probability of a *Parameter* given a certain event.

Usually Bayes Theorem is displayed in one of two ways:

$ P(A|B) = \frac{P(B|A)P(A)}{P(B)} $ or $ P(A|B) = \frac{P(B|A)P(A)}{P(B|A)P(A) + P(B|not A)P(not A)}$ given that $P(B) > 0$


 

**Exercise 1:**  
- A manufacturing company finds that 1 out of every 500 products are defective (or .002) or %2
- The company buys a diagnostic tool that correctly identifies a defective part 99% of the time (not %100)
- If the part is diagnosed as defective, what's the probability that it's really defective (true positives/false negatives)



The Questions:

If a part is diagnosed as defective what is the probability that it really is defective?

- $ P(A)$ The probability of being defective
- $ P(B)$ The probability of testing as defective
- $ P(A|B)$ The probability of being defective if the tool indicated a defect - True Positive
- $ P(B|A)$ The probability of the tool indicating a defect if the product is defective

True Positive - $P(A|B)$
True Negative - 
False Positive - $P(B|- A)$
False Negative -

- $ P(A|B)$ - ? we're solving for this
- $ P(B|A)$ The accuracy rate of the diagnostic tool (.99) or 99%
- $ P(A)$  The correct failure rate (.002) (1 / 500) or 2%
- $ P(B)$ - ? we need to calculate this

 

1. **First calculate $P(B)$**
- P(B) - Probability of testing positive  = (P(true positive) + P(false positive))

Calculate this with:  
**P(B) = $P(B|A) * P(A) + P(B|- A) * P(-A)$**
- The probability of a True Positive ($P(B)$) is the probability of Testing Defective given that it was actually defective ($P(A|B)$) 
- Multiplied by the probability of an actual defect ($P(A)$) 
- Add the probability that it tested defective but was not defective (False Positive) ($P(B|- A)$)
- Multiplied by the probability that it was not defective ($P(A)$)


- $ P(B|A)$ The accuracy rate of the diagnostic tool (.99) or 99%
- $ P(A)$  The correct failure rate (.002) (1 / 500) or 2%

Use the Complement Rule (the Probability of "Not A" is "1 minus the probability of A occurring")

*Calculate $P(B|-A)$:*
- $P(B|-A)  = 1 - P(B|A) = 1 - .99 =$ **.01**

*Calculate $P(-A)$*:
- $P(-A) = 1 - P(A) = 1 - .002 =$ **.998** - The probability of getting a false positive.

*Calculating $P(B)$*:
- $P(B) = $P(B|A) * P(A) + P(B|- A) * P(-A)$ = .99 * .002 + .01 * .998  = **.01196**

- $ P(A|B)$ = ??
- $ P(B|A)$ = .99
- $ P(A)$ = .002
- $ P(B)$ = .01196
- $ P(-A)$ = .998
- $ P(B|-A)$ = .01 

*So, plugging in the numbers to calculate $P(A|B)$*:

numerator:  
In [3]: .99*.002  
Out[3]: 0.00198  
denominator:  
In [2]: .99 * .002 + .01 * .998  
Out[2]: 0.011960000000000002

The expanded formula:
- $P(A|B) = \frac{P(B|A) * P(A)} {P(B|A) * P(A) + P(B|-A) * P(-A)}$

with the real numbers:

- $P(A|B) = \frac{.99 * .002} {.99 * .002 + .01 * .998} = \frac{0.00198}{0.01196} = .165$ or **%16.5** 

In [118]: 0.00198 / 0.011960000000000002  
Out[118]: 0.16555183946488292


So a positive test only has a **%16.5 chance of correctly identifying a defective part** 
- $ P(A|B)$ = .165 or **%16.5 True Positive Rate**
- $ P(B|A)$ = .99 Probability of the test producing a True Positive (actual defect)
- $ P(A)$ = .002  Probability of Being Defective
- $ P(B)$ = .01196 Probability of Testing Defective


**Exercise 2:** What if a second test on the same part comes up that also returns positive (shows a defect)?

- Fill in the details from the first run through.
- Because it's already gone through the diagnostic test the probability goes up.

- $ P(A|B)$ = ??
- $ P(B|A)$ = .99
- $ P(A)$ = Changes from .002 to .165
- $ P(B)$ = .01196
- $ P(-A)$ = Changes from .998 to .835
- $ P(B|-A)$ = .01 

So:
- $P(A|B) = \frac{P(B|A) * P(A)} {P(B|A) * P(A) + P(B|-A) * P(-A)}$

with the real numbers:

numerator:  
In [4]: .99*.165  
Out[4]: 0.16335

denominator:  
In [6]: .99 * .165 + .01 * .835  
Out[6]: 0.1717

total:  
In [7]: 0.16335 / 0.1717  
Out[7]: 0.9513686662783926

- $\require{enclose} P(A|B) = \frac{.99 *  \enclose{horizontalstrike}{.002} .165} {.99 * \enclose{horizontalstrike}{.002} .165 + .01 * \enclose{horizontalstrike}{.998} .835} = \frac{0.16335}{0.1717} = 0.9513$ or **%95**

So the probability gets much higher and closer to $ P(B|A)$ - the diagnostic tool's error rate
- $P(A|B) = \frac{.99 * .165} {.99 * .165 + .01 * .835} = .951$ or **95.1% probability that the part is defective**



**Exercise 2: Try it a third time**
- $ P(A|B)$ = ??
- $ P(B|A)$ = .99
- $ P(A)$ = Changes from .165 to .951
- $ P(B)$ = .01196
- $ P(-A)$ = Changes from .998 to 0.049
- $ P(B|-A)$ = .01 

numerator: 
In [8]: .99*.951
Out[8]: 0.9414899999999999

$ P(-A)$  
In [9]: 1 - .951
Out[9]: 0.049000000000000044

denominator:
In [10]: .99 * .951 + .01 * .049
Out[10]: 0.9419799999999999

Outcome:
In [11]: 0.9414899999999999 / 0.9419799999999999
Out[11]: 0.9994798191044396

- $ P(A|B)$ = **%99.94** chance that the the part is actually showing a defect



## Other Bayesian notes:

- base rate neglect
- remember your priors when analyzing
- 

**base rate fallacy**
The base rate fallacy, also called base rate neglect or base rate bias, is a fallacy. If presented with related base rate information (i.e. generic, general information) and specific information (information pertaining only to a certain case), the mind tends to ignore the former and focus on the latter. Base rate neglect is a specific form of the more general extension neglect. 

Base-rate neglect refers to the phenomenon whereby people ignore or undervalue that probability, typically in lieu of less informative, but more intuitively appealing information about an individual case (Kahneman & Tversky, 1973).

also known as: neglecting base rates, base rate neglect, prosecutor's fallacy 

**prior probability distribution**
In Bayesian statistical inference, a prior probability distribution, often simply called the prior, of an uncertain quantity is the probability distribution that would express one's beliefs about this quantity before some evidence is taken into account. For example, the prior could be the probability distribution representing the relative proportions of voters who will vote for a particular politician in a future election. The unknown quantity may be a parameter of the model or a latent variable rather than an observable variable. 

Bayes' theorem calculates the renormalized pointwise product of the prior and the likelihood function, to produce the posterior probability distribution, which is the conditional distribution of the uncertain quantity given the data.

Similarly, the prior probability of a random event or an uncertain proposition is the unconditional probability that is assigned before any relevant evidence is taken into account. 



### Bayesian Odds

Bayes' Rule can be expressed in terms of odds: 
Posterior Odds = Prior Odds * Likelihood Ratio


Odds, a:b, and probability

Odds are commonly written as the ratio of two numbers separated by a colon. For example, if P(A) = 2/3, the odds would be 2, but this would most likely be written as 2:1.

The relation between odds, a:b, and probability, p is as follows: 

$ a:b=p:(1-p)$

$p=\frac{a}{a+b} $

Suppose you have a box that has a 5% chance of containing a diamond. You also have a diamond detector that beeps half of the time if there is a diamond, and one fourth of the time if there is not. You wave the diamond detector over the box and it beeps.

- The prior odds of the box containing a diamond are 1:19. (95% of 20 = 19 with 1 chance that it's empty)
- The likelihood ratio of a beep is 1/2:1/4 = 2:1. 
- The posterior odds are 1:19 * 2:1 = 2:19. 
- This corresponds to about a probability of 2/21, which is about 0.095 or 9.5%. 

||Contains Diamond|Doesn't contain diamond|
|---|---|---|
|Prior Odds Ratio|1 |19 | 
|Likelihood Ratio|2|1|
||__|__|
|Posterior Odds Ratio|2|21|


Bayesian Proportionality

 Bayesian Thinking allows us to keep account of priors and likelihood information to predict a posterior probability.

Imagine a management consultation firm hires only two types of employees: IT and business consultants. You meet an employee who is very shy, but don't know their job role. 

If your guess is IT using only shyness as an attribute, then you have fallen for an inherent cognitive bias: Base Rate Neglect. Base Rate Neglect occurs when we do not take into account the underlying proportion of a group in the population. To answer the question we need to find out the proportion of IT consultant to Business consultants. In this case for every 1 IT person the firm hires 10 business consultants for a ratio of 10 to 1.

Another assumption could be made about shyness as an attribute. It would be fair to assume shyness is more common in IT as compared to business consultants (geeks vs people persons). Let’s assume, 75% of IT professionals are in fact shy corresponding to about 15% of business consultants.

Use the proportion of employees in the firm as the prior odds. Then use the shyness as an attribute as the Likelihood. 

The figure below demonstrates when we take a product of the two, we get posterior odds.

<img src="bayesian_proportions.png" width="400" height="200">

 Bayesian Thinking allows us to keep account of priors and likelihood information to predict a *Posterior probability*.
 
 
||IT Consultant|Business Consultant|
|---|---|---|
|Prior Odds Ratio|1 |10 | 
|Likelihood Ratio|75|15|
||__|__|
|Posterior Odds Ratio|1|2|
 
 or 2 to 1.
 

#### Principles of Bayesian Thinking
**Rule 1 – Remember your priors!**

As we saw earlier how easy it is to fall for the base rate neglect trap. The underlying proportion in the population is often times neglected and we as human beings have a tendency to just focus on just the attribute. Think of priors as the underlying or the background knowledge which is essentially an additional bit of information in addition to the likelihood. A product of the priors together with likelihood determines the posterior odds/probability.

**Rule 2 – Question your existing belief**

This is somewhat tricky and counter-intuitive to grasp but question your priors. Present yourself with a hypothesis what if your priors were irrelevant or even wrong? How will that affect your posterior probability? Would the new posterior probability be any different than the existing one if your priors are irrelevant or even wrong?

**Rule 3 – Update incrementally**

We live in a dynamic world where evidence and attributes are constantly shifting. While it is okay to believe in well-tested priors and likelihoods in the present moment. However, always question does my priors & likelihood still hold true today? In other words, update your beliefs incrementally as new information or evidence surfaces. A good example of this would be the shifting sentiments of the financial markets. What holds true today, may not tomorrow? Hence, the priors and likelihoods must also be incrementally updated.


### Posterior probabilities of hypotheses and Bayes factors


**Prior Odds**  
$O[H_{1}:H_{2}] = \frac {P(H_{1})}{P(H_{2}})$

**Ratio of posterior probabilities and hypotheses**  

$PO[H_{1}:H_{2}] = \frac {P(H_{1}| data)}{P(H_{2}|data)}$
- $PO[H_{1}:H_{2}]$ The probability of $H_{1}$ given data divided by the probability of  $H_{2}$ given data


$PO[H_{1}:H_{2}] = \frac {P(H_{1}| data)}{P(H_{2}|data)}$

Posterior odds expanded:

$PO[H_{1}:H_{2}] = \frac {P(H_{1}| data)}{P(H_{2}|data)} = \frac{(P(data|H1) * P(H1)/P(data)}{P(data|H2 * P(H2)/P(data)}$

The probability of data in the numerator and denominator cancel out, leaving us with:

$\frac{(P(data|H1 * P(H1)}{P(data|H2 * P(H2)}$

- reorganize that as the ratio of the data (given H1) and the data (given H2) multiplied by the ratio of the prior probabilities based on this hypothesis.

$\frac{(P(data|H1)}{P(data|H2} * \frac{P(H1)}{* P(H2)}$

The first half is called The Bayes Factor:

$\frac{(P(data|H1)}{P(data|H2}$

and the second half is the Prior Odds:

$\frac{P(H1)}{* P(H2)}$

In other words  Posterior Odds is the product of the Bayes Factor and the Prior Odds.

- *Bayes Factor* quantifies the evidence of data arising from  Hypothesis-1 vs Hypothesis-2
- in a *discrete case* this ist just the ratio of the likelihoods of the observed data under the two hypotheses.
- in a *continuous case* it's the ratio of the *marginal likelihoods* $BF[H1:H2] = \frac{\int P(data | \theta, H_{1}), d\theta}{\int P(data | \theta, H_{2}), d\theta} $

HIV Testing With ELISA Example:
*Hypotheses*:  
$H_{1}$ - patient does not have HIV
$H_{2}$ - patient does have HIV

*Priors*:  
$P(H_{1})$ - 0.99852  
$P(H_{2})$ - 0.00148

In [12]: 0.99852 / 0.00148  
Out[12]: 674.6756756756756


*Posteriors*:  
$P(H_{1}|+)$ = .8788551  
$P(H_{1}|+)$ = .1211449

Posterior Odds:
In [13]: .8788551 / .1211449  
Out[13]: 7.254577782473715


Interpreting the Bayes Factor:  
Jefferys - 1961
- if they Bayes Factor is between 1 and 3, the evidence against H2 isn't worthwhile
- 3-20 the evidence is positive
- 20-150 the evidence is strong
- >150 - very strong



## More Bayesian notes


**Bayesian inference** is a general approach to statistics which uses prior probabilities to answer questions like:
- Has this happened before?
- Is it likely, based on my knowledge of the situation, that it will happen?

Prior probability is a probability distribution that summarizes established beliefs about an event before (i.e. prior to) new evidence is considered. When the new evidence is added, the new distribution is called posterior probability. The probabilities, which you can think of as degrees of belief, are called Bayesian probabilities. For some examples of Bayes probability, see:

    Inverse Probability (which is another name for Bayes probability)
    Bayes Theorem Problems (some step-by-step examples of using Bayes Theorem)



**Frequentists vs Bayesians** 

The opposite of Bayesian statistics is frequentist statistics —the type of statistics you study in an elementary statistics class. In elementary statistics, you use rigid formulas and probabilities. Bayesian probabilities are a lot more flexible.

**Bayesians** are subjective and uses a priori beliefs to define a prior probability distribution on the possible values of the unknown parameters.


**Frequentists** see probability as something that has to do with a limiting frequency based on an observed proportion.
- defines an event's probability as the limit of its relative frequency in many trials. 
- Probabilities can be found (in principle) by a repeatable objective process (and are thus ideally devoid of opinion).

**Example using lottery balls**  
If you're pulling balls from the the 'air popper' lottery ball system, what are the odds that the ball will be even?  
We might say a priori "1 in 2", but you don't know if the balls are actually a fair sample of 1-100.

- A frequentist will say: "count them, then you'll know"
- A bayesian works on the idea that you have some prior commitments about the way the world is. In this example:
'it's very unlikely that the government would permit a biased sample of balls in the lottery, therefore it's only going to be all the balls 1-100, represented only once and therefore it's a 1 in 2 chance.
- Bayesianism works well in a constructed system like a lottery where prior assumptions are more meaningful.
- in the example that there will be a new disease in a certain populations, then prior assumptions can get messy quickly. 

*"If we aren't actually Bayesians in practice, we're pretty close to it".*

 A good way to model quickly and infer things in complicated scenarios (satisficing?) 
 
 "if we're bayesians, we're unconcious bayesians", because we're not perfectly rational.
 
 "Humans are not good at reasoning in a probabilistic fashion. Statistics were invented to overcome that"
 
 Is Bayesianism a "description of what we do" or a "prescription for what we should do"? 
 
 "With new observations you tend to refine your priors over time, so a bayesian will improve and refine over time"
 
 
 
 

<img src="frequentists_vs_bayesians.png" width="400" height="200">


**Bayesian Hypothesis Testing**

- Traditional testing (the type you probably came across in elementary stats or AP stats) is called Non-Bayesian. It is how often an outcome happens over repeated runs of the experiment. It’s an objective view of whether an experiment is repeatable.
- Bayesian hypothesis testing is a subjective view of the same thing. It takes into account how much faith you have in your results. In other words, would you wager money on the outcome of your experiment?

**Differences Between Traditional and Bayesian Hypothesis Testing**

- Traditional testing (Non Bayesian) requires you to repeat sampling over and over, while Bayesian testing does not. The main different between the two is in the first step of testing: stating a probability model. 
- In Bayesian testing you add prior knowledge to this step. It also requires use of a posterior probability, which is the conditional probability given to a random event after all the evidence is considered.

  
**Arguments for Bayesian Testing**  
Many researchers think that it is a better alternative to traditional testing, because it:
- Includes prior knowledge about the data.
- Takes into account personal beliefs about the results.  
**Arguments against**
- Including prior data or knowledge isn’t justifiable.
- It is difficult to calculate compared to non-Bayesian testing.



**Here's five reasons why frequentists methods may be preferred:**
- **Faster.** Given that Bayesian statistics often give nearly identical answers to frequentist answers (and when they don't, it's not 100% clear that Bayesian is always the way to go), the fact that frequentist statistics can be obtained often several orders of magnitude faster is a strong argument. Likewise, frequentist methods do not require as much memory to store the results. While these things may seem somewhat trivial, especially with smaller datasets, the fact that Bayesian and Frequentist typically agree in results (especially if you have lots of informative data) means that if you are going to care, you may start caring about the less important things. And of course, if you live in the big data world, these are not trivial at all.
- **Non-parametric statistics.** I recognize that Bayesian statistics does have non-parametric statistics, but I would argue that the frequentist side of the field has some truly undeniably practical tools, such as the Empirical Distribution Function. No method in the world will ever replace the EDF, nor the Kaplan Meier curves, etc. (although clearly that's not to say those methods are the end of an analysis).
- **Less diagnostics.** MCMC methods, the most common method for fitting Bayesian models, typically require more work by the user than their frequentist counter part. Usually, the diagnostic for an MLE estimate is so simple that any good algorithm implementation will do it automatically (although that's not to say every available implementation is good...). As such, frequentist algorithmic diagnostics is typically "make sure there's no red text when fitting the model". Given that all statisticians have limited bandwidth, this frees up more time to ask questions like "is my data really approximately normal?" or "are these hazards really proportional?", etc.
- **Valid inference under model misspecification.** We've all heard that "All models are wrong but some are useful", but different areas of research take this more or less seriously. The Frequentist literature is full of methods for fixing up inference when the model is misspecified: bootstrap estimator, cross-validation, sandwich estimator (link also discusses general MLE inference under model misspecification), generalized estimation equations (GEE's), quasi-likelihood methods, etc. As far as I know, there is very little in the Bayesian literature about inference under model misspecification (although there's a lot of discussion of model checking, i.e., posterior predictive checks). I don't think this just by chance: evaluating how an estimator behaves over repeated trials does not require the estimator to be based on a "true" model, but using Bayes theorem does!
- **Freedom from the prior (this is probably the most common reason for why people don't use Bayesian methods for everything).** The strength of the Bayesian standpoint is often touted as the use of priors. However, in all of the applied fields I have worked in, the idea of an informative prior in the analysis is not considered. Reading literature on how to elicit priors from non-statistical experts gives good reasoning for this; I've read papers that say things like (cruel straw-man like paraphrasing my own) "Ask the researcher who hired you because they have trouble understanding statistics to give a range that they are 90% certain the effect size they have trouble imagining will be in. This range will typically be too narrow, so arbitrarily try to get them to widen it a little. Ask them if their belief looks like a gamma distribution. You will probably have to draw a gamma distribution for them, and show how it can have heavy tails if the shape parameter is small. This will also involve explaining what a PDF is to them."(note: I don't think even statisticians are really able to accurately say a priori whether they are 90% or 95% certain whether the effect size lies in a range, and this difference can have a substantial effect on the analysis!). Truth be told, I'm being quite unkind and there may be situations where eliciting a prior may be a little more straightforward. But you can see how this is a can of worms. Even if you switch to non-informative priors, it can still be a problem; when transforming parameters, what are easily mistaken for non-informative priors suddenly can be seen as very informative! Another example of this is that I've talked with several researchers who adamantly do not want to hear what another expert's interpretation of the data is because empirically, the other experts tend to be over confident. They'd rather just know what can be inferred from the other expert's data and then come to their own conclusion. I can't recall where I heard it, but somewhere I read the phrase "if you're a Bayesian, you want everyone to be a Frequentist". I interpret that to mean that theoretically, if you're a Bayesian and someone describes their analysis results, you should first try to remove the influence of their prior and then figure out what the impact would be if you had used your own. This little exercise would be simplified if they had given you a confidence interval rather than a credible interval!
- Of course, if you abandon informative priors, there is still utility in Bayesian analyses. Personally, this where I believe their highest utility lies; there are some problems that are extremely hard to get any answer from in using MLE methods but can be solved quite easily with MCMC. But my view on this being Bayesian's highest utility is due to strong priors on my part, so take it with a grain of salt.




**More Bayesian concepts** 

- Approximate Bayesian Computation (ABC): This set of techniques starts with a set of known summary statistics. A second set of the same statistics is calculated from a variety of potential models, and the candidates are placed in an acceptance/rejection loop. ABC favors those candidates that more closely match the known summary statistics (Medhi, 2014).
- Admissible decision rule: A decision rule is a guideline to help you support or reject a null hypothesis. Generally, speaking, a decision rule is “admissible” if it is better than the set of all other possible decision rules. It’s similar to the line of best fit in regression analysis: it’s not a perfect fit, but it’s “good enough.”
- Bayesian efficiency: An efficient design requires you to input parameter values; In a Bayesian efficient model you have to take your “best guess” at what those parameters might be. Figuring out what is an efficient design (and what isn’t) by hand is only possible for very small designs, as it’s a computationally complex process (Hess & Daly, 2010).
- Bayes’ theorem: see Bayes Theorem Problems
- Bayes factor: the Bayes factor is a measure of relative likelihood between two hypotheses, or what Cornfield (1976) calls the “relative betting odds.” The factor ranges between zero and infinity, where values close to zero are evidence against the null hypothesis and evidence for the alternate hypothesis (Spiegelhalter, D. et. al, 2004).
- Bayesian network: A directed acyclic graph that represents a set of variables and their associated dependencies.
- Bayesian linear regression: treats regression coefficients and errors as random variables, instead of fixed unknowns. This tends to make the model more intuitive and flexible. However, the results are similar to simple linear regression if priors are uninformative and N is much greater than P (i.e. when the number of items is much greater than the number of prior distributions).
- Bayesian estimator: Also called a Bayes action, the Bayes estimator is defined as a minimizer of Bayes risk. In more general terms, it’s a single number that summarizes information found in a prior distribution about a particular parameter.
- Bayesian Information Criterion (also called the Schwarz criterion): given a set of models to choose from, you should choose the model with the lowest BIC.
- Bernstein–von Mises theorem: This is the Bayesian equivalent of the asymptotic normality results in the asymptotic theory of maximum likelihood estimation (Ghosh & Ramamoorthi, 2006, p.33).
- Conjugate prior: A conjugate prior has the same distribution as your posterior prior. For example, if you’re studying people’s weights, which are normally distributed, you can use a normal distribution of weights as your conjugate prior.
- Credible interval: a range of values where an unobserved parameter falls with a certain subjective probability. It is the Bayesian equivalent of a confidence interval in frequentist statistics.
- Cromwell’s rule: This simple rule states that you should not assign probabilities of 0 (an event will not happen) or 1 (an event will happen), except when you can demonstrate an event is logically true or false. For example, the event 5 + 5 will logically add up to 10, so you can apply a probability of 1 to it.
- Empirical Bayes method: a technique where the prior distribution is estimated from actual data. This is unlike the usual Bayesian methods, where the prior distribution is fixed at the beginning of an experiment.
- Hyperparameter: a parameter from the prior distribution that’s set before the experiment begins.
- Likelihood function: A measurement of how well the data summarizes these parameters. See: What is the Likelihood function?
- Maximum a posteriori estimation: An estimate of an unknown; It is equal to the mode of the posterior distribution.
- Maximum entropy principle: This principle states that if you are estimating a probability distribution, you should select the distribution which gives you the maximum uncertainty (entropy).
- Posterior probability: Posterior probability is the probability an event will happen after all evidence or background information has been taken into account. See: What is Posterior Probability?
- Principle of indifference: states that if you have no reason to expect one event will happen over another, all events should be given the same probability.


## More Bayes examples


*Example 1:*  



$ P(A|B) = \frac{P(B|A)P(A)}{P(B)} $

- In a pain clinic 10% of patients are prescribed narcotic pain killers. 
- Overall 5% of the clinic's patients are addicted to narcotics (including pain killers and illegal substances)
- Out of all the people prescribed pain pills 8% are addicts

If a patient is an addict whatis the probability that they are prescribed pain pills.


Step 1: Figure out what your event “A” is from the question. That information is in the italicized part of this particular question. The event that happens first (A) is being prescribed pain pills. That’s given as 10%.

Step 2: Figure out what your event “B” is from the question. That information is also in the italicized part of this particular question. Event B is being an addict. That’s given as 5%.

Step 3: Figure out what the probability of event B (Step 2) given event A (Step 1). In other words, find what (B|A) is. We want to know “Given that people are prescribed pain pills, what’s the probability they are an addict?” That is given in the question as 8%, or .8.

Step 4: Insert your answers from Steps 1, 2 and 3 into the formula and solve.
P(A|B) = P(B|A) * P(A) / P(B) = (0.08 * 0.1)/0.05 = 0.16


- $ A$  Being prescribed pain pills (%10)
- $ B$  Being an addict (%5)
- $ P(B|A)$ given that someone is prescribed a pain pill what's the probability they are an addict 8%

$ P(A|B) = \frac{P(B|A)P(A)}{P(B)} $
$ P(A|B) = \frac{.08 * .10 }{.05} = \frac{.008}{.05} = .16 $ or 16%

P(A|B) = 16%

*Example 2:*  

You might be interested in finding out a patient’s probability of having liver disease if they are an alcoholic. “Being an alcoholic” is the test (kind of like a litmus test) for liver disease.

- A could mean the event “Patient has liver disease.” Past data tells you that 10% of patients entering your clinic have liver disease. P(A) = 0.10.
- B could mean the litmus test that “Patient is an alcoholic.” Five percent of the clinic’s patients are alcoholics. P(B) = 0.05.
- You might also know that among those patients diagnosed with liver disease, 7% are alcoholics. This is your B|A: the probability that a patient is alcoholic, given that they have liver disease, is 7%.

P(A) = .10
P(B) = .05
P(B|A) = .07

$ P(A|B)= \frac{.07 * .01}{.05} = \frac{0.0007}{.05} = .14 $ or a 14% chance


*Example 3*:  

A slightly more complicated example involves a medical test (in this case, a genetic test):

There are several forms of Bayes’ Theorem out there, and they are all equivalent (they are just written in slightly different ways). In this next equation, “X” is used in place of “B.” In addition, you’ll see some changes in the denominator. The proof of why we can rearrange the equation like this is beyond the scope of this article (otherwise it would be 5,000 words instead of 2,000!). However, if you come across a question involving medical tests, you’ll likely be using this alternative formula to find the answer:

$Pr(A|X) = \frac{Pr(X|A) * Pr(A)}{(Pr(X|A) * Pr(A)) + (Pr(X|\sim A) * Pr(\sim A))}$


1% of people have a certain genetic defect.
90% of tests for the gene detect the defect (true positives).
9.6% of the tests are false positives.
If a person gets a positive test result, what are the odds they actually have the genetic defect?

The first step into solving Bayes’ theorem problems is to assign letters to events:

- A = chance of having the faulty gene. That was given in the question as 1%. That also means the probability of not having the gene (~A) is 99%.
- X = A positive test result.

So:

    P(A|X) = Probability of having the gene given a positive test result.
    P(X|A) = Chance of a positive test result given that the person actually has the gene. That was given in the question as 90%.
    p(X|~A) = Chance of a positive test if the person doesn’t have the gene. That was given in the question as 9.6%

Now we have all of the information we need to put into the equation:  
$ P(A|X) = (.9 * .01) / (.9 * .01 + .096 * .99) = 0.0865 (\%8.65) $

The probability of having the faulty gene on the test is 8.65%.



