# Module 2 Part 3: The Bayesian Framework and Random Variables

This module consists of 3 parts:

- **Part 1** - Introduction to Probability

- **Part 2** - Probability Distributions

- **Part 3** - The Bayesian Framework and Random Variables

Each part is provided in a separate notebook file. It is recommended that you follow the order of the notebooks.

<h1>Table of Contents<span class="tocSkip"></span></h1>
<br>
<div class="toc">
<ul class="toc-item">
<li><span><a href="#Module-2-Part-3:-The-Bayesian-Framework-and-Random-Variables" data-toc-modified-id="Module-2-Part-3:-The-Bayesian-Framework-and-Random-Variables">Module 2 Part 3: The Bayesian Framework and Random Variables</a></span>
</li>
<li><span><a href="#Table-of-Contents" data-toc-modified-id="Table-of-Contents">Table of Contents</a></span>
</li>
<li><span><a href="#Introduction-to-Bayesian-inference" data-toc-modified-id="Introduction-to-Bayesian-inference">Introduction to Bayesian inference</a></span>
<ul class="toc-item">
<li><span><a href="#Bayesian-statistics" data-toc-modified-id="Bayesian-statistics">Bayesian statistics</a></span>
</li>
<li><span><a href="#Bayesian-analysis" data-toc-modified-id="Bayesian-analysis">Bayesian analysis</a></span>
<ul class="toc-item">
<li><span><a href="#What-types-of-questions-is-Bayesian-analysis-suited-to-answer?" data-toc-modified-id="What-types-of-questions-is-Bayesian-analysis-suited-to-answer?">What types of questions is Bayesian analysis suited to answer?</a></span>
</li>
</ul>
</li>
<li><span><a href="#A-review-of-conditional-probability" data-toc-modified-id="A-review-of-conditional-probability">A review of conditional probability</a></span>
<ul class="toc-item">
<li><span><a href="#Marginal-and-joint-probabilities" data-toc-modified-id="Marginal-and-joint-probabilities">Marginal and joint probabilities</a></span>
</li>
<li><span><a href="#Conditional-probability-and-independence" data-toc-modified-id="Conditional-probability-and-independence">Conditional probability and independence</a></span>
</li>
</ul>
</li>
</ul>
</li>
<li><span><a href="#Bayes'-Theorem" data-toc-modified-id="Bayes'-Theorem">Bayes' Theorem</a></span>
<ul class="toc-item">
<li><span><a href="#The-Bayesian-framework" data-toc-modified-id="The-Bayesian-framework">The Bayesian framework</a></span>
<ul class="toc-item">
<li><span><a href="#M&Ms-Example" data-toc-modified-id="M&Ms-Example">M&Ms Example</a></span>
</li>
</ul>
</li>
</ul>
</li>
<li><span><a href="#Properties-of-Random-Variables" data-toc-modified-id="Properties-of-Random-Variables">Properties of Random Variables</a></span>
<ul class="toc-item">
<li><span><a href="#Expectation-of-a-random-variable" data-toc-modified-id="Expectation-of-a-random-variable">Expectation of a random variable</a></span>
</li>
<li><span><a href="#Variability-in-expectation" data-toc-modified-id="Variability-in-expectation">Variability in expectation</a></span>
</li>
</ul>
</li>
<li><span><a href="#References" data-toc-modified-id="References">References</a></span>
</li>
</ul>
</div>

# Introduction to Bayesian inference

## Bayesian statistics

How do we measure uncertainty? And, how do we make decisions in its presence? One of the ways to deal with uncertainty, in a more quantified way, is to think about probabilities.

There are two major frameworks that statisticians use to think about probabilities:

1. In the **frequentist** framework, probabilities depend on the relative frequency of repeatable events. This approach works very well when we can define a hypothetical infinite sequence.<br><br>

2. In the **Bayesian** framework, probabilities represent our perspective, which takes into account what we know about a particular problem. The uncertainty of the relevant measurements are integral to the framework.

For example, when we flip a fair coin many times, the frequentist approach assumes that the statistics of the coin do not change (i.e. the coin remains fair). However, in the Bayesian framework, our perspective of the fairness of the coin may change as new information becomes available.

The Bayesian worldview interprets probability as a measure of believability in an event &mdash; that is, how confident we are in an event occurring. Frequentists, whose analysis is a more classical version of statistics, assume that probability is the long-run frequency of events. This makes sense for the probabilities of many events but becomes more difficult to understand when events have no long-term frequency of occurrence.

## Bayesian analysis

Bayesians follow an intuitive approach. We will use an example to demonstrate the frequentist versus Bayesian approach.

Consider the question: ***What is the probability of a die being fair?***

A frequentist would think like this: We can roll the die many times but that's not going to change whether or not it's a fair die.

The probability is either 0 or 1.

* The frequentist approach tries to be objective in how it defines probabilities.
* However, sometimes we get interpretations that are not particularly intuitive.

A Bayesian would think like this: We can roll the die many times, but if we have different information than somebody else, then our probabilities may be different.

Probabilities are updated as more data comes in.

* This is inherently a subjective approach to probability.
* The Bayesian framework works well with a mathematically rigorous foundation and follows all probability rules (i.e., $p_i < 1$, $\sum{p_i} = 1$).
* Thinking in this way leads to much more intuitive results.

### What types of questions is Bayesian analysis suited to answer?

Certain questions, such as coin flips, dice rolls, and other situations in which probabilities are static are easily answered using a frequentist approach. However, when the probabilities are not static, Bayesian analysis becomes a more intuitive way to investigate a problem.

For example, user preferences on topics such as movies and products in online stores are complex, rely on many factors, and can change rapidly. For these scenarios, a frequentist approach is less useful because it is extremely difficult to predict the underlying statistics, whereas a Bayesian approach allows us to update our user preference model as more data comes in.

Thus, Bayesian analysis is suited to the following kinds of questions:

* What is the probability of a coin being fair?
* What is the probability of getting a four when rolling a die?
* What is the probability of rain tomorrow?
* What is the probability that users prefer site A vs. site B?

![frequentists_vs_bayesians.png](attachment:frequentists_vs_bayesians.png)

*Cartoon example of the frequentist versus Bayesian approach.* **Image Source**: (xkcd comics, n.d.)

## A review of conditional probability

Bayesian statistical analysis requires the correct application of probability concepts. We will now review the basics.

If two events are related to each other, what is the probability of event A happening given that we know event B happened?

### Marginal and joint probabilities

Recall the examples from Part 1 and Part 2 of this module regarding the probability of whether a teen will go to college based on whether their parents did.

If a probability is based on a single variable, it is called a **marginal probability**.

For example, a probability based solely on the `teen` variable is called a marginal probability:

$$P (teen\ college) = \frac{445}{792} = 0.56$$

The probability of outcomes for two or more variables or processes is called a **joint probability**.

For example:

$$P (teen\ college\ and\ parents\ not) = \frac{214}{792} = 0.27$$

### Conditional probability and independence

The **conditional probability** of the outcome of interest $A$ given condition $B$ is computed as follows:

$$P(A\ |\ B) = \frac{P(A\cap B)}{P(B)}$$


Recall that $P(A\ |\ B)$ means the probability of $A$ **given** $B$, and $P(A\cap B)$ is the probability of $A$ **and** $B$.

Thus, in this example:

$$P (parents\ not\ given\ teen\ college) = \frac{P (teen\ college\ and\ parents\ not)}{P(teen\ college)} = \frac{\frac{214}{792}}{\frac{445}{792}} = \frac{214}{445} = 0.48 $$

Two events are called **independent** when:

$$P (A\ |\ B) = P (A)$$


Then:

$$P(A \cap B) = P(A) \cdot P(B)$$


In this example, if $P(teen\ college)$ and $P(parents\ not)$ were independent, then:

$$P (teen\ college\ given\ parents\ not) = P (teen\ college) = \frac{214}{445} = 0.48$$

**Example**

Suppose a box contains 100 t-shirts: 60 are blue and 40 are red. Suppose also that we have 50 size small of which 30 are red.

* Marginal probability of red $ = \frac{40}{100} = 0.4$
* Marginal probability of small $ = \frac{50}{100} = 0.5$
* Joint probability of red and small $ = \frac{30}{100} = 0.3$

**Question**: If a t-shirt were randomly chosen from the box, what is $P(Red\ |\ Small) $?

$$\frac{P(Red\ and\ Small)}{P(Small)} = \frac{0.3}{0.5} = \frac{3}{5}$$

Alternatively, we can see from the information given that out of the 50 size small shirts, 30 are red.

$$\frac{30}{50} = \frac{3}{5}$$

We arrived at the same answer.

# Bayes' Theorem

There are situations where we witness a particular event and we need to compute the probability of one of its possible causes.

In other words, we observe $P(A\ |\ B)$ and we want to know $P(B\ |\ A)$. This is where we apply Bayes' Theorem.

For illustrative purposes, suppose we want to calculate $P(A \ and \ B)$. We can use the conditional probability equation in two ways:


$$ P(A \ and \ B)=P(A\ |\ B) \cdot P(B) $$


or


$$ P(A \ and \ B)=P(B\ |\ A) \cdot P(A) $$


then, we can say:


$$ P(B\ |\ A) \cdot P(A)=P(A\ |\ B) \cdot P(B) $$


which is only one step away from Bayes' Theorem &mdash; we only need to solve for any of the two conditional probabilities.

This is known as the **odds form** of Bayes' theorem. Thus, in the previous example:

$$P(Red\ |\ Small) = \frac{P(Small\ |\ Red) \cdot P(Red)}{P(Small\ |\ Red) \cdot P(Red)+P(Small\ |\ not\ Red) \cdot P(not\ Red)}$$


and since:

$$P(Small\ |\ Red) = \frac{30}{40} = \frac{3}{4}$$


$$P(Small\ |\ not\ Red) = \frac{20}{60} = \frac{1}{3}$$


$$P(Red) = \frac{4}{10}$$


$$P(not\ Red) = \frac{3}{5}$$


therefore:

$$P(Red\ |\ Small) = \frac{\frac{3}{4} \cdot \frac{2}{5}}{\frac{3}{4} \cdot \frac{2}{5}+\frac{1}{3} \cdot \frac{3}{5}} = \frac{\frac{3}{10}}{\frac{3}{10}+\frac{2}{10}} = \frac{3}{5}$$


This is the same answer as before. It seems a rather roundabout way of getting it, but if you only have certain data available to you, it's important to know how to use this form.

We can now add the equation for the sum of conditional probabilities.  Suppose like before that $A_1$, ..., $A_k$ represent all the disjoint outcomes for a variable or process $A$, then:


$$ P(A_1\ |\ B)=\frac{P(B\ |\ A_1)\cdot P(A_1)}{P(B)}=\frac{P(B\ |\ A_1)\cdot P(A_1)}{P(B\ |\ A_1)\cdot P(A_1)+P(B\ |\ A_2)\cdot P(A_2)+···+P(B\ |\ A_k)\cdot P(A_k)} $$


where


$$ P(B)= P(B\ |\ A_1)\cdot P(A_1)+P(B\ |\ A_2)\cdot P(A_2)+···+P(B\ |\ A_k)\cdot P(A_k) $$


Bayes’ Theorem can be thought of as a way of "inverting" conditional probabilities. Sometimes we know $P(A\ |\ B)$ but we need to calculate $P(B\ |\ A)$.

**Example: The Cookie Problem**

- Bowl 1 contains 30 vanilla cookies and 10 chocolate cookies.

- Bowl 2 contains 20 vanilla cookies and 20 chocolate cookies.

Suppose you choose one of the bowls at random and select a cookie at random. The cookie is vanilla. What is the probability that it came from Bowl 1?

We want to find $P(Bowl\ 1\ |\ Vanilla)$. Using Bayes' Theorem:

$$P(Bowl\ 1\ |\ Vanilla) = \frac{P(Vanilla\ |\ Bowl\ 1) \cdot P(Bowl\ 1)}{P(Vanilla)}$$

$P(Bowl\ 1) = \frac{1}{2}$, the probability that we choose Bowl 1 (assuming this is random)

$P(Vanilla\ |\ Bowl\ 1)$ is the probability of getting a vanilla cookie from Bowl 1 ($\frac{30}{40}$ or $\frac{3}{4}$)

$P(Vanilla)$ is the probability of drawing a vanilla cookie from either bowl: $(\frac{1}{2})(\frac{30}{40}) + (\frac{1}{2})(\frac{10}{20}) = \frac{5}{8}$.

$P(Bowl\ 1\ |\ Vanilla) = \frac{(\frac{3}{4})(\frac{1}{2})}{(\frac{5}{8})} = \frac{3}{5} = 0.6$

This may seem obvious from the nature of the question (i.e. we know that there are 30 vanilla cookies in Bowl 1 and 50 vanilla cookies overall, so it seems clear that $P(Bowl\ 1\ |\ Vanilla) = 0.6$), but many problems of this type are not so clear cut. This method generalizes to more complex situations!

**Exercise 1: Genetics and Medicine**

**A: Genetic testing**

Say that 1% of people have a certain genetic defect. Genetic testing is available for this particular defect, and 90% of tests for the gene detect the defect if it is there.

However, 9.6% of the tests are false positives (i.e. they detect the defect when it isn't really there).

If a person gets a positive test result, **what are the odds they actually have the genetic defect?**

In [1]:
# Your work here

In [2]:
defect = 0.01
normal = 0.99
positive = 0.9
false_positive = 0.096

# P(false positive | positive)
# = P(positive | false positive) / P(false positive | positive) x P(false positive)
# = 

How would your answer to the above change if 5% of people had the defect?

In [2]:
# Your work here

**B: A test for cancer**

Given the following statistics, what is the probability that a woman over 50 has cancer if she has a positive mammogram result?

One percent of women over 50 have breast cancer.
Ninety percent of women who have breast cancer test positive on mammograms.
Eight percent of women will have false positives.

In [3]:
# Your work here

**Solution**

**A: Genetic testing**

In [4]:
P_Gene = 0.01

P_NOT_Gene = 0.99

P_Pos_given_Gene = 0.9

P_Pos_given_NOT_Gene = 0.096

P_Gene_given_Pos = (P_Pos_given_Gene*P_Gene) / (P_Pos_given_Gene*P_Gene + P_Pos_given_NOT_Gene*P_NOT_Gene)

In [5]:
P_Gene_given_Pos

0.0865051903114187

This is much smaller than we might expect for a test that supposedly detects a defective gene 90% of the time! It's important to realise how our intuitions can lead us astray in problems like these.

In [6]:
# the odd to actually have genetic defect is P_Gene_given_Pos/(1-P_Gene_given_Pos) is very small

odd_to_have_Gene_defect = P_Gene_given_Pos/(1 - P_Gene_given_Pos)

odd_to_have_Gene_defect

0.09469696969696972

If 5% of people had the defect:

In [7]:
P_Gene = 0.05

# This changes the probability of observing patient without Gene defect P_NOT_Gene = 1 - P_Gene
P_NOT_Gene = 0.95

P_Gene_given_Pos = (P_Pos_given_Gene*P_Gene) / (P_Pos_given_Gene*P_Gene + P_Pos_given_NOT_Gene*P_NOT_Gene)

In [8]:
P_Gene_given_Pos

0.3303964757709251

In [9]:
# When the event is less rare to observe (5 % from 1%) the fraction of False positive drops
# what leads to much higher chance to actually have defect given positive test and corresponding odds

odd_to_have_Gene_defect_5 = P_Gene_given_Pos/(1 - P_Gene_given_Pos)
odd_to_have_Gene_defect_5

0.493421052631579

**B: A test for cancer**

Given the following statistics, what is the probability that a woman over 50 has cancer if she has a positive mammogram result?

One percent of women over 50 have breast cancer.
Ninety percent of women who have breast cancer test positive on mammograms.
Eight percent of women will have false positives.

In [10]:
P_Cancer = 0.01
P_NOT_Cancer = 0.99
P_Pos_given_Cancer = 0.9
P_Pos_given_NOT_Cancer = 0.08

P_Cancer_given_Pos = (P_Pos_given_Cancer*P_Cancer) / (P_Pos_given_Cancer*P_Cancer + P_Pos_given_NOT_Cancer*P_NOT_Cancer)

In [11]:
P_Cancer_given_Pos

0.10204081632653063

In [12]:
# The odds to actually have cancer given positive test:

odd_cancer_positive_test = P_Cancer_given_Pos / (1 - P_Cancer_given_Pos)

odd_cancer_positive_test

0.11363636363636366

## The Bayesian framework

The Bayesian approach assumes that we always have a prior distribution even though it may be very vague, equiprobable, or even outright wrong.

- When we obtain new data, we update the prior distribution in light of the new data to get an updated probability distribution called the posterior distribution.


- The posterior distribution reflects our state of knowledge after collecting the data.

Bayes is a big topic, so for now we will only cover a quick introduction (we will explore it further in a later module). Essentially, it gives us a way to update the probability of a hypothesis, $H$, in light of some body of data, $D$.

Rewriting Bayes' Theorem with $H$ and $D$ yields:


$$P(H\ |\ D)= \frac{P(D\ |\ H) \cdot P(H)}{P(D)}$$


where

- $p(H)$ is the probability of the hypothesis before we see the data, called the **prior** probability


- $p(H\ |\ D)$ is the probability of the hypothesis after we see the data, called the **posterior**


- $p(D\ |\ H)$ is the probability of the data under the hypothesis $H$, called the **likelihood**


- $P(D)$ is the probability of the data under any hypothesis, called the **evidence or normalizing constant**

### M&Ms Example
![image.png](attachment:image.png)

*Image of M&Ms in all colours, including blue.* **Source**: Plain-M&Ms-Pile.jpg, (2010).

**Example** (Downey, 2012)

In 1995, they introduced blue M&Ms.

* Before 1995, the colour mix was 30% brown, 20% yellow, 20 % red, 10% Green, 10% orange, 10% tan.


* Afterwards, it was 24% blue, 20 % green, 16% orange, 14% yellow, 13% red, 13% brown.

You have 2 bags. One is from 1994 and one from 1996 (but you don't know which is which). You draw one M&M from each bag. One is yellow and one is green. 

What is the probability that the yellow one came from the 1994 bag?

It is easy to calculate the probability of a colour being drawn from a specific bag. However, what we want to know is:

<br><center><b>If we draw a specific colour, what is the probability that it is coming from a specific bag?</b></center><br>

- **Hypothesis A**: Bag 1 is from 1994, which implies that Bag 2 is from 1996.


- **Hypothesis B**: Bag 1 is from 1996, which implies that Bag 2 is from 1994.

Both bags have the same probability of being chosen to draw an M&M from: $\frac{1}{2}$. The probability of drawing a green and a yellow from each bag is also easy to calculate:

$$P(bag_{1994})=\frac{1}{2}$$


$$P(bag_{1996})=\frac{1}{2}$$


$$P(green\ |\ bag_{1994})=\frac{10}{100}$$


$$P(yellow\ |\ bag_{1994})=\frac{20}{100}$$


$$P(green\ |\ bag_{1996})=\frac{20}{100}$$


$$P(yellow\ |\ bag_{1996})=\frac{14}{100}$$

|Hypothesis     | Prior         | Likelihood            |Prior $\cdot$ Likelihood|Posterior              |
| :-----------: | :-----------: |:--------------------: |:--------------------: |:--------------------: |
|Hypothesis| P(H)	        |P(D\|H)	            |P(H) p(D\|H)          |P(H\|D)               |
|    A          | $\frac{1}{2}$ |$\frac{20}{100} \cdot \frac{20}{100}$   |0.02  |0.74                   |
|    B          | $\frac{1}{2}$ |$\frac{14}{100} \cdot \frac{10}{100}$	 |0.007 |0.26                  |


To calculate the **posterior**, we need to apply Bayes' formula:

$$
P(H|D)= \frac{P(D|H) \cdot P(H)}{P(D)}= \frac {\frac{20}{100} \cdot \frac{20}{100} \cdot \frac{1}{2}}{\frac{20}{100} \cdot \frac{20}{100} \cdot \frac{1}{2} + \frac{14}{100} \cdot \frac{10}{100} \cdot \frac{1}{2}}=0.74
$$

# Properties of Random Variables

## Expectation of a random variable

If we have a random variable ($X$) with multiple possible outcomes, our expectation is that its expected value will be the sum of its probabilities multiplied by their value. Let's illustrate this with an example.

A restaurant sells three things:

1. Pasta (10 dollars) – 50% of customers purchase it<br><br>

2. Pizza (8 dollars) – 40% of customers purchase it<br><br>

3. Salad (6 dollars) – 10% of customers purchase it
  
How much can we expect each customer to spend?

$E = $ expected value

$E = \$10\cdot 0.5 + \$8\cdot 0.4 + \$6\cdot 0.1$

$E = \$8.80$

So, if I serve 100 customers, I can expect to make \$880.

In general, if $X$ takes outcomes $x_1$, ..., $x_k$ with probabilities $P(X = x_1)$, ..., $P(X = x_k)$, the expected value of $X$ is the sum of each outcome multiplied by its corresponding probability:


$$ E(X)=x_1 \cdot P(X =x_1)+···+x_k \cdot P(X =x_k) $$

## Variability in expectation

The variance is calculated as the squared difference between each value and the mean, multiplied by the probability:

$E = \$10 \cdot 0.5 + \$8 \cdot 0.4 + \$6 \cdot 0.1 = \$8.8$

$V = (\$10-\$8.8)^2\cdot 0.5 + (\$8-\$8.8)^2\cdot0.4 + (\$6- \$8.8)^2\cdot 0.1$

$V = 0.72 + 0.256 + 0.784 = 1.76$

$SD = \$1.33$

For 100 customers, we can expect a range from \\$747 to \\$1013.

In general, if $X$ takes outcomes $x_1$, ..., $x_k$ with probabilities $P(X = x_1)$, ..., $P(X = x_k)$ and expected value $\mu = E(X)$, then the variance of $X$, denoted by $\text{Var}(X)$ or the symbol $\sigma^2$, is:

$$
\sigma^2 =(x_1-μ)^2 \cdot P(X=x_1)+···+ (x_k-μ)^2 \cdot P (X=x_k)
$$

**End of Module**

You have reached the end of this module.

If you have any questions, please reach out to your peers using the discussion boards. If you
and your peers are unable to come to a suitable conclusion, do not hesitate to reach out to
your instructor on the designated discussion board.

When you are comfortable with the content, and have practiced to your satisfaction, you may
proceed to any related assignments, and to the next module.

# References

Downey, A. (2012). Section 1.6: the M&M problem in *Think Bayes: Bayesian Statistics Made Simple, Version 1.0.9,* Green Tea Press. http://www.greenteapress.com/thinkbayes/html/index.html

Plain-M&Ms-Pile.jpg (2010). Retrieved Dec 5, 2018 from Wikimedia Commons. https://commons.wikimedia.org/wiki/File:Plain-M%26Ms-Pile.jpg Public Domain