# Lecture 3 - Conditional Probability, Total Probability

Summary from last class:

1. Defined a **probabilistic model**, a mathematical description to characterize *uncertainty*. A probabilistic model is the triple $(\Omega,\mathcal{F},P)$, where $\Omega$ is the sample space, $\mathcal{F}$ is the event class and $P$ is a real-valued function that maps all elements of $\mathcal{F}$ onto $\mathbb{R}$.

2. Defined **axioms of probability**:
    * $\forall E\in\mathcal{F}, P(E)\geq 0$
    * $P(\Omega)=1$
    * $\forall E, F \in \mathcal{F}, P(E\cup F) = P(E) + P(F)$ if $E$ and $F$ are mutually exclusive, that is, $E\cap F =\emptyset$
    * If $A_1,A_2,\dots$ is a sequence of event such that $A_i\cap A_j = \emptyset, \forall i\neq j$, then

$$P\left(\bigcup_{k=1}^{\infty} A_k\right) = \sum_{k=1}^{\infty} P(A_k)$$

3. Derived **corollaries** from these axioms:
    * $P(A^c) = 1 - P(A)$
    * $P(A) \leq 1$
    * $P(\emptyset) = 0 $
    * $P(A\cup B) = P(A) + P(B) - P(A\cap B)$
    * If $A\subset B$, then $P(A)\leq P(B)$
    * If $A_1,A_2,\dots,A_n$ are pairwise mutually exclusive, then $P\left(\bigcup_{k=1}^n A_k\right) = \sum_{k=1}^n P(A_k)$. Proof is by induction.
    * $P\left(\bigcup_{k=1}^n A_k\right) = \sum_{k=1}^n P(A_j) - \sum_{j<k} P(A_j\cap A_k) + \dots + (-1)^{(n+1)}P(A_1\cap A_2 \cap \dots\cap A_n)$
    
4. Learned that an experiment is **fair** if every outcome is equally likely.

5. Defined probability as a measure of frequency of occurrence (**frequentist view**). When the sample space $\Omega$ has a finite number of equally likely outcomes, the probability of an event $E\subset \Omega$, is given by 
$$P(E) = \frac{|E|}{|\Omega|} = \frac{\text{number of elements in }E}{\text{number of elements in }\Omega}$$

___

# Discrete and Continuous Probabilistic Models

## Discrete Sequential Models

A **sequential model** is a type of experiment that has an inherent sequential character. For example:

* Flipping a coin 3 times
* Receiving eight successive digits at a communication receiver
* Observing the value of a stock on five successive days

This type of models are also characterized by probabilistic model that must obey the set of axioms and derived corollaries.

It is often useful to describe a sequential model experiment in a **tree-based sequential description**. 

**Let's use the virtual whiteboard to work through some examples to demonstrate the tree-based description of this type of experiments.**

**<font color=blue>Example 1:</font>** Consider the experiment where we flip a fair coin 2 times.

* What is the sample space? Remember that different elements of the sample space should be distinct and mutually exclusive.
* What is the probability for each possible outcome?

**<font color=blue>Example 2:</font>** Consider the experiment where we roll a 6-sided fair die 2 times and the event $E\equiv$observing a 1 or 2 on either roll.

* What is the sample space?
* What are the outcomes of event $E$?
* What is the probability of event $E$?

In [1]:
import random
import numpy as np
import numpy.random as npr
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')

In [None]:




    
print("Probability of getting a 1 or 2 on either roll \
when rolling a fair 6-sided die twice= ", 
     event/num_sims)
print('True probability is ', ##)

In [None]:
# Alternatively



In [None]:




print("Probability of getting a 1 or 2 on either roll \
when rolling a fair 6-sided die twice= ", 
     event/num_sims)
print('True probability is ', ##)

**<font color=blue>Example 3:</font>** Consider the experiment where we roll a 6-sided fair die 2 times and the event $E\equiv$at least one roll is 4.

* What is the sample space?
* What are the outcomes of event $E$?
* What is the probability of event $E$?

In [None]:




print("Probability of getting at one roll as 4 when rolling a fair 6-sided die twice= ", 
     event/num_sims)
print('True probability is ', ##)

**<font color=blue>Example 4:</font>** Consider the experiment where we flip a fair coin 3 times and the event $E\equiv$observing heads in the 2nd flip.

* What is the sample space?
* What are the outcomes of event $E$?
* What is the probability of event $E$?

In [None]:




print("Probability of flipping a fair coin 3 times and observing heads in the 2nd flip= ", 
     event/num_sims)
print('True probability is ', ##)

**<font color=blue>Example 5:</font>** Consider the experiment where we flip a fair coin 3 times, the sub-experiment of counting how many times is came up heads, and the event $E\equiv$it came up heads 2 times.

* What is the sample space of the experiment?
* What is the sample space of the sub-experiment?
* What are the outcomes of this sub-experiment?
* What is the probability of event $E$?

In [None]:




print("Probability of flipping a fair coin 3 times and observing 2 heads = ", 
     event/num_sims)
print('True probability is ', ##)

## Continuous Models

The way that continuous models differ from discrete ones is that the probabilities of single-element events may not be sufficient to characterize the probability law.

**For example:** a wheel of fortune is continuously calibrated from 0 to 1, so the possible outcomes of an experiment consisting of a single spin are the numbers in the interval $\Omega=[0,1]$. 
* Assuming a fair wheel, it is appropriate to consider all outcomes equally likely, but what is the probability of the event consisting of a single element? say 0.472927028..?
* If possible events are mutually exclusive, then using the 3rd axiom of probability (also called the **Additivity axiom**), it would follow that events with a sufficiently large number of elements would have probability larger than 1.
* Therefore, the probability of any event that consists of a single element **must** be 0.
* It makes sense then to assign a probability of $b-1$ to any sub-interval $[a,b]$ of $[0,1]$, and to calculate the probability of a more complicated set by evaluating its "length". This satisfies all axioms of probability and qualifies as a legitimate probability model.

___

## Introduction to Conditional Probability

Consider the following scenarios:

**<font color=blue>Example 6:</font> A magician has in her pocket a fair coin and a two-headed coin. She chooses one at random and flips it. What is the probability that the result is heads?**

* Is this experiment fair? <!-- No, because the probability of each outcome (H, T) are not equal. P(H)=3/4 and P(T)=1/4 -->

Let's compute this probability on the virtual whiteboard.

Let's build a simulation to answer this:

**<font color=blue>Example 7:</font> Suppose that she chooses a coin at random. Using that coin, she flips it once and observes heads. What is the probability of observing heads in the second flip (using the same coin) if we observed heads in the first flip?**

* How can we visualize and compute the analytical probability of this event?

Let's compute this probability on the virtual whiteboard.

Let's build a simulation to answer this:

This probability is called **conditional probability** as it provides us a way to reason about the outcome of an experiment, based on **partial information**.

For example 7, consider the event $H_i=$heads on flip i. We are asking what is the **probability of $H_2$ given $H_1$ occurred**, that is,

$$P(H_2 | H_1) = \frac{5}{6}$$

Consider the Venn diagram:

In [None]:
from IPython.display import Image
Image("figures/condProb1.png", width=500)

If we **condition** on $B$ having occurred, then we can form the new Venn diagram:

In [None]:
Image("figures/condProb2.png", width=300)

This diagram suggests that if $A\cap B=\emptyset$ then if $B$ occurs, $A$ could not have occurred.

Similarly if $B\subset A$, then if $B$ occurs, the diagram suggests that $A$ must have occurred.

A definition of conditional probability that agrees with these and other observations is:

<div class="alert alert-info" role="alert">
  <strong>Conditional Probability</strong>
    
For $A\in\mathcal{F}$, $B\in\mathcal{F}$, the **conditional probability** of event $A$ *given* event $B$ occurred is
    
$$P(A|B) = \frac{P(A\cap B)}{P(B)},\text{ for }P(B)>0$$ 
</div>

**Claim: If $P(B)>0$, the conditional probability $P(\bullet|B)$ satisfies the axioms on the original sample space $(\Omega,\mathcal{F},P(\bullet|B))$.**

<div class="alert alert-info" role="alert">
For a fixed event $B\neq\emptyset$, the conditional probabilities $P(A|B)$ form a legitimate probability law that satisfies the three axioms!
</div>

<div class="alert alert-warning">
    <b>Relating Conditional and Unconditional Probabilities</b>
    
Which of the following statement is true?

1. $P(A|B) \geq P(A)$
2. $P(A|B) \leq P(A)$
3. Not necessarily 1 or 2
</div>

**<font color=blue>Example 8:</font> A computer lab contains**

* **two computers from manufacturer A, one of which is defective**
* **three computers from manufacturer B, two of which are defective**
    
**A user sits down at a computer at random. Let the properties of the computer she sits down at be denoted by a two letter code, where the first letter is the manufacturer and the second letter is D for a defective computer or N for a non-defective computer. (We add a subscript to differentiate computers with the same two-letter code.)**

* What is the sample space?

$$\Omega = \{AD, AN, BD_1, BD_2, BN\}$$

Let
* $E_A$ be the event that the selected computer is from manufacturer A
* $E_B$ be the event that the selected computer is from manufacturer B
* $E_D$ be the event that the selected computer is defective

Let's find

$$P(E_A) = $$

$$P(E_B) = $$

$$P(E_D) = $$

Now, suppose that I select a computer and tell you its manufacturer. Does that influence the probability that the computer is defective?

* For example, I tell you the computer is from manufacturer A. Then what is the probability that it is defective?


$$P(E_D|E_A) = $$

* Let's find:

$$P(E_D | E_B) = $$

$$P(E_A | E_D) = $$

$$P(E_B | E_D) = $$

## Conditional Probability for Discrete Sample Spaces with Equal Probabilities

Consider again our simulation of the magician's coins.

We directly estimated of those outcomes where the first flip was heads what proportion was the second flip heads. I.e., we did not use the definition of conditional probability, which involves a ratio of probabilities.

How does that work out?

Let $H_i$ be the event that the outcome of the $i$th flip was heads. We were trying to estimate $P(H_2|H_1)$.

If we were to use the definition of conditional probability, then we would find this as

$$P(H_2|H_1) = \frac{P(H_2\cap H_1)}{P(H_1)}$$

If we didn't know how to solve these analytically, we could estimate them by their relative frequencies. Let:

* $N$ be the number of simulations,
* $N_1$ be the number of simulations in which the first flip is heads, and
* $N_{12}$ be the number of simulations in which both flips are heads.

Then

\begin{align*}
P(H_2|H_1) &= \frac{N_{12}/N}{N_1/N} \\
&= \frac{N_{12}}{N_1}
\end{align*}

**<font color=blue>Example 9:</font> XOR of two independent binary values.**

**Flip a fair coin with sides labeled '0' and '1' two times.**
* **Let $E_i$ denote a '1' on the top face on flip $i$.**
* **Let $F_i$ denote a '0' on the top face on flip $i$.**
* **Let $G$ denote the event that the XOR of the values observed on the top faces on the two flips is '1'.**

Compute:

$$P(E_1) = $$
$$P(E_2) = $$
$$P(F) = $$

$$P(E_1|E_2) = $$
$$P(E_2|E_1) = $$
$$P(G|E_1) = $$

and

$$P(G|E_1\cap E_2) = $$

# Chain Rule - Using Conditional Probability to Decompose Events

Let's use the **virtual whiteboard** to depict how we can decompose conditional probability using a tree-based sequential representation.

In general, note that:

$$P(A|B) = \frac{P(A\cap B)}{P(B)}$$
$$\Rightarrow P(A\cap B) = P(A|B)P(B)$$

and

$$P(B|A) = \frac{P(A\cap B)}{P(A)}$$
$$\Rightarrow P(A\cap B) = P(B|A)P(A)$$

These equations $P(A\cap B) = P(A|B)P(B)$ and $P(A\cap B) = P(B|A)P(A)$ are known as **chain rules** for expanding the probability of the intersection of two events. 

* The chain rule can be easily generalized to more than two events:

\begin{align}
P(A\cap B\cap C) &= P(A)P(B|A)P(C|A\cap B) \\
&= P(A) \cdot\frac{P(A\cap B)}{P(A)} \cdot\frac{P(A\cap B\cap C)}{P(A\cap B)}
\end{align}

Similarly,

$$P(A\cap B\cap C) = P(A|B\cap C)P(B|C)P(C)$$

<div class="alert alert-info" role="alert">
  <strong>Multiplication Rule</strong>
    
Assuming that all of the conditioning events have positive probability, we have
    
$$P\left(\bigcap_{i=1}^n A_i\right) = P(A_1)P(A_2|A_1)P(A_3|A_1\cap A_2)\dots P\left(A_n| \cap_{i=1}^{n-1} A_i\right)$$
</div>

# Total Probability Theorem (also known as the Law of Total Probability)

A collection of events $A_1, A_2, \dots$ **partitions** the sample space $\Omega$ *if and only if*

$$\Omega = \bigcup_i A_i$$

and $A_i\cap A_j = \emptyset, i\neq j$, i.e., they are disjoint events.

$\{A_i\}$ is also said to be a **partition** of $\Omega$.

1. Let's visualize this partition using the Venn diagram. (**virtual whiteboard**)

2. Let's also use a Venn diagram to express an arbitrary set using a partition of $\Omega$ (**virtual whiteboard**)

___

**<font color=blue>Example 10:</font> Suppose you have an urn containing 7 red and 3 blue balls. You draw three balls at random. On each draw, if the ball is red you set it aside and if the ball is blue you put it back in the urn. What is the probability that the second draw is blue? (If you get a blue ball it counts as a draw even though you put it back in the urn.)**

<!-- Let's first give names to possible events. Let $B_i$ be a blue ball in the $i^{th}$ draw. And, let $R_i$ be a red ball in the $i^{th}$ draw.

We know that the cardinality of blue balls is $|B|=3$ and the cardinality of red balls is $|R| = 7$.

Now, we want to compute $P(B_2)$. There are only 2 cases where we a blue ball is drawn in the 2nd draw: $B_1B_2$ and $R_1B_2$.

So, the event $B_2 = (B_1\cap B_2) \cup (R_1\cap B_2)$. And the events $B_1\cap B_2$ and $B_1\cap B_2$ are mutually exclusive! 

Then we can compute:

\begin{align*}
P(B_2) &= P(B_1\cap B_2) + P(R_1\cap B_2) \\
&= P(B_2|B_1)P(B_1) + P(B_2|R_1)P(R_1)\text{, using the chain rule} \\
&= \frac{3}{10}\times\frac{3}{10} + \frac{3}{9}\times\frac{7}{10}\\
&\approx 0.323
\end{align*} -->

___

<div class="alert alert-info" role="alert">
  <strong>Total Probability Theorem</strong>
    
Also called **Total Probability Law**; if the set of events $\{A_i\}$ partitions $\Omega$, then

$$P(B) = \sum_i P(B|A_i)P(A_i)$$
</div>

* Total probability is often used in problems where there is a **hidden state**.

    * It is commonly used in Machine Learning when describing generative models.

* The problem with the magician with the two coins (fair and 2-headed) and computing the probability of heads is such a problem: what is the hidden state?


* When applying chain rule in such problems, we are often conditioning on the different possibilities of the hidden state.

___
**<font color=blue>Example 5:</font> A magician has two coins, one fair and one 2-headed coin. Consider the experiment where she picks one coin at random and flips it $i$ times. Let $H_i$ denote the event that the outcome of flip i is heads. Using the Total Probability Law, find the following:**

1. $P(H_1)$

<!-- \begin{align*}
P(H_1) &= P(H_1|F)P(F) + P(H_1|\overline{F})P(\overline{F})\\
&= \frac{1}{2}\cdot\frac{1}{2} + 1\cdot\frac{1}{2}\\
&=\frac{3}{4}
\end{align*} -->

2. $P(H_1\cap H_2)$

<!-- \begin{align*}
P(H_1\cap H_2) &= P(H_1\cap H_2|F)P(F) + P(H_1\cap H_2|\overline{F})P(\overline{F})\\
&= \frac{1}{4}\cdot\frac{1}{2} + 1\cdot \frac{1}{2}\\
&= \frac{5}{8}
\end{align*} -->