# Joint Probability (Discrete)

In [4]:
# Import some helper functions (please ignore this!)
from utils import * 

**Context:** So far, you've spent some time conducting a preliminary exploratory data analysis (EDA) of IHH's ER data. You noticed that considering variables separately can result in misleading information. As a result, you decided to use *conditional distributions* to model the *relationship between variables*. Using these conditional distributions, you were able to develop *predictive models* (e.g. predicting the probability of intoxication given the day of the week), These predictive models are useful for the IHH administration to make decisions. 

However, you've noticed that your modeling toolkit is still limited. The conditional distributions we introduced can model how the probability of one variable changes given a *set* of variables. What if we wanted to describe how the probability of a *set* of variables (i.e. more than one) changes given a *set* of variables? For example, we may want to answer questions like: "how does the probability that a patient is hospitalized for an allergic reaction change given the day of the week?" In this question, we're inquiring about two variables---that the condition is an allergic reaction, *and* that the patient was hospitalized---given the day of the week.

**Challenge:** We need to expand our modeling toolkit to include yet another tool---joint probabilities. 

**Outline:**
1. Introduce and practice the concepts, terminology, and notation behind discrete joint probability distributions (leaving continuous distributions to a later time).
2. Introduce a graphical representation to describe joint distributions.
3. Translate this graphical representation directly into code in a probabilistic programming language (using `NumPyro`) that we can then use to fit the data.

## Terminology and Notation

We, again introduce the statistical language---terminology and notation---to precisely specify to a computer how to model our data. We will then translate statements in this language directly into code in `NumPyro` that a computer can run.

**Concept:** The concept behind a joint probability is elegant; it allows us to build complicated distributions over many variables using simple conditional and non-conditional distributions (that we already covered). 

We can illustrate this using an example with just two variables. Suppose you have two RVs, $A$ and $B$. The probability that $A = a$ and $B = b$ are *both* satisfied is called their *joint probability*. It is denoted by $p_{A,B}(a, b)$. This joint distribution can be *factorized* to a product of conditional and non-conditional (or "marginal") distributions as follows:
\begin{align*}
p_{A, B}(a, b) &= p_{A | B}(a | b) \cdot p_B(b) \quad \text{(Option 1)} \\
\underbrace{\phantom{p_{A, B}(a, b)}}_{\text{joint}} &= \underbrace{p_{B | A}(b | a)}_{\text{conditional}} \cdot \underbrace{p_A(a)}_{\text{marginal}} \quad \text{(Option 2)}
\end{align*}
Notice that the joint is now described in terms of conditional and marginal distributions, which we already know how to work with! 

**Intuition:** So what's the intuition behind this formula? Let's depict events $A$ and $B$ as follows:

<img align="center" width="500px" src="figs/joint-probability-venn.png" />

Using the above diagram, we can pictorally represent all distributions of interest. The marginal $p_B(b)$ is the ratio of the blue circle relative to the whole space (the gray square):

<img align="center" width="300px" src="figs/joint-probability-eq-marginal.png" />

The conditional $p_{A | B}(a | b)$ is the ratio of the purple intersection relative to the blue circle. This is because the blue circle represents us conditioning on $B = b$, and the intersection of the circles represents the observations for which we *also* have $A = a$.

<img align="center" width="300px" src="figs/joint-probability-eq-conditional.png" />

Finally, the joint $p_{A, B}(a, b)$ is the ratio between the purple intersection and the whole space (the gray square). This is because the intersection is the place where both $A = a$ and $B = b$.

<img align="center" width="300px" src="figs/joint-probability-eq-joint.png" />

Now we can see that the joint is the product of the conditional and the marginal because the blue circles "cancel out":

<img align="center" width="500px" src="figs/joint-probability-eq-joint-expanded.png" />

**Choice of Factorization:** Lastly, notice that we have a *choice* to factorize the distribution in two ways. How do you know which one to use? Typically, we choose a factorization that is *intuitive to us* and what we can compute. 
> For example, suppose you want to model the joint distribution of the day of the week, $D$ and whether a patient arrive with intoxication, $I$. The joint distribution can be factorized in two ways:
> \begin{align}
p_{D, I}(d, i) &= p_{I | D}(i | d) \cdot p_D(d) \quad \text{(Option 1)} \\
&= p_{D | I}(d | i) \cdot p_I(i) \quad \text{(Option 2)} \\
\end{align}
> Which one makes more intuitive sense? Well, it's a little weird to try to predict the day of the week given whether a patient arrives with intoxication; we typically know what the day of the week is and we don't need to predict it. In contrast, given the day of the week, it makes a lot of sense to wonder about the probability of a patient arriving with intoxication. As such, Option 1 makes more sense here. 

**Generalizing to More than Two RVs:** So now we have the tools to work with joint distributions with two RVs. What do we do if we have three or more? The same ideas apply. The joint distribution for random variables $A$, $B$, and $C$ can be factorized in a number of ways. For example, we can condition on two variables at a time:
\begin{align*}
p_{A, B, C}(a, b, c) &= p_{A | B, C}(a | b, c) \cdot p_{B, C}(b, c) \quad \text{(Option 1)} \\
&= p_{B | A, C}(b | a, c) \cdot p_{A, C}(a, c) \quad \text{(Option 2)} \\
&= p_{C | A, B}(c | a, b) \cdot p_{A, B}(a, b) \quad \text{(Option 3)}
\end{align*}
Here, we already know how to factorize $p_{B, C}(b, c)$, $p_{A, C}(a, c)$, and $p_{A, B}(a, b)$.

We can also condition on one variable at a time:
\begin{align*}
p_{A, B, C}(a, b, c) &= p_{A, B | C}(a, b | c) \cdot p_C(c) \quad \text{(Option 1)} \\
&= p_{A, C | B}(a, c | b) \cdot p_B(b) \quad \text{(Option 2)} \\
&= p_{B, C | A}(b, c | a) \cdot p_A(a) \quad \text{(Option 3)}
\end{align*}
And how do we further factorize distributions of the form $p_{A, B | C}(a, b | c)$? We apply the same factorization for a joint distribution with two variables, and simply add a "conditioned on $C$" to each one:
\begin{align*}
p_{A, B | C}(a, b | c) &= p_{A | B, C}(a | b, c) \cdot p_{B | C}(b | c) \quad \text{(Option 1)} \\
&= p_{B | A, C}(b | a, c) \cdot p_{A | C}(a | c) \quad \text{(Option 2)} \\
\end{align*}

**Sampling from Joint Distributions:**

## Translating Math to Code with `NumPyro`

**What is `NumPyro`?**

**Distributions in `NumPyro`.**

**Random number generators in `NumPyro`.**

**Putting it all together.**