# Bayes' Theorem

## Motivating Examples - Limitations of Machine Learning and (frequentist) Statistics

(1) Consider a model for detection of fraud in financial transactions. We trained and validated such a model using (labelled) data from the past. The model is deployed in production and predicts a given transaction to be fraud. Knowing our model's True and False Positive Rates, and having an estimate of the frequency of fraud, what is the _probability_ that this transaction is actually fraudulent?

(2) Consider we observe a few thousand heart rate measurements from a person's last running activity. What is the _probability_ (interval) that the observed mean and variance in this sample represent this person's true distribution of exercise intensity? How does this inference change if we have prior knowledge (e.g. population data)?

ML and stats tools lack common ways to:

- Express uncertainty about inference of parameters,
- Express uncertainty about predictions,
- Use **and make explicit** (subjective) prior knowledge for inference


## The basics: Probability Mass/Density Functions

A **random variable** $X$ can be:

### Discrete:

<img src="https://momath.org/wp-content/uploads/2015/09/urn2_small.png" alt="Example: draw a colored ball from an urn (momath.org)" width=300/>


$X \in \{Red, Black\}$, with **probability mass function (PMF)**

<div style="font-size: 2em">
$$
p(x) = 
\begin{cases}
0.8,&x = Red\\
0.2,&x = Black\\
0,&otherwise
\end{cases}
$$
</div

In [None]:
def pmf(x):
    return {
        'Red': 0.8,
        'Black': 0.2
    }.get(x, 0.0)

### Continuous:
<img src="https://img.staticbg.com/thumb/view/oaupload/banggood/images/6A/05/434b39fd-72b7-4ff5-93c0-eae680dfa5d7.jpg" alt="a person's heart rate" width=300/>

$X \in [30, 230]$, with probability **density** function **(PDF)**, assuming a normal distribution with:

* mean $\bbox[1pt,border:2px solid red]{\mu} = 120$,
* variance $\bbox[1pt,border:2px solid blue]{\sigma^2} = 400$

<div style="font-size: 2em">
$$p(x) = \frac{1}{\sqrt{2\pi\bbox[1pt,border:2px solid blue]{400}}}e^{-\frac{(x - \bbox[1pt,border:2px solid red]{120})^2}{2*\bbox[1pt,border:2px solid blue]{400}}}$$
</div>

Be aware there are [many more](https://en.wikipedia.org/wiki/List_of_probability_distributions) types of probability distributions.

In [None]:
# No need to implement these PDFs yourselves, see scipy.stats
import numpy as np
from scipy.stats import norm

heart_rate_mean = 120
heart_rate_std = 20

norm.pdf(130, loc=heart_rate_mean, scale=heart_rate_std)

In [None]:
# For reuse of the same distribution params, a distribution can be _frozen_:
norm_hr = norm(loc=heart_rate_mean, scale=heart_rate_std)
norm_hr.pdf(130)

- What does this value mean?
- Is it a probability?

- What is the probability $Pr(X = 120)$, where $X$ is _exactly_ 120?

---

In [None]:
import plotly.graph_objs as go

x=np.linspace(30, 230, 100)

go.FigureWidget(
    data=[
        go.Scatter(x=x, y=norm_hr.pdf(x), mode='lines', line={'shape': 'spline', 'width': 4}, showlegend=False),
        go.Scatter(x=[130, 155], y=norm_hr.pdf([130, 155]), mode='markers', marker={'size': 8}, showlegend=False)
    ],
    layout={
        'width': 800,
        'title': 'p(X), for μ=120, σ=20',
        'xaxis': {'title': 'X'},
        'yaxis': {'title': 'p(X)'}
    }
)

A PDF gives the **relative likelihood** of $X$ having a given value:

In [None]:
norm_hr.pdf(130) / norm_hr.pdf(155)

meaning that with $\mu=120$ and $\sigma=20$, a heart rate of 130 is around 4 times more likely than a heart rate of 155.

You can derive probabilities from PDFs by integration:

$$Pr(100 \le X \le 120) = \int_{100}^{120}p(x)$$

And for a valid PDF:

$$ \int_{-\infty}^{\infty}p(x) = 1$$

## Adding 1 Dimension: Joint and Conditional Probability

The joint probability density of 2 random variables $A$ and $B$ is given as

<div style="font-size: 2em">
$$
p(A, B) = p(A|B)p(B)
$$
</div>

where $p(A|B)$ represents the density of $A$, **given** $B$, or **conditioned on** $B$.

How should $p(A, B)$, or $p(A|B)$ be interpreted?

Let's use maximum heart rates as an example. A person's maximum heart rate usually decreases with age. A commonly mentioned formula to estimate maximum heart rate is 220 - age, (see [wikipedia](https://en.wikipedia.org/wiki/Heart_rate#Haskell_&_Fox)).

In [None]:
def to_max_heart_rate(age):
    return 220 - age

This method is a gross oversimplification and shouldn't be used in practice, but serves well for this example.

In [None]:
ages = np.linspace(15, 85, 100)  # simulate 100 users of some fitness app, ages "uniformly" distributed between 15 and 85
max_heart_rate_means = to_max_heart_rate(ages)

In [None]:
y = np.linspace(130, 250, 100)
max_heart_rate_densities = np.array([norm.pdf(y, loc=age_hr, scale=10) for age_hr in max_heart_rate_means])

In [None]:
go.FigureWidget(
    data=[
        go.Surface(x=ages, y=y, z=max_heart_rate_densities)
    ],
    layout=go.Layout(
        width=600,
        height=600,
        xaxis=dict(title='age'),
        yaxis=dict(title='hr max')
    )
)

$p(H,A)$ represents the relative likelihood that age ($A$) and heart rate ($H$) have some values **simultaneously**. For discrete random variables, the joint PMF is a 2-d lookup table.

$p(H|A=a)$ represents the relative likelihood of a heart rate for a fixed age (imagine a slice cutting through the sphere above, going through x=a). This results in a single-variable PDF.

Similar to 1-d PDFs, probabilities can be obtained by (double) integration.

If $A$ and $B$ are **independent**, $p(A|B) = p(A)$. Most often, this is _not_ the case, so don't interpret $p(A,B)$ as being as simple as $p(A) \times p(B)$

## Since Variable Order Doesn't Matter...

$$
p(A,B) = p(B,A)\Leftrightarrow\\
p(A|B)p(B) = p(B|A)p(A)\Leftrightarrow
$$

<div style="font-size: 2em">
$$
p(A|B) = \frac{p(B|A)p(A)}{p(B)}
$$
</div>

a.k.a. **Bayes' Theorem**, or **Bayes' Rule**

## That Fraud Detection Model

Let's try to plug our initial question about our fraud detection model into this formula. There are 2 (discrete) random variables involved:

- Transaction Fraud ($F \in \{Fraud, OK\}$)
- Model Alert ($A \in \{Alert, OK\}$)

Having trained and validated our model using historical data, we obtained a confusion matrix that looks as follows:


| predicted\true| fraud | ok     |
|---------------|-------|--------|
| predict_fraud | 0.95  | 0.0001 |
| predict_ok    | 0.05  | 0.9999 |


Further, we know from past experience, that roughly 1 in a million transactions are fraudulent, i.e. $p(F=Fraud) = 0.000001$

We are interested in the probability $p(F=Fraud|A=Alert)$

According to Bayes' Theorem, this is equal to:

$$
\frac{\bbox[1pt,border:2px solid red]{p(A=Alert|F=Fraud)}\times\bbox[1pt,border:2px solid yellow]{p(F=Fraud)}}{\bbox[1pt,border:2px solid blue]{p(A=Alert)}}
$$

with

$$
\begin{align}
p(A=Alert) & = \sum_{f \in F}p(A=Alert|f).p(f)\\
& = 0.95\times0.000001 + 0.0001\times0.999999\\
& \approx 0.0001 \\
\end{align}
$$

which leads to

$$
p(F=Fraud|A=Alert) = \frac{\bbox[1pt,border:2px solid red]{0.95}\times\bbox[1pt,border:2px solid yellow]{0.000001}}{\bbox[1pt,border:2px solid blue]{0.0001}} \approx 0.01
$$

In [None]:
def p_fraud_given_alert(p_fraud, true_positive_rate, false_positive_rate):
    return true_positive_rate * p_fraud / (true_positive_rate * p_fraud + false_positive_rate * (1 - p_fraud))

p_fraud_given_alert(0.001, 0.95, 0.0001)

In [None]:
p_fraud_given_alert(0.0001, 0.95, 0.0001)

In [None]:
p_fraud_given_alert(0.00001, 0.95, 0.0001)

To conclude, the first application of Bayes' Rule is for cases where (for discrete events) we can directly measure some conditional probability $p(B|A)$, and prior probabilities $p(A)$ and $p(B)$, but our probability of interest $p(A|B)$ is less straightforward.