## Probability

### What is probability?

We are all familiar with the phrase “the probability that a coin will land heads is 0.5”. But what does this mean?
There are actually at least two different interpretations of probability. One is called the **frequentist** interpretation.
In this view, probabilities represent long run frequencies of events. For example, the above statement means that,
if we flip the coin many times, we expect it to land heads about half the time.

The other interpretation is called the **Bayesian** interpretation of probability. In this view, probability is used
to quantify our uncertainty about something; hence it is fundamentally related to information rather than repeated
trials. In the Bayesian view, the above statement means we believe the coin is equally likely to land heads or tails
on the next toss.

One big advantage of the Bayesian interpretation is that it can be used to model our uncertainty about events that do not
have long term frequencies. For example, we might want to compute the probability that the polar ice cap will melt by
2020 CE. This event will happen zero or one times, but cannot happen repeatedly. Nevertheless, we ought to be able to
quantify our uncertainty about this event; based on how probable we think this event is, we will (hopefully!) take
appropriate actions. To give some more machine learning oriented examples, we might have received a specific email message,
and want to compute the probability it is spam. Or we might have observed a “blip” on our radar screen, and want to compute
the probability distribution over the location of the corresponding target (be it a bird, plane, or missile). In all these cases,
the idea of repeated trials does not make sense, but the Bayesian interpretation is valid and indeed quite natural.

The basic rules of probability theory are the same, no matter which interpretation is adopted.

### Basic Probability
The expression $p(A)$ denotes the probability that the event A is true. For example, $A$ might be the logical expression
“it will rain tomorrow”.

We have:

- $0 \leq p(A) \leq 1$
- $p(A) = 0$ means the event definitely will not happen
- $p(A) = 1$ means the event definitely will happen
- $p(\neg A)$ denotes the probability of the event not $A$, that is that $A$ will not occur
- $p(\neg A)=1-p(A)$
- We will often write $A=1$ to mean the event $A$ is true, and $A=0$ to mean the event $A$ is false

### Discrete random variables

A **discrete random variable** $X$ is a set of possible observed events. For example, we might have that $X$ is the integer age of of the
students in our class.

We can intuit that certainly $X\in [0, 100]$ ($X$ is *in* the set of integers from 0 to 100). We might take a sample from $X$ and this will signify the age of one member of the class. In terms of probability, we might think of the event $P(X=x)$, the probability that our sample is some number $x$. We can also call this simply $p(x)$. Assuming that no one in the class has the same integer age, we have an equal chance of sampling every student, and there are $n$ students in the class, we could say $p(x)=\frac{1}{n}$. As with all probability, $0 \leq p(x) \leq 1$.

**ADVANCED NOTE** $p(x)$ is called a **probability mass function** or pmf.

### Probability of Two Events Occurring

Given two events, $A$ and $B$, we define the probability of $A$ or $B$ as follows:

$$p(A \vee B) = p(A) + p(B) - p(A \wedge B)$$
$$p(A \vee B) = p(A) + p(B)  \text{if }A\text{ and }B\text{ are mutually exclusive}$$

#### Joint Probability

Joint probability refers to two events co-occurring.

$$p(A,B) = p(A\wedge B) = p(A|B)p(B) = p(B|A)p(A)$$

This is sometimes called **the product rule**. Note that in both $p(A|B)p(B)$ and $p(B|A)p(A)$, *both events are occurring*. You should read $p(A|B)p(B)$ as "the probability of of $A$ given $B$ times the probability of $B$".  

#### Conditional Probability

$$p(A|B) = \frac{p(A,B)}{p(B)} \text{ if } p(B) > 0$$

### Bayes Rule

$$p(A|B) = \frac{p(B|A)p(A)}{p(B)} \text{ if } p(B) > 0$$

### An Example: A Cancer Detection Test

Suppose a medical institution has developed a test for assessing whether or not a patient
has cancer. The test has been around for a long time (meaning we can use frequentist statistics to measure its success) and we know that it is 98% successful in identifying cancer
when a patient has cancer and 99% successful in returning a negative when a patient does
not have cancer. We can rewrite each of these as a conditional probability:

\begin{align*}
p(\text{positive test}|\text{cancer}) &= 0.99\\
p(\text{negative test}|\text{no cancer}) &= 0.97\\
\end{align*}

We also know that cancer in the American population is extremely rare. Approximately 0.4% of people develop cancer. We can thus say $p(\text{cancer}) = 0.004$. *NOTE: these numbers are made up for demonstration*.

#### what is the probability that a patient has cancer?
What we wish to know is, given a positive test, what is the probability that the patient has cancer? What is $p(\text{cancer}|\text{positive test})$?

#### Bayes Rule 

We can find this probability by calculating

$$p(\text{cancer}|\text{positive test}) = \frac{p(\text{positive test}|\text{cancer})p(\text{cancer})}{p(\text{positive test})}$$

The calculation is pretty straightforward, except for the calculation of $p(\text{positive test})$. This calculation must include all of the ways in which we can obtain a positive test. We have to include the false positives in the calculation. The false positive rate is $1 - p(\text{negative test}|\text{no cancer}) = 0.03$

\begin{align*}
p(\text{positive test})&=p(\text{positive test}|\text{cancer})p(\text{cancer}) + p(\text{positive test}|\text{no cancer})p(\text{no cancer})\\
&=0.99\cdot0.004 + 0.03*0.999\\
&=0.03384
\end{align*}

Then,

\begin{align*}
p(\text{cancer}|\text{positive test}) &= \frac{p(\text{positive test}|\text{cancer})p(\text{cancer})}{p(\text{positive test})}\\
&= \frac{0.99\cdot 0.004}{0.03384}\\
&= 0.11702
\end{align*}



#### Why would the number be so small if our test is 99% accurate?




Below we visualize a population of 1000 patients to whom this test has been administered. In 1000 patients, we would expect 996 of them to be cancer-free. But according to the test, with 996 patients, we would expect 30 **false positives**. In the same population, we would expect 4 patients to actually have cancer. Luckily, we would expect the test to correctly identify all four of these patients. This would be a total of 34 positive tests, the vast majority of these being false positives.

![A population of 1000 patients](doc/img/population.png)

We might also look at these results using a **confusion matrix**

| | True Positive | True Negative |
|:-:|:-:|:-:|
| Predicted Positive | 4 | 30 |
| Predicted Negative | 0 | 968 |

In this particular case, this result may be preferable to lowering the sensitivity of our
test to lower the false positive rate, but at the expense of missing true positives. One
can imagine a situation in which the opposite were true so that we would want to lower
the false positive rate at the expense of missing true positives. For example, we might
consider a test to see if a patient is a match for a certain kind of organ donation.
In this case, it is preferable that every positive match is a true positive at the
expense of possibly missing one.