# Introducing Probability

Please ensure you have watched the Chapter 2 video(s).

## You will learn the following things in this Chapter

- Basic probability rules 
- Bayes Theorem
- How to use Python programming to estimate probabilities
- After completing this notebook you will be able to attempt CA 1 questions 1 and 4.

***

## Probably useful: probability revision

We are going to be using the concept of probability a lot in this course, so it is best
to first review the basic ideas behind probability theory, and get used to the notation. Unfortunately, there are many notations; this course will adopt one, but we will also
mention the others, so that you can recognise them when reading further.

- An experiment results in a set of outcomes, which we will call $\Omega$. This can be a discrete set of outcomes, such is in the classic coin toss, $\Omega = \{H, T\}$, or the roll of a die, $\Omega = \{1, 2, 3, 4, 5, 6\}$. However in many real life experiments, $\Omega$, which is referred to as the *outcome space*, or *event space*, can have an infinite continuum of values. We will return to this idea later, but for the moment we will consider just discrete events to help outline the basic properties of probability.

- Returning to the coin toss, we can say that a fair coin will have a probability of heads of $P(H) = 0.5$, and a probability of tails of $P(T) = 0.5$.  Each outcome of the experiment of tossing the coin,  $\Omega = \{H, T\}$, are thus equally likely. Similarly, for a die roll, $\Omega = \{1, 2, 3, 4, 5, 6\}$, with $p(i) = 1/6$. When an experiment has $m$ equally likely outcomes, the probability of any outcome $x$ is then:

    $P(x) = \dfrac{N(x)}{m}$

    where $N(x)$ is the number of times that $x$ occurs. For example, consider the more complicated case where we toss three different coins together, a 50p, a 20p and a pound coin. The outcome space is then

    $\Omega = \{HHH, HHT, HTH, HTT, THH, THT, TTH, TTT\}.$

    Assuming all outcomes were equally likely, then the probability of getting any one is 1/8.
    
- Not all outcomes are equally likely. For example, we could ask what the probability of the event that our 3-coin toss comes up with $n$ heads, so the outcome space is then  $\Omega = \{0, 1, 2, 3\}$. The outcome $\omega \in \Omega$ in this new experiment is just given by considering the outcomes in our previous 3-coin experiment, but ignoring which exact coin lands on Heads or Tails ie <br><br>

	- $\omega = 1$: n = 0, corresponds to TTT
	- $\omega = 2$: n = 1  corresponds to HTT, THT, or TTH
	- $\omega = 3$: n = 2  corresponds to HHT, HTH, or THH
    - $\omega = 4$: n = 3  corresponds to HHH

    Thus $P(n =0) = P(n =3) = 1/8$ while $P(n=1) = P(n=2) = 3/8$.
  
The above uses of $P$ hide an important aspect of probabilities: $P(x_i)$ is *normalised*, such that the sum of the probabilities of all possible events adds up to 1,

$P(X) = \sum_i^N P(x_i) = 1$,

where $X = (x_i, \ldots, x_N)$. 

To put this another way:
- the probability of something which is certain to happen is 1,
- the probability of something which is impossible to happen is 0,
- the probability of something not happening is 1 minus the probability that it will happen.

Technically this in only true for either finite or countably infinite outcome spaces. If the outcome space is truly uncountably infinite, then the definition of probability has to be relaxed.

### Axioms 

Some definitions before we continue:

- **Disjoint**: Two events that cannot occur at the same time are called disjoint or mutually exclusive. 
- **Discrete** variables: this is a variable whose value can be obtained by counting, and has gaps within each value that the variable can take on. 
- **Continuous** variables:  this is a variable whose value cannot be obtained by counting, and can take on all values within the range.<br><br>

For a discrete variable, if I pick any two consecutive outcomes. I cannot get an outcome that lies in between. For example, if we consider 1 and 2 as outcomes of rolling a six-sided die, then I cannot have an outcome of 1.5).

The axioms of probability are:

Axiom 1 : $0 \le P(A) \le 1$

Axiom 2 : $P(\Omega) = 1$

Axiom 3 : $P(A_1 \cup A_2 \cup A_3 ...) = P(A_1)+P(A_2)+P(A_3)+...$

Here the symbol $\cup$ denotes **or** ie  P(event $A_1$ occurs or event $A_2$ occurs or both occur etc). This is true for disjoint events and can be used for any number of disjoint events.

The axioms permit us to work out the probability of event $A$ **not** occurring,

$P(A^c) = 1 - P(A)$,

where $^c$ is called the **complement** ie $P$(not A). <br><br>

This result can be generalised. For example, for any two events $C$, $D$, the probability of getting $C$ **or** $D$ is:

$P(C \cup D) = P(C) + P(D) - P(C \cap D)$,

where now the symbol $\cap$ denotes **and** (also written as $P(CD)$ or $P(C, D)$ in the literature). A simple way to think of this is that the probability of getting either $C$ or $D$ is just **the sum** of the chances of getting either ($P(C) + P(D)$) minus that chances of getting both at the same time, $P(C \cap D)$. 

The last part is important since we're asking for the probability of *either* $C$ **or** $D$! The best way to see this is by considering the diagram below.

<img src="https://github.com/haleygomez/Data-Science-2024/raw/master/blended_exercises/Chapter2/union_CD_diagram.png" width="400">

The orange bit in the middle is the bit we need to subtract.

Note that the last expression is valid *whether or not the events are mutually exclusive (disjoint)*. That is the purpose of the **and** term, since it subtracts the probability that both the events occur at the same time. 

How to calculate the probability of $C$ **and** $D$ ie $P(C \cap D)$? Well, if the events are independent (i.e. do not depend on the other event occurring), then

$P(C \cap D) = P(C) P(D)$,

i.e. just **the product** of the probability of the two events.

### <font color=#4290C4>Example</font>

Someone has a bag of M&Ms in which there are red, blue, green and orange colours.  M&Ms are picked out and replaced. Someone did this 1000 times and obtained the following results:

number of blue M&Ms: 300,
number of red M&Ms: 200,
number of green M&Ms: 450,
Number of orange M&Ms: 50.

1. What is the probability of picking a green M&M?
2. If there are 100 sweets in the bag, how many of them are likely to be green?

###  <font color=#c38241> Solution</font>

Click below to see the solution.

1. For every 1000 M&Ms picked out, 450 are green. Therefore $P$(green) = 450/1000 = 0.45.

2. The experiment suggests that 450 out of 1000 M&Ms are green. Therefore, out of 100, we would expect that 45 will be green (using ratios).

### <font color=#4290C4>Example</font>

All human blood can be typed as O, A, B or AB. The frequency of occurance varies dependent on group.  The probabilities for the different human blood types O, B and AB in the US are 

| Blood Type| Probability|
|---|---|
|O|0.44|
|A|?|
|B|0.1|
|AB|0.04|

A person in the United States is chosen at random. What is the probability of the person having blood type A?

###  <font color=#c38241> Solution</font>

Click below to see the solution.

Since the four blood types O, A, B, and AB should sum to 1 we simply need to sum O, B, and AB together and the probability of type A must be what is remaining.

In [1]:
p_O = 0.44
p_B = 0.1 
p_AB =0.04
p_A = 1. - (p_O+p_B+p_AB)

print('The probability of a person at random having blood type A is {:.2f}'.format(p_A))

The probability of a person at random having blood type A is 0.42


### <font color=#4290C4>Example</font>

What is the probability that a randomly chosen person does not have blood type O?              

###  <font color=#c38241> Solution</font>

Click below to see the solution.

- So we need to find out what is the the probability that a randomly chosen person does not have blood type O? Ie we need to calculate: $P ({\rm not} O)$ or $P(O^c)$. 

In [2]:
p_not_O = 1-p_O
print('The probability of a person at random not being able to donate blood to anyone is {:.2f}'.format(p_not_O))

The probability of a person at random not being able to donate blood to anyone is 0.56


- From the information given, we know that being a potential donor for a person with blood type A means having blood type A **or** O.

 We therefore need to find $P$(A or O). Since the events A and O are disjoint, we can use the addition rule for disjoint events to get:

 $P(A \cup O) = P(A) + P(O) $

In [3]:
p_A_or_O = p_A + p_O
print('The probability of a person at random being able to donate to someone with type A is {:.2f}'.format(p_A_or_O))

The probability of a person at random being able to donate to someone with type A is 0.86


### Conditional Probabilities

We now have enough knowledge of the probability basics to consider *conditional probabilities*. Technically speaking, all probabilities are conditional. For example, the probability that my coin will land either heads or tails is conditioned by the probability that the coin will land. Or indeed have a heads and a tails! In a less contrived example, the probability that an astronomer is observing a specific type of star, say an A2 star, is first conditioned on the probability that what they are observing is indeed a star, and not another astronomical phenomenon. 

So conditional probabilities are important. They are denoted in the following way: **the probability of event $A$ given condition (or event) $B$ is $P(A|B)$.**

Take a classic example: what is the probability of randomly drawing the Queen of Spades ($QoS$) from a well-shuffled, true pack of cards? 

There are 52 cards in total, so the probability of drawing any card ($C$) is simply $P(C) = 1/52$. The probability of drawing our desired $QoS$ is then $P(QoS) = 1/52$. Now consider that the dealer is truthful, and tells you that the card you have just drawn is a face ($F$) card. Now what is the probability $P(QoS)$?

Well, first, the probability of drawing a face card that's also the $QoS$ is given by $P(QoS \cap F)$. But since we know that the card is a face card, our probabilities for $P(QoS)$ and $P(F)$ are wrong in the sense that they were derived by dividing by all cards - 1/52, or 12/52, since there are 12 face cards in the pack. Instead we want to renormalise our probabilities to the region of outcome space where $F$ is true, so we need to divide by $P(F)$. 

***

## Bayes Theorem

Bayes’ theorem, named after 18th-century British mathematician Thomas Bayes, is a mathematical formula for determining conditional probabilities. This theorem has huge importance in the field of data science. It is used in finance to rate the risk of lending money to potential borrowers. It can be used to determine the accuracy of medical test results by accounting for both the probability of a person having the disease and the accuracy of the test itself.  The famous Bayes Theorem is derived from considering the idea of conditional probabilities introduced above. 

Imagine for instance that someone has a cough and we want to know if this means they have lung disease. Let’s say you know: 
- the probability of somebody having a cough *given* that they have lung disease $X$ ie **$P$(cough|lung disease)**, 
- the probability of somebody in general having lung disease $X$ ie **$P$(lung disease)**,
- the probability of somebody in general having a cough ie **$P$(cough)**. 

With these 3 pieces of information you should be able to calculate the probability of somebody having lung disease $X$ *given* that they have a cough ie **$P$(lung disease|cough)** - but how?

### Standard form of Bayes Theorem

Starting from the equations above, given that $P(A \cap B) = P(B \cap A)$ and two independent events, then we can write

$P(A | B) P(B)  = P(B | A) P(A)$ 

such that,

$P(A | B)   = \dfrac{P(B | A) P(A)}{P(B)}$,

which is the standard form of Bayes Theorem. 

A simple way to visualise this result is to consider the areas in the figure below.

<img src="https://github.com/haleygomez/Data-Science-2024/raw/master/blended_exercises/Chapter2/ProbAgivenB_diagram.png" width="400">

We can think of the denominator as re-normalising the probability into a section of probability space in which B has occured.

One can generalise the denominator by considering that,

$P(B) = P(B \cap A) + P(B \cap A^c) = P(B | A)P(A) + P(B|A^c)P(A^c)$

to give the result

$P(A | B)   = \dfrac{P(B | A) P(A)} {  P(B | A)P(A) + P(B|A^c)P(A^c) }$

### Bayes with Models and Datasets

Bayes Rule is normally used to determine the probability of a specific model, $\theta$, given some data D, such that

$P(\theta|D) = \dfrac{P(D|\theta) P(\theta)}{P(D)}$ where

where $P(D|\theta)$ is the *likelihood*, $P(\theta)$ is the *prior*, and $P(D)$ is the *evidence*. $P(\theta|D)$ is the *posterior*. 

The standard way to write Bayes Rule is then,

$P(\theta | D) = \dfrac{ P(D | \theta) P(\theta)} { P(D)}$

Let's define the terms:

- $P(\theta | D)$ the posterior - the probability of model parameter $\theta$ being true, given the data
- $ P(D | \theta) $ the likelihood - given model parameter $\theta$ what is likelihood of obtaining the data
- $ P(\theta)$ the prior - the probability of the model parameter $\theta$ being ‘true’
- $P(D)$ the evidence - the probability of getting the data, give all possible model parameter values ($\theta$ and others!)

<img src="https://github.com/haleygomez/Data-Science-2024/raw/master/blended_exercises/Chapter2/bayes_image.png" width="500">

Why is Bayes so powerful? Here's an example: if cancer is related to age, then, using Bayes’ theorem, a person's age can be used to more accurately assess the probability that they have cancer. 


### Priors

People feel uncomfortable about ‘priors’, since often they are a ‘best guess’. Indeed, different analysts may have differing opinions about what the prior for a given experiment should be.  Although ‘frequentists’ disagree with the use of priors, note that technically they do assume one: they assume that all is equally likely (i.e. a ‘flat’ prior).

Clearly this is also wrong.  So priors are useful — but they must be clearly stated. They provide a formal means for the analyst to include previous information that is relevant to the experiment. They also allow you test whether the model is good.   

Where we have new data in an experiment, we can reapply Bayes Rule to get a new posterior. But what to use for the prior? Well, the posterior from the previous analysis! Hence the statement:

Yesterdays posterior is tomorrow's prior

This feedback loop makes Bayes Theorem particularly useful in machine learning, in which decision making needs to adapt to new information as it comes in.

### <font color=#4290C4>Example</font>

Imagine that a box contains five coins, one of which is a joke (J) coin, with heads on both sides. A coin is selected at random from the box, and flipped 3 times. The result is 3 heads (3H). What is probability that the coin is the trick coin?

###  <font color=#c38241> Solution</font>

Click below to see the Solution.

First, we should define what we are trying to work out. 

We are interested in $P(J | 3H)$. We will let the normal coin be denoted by $C$. So using Bayes Theorem we can write,

$P(J | 3H) = \dfrac{ P(3H | J) P(J)}  {P(3H) }$

To get the probability of $P(3H)$ we need to add up all possibilities of getting it.

$P(J | 3H) = \dfrac{ P(3H | J) P(J)}  {P(3H | J) P(J) + P(3H | C) P(C)}$

The probability of randomly selecting the joke coin is $P(J) = 1/5$. 

The probability of not selecting it, is $P(J^c) = 1 - 1/5 = 4/5 = P(C)$. 

The probability of getting 3 heads with the joke coin is 1, so 

$P(3H | J) = 1$

The probability of getting 3 heads with a standard coin is $(1/2) \times (1/2) \times (1/2)$ (remember these are independent events), so

$P(3H | C) = 1/8$

$P(J | 3H) = \dfrac{ 1 \times 1/5}  {1 \times 1/5 ~ + ~1/8 \times 4/5} = 2 / 3$

So there's a 66% chance the coin that we are seeing flipped is the joke coin!

### <font color=#4290C4>Example</font>

A couple has 2 children and the older child is a boy. If the probabilities of having a boy or a girl are both 50%, what's the probability that the couple has two boys?

We already know that the older child is a boy. The probability of two boys then is equivalent to the probability that the younger child is a boy, which is 50%.  Show this is indeed true using Bayes Theorem.

###  <font color=#c38241> Solution</font>

Click below to see the Solution.

Using Bayes Theory let's evaluate this formally: 

Define the events, $A$ and $B$ as follows:

- $A  = \mbox{both children are boys}$
- $B = \mbox{the older child is a boy}$

So we need to find out

$P(A | B) = \dfrac{P(B | A)P(A)}{P(B)}$ where

$P(B) = \dfrac{1}{2}$ (the probability of the older child being a boy)

$P(A) = \dfrac{1}{4}$ (probability of both children being boys - this is an **and** probability - $=0.5 \times 0.5 $).

$P(B|A) = 1$ (the probability of the older child being a boy given both children are boys).

In [4]:
p_A = 1./4
p_B = 1./2
p_B_given_A = 1.

p_A_given_B = (p_A*p_B_given_A)/p_B

print('Probability that both children are boys given older child is a boy is {:.2f}'.format(p_A_given_B))

Probability that both children are boys given older child is a boy is 0.50


### <font color=#4290C4>Example</font>

A woman's DNA matches that of a sample found at a crime scene. The chances of a DNA match are just one in two million, the court interprets this as the chance that it came from someone else is 1 in 2 million.  

Given the court concludes that the probability that she is guilty $= 1 - \dfrac{1}{2,000,000}$ which is basically 100%, she gets sent to jail for a very long time.   

The women comes from a city where approximately 400,000 women with ages greater than 18 live, approximately 300,000 of which are from a similar ethnic group. Use a simple Bayes theorem approach to show why the justification above for sending her to jail is completely wrong (this is known as the prosecutor's fallacy).

###  <font color=#c38241> Solution</font>

Click below to see the Solution.

The answer above mistakes the one in two million for the probability of the woman's innocence. In order to assess the woman's guilt properly, we need to need to take the fact that she matched the sample as a given, and see *how much more likely this makes her to be guilty than she was before the DNA evidence came to light*.

But how likely is it that the DNA match profile also exists in the population at large? Who else could there be out there with that profile?  The match probability of one in two million tells how likely it is that a random person's DNA profile will match the crime sample not how guilty that person is. 

To take the match probability into account we need to calculate the likelihood ie the 

$\mbox{likelihood } = 2,000,000 = \dfrac{\mbox{Probability of observing the DNA if the defendant is GUILTY}}{\mbox{Probability of observing the DNA if the defendant is INNOCENT}}$

This tells us that the 2 million times number simply tells us how more likely we are to observe the evidence if the woman is guilty, than if she is innocent.  

Bayes theorem allows us to write

$\mbox{Posterior odds of guilt} = \mbox{likelihood} \times \mbox{prior}$

Assuming that this woman is no more likely to be guilty than any other woman in the local area we can estimate that there is a 1 in 300,000 chance that she is guilty - this is then our prior odds.

In [5]:
prior = 1./3e5
likelihood = 2e6

answer_odds = likelihood*prior # let's do it in odds to compare with the 1 in 2 million number 

print('Odds woman is guilty after seeing the DNA evidence is 1 in {:.0f}'.format(answer_odds))

answer = answer_odds/(answer_odds+1)*100
print('Probability woman is guilty after seeing the DNA evidence is {:.0f}%'.format(answer))
print('So there is a high probability that she is guilty of the crime given DNA evidence but not 100% sure.')

Odds woman is guilty after seeing the DNA evidence is 1 in 7
Probability woman is guilty after seeing the DNA evidence is 87%
So there is a high probability that she is guilty of the crime given DNA evidence but not 100% sure.


### Bayesian vs Frequentist:

Bayesian analysis has a somewhat formidable reputation for being extremely difficult… why is that?

- In general, the denominator can be difficult to evaluate
- Tricky integrals
- Often require numerical solutions
- Large (multivariate) parameter space
- In the 20th century, the development of Monte Carlo Markov Chains have made the evaluation of the integrals and the probabilities much easier. 

You will learn more about this later in the course!

**Bayesian statistician**
- Philosophy of science: we do not “rule out” models, just determine their probabilities 
- Argument: *“the prior probability is a logical necessity when assessing the probability of a model. It should be stated, and if it is unknown you can just use an uninformative (wide) prior”* 

**Frequentist statistician**
- Philosophy of science: we attempt to “rule out” or falsify models if $P$(data) given a model is too small.
- Argument: *“setting the prior is subjective - two experimenters could use the same data to come to two different conclusions just by taking different priors”*

## To recap: how do we estimate probability?

Our examples of coin flipping, die rolling and card selecting, we introduced the notion that the probability of a particular outcome or event is simply the number of times that event occurs, divided by the number of all possible outcomes.

But say you have a coin, and you want to know $P(H)$ -- how do you proceed? You could guess that the coin is fair and assign 0.5 to outcome heads/tails. But is the coin fair? One way you could test this is to perform lots of experiments (coin flips) and keep track of the outcome. If you do enough of these, eventually you will get an empirical measure,

$P(H) = \dfrac{n_H}{n_{\rm flips}}$

where $n_H$ is the number of heads that appeared in the experiment and $n_{\rm flips}$ is the number of times you flipped the coin (and counted the result). But when do you stop? Well, that depends on how accurately you want to know $P(H)$. But for now, we will simply note that this type of determination of $P$ is *frequentist*, in that the probability is defined by counting the instances of occurrence.

However, what about the probability that it will rain tomorrow? You can see straight away that such a probability is more difficult to define. In fact, the use of Bayes Theorem, and in particular the prior, introduces the idea that $P$ represents the belief that something will occur.

***

Now you are ready to tackle the **Chapter 2 quiz** on Learning Central and the [Chapter 2 yourturn notebook](https://github.com/haleygomez/Data-Science-2024/blob/master/blended_exercises/Chapter2/Chapter2_yourturn.ipynb).