
```{index} axiomatic probability
```

# Axiomatic  Probability


The methods of the previous two sections that define probabilities using relative frequencies or based on properties of fair experiments are helpful to develop some intuition about probability. However, these methods have limitations that restrict their usefulness to many  real world problems. These problems were recognized by mathematicians working on probability and motivated these mathematicians to develop an approach to probability that is:
* not based on a particular application or interpretation,
* agrees models based on relative frequency and fair probabilities,
* agrees with our intuition (where appropriate), and
* are useful to solving real-world problems.

The approach they developed is called *Axiomatic Probability*. Axiomatic means that there is a set of assumptions or rules (called axioms) for probability, but that the set of rules is made as small as possible. This approach may at first seem unnecessarily mathematical, but I believe that the reader will soon see that this approach will help them to develop a fundamentally sound understanding of probability.

```{index} probability space
```

## Probability Spaces



The first step in developing Axiomatic Probability is to define the core objects that the axioms apply to. Define a *Probability Space* as an ordered collection (tuple) of three objects, and we denote it by 
$$
(S, \mathcal{F}, P)
$$

These objects are called the *sample space*, the *event class*, and the *probability measure*. Since there are three objects in a probability space, it is sometimes said that **probability is a triple** or **a probability space is a triple**.

**Sample Space**

We have already introduce the sample space in {doc}`outcomes-samplespaces-events`. It is a **set** containing all possible outcomes for an experiment.

**Event Class**

The second object, denoted by a calligraphic F ($\mathcal{F}$), is called the *event class*.

````{card}
DEFINITION
^^^
```{glossary}
event class
  For a sample space $S$ and a probability measure $P$, the event class, denoted by $\mathcal{F}$ is a collection of all subsets of $S$ to which we will assign probability (i.e., for which $P$ will be defined). The sets in $\mathcal{F}$ are called events.
```
````
 
 
We require that the event class be a $\sigma-algebra$ (read “sigma algebra”) of $S$, which is a concise and mathematically precise way to say that events in $\mathcal{F}$ that combinations of events using (a finite or countably infinite number of) the usual set operations will still be events in $\mathcal{F}. 

For many readers of this book, the above explanation will be sufficient to understand what events are in $\mathcal{F}$. If you feel satisfied with this explanation, you may skip ahead to the heading **Event Class for Finite Sample Spaces**. If you want more mathematical depth and rigor, here are the properties that $\mathcal{F}$ must satisfy to be a $\sigma$-algebra on $S$:

1. $\mathcal{F}$ contains the sample space:<br>
$S \in \mathcal{F}$
1. $\mathcal{F}$ is **closed under complements**:<br>
If $A \in \mathcal{F}$, then $\overline{A} \in \mathcal{F}$.
1. $\mathcal{F}$ is **closed under countable unions**:<br>
If $A_1, A_2, \ldots$ are a finite or countably infinite number of sets in \mathcal{F}, then 
\begin{equation*}
\bigcup_i A_i \in \mathcal{F}
\end{equation*}

Note that DeMorgan's Laws immediately imply a few other properties:
* The null set $\emptyset$ is in $\mathcal{F}$ by combining properties 1 and 2. $S \in \mathcal{F}$, and so $\overline{S} =\emptyset \in \mathcal{F}$.
* $\mathcal{F}$ is **closed under countable intersections**. If $A_1, A_2, \ldots$ are a finite or countably infinite number of sets in $\mathcal{F}$, then by property 2, $\overline{A_1}, \overline{A_2} \ldots$ are in $\mathcal{F}$. By property 3,
\begin{equation*}
\bigcup_i \overline{A_i} \in \mathcal{F}
\end{equation*}
If we apply DeMorgan's Laws to this expression, we have
\begin{equation*}
\overline{\bigcap_i A_i} \in \mathcal{F}
\end{equation*}
Then by applying property 2 again, we have that 
\begin{equation*}
{\bigcap_i A_i} \in \mathcal{F}
\end{equation*}

**Event Class for Finite Sample Spaces**

When $S$ is finite, we almost always use the same event class, which is to take $\mathcal{F}$ to be the *power set* of $S$:
````{card}
DEFINITION
^^^
```{glossary}
power set
  For a set $S$ with finite cardinality, $|S|=N < \infty$, the power set is the set of all possible subsets. We will use the notation $2^S$ to denote the power set.
```
````



Note that the power set includes both the empty set ($\emptyset$) and $S$.

**Example**
Consider flipping a coin and observing the top face. Then $S=\{H,T\}$ and

$$
\mathcal{F} = \bigl\{ \emptyset, H, T, \{H,T\} = S \bigr\}
$$

Note that $|S|=2$ and $|2^S| = 4 = 2^{|S|}$.

**Exercise**

Consider rolling a standard six-sided die. Give the sample space, $S$,  and the power set of the sample space, $2^S$.  What is the cardinality of $2^S$?

When $|S|=\infty$, weird things can happen if we try to assign probabilities to every subset of $S$. **JMS: Working here. Need footnote about uncountably infinite** For typical data science applications, we can assume that any event that we want to ask about will be in the event class, and we do not need to explicitly enumerate the event class.

**Probability Measure**

Until now, we have discussed the probabilities of outcomes. However, this is not the approach taken in probability spaces:

````{card}
DEFINITION
^^^
```{glossary}
probability measure
  The probability measure, $P$, is a real-valued set function that maps every element of the event class to the real line.
```
````

Note that in defining the probability measure, we do not specify the range of values for $P$, because at this point we are only defining the structure of the probability space through the types of elements that make it up.

Although $P$ assigns outcomes to events (as opposed to outcomes), every outcome in $S$ is typically an event in the event class. Thus, $P$ is more general in its operation than we have considered in our previous examples. As explained in {doc}`outcomes-samplespaces-events`, an event occurs if the experiment's outcome is one of the outcomes in that event's set.

```{index} axioms of probability
```

## Axioms of Probability

As previously mentioned, axioms are a minimal set of rules. There are three Axioms of Probability that are specified in terms of the probability measure:



**The Axioms of Probability**

**I.** For every event $E$ in the eventclass $\mathcal{F}$, $ P(E) \ge 0$ 
*(the event probabilities are non-negative)*

**II.** $P(S) =1$   *(the probability that some outcome occurs is 1)*

**III.** For all pairs of events $E$ and $F$ in the event class that are disjoint ($E \cap F = \emptyset$), 
          $P( E \cup F) = P(E)+P(F)$ *(if two events are disjoint, then the probability that either one of the events occurs is equal to the sum of the event probabilities)*
          
When dealing with infinite sample spaces, an alternative version of Axiom III should be used:

**III'.** If $A_1, A_2, \ldots$ is a sequence of
          event that are all disjoint  ($A_i \cap A_j = \emptyset~ \forall i\ne j$),
          then 
          
$$
P \left[ \bigcup_{k=1}^{\infty} A_k \right] = \sum_{k=1}^{\infty}
 P\left[ A_k \right].
$$
            
<!-- *(Note that these sums and unions are over countably infinite sequences of events.)* -->

Many students of probability wonder why Axiom I does not specify that $0 \le P(E) \le 1$. The answer is that the second part of that inequality is not needed because it can be proven from the other axioms. Anything that is not required is removed to ensure that the axioms are a minimal set of rules. 

Axiom III is a powerful tool for calculating probabilities. However, it must be used carefully. 

**Example**

A fair six-sided die is rolled twice. What is the probability that the top face on the first roll is less than 3? What is the probability that the top face on the second roll is less than 3?

First, let's define some notation for the events of interest:

Let $E_i$ denote the event that the top face on roll $i$ is less than 3

Then 

$$
E_1=\{1_1, 2_1 \},
$$

where $k_l$ denotes the **outcome** that the top face is $k$ on roll $l$. Similarly,

$$E_2=\{1_2, 2_2 \}.$$
Note that we can rewrite

$$
E_i = \{1_i\} \cup \{2_i\},
$$ 

where $\cup$ is the union operator.  Because outcomes are always disjoint, axiom III can be applied to yield
\begin{align*}
P(E_i) &= P\left(\{1_i\} \cup \{2_i\} \right) \\
 &= P\left(\{1_i\} \right) + P \left( \{2_i\} \right) \\
 &= \frac{1}{6} + \frac{1}{6},
\end{align*}
where the last line comes from applying the probability of an outcome in a fair experiment. Thus, $P(E_i)=1/3$ for $i=1,2$. Most readers will intuitively have known this answer. 

**Example**

Consider the same exact experiment described in the previous example.  However, let's ask a slightly different question: what is the probability that either the value on the first die is less than 3 **or** the value on the second die is less than 3. (This could also include the case that both are less than 3.)  Mathematically, we write this as $P(E_1 \cup E_2)$ using the events already defined.

Since $E_1$ and $E_2$ correspond to events on completely different dice, it may be tempting to apply Axiom III like:
\begin{align*}
P(E_1 \cup E_2) &= P(E_1) + P(E_2) \\
&= \frac{1}{3} + \frac{1}{3}\\
&= \frac{2}{3}.
\end{align*}
However, it is easy to see that somehow this thinking is not correct. For example, if we defined events $G_i$ to be the event that the value on die $i$ is less than 5, this approach would imply that
\begin{align*}
P(G_1 \cup G_2) &= P(G_1) + P(G_2) \\
&= \frac{2}{3} + \frac{2}{3}\\
&= \frac{4}{3}.
\end{align*}
Hopefully the reader recognizes that this is not an allowed value for a probability! Let's delve in to see what went wrong. We can begin by estimating the true value of $P(E_1 \cup E_2)$ using simulation:

In [1]:
import numpy as np
import numpy.random as npr

num_sims = 100_000

# Generate the dice values for all simulations:
die1 = npr.randint(1, 7, size=num_sims)
die2 = npr.randint(1, 7, size=num_sims)

# Each comparison will generate an array of True/False value
E1occurred = die1 < 3
E2occurred = die2 < 3

# Use NumPy's union operator (|) to return True where either array is True:
Eoccurred = E1occurred | E2occurred

# NumPy's count_nonzero function will count 1 for each True value and 0 for each False value
print("P(E1 or E2) =~", np.count_nonzero(Eoccurred) / num_sims)

P(E1 or E2) =~ 0.55788


The estimated probability is about 0.56, which is lower than predicted by trying to apply Axiom III. The problem is that Axiom III does not hold in the way that it is used here because $E_1$ and $E_2$ are not disjoint: both can occur at the same time. Let's enumerate every thing that could happen by writing the outcomes of dice 1 and dice 2 as a tuple, where $(j,k)$ means that die 1's outcome was $j$ and die 2's outcome was $k$.  

We will use colors to help denote when events belong to a particular event. We start by printing all outcomes with the outcomes in event $E_1$ highlighted in blue:

In [2]:
# Need to see if this is standard in Anaconda
from termcolor import colored

print("Outcomes in E1 are in blue:")
for j in range(1, 7):
    for k in range(1, 7):
        if j < 3:
            print(colored("(" + str(j) + ", " + str(k) + ")   ", "blue"), end="")
        else:
            print("(" + str(j) + ", " + str(k) + ")   ", end="")
    print()

Outcomes in E1 are in blue:
[34m(1, 1)   [0m[34m(1, 2)   [0m[34m(1, 3)   [0m[34m(1, 4)   [0m[34m(1, 5)   [0m[34m(1, 6)   [0m
[34m(2, 1)   [0m[34m(2, 2)   [0m[34m(2, 3)   [0m[34m(2, 4)   [0m[34m(2, 5)   [0m[34m(2, 6)   [0m
(3, 1)   (3, 2)   (3, 3)   (3, 4)   (3, 5)   (3, 6)   
(4, 1)   (4, 2)   (4, 3)   (4, 4)   (4, 5)   (4, 6)   
(5, 1)   (5, 2)   (5, 3)   (5, 4)   (5, 5)   (5, 6)   
(6, 1)   (6, 2)   (6, 3)   (6, 4)   (6, 5)   (6, 6)   


We can easily modify this to highlight the events in $E_2$ in green:

In [3]:
print("Outcomes in E2 are in green:")
for j in range(1, 7):
    for k in range(1, 7):
        if k < 3:
            print(colored("(" + str(j) + ", " + str(k) + ")   ", "green"), end="")
        else:
            print("(" + str(j) + ", " + str(k) + ")   ", end="")
    print()

Outcomes in E2 are in green:
[32m(1, 1)   [0m[32m(1, 2)   [0m(1, 3)   (1, 4)   (1, 5)   (1, 6)   
[32m(2, 1)   [0m[32m(2, 2)   [0m(2, 3)   (2, 4)   (2, 5)   (2, 6)   
[32m(3, 1)   [0m[32m(3, 2)   [0m(3, 3)   (3, 4)   (3, 5)   (3, 6)   
[32m(4, 1)   [0m[32m(4, 2)   [0m(4, 3)   (4, 4)   (4, 5)   (4, 6)   
[32m(5, 1)   [0m[32m(5, 2)   [0m(5, 3)   (5, 4)   (5, 5)   (5, 6)   
[32m(6, 1)   [0m[32m(6, 2)   [0m(6, 3)   (6, 4)   (6, 5)   (6, 6)   


Note that we can already see that the set of outcomes in $E_1$ overlap with the set of outcomes in $E_2$. To make that explicit, let's highlight the outcomes that are in both $E_1$ or $E_2$ in red: 

In [4]:
print("Outcomes in both E1 and E2 are in red:")
for j in range(1, 7):
    for k in range(1, 7):
        if j < 3 and k < 3:
            print(colored("(" + str(j) + ", " + str(k) + ")   ", "red"), end="")
        else:
            print("(" + str(j) + ", " + str(k) + ")   ", end="")
    print()

Outcomes in both E1 and E2 are in red:
[31m(1, 1)   [0m[31m(1, 2)   [0m(1, 3)   (1, 4)   (1, 5)   (1, 6)   
[31m(2, 1)   [0m[31m(2, 2)   [0m(2, 3)   (2, 4)   (2, 5)   (2, 6)   
(3, 1)   (3, 2)   (3, 3)   (3, 4)   (3, 5)   (3, 6)   
(4, 1)   (4, 2)   (4, 3)   (4, 4)   (4, 5)   (4, 6)   
(5, 1)   (5, 2)   (5, 3)   (5, 4)   (5, 5)   (5, 6)   
(6, 1)   (6, 2)   (6, 3)   (6, 4)   (6, 5)   (6, 6)   


So, does this mean that we cannot use Axiom III to solve this problem? No. We just have to be more careful. Let's highlight all the outcomes that belong to $E_1 \cup E_2$ with a yellow background. Let's also count these as we go:

In [5]:
print("Outcomes in  E1 OR E2 are on a yellow background:")
count = 0

for j in range(1, 7):
    for k in range(1, 7):
        if j < 3 or k < 3:
            print(
                colored("(" + str(j) + ", " + str(k) + ")   ", on_color="on_yellow"),
                end="",
            )
            count += 1
        else:
            print("(" + str(j) + ", " + str(k) + ")   ", end="")
    print()

print()
print("Number of outcomes in E1 OR E2 is", count)

Outcomes in  E1 OR E2 are on a yellow background:
[43m(1, 1)   [0m[43m(1, 2)   [0m[43m(1, 3)   [0m[43m(1, 4)   [0m[43m(1, 5)   [0m[43m(1, 6)   [0m
[43m(2, 1)   [0m[43m(2, 2)   [0m[43m(2, 3)   [0m[43m(2, 4)   [0m[43m(2, 5)   [0m[43m(2, 6)   [0m
[43m(3, 1)   [0m[43m(3, 2)   [0m(3, 3)   (3, 4)   (3, 5)   (3, 6)   
[43m(4, 1)   [0m[43m(4, 2)   [0m(4, 3)   (4, 4)   (4, 5)   (4, 6)   
[43m(5, 1)   [0m[43m(5, 2)   [0m(5, 3)   (5, 4)   (5, 5)   (5, 6)   
[43m(6, 1)   [0m[43m(6, 2)   [0m(6, 3)   (6, 4)   (6, 5)   (6, 6)   

Number of outcomes in E1 OR E2 is 20


If an event is written in terms of a set of $K$ **outcomes** $o_0, o_1, \ldots, o_{K-1}$, and the experiment is fair and has $N$ total outcomes, then Axiom III can be applied to calculate the probability as 
\begin{align*}
P(E) &= P \left(\left\{o_0, o_1, \ldots, o_{K-1} \right\} \right) \\
&= P \left( o_0 \right) + P \left( o_1 \right) +  \ldots + P \left( o_{K-1}  \right) \\
&= \frac{1}{N} + \frac{1}{N} + \ldots + \frac{1}{N} \mbox{        (total of } K \mbox{ terms)} \\
&= \frac{K}{N} 
\end{align*}
We believe that this experiment is fair and that any of the 36 total outcomes is equally likely to occur. The form above is general to any event for a fair experiment, and it is convenient to rewrite it in terms of set cardinalities as

$$
P(E) = \frac{|E|}{|S|}.
$$

Applying this our example, we can easily calculate the probability we are looking for as
\begin{align*}
P\left( E_1 \cup E_2 \right) &= \frac{ \left \vert E_1 \cup E_2 \right \vert } { \left \vert S \right \vert} \\ 
&= \frac{20}{36} \\
&= \frac{5}{9}
\end{align*}

Note that the calculated value matches our estimate from the simulation:

In [6]:
5 / 9

0.5555555555555556

The key to making this work was that we had to realize several things:
* $E_1$ and $E_2$ are not outcomes. They are events, and they can occur at the same time.
* The outcomes of the experiment are the combination of the outcomes from the individual rolls of the two dice.
* The composite experiment is still a fair experiment. It is easy to calculate probabilities using Axiom III and the properties of fair experiments once we can determine the number of outcomes in the event of interest.

However, we can see that the solution method is still lacking in some ways:
* It only works for fair experiments
* It requires enumeration of the outcomes in the event -- this may be challenging to do without a computer and may not scale well.
<!-- * It will not work if the trials are not independent.-->

Some of the difficulties in solving this problem come from not having a larger toolbox; i.e., the axioms provide a very limited set of equations for working with probabilities. In the next section, we explore several corollaries to the axioms and show how these can be used to simplify some problems in probability.

## Terminology Review

Use the flashcards below to help you review the terminology introduced in this chapter.

In [7]:
from jupytercards import display_flashcards

github='https://raw.githubusercontent.com/jmshea/Foundations-of-Data-Science-with-Python/main/04-probability1/flashcards/'
display_flashcards(github+'axiomatic-prob.json')