# Philstats Week 1 Thursday

**Outline**

[Review of set-theoretic operations](#Review-of-set-theoretic-operations)

[Probability axioms](#Probability-axioms)

[Conditioning and conditionals](#Conditioning-and-conditionals)

[Interpretations of probability](#Interpretations-of-probability)


# Review of set-theoretic operations

Recall from last time:

**Set-theoretic operations**

We are usually just working with subsets of a given set $\Omega$, which is fixed by the topic at hand.

There's some operations that we need to remind ourselves of:

- Intersection of $A,B$, symbol $A\cap B$, definition: $A\cap B= \{x\in \Omega: x\in A \wedge x\in B\}$ 

- Union of $A,B$, symbol $A\cup B$, definition: $A\cup B= \{x\in \Omega: x\in A \vee x\in B\}$ 

- Difference of $A$ from $B$, symbol $A\mbox{-} B$, definition: $A\mbox{-} B = \{x\in \Omega: x\in A \wedge x\notin B\}$ 

- Relative complement of $A$, symbol $A^c$ or $\overline{A}$, definition: $A^c = \overline{A} =\{x\in \Omega: x\notin A\}$

The empty set $\emptyset$ is the subset of $\Omega$ that has no elements.

---

**Definition** (subset)

A set $A$ is a *subset* of $B$ if every element of $A$ is also an element of $B$.

We write $A\subseteq B$ for $A$ is a subset of $B$. 

For instance $\{0,1,2\}$ is a subset of $\{0,1,2,3,4\}$ since each number in the first set is in the second set.

But we do not have $\{0,1,2\}$ is a subset of $\{1,2,3,4\}$ since $0$ is in the first set but not in the second set.

---

**Definition** (disjointness)

Sets $A,B$ are *disjoint* if $A\cap B=\emptyset$

For instance, $\{1,2,3\}$ and $\{8,9,10\}$ are disjoint.

Sets $A,B,C$ are *pairwise disjoint* if $A,B$ are disjoint and $A,C$ are disjoint and $B,C$ are disjoint.

For instance, $\{1,2\}, \{8,9\}, \{15,16\}$ are pairwise disjoint.

We define pairwise disjointness for longer sequences of sets similarly.

---

**Recall** from last time:

|  | Philosophy | Probability | Logic | Math |
|:----------:|:----------:|:----------:|:----------:|:----------:|
|   $\Omega$  |   Set of worlds  |   Sample space  | Set of models | Underlying set 
|   $\{A: A\subseteq \Omega\}$  |   Set of propositions  |   Event space  | Sets determined by formulas | Powerset of underlying set
|   $X:\Omega\rightarrow \mathbb{R}$  |  na  |   Random variable  | na | Real-valued function
|   $\{\omega\in \Omega: X(\omega)\in R\}$  |  na  |  the event $X\in R$  | na | the set $X^{-1}(R)$, called the inverse image

In this last one, $R$ is a subset of the reals $\mathbb{R}$

**In what follows, we use the probability terminology.**

# Probability axioms

**Definition** (probability axioms)

A *probability measure* $P$ is a function from subsets of the space $\Omega$ to real numbers  satisfying the following for all events $A,B\subseteq \Omega$:

- Non-negativity: $P(A)\geq 0$ 

- Finite additivity: $P(A\cup B)=P(A)+P(B)$ if $A\cap B=\emptyset$ 

- Value of entire space: $P(\Omega)=1$

---

Again, if $\Omega$ is infinite, one will in general need to restrict attention to Borel events (or some other reasonably defined class of events). And again, we ignore this issue here, not because it is unimportant but because it is a slightly different subject (measure theory and/or descriptive set theory).

For rest of this section, we fix $\Omega$ and only consider events (sets) which are subsets of $\Omega$

**Proposition** (value of complements)

For all events $A$, we have: $P(A^c)=1-P(A)$

*Proof*: 

We always have $\Omega = (\Omega\mbox{-}A)\cup A$ and $\Omega\mbox{-}A, A$ are disjoint (Draw the picture)

Then by finite additivity one has:

$1 = P(\Omega) = P((\Omega\mbox{-}A)\cup A) = P(\Omega\mbox{-}A)+P(A)=P(A^c)+P(A)$

Then subtract $P(A)$ from both sides.

**Corollary** (value for emptyset)

We have $P(\emptyset)=0$

*Proof*:

Since $\emptyset^c =\Omega$ we have

$P(\emptyset^c) =1-P(\Omega)=1-1=0$.

**Proposition** (monotonicity)

For all events $A,B$ if $A\subseteq B$ then $P(A)\leq P(B)$

Proof: 

When $A\subseteq B$ we have $B=A\cup (B\mbox{-}A)$ and $A,B\mbox{-}A$ are disjoint. (Draw the picture)

We then appeal to finite monotoncity as follows:

$P(B) = P(A)+P(B\mbox{-}A)\geq P(A)$

For the last inequality, we appeal to non-negativity.

**Corollary** (values in the interval 0 to 1)

For all events $A$, we have $0\leq P(A)\leq 1$.

*Proof*:

We always have $A\subseteq \Omega$. 

Hence by monotonicity $P(A)\leq P(\Omega) =1$.

**Proposition** (finite additivity, redeux)

For pairwise disjoint events $A,B,C$ one has

$P(A\cup B\cup C) = P(A)+P(B)+P(C)$

*Proof*:

If $A,B,C$ are pairwise disjoint, then $A, B\cup C$ are disjoint too (draw the picture). Then by two applications of finite additivity we have

$P(A\cup B \cup C) = P(A)+P(B\cup C)=P(A)+P(B)+P(C)$

*Note 1*:

In this proof, we are relying on rules of sets, such as associativity of union, which says that $(A\cup B)\cup C=A\cup (B\cup C)$, and so we can "drop parentheses" and just write it as $A\cup B\cup C$.

*Note 2*:

The same proposition holds for any finite sequence of pairwise disjoint events.

**Proposition** (inclusion-exclusion)

For all events $A,B$ we have 

$P(A\cup B) = P(A)+P(B)-P(A\cap B)$

*Proof* (less detailed):

$P(A\cup B) = P(A\mbox{-}B) + P(B\mbox{-}A) + P(A\cap B)$

$\hspace{20mm} = P(A)-P(A\cap B)+P(B)-P(A\cap B)+P(A\cap B)$

$\hspace{20mm} = P(A)+P(B)-P(A\cap B)$

*Proof* (more detailed):

We have that $A\cup B = (A\mbox{-}B)\cup (B\mbox{-}A) \cup (A\cap B)$ and $A\mbox{-}B, B\mbox{-}A, A\cap B$ are  are pairwise disjoint (draw the picture)

Hence by finite additivity we have

(1) $P(A\cup B) = P(A\mbox{-}B) + P(B\mbox{-}A) + P(A\cap B)$

But we also have $A=(A\mbox{-}B)\cup (A\cap B)$ and $B=(B\mbox{-}A)\cup (A\cap B)$ and the relevant sets are disjoint (draw the picture)

Hence by finite additivity and some subtraction we get:

(2) $P(A)=P(A\mbox{-}B)+ P(A\cap B)$

(2') $P(A\mbox{-}B)=P(A)-P(A\cap B)$

(3) $P(B)=P(B\mbox{-}A)+P(A\cap B)$

(3') $P(B\mbox{-}A)=P(B)-P(A\cap B)$

We then just chain together (1), (2'), (3') to get

$P(A\cup B) = P(A\mbox{-}B) + P(B\mbox{-}A) + P(A\cap B)$

$\hspace{20mm} = P(A)-P(A\cap B)+P(B)-P(A\cap B)+P(A\cap B)$

$\hspace{20mm} = P(A)+P(B)-P(A\cap B)$

# Conditioning and conditionals

**Definition** (Conditional probability)

Suppose that $P$ is a probability measure and $P(E)>0$. Then the *conditional probability* $P(H\mid E)$ of $H$ given $E$ is $$P(H\mid E)=\frac{P(H\cap E)}{P(E)}$$

**Alterate notation** (Subscript notation for conditional probability)

We also write: $P_E(H)=P(H|E)$.

**Convention** (Assume your interlocutor is not dividing by zero)

Whenever anyone writes $P(H|E)$ or $P_E(H)$, assume that they are restricting their statement to the case where $P(E)>0$. 

This saves us from needing to write out this hypothesis over and over again.

**Proposition** (repeated conditioning)

$P_E(H\mid E') = P(H\mid E\cap E')$

*Proof*:

$$ P_E(H\mid E') =\frac{P_E(H\cap E')}{P_E(E')} = \frac{P(H\cap E'\cap E)/P(E')}{P(E\cap E')/P(E')} = \frac{P(H\cap E'\cap E)}{P(E\cap E')} = P(H\mid E\cap E')$$

**Proposition** (conditioning induces a probability measure)

For all $E$, one has that $P_E$ is also a probability measure.

*Proof*: Non-negativity: $P_E(A)=\frac{P(A\cap E)}{P(E)}\geq 0$. 

Finite additivity: assuming $A,B$ are disjoint, we have 

$P_E(A)+P_E(B)=\frac{P(A\cap E)}{P(E)}+\frac{P(B\cap E)}{P(E)}=\frac{P((A\cap E)\cup (B\cap E))}{P(E)}=\frac{P((A\cup B)\cap E)}{P(E)}=P_E(A\cup B)$

The first identity is definition, the second is $A\cap E, B\cap E$ disjoint (draw picture); the third identity is distributivity; the last identity is definition.

Value of the whole space: $P_E(\Omega) = \frac{P(\Omega\cap E)}{P(E)} = \frac{P(E)}{P(E)}=1$ since $\Omega\cap E=\Omega$.

**Theorem** (Bayes' theorem / formula)

$$P(H\mid E) = \frac{P(E\mid H)\cdot P(H)}{P(E)}$$

*Proof*:

Bayes' formula can be verified by moving from the right-to-left, with the probabilities of the hypotheses cancelling on top and bottom:

$$ \frac{P(E\mid H)\cdot P(H)}{P(E)} = \frac{P(E\cap H)\cdot P(H)}{P(H)\cdot P(E)} = \frac{P(H\cap E)}{P(E)} = P(H\mid E)$$

**Definition** (Likelihood, Prior, Posterior)

In the context of Bayes' Theorem, one has names for the certain quantities:

$$P(H\mid E) = \frac{P(E\mid H)\cdot P(H)}{P(E)}, \hspace{20mm} \mathrm{posterior\;probability}= \frac{\mathrm{likelihood} \times \mathrm{prior\;probability} }{\mathrm{probability\;of\;evidence} } $$

In English: the probability of a hypothesis conditional on evidence is equal to the likelihood of the evidence given the hypothesis times the prior probability associated to the hypothesis, divided by the probability associated to the evidence.

It is useful in situations where one has some sense already of which hypotheses make the evidence at hand more or less probable.

**Proposition**

$P(H) = P(H|E)\cdot P(E)+P(H|E^c)\cdot P(E^c)$

*Proof*:

We have $H=(H\cap E)\cup (H\cap E^c)$ and $H\cap E, H\cap E^c$ are disjoint. 

Then $P(H)=P(H\cap E) +P(H\cap E^c) = P(H|E)\cdot P(E)+P(H|E^c)\cdot P(E^c)$


---

**Theorem** (Lewis-Stalnaker trivality)

Suppose that conditional probability was *factive*, in that there is a binary operation $\Rightarrow$ on subsets of $\Omega$ such that $P(H|E) = P(E\Rightarrow H)$ for all $P,H,E$ with $P(E)>0$.

Then conditional probability is *trivial* in that $P(H|E) = P(H)$ for all $E,H$ with $P(H\cap E), P(E\cap H^c)>0$.

*Proof*:

$P(H|E)=P(E\Rightarrow H) = P(E\Rightarrow H\mid H)\cdot P(H)+P(E\Rightarrow H\mid H^c)\cdot P(H^c)$

$\hspace{15mm} = P_H(E\Rightarrow H)\cdot P(H)+P_{H^c}(E\Rightarrow H)\cdot P(H^c)$

$\hspace{15mm} = P_H(H|E)\cdot P(H)+P_{H^c}(H|E)\cdot P(H^c)$

$\hspace{15mm} = P(H|E\cap H)\cdot P(H)+P(H|E\cap H^c)\cdot P(H^c)$ by repeated conditioning

$\hspace{15mm} = 1\cdot P(H)+0 \cdot P(H^c)$

$\hspace{15mm} = P(H)$

*Note*: next week we will recognize the 'triviality' as a kind of 'too much independence.'

# Interpretations of probability

**Laplace and the principle of indifference**

From the *Philosophy Essay on Probabilities* (p. 4):

>The theory of chances consists in reducing all events of the same kind to a certain number of equally possible cases, that is to say, to cases whose existence we are equally uncertain of, and in determining the number of cases favourable to the event whose probability is sought. The ratio of this number to that of all possible cases is the measure of this probability, which is thus only a fraction whose numerator is the number of favourable cases, and whose denominator is the number of all possible cases.

It seems he is thus interested in what we would call today the *uniform* measure on a finite space. 

If $\Omega$ has cardinality $n$ and an event $A$ has $m$ elements, then we define $P(A)=\frac{m}{n}$.

If we use $\left|\cdot\right|$ for cardinality (or size), then this can be written as $P(A)=\frac{\left|A\right|}{\left|\Omega\right|}$.

Virtues: connects our epistemic state to the probabilites, and gets simple cases of fair coin flips right. 

Vices: more often than not, we are not interested in the uniform measure.

**von Mises and frequentism**

Given a space $\Omega$, consider the "superspace" $\Omega^{\mathbb{N}}=\{f:\mathbb{N}\rightarrow \Omega\}$, i.e. the set of all functions from the natural numbers to $\Omega$. 

For instance if $\Omega =\{0,1\}$, then $\Omega^{\mathbb{N}}$ is *Cantor space*, the space of all infinite sequences of zeros and ones. This is a good initial representation of the space of all infinite sequences of coin flips. 

Then for such a function $f$, define $P_f(A) = \lim_n \frac{\left|\{m\leq n: f(m)\in A\}\right|}{n}$.

That is, it is the limiting relative frequency of $A$'s that appear in the sequence $f$.

**Virtues of frequentism**

One can derive the elementary laws of probability. 

For instance, if $A,B$ are disjoint and the relevant limits exist, then

$P_f(A\cup B) = \lim_n \frac{\left|\{m\leq n: f(m)\in A\cup B \}\right|}{n}$.

$\hspace{15mm} = \lim_n \frac{\left|\{m\leq n: f(m)\in A \}\right|+\left|\{m\leq n: f(m)\in B \}\right|}{n}$

$\hspace{15mm} = \lim_n \frac{\left|\{m\leq n: f(m)\in A \}\right|+\lim_n \left|\{m\leq n: f(m)\in B \}\right|}{n}$

$\hspace{15mm} = P_f(A)+P_f(B)$

**Vices of frequentism**
    
One has to assume that the limits exist and von Mises postualtes this rather than explains it. Some sequences don't have limits. Imagine a sequence that has long stretches where the frequency is $\frac{1}{3}$ followed by long stretches where the frequency is $\frac{2}{3}$, repeated ad nauseum.

**Vices of frequentism, con't**

Von Mises postulated that the limiting relative frequency is not changed by passing to subsequences. (See von Mises 1957 pp. 23-25, 28-29, 65).

Let $\Omega =\{0,1\}$ i.e. a single flip of tails and heads. Let $\Omega' = \{00, 01, 10, 11\}$, i.e. two flips of heads and tails. Let $g(m)=f(2m)$. 

Then under the assumption that all the relevant limits exist one has:

$P_{g}(\{1\}) = \lim_n \frac{\left|\{m\leq n: f(2m)=1\}\right|}{n}$

$\hspace{15mm} = \lim_n \frac{\left|\{m\leq n: f(2m)=1, f(2m+1)=0\}\right|}{n}$

$\hspace{20mm} + \lim_n \frac{\left|\{m\leq n: f(2m)=1, f(2m+1)=1\}\right|}{n}$

$\hspace{15mm} = P'_{f'}(\{10\})+ P'_{f'}(\{11\})$

$\hspace{15mm} = \frac{1}{4}+\frac{1}{4} = \frac{1}{2}\hspace{10mm}$  ($\star$)

where $f'(m) = f(2m)f(2m+1)$. 

The reason for ($\star$) is philosophical: if $P_f$ succeeded in getting the fair measure on $\Omega$, then $P'_{f'}$ should suceed in being the fair measure on $\Omega'$.

See Van Lambalgen 1987 pp. 36-37.

**Subjective Bayesianism**

De Finetti described his own view as *subjectivism*, but it is now usually called just *Bayesianism* or perhaps sometimes *subjective Bayesianism*. 

The basic idea is that probabilities are just degrees of confidence or degrees of belief: they are reflective of the subjective evaluations of agents.

Different subjective Bayesians will differ in terms of how they understand the "reflectiveness", and can range the gamut from pyschologically real states of agents to the best way to understand from an external point of view the bevahior of an agent.

**de Finietti vs. von Mises**

Both von Mises and de Finetti were primarily concerned to give a scientifically acceptable account of probability. 

Von Mises is often counted as a member of the Vienna circle.

Similarly, de Finetti speaks of his motivation as stemming from a view that "Every notion is only a word without meaning so long as it is not known how to verify practically any statement at all where this notion comes up ..." (De Finetti 1964 p. 148).

For von Mises, probability was appropriately scientific because "the relative frequency of the repetition is the 'measure' of probability, just as the length of a column of mercury is the 'measure' of temperature" (Von Moses 1957 p. vi).

For de Finetti, probability was operationalized in terms of human belief.

**Virtues of subjective Bayesianism**

There are arguments from betting behavior (Dutch books) that one's degrees of belief satisfy the probability axioms, if one acts in an appropriately rational manner.

Similarly, in the finite case, there are arguments from accuracy, to the effect that any violations of the laws of probability would make one's credences less close to the truth than they otherwise could be.

These are really subjects for another course. 

**Vices of subjective Bayesianism**

It is harder to see how one is going to get relative frequences to emerge out of this (if one thought that was a good thing). But Bayesians have some things to say about this (exchangeability)

The Bayesian updating procedure has a hard time with the problem of "old evidence": namely the problem of how to account for how a theory's ability to explain an old already-updated-upon piece of evidence lends credence to the theory. 