## Basic Statistics

Karen Leighly Fall 2017

Resources for this material:
- Previous iteration of the Machine Learning class, specifically http://seminar.ouml.org/lectures/bayesian-statistics/
- Jupyter notebook structure - Gordon Richards PHYS_T480 class (https://github.com/gtrichards/PHYS_T480)
- Notes are taken from "Introduction to Probability: Theory and Applications" by R. L. Scheaffer and W. Mendenhall. This was the textbook for my undergraduate probability class; any undergraduate textbook should do.

I will also be referring to University of Washington ASTR 324 (https://github.com/uw-astr-324-s17/astr-324-s17)


## The Need for Probabilistic Thinking

In science, nothing can be known absolutely.  All measurements have uncertainty, and all models have simplifying assumptions.  It is critical to think probabilistically, and to understand and be aware of assumptions, both explicit and implicit.  

## Combinatorics Review

We will start with a review of probability and combinatorics, focusing on discrete data for today.  This lecture follows "Introduction to Probability: Theory and Applications" by Scheaffer & Mendenhall (hereafter abbreviated SM).

#### Set Notation

If the elements in the set are $a_1, a_2, a_3$, we generally write:  

$A=\{a_1, a_2, a_3\}.$




#### Simple Set Relationships

$A \cup B$ is the <b>union</b> of sets A and B.  The result contains all of the elements in either set.

$A \cap B$ is the <b>intersection</b> of sets A and B.  The results contains all of the elements that appear in both sets.

![Figure 3.1](http://www.astroml.org/_images/fig_prob_sum_1.png)

#### Distribution laws

$A\cap (B \cup C) = (A \cap B) \cup (A\cap C)$
$A\cup (B \cap C) = (A\cup B) \cap (A \cup C)$

## The sample-point approach of computing probabilities

Say you wanted to determine the probability of a discrete event.  The sample-point approach is the most straightforward method of determining the probability of an event.  (The sample-point approach is fundamental in frequentist statistics.)

- List the simple possibilities of the experiment and check to see whether they can be further decomposed.
- Assign probabilities to the sample points, making sure that the sum of the probability is 1.
- Define the event of interest as a collection of the sample points (i.e., a subset of all possible points).
- Find P(A) by summing the probabilities of the sample points in A.

See SM Example 2.1 and 2.2.



### SM Example 2.1

Cosider the problem of selecting two applicants for a job out of a group of five, and imagine that the applicants vary in competence, 1 being the best, 2 being the second best, and so on.  (These ratings are unknown to the employer.) Define two events, A and B:

 - A: The employer selects the best, and one of the two poorest applicants (i.e., applicants 1 and 4 or applicants 1 and 5)
 - B: The employer selects at least one of the two best. 
 
Find the probabilities for these two events.

## Combinatorical Analysis

Some elementary but useful results from the theory of combinatorics analysis.  

**$mn$ rule:**  With $m$ elements in one set, and $n$ elements in another, it is possible to form $mn$ pairs.

### SM Example 2.3

An experiment involves tossing a pair of dice and observing the results.  Find the number of points in $S$.

### Permutations

$P^n_r = \frac{n!}{(n-r)!}=n(n-1)(n-2)...(n-r+1)$ is the number of ways of ordering $n$ distinct objects taken $r$ at a time. 

Recall that factorial is defined as $n!=n(n-1)....(2)(1)$ and $0!=1$.  Note that this is for **ordered selection.**

See SM Example 2.5, 2.6

### SM Example 2.5

Opening a combination lock requires the selection of the correct set of four different digits in sequence.  How many combinations are there, assuming no digit is used twice.

### Partitioning

The number of ways of partitioning $n$ distinct objects into $k$ distinct groups containing $n_1, n_2, ..., n_k$ objects, respectively is:

$N=\frac{n!}{n_1!n_2!....n_k!} = \binom{n}{n_1 n_2... n_k}$, where $\sum_{i=1}^k n_i = n$.

See SM Example 2.7.

### SM Example 2.7

A labor dispute has arisen concerning the alleged unequal distribution of 20 laborers to 4 different construction jobs.  The first job requires 6 laborers, the second, third and fourth required four, five and four respectively.  The dispose arose over an alleged random distribution of the laborers that placed all 4 members of a particular ethnic group on job 1 (considered to be the worst job).  

If the assignment of the laborers to jobs was random, find the probability of the observed event.

## Combinations

The number of combinations of $n$ objects taken $r$ at a time is the number of ways of forming a subset of size $r$ from the $n$ objects. 

$C^n_r = \binom{n}{r} = \frac{P^n_r}{r!} = \frac{n!}{r!(n-r)!}$

Note that this subset is not ordered, unlike the permutations above.

See SM Example 2.8, 2.9.

### SM Example 2.8

Find the number of ways of selecting two applicants out of 5, and hence the total number of sample points in $S$ for Example 2.1.

## Conditional Probability

Conditional probability forms the heart of Bayesian analysis, since using Bayesian analysis allows us to take into account additional knowledge. For example, one might ask: what is the probability it will snow on a given day in Norman? That probability will change depending on the time of year, so the probability that it will snow given that it is December will be different than the probability it will snow given that it is August.

$P(A|B) = \frac{P(A\cap B)}{P(B)}$ is the conditional probability of event A, given that an event B has occurred, provided $P(B)>0$. 

We read this as "the probability of A given B (has occurred)".  

See SM Example 2.10

### SM Example 2.10
Consider the toss of a single die. What is the probability that the result is 1, given the information that an odd number was obtained.


## Independent Events

If $P(A\cap B) = P(A)P(B)$ then the two events $A$ and $B$ are _independent_. Otherwise, if this is false, they are dependent. 

Then, if $A$ and $B$ are independent, $P(A|B)=P(A)$ and $P(B|A)=P(B)$.  

I.e., the fact that $B$ has occurred has no influence on the conditional probability, because $A$ and $B$ are independent.

See SM Example 2.11, 2.12



### SM Example 2.11

Consider the toss of a single die, and 3 events:
 - A: Observe an odd number
 - B: Observe an even number
 - C: Observe a 1 or a 2
 
(a.) Are A and B independent events?

(b.) Are A and C independent events?

## Multiplicative Law of Probability

The probability of the intersection of two events, $A$ and $B$, is

$$ P(A\cap B) = P(A)P(B|A) = P(B)P(A|B).$$

If $A$ and $B$ are independent, then

$$P(A \cap B) = P(A)P(B).$$

Note it can be extended to any number of events:

$$P(A\cap B\cap C) = P(A)P(B|A)P(C | A\cap B).$$



## Additive Law of Probability

The probability of the union of two events $A$ and $B$ is

$$P(A \cup B) = P(A)+P(B)-P(A\cap B).$$

If $A$ and $B$ are mutually exclusive events, $P (A \cap B) = 0$ and
$$P(A\cup B) = P(A)+P(B).$$

This can also be extended to any number of events:

$$P(A\cup B\cup C) = P(A)+P(B)+P(C) - P(B\cap C)$$
$$-P(A\cap B) - P(A\cap C)+P(A\cap B\cap C).$$

## The Event-composition Method

A useful approach for more complicated problems is to recast the event in question as a _composition_ (i.e., union and/or intersection) of two or more other events, and then use the multiplicative law of probability and / or the additive law of probability to solve for the probability of the event of interest.

See SM examples 2.13-2.17

### SM Example 2.14

Let's revisit the problem involving the selection of two applicants out of five, but instead of using the sample-point approach, we'll use the event-composition method.  Find the probability of drawing exactly one of the two best applicants, event $A$.

## Complementary Events

One very useful feature is that the probability that A is true plus the probability that A is not true must equal 1.  This can be useful if the probability of A not being true is easier to evaluate.  Mathematically:

$P(A)=1-P(\bar{A})$

## Bayes' Rule

Armed with these relationships, we are ready to derive Bayes' rule.  From the probability standpoint, the motivation for using Bayes' rule is that it is sometimes easier to solve probability problems if the sample space $S$ is viewed as the union of mutually exclusive subsets,
i.e.,

$$S=B_1\cup B_2\cup....\cup B_k$$

where $B_i\cap B_j$ is the empty set for $i\neq j$. Then, any subset A of S can be written as:

$$A=A\cap S$$
$$=A\cap (B_1 \cup B_2 \cup...\cup B_k)$$
$$= (A\cap B_1) \cup (A\cap B_2) \cup ...\cup (A\cap B_k)$$

Then, using the additive rule of probability:

$$P(A) = P(A\cap B_1) + P(A\cap B_2)+...+P(A\cap B_k)$$

Then, using the multiplicative rule of probability:

$$ = P(B_1)P(A|B_1)+P(B_2)P(A|B_2)+....+P(B_k)P(A|B_k)$$
$$ = \sum_{i=1}^k P(B_i)P(A|B_i).$$

Then, the probability of any $P(B_j|A)$ is (from conditional probability definition, and multiplicative law:

$$P(B_j|A) = \frac{P(A\cap B_j)}{P(A)}$$

$$=\frac{P(B_j)P(A|B_j)}{\sum_{i=1}^k P(B_i)P(A|B_i)}.$$

The important part of the RHS is the numerator, because it can be understood that the probability will be normalized such that the integrated probability is equal to 1. So the denominator is often not evaluated explicitly.


So for a single $B$ the equation becomes:

$$P(B|A)=\frac{P(A|B)P(B)}{P(A)} \sim P(A|B)P(B).$$

We will discuss Bayes' Rule in more detail the week after next.  But the terminology associated with it is important for understanding it, so we will mention that briefly here.

$$P(B|A)=\frac{P(A|B)P(B)}{P(A)} \sim P(A|B)P(B).$$

 - $P(A|B)$ is known as the likelihood, and can be thought as the probability that the data that you have ($A$) matchs your model ($B$).
 - $P(B)$ is the prior, and it can be thought of as additional knowledge that you already have about the model.
 - $P(A)$ is the evidence, and as mentioned above, it is sometimes ignored because the posterior can be normalized _a posteriori_.
 - $P(B|A)$ is the posterior probability, can be though of as the probabiity of the model given the data, and is what you want.
 
We can write this in words as:
$${\rm Posterior Probability} = \frac{{\rm Likelihood}\times{\rm Prior}}{{\rm Evidence}},$$

where we interpret the posterior probability as the probability of the model (including the model parameters).


## Example: Lego's 

An example with Lego's (it's awesome):
[https://www.countbayesie.com/blog/2015/2/18/bayes-theorem-with-lego](https://www.countbayesie.com/blog/2015/2/18/bayes-theorem-with-lego)

Also see SM Example 2.18

## Example: Monty Hall Problem

You are playing a game show and are shown 2 doors.  One has a car behind it, the other a goat.  What are your chances of picking the door with the car?

OK, now there are 3 doors: one with a car, two with goats.  The game show host asks you to pick a door, but not to open it yet.  Then the host opens one of the other two doors (that you did not pick), making sure to select one with a goat.  The host offers you the opportunity to switch doors.  Do you?

![https://upload.wikimedia.org/wikipedia/commons/thumb/3/3f/Monty_open_door.svg/180px-Monty_open_door.svg.png](https://upload.wikimedia.org/wikipedia/commons/thumb/3/3f/Monty_open_door.svg/180px-Monty_open_door.svg.png)

Now you are back at the 2 door situation.  But what can you make of your prior information?

$p(1{\rm st \; choice}) = 1/3$

$p({\rm other}) = 2/3$
which doesn't change after host opens door without the prize.
So, switching doubles your chances.  But only because you had prior information.  If someone walked in after the "bad" door was opened, then their probability of winning is the expected $1/2$.

For $N$ choices, revealing $N-2$ "answers" doesn't change the probability of your choice.  It is still $\frac{1}{N}$.  But it *does* change the probability of your knowledge of the *other* remaining choice by $N-1$ and it is $\frac{N-1}{N}$.

This is an example of the use of *conditional* probability, where we have $p(A|B) \ne p(A)$.

## Example: Contingency Table

We can also use Bayes' rule to learn something about false positives and false negatives.

Let's say that we have a test for a disease.  The test can be positive ($T=1$) or negative ($T=0$) and one can either have the disease ($D=1$) or not ($D=0$).  So, there are 4 possible combinations:
$$T=0; D=0 \;\;\;  {\rm true \; negative}$$
$$T=0; D=1 \;\;\; {\rm false \; negative}$$
$$T=1; D=0 \;\;\; {\rm false \; positive}$$
$$T=1; D=1 \;\;\; {\rm true \; positive}$$

All else being equal, you have a 50% chance of being misdiagnosed.  Not good!  But the probability of disease and the accuracy of the test presumably are not random.

If the rates of false positive and false negative are:
$$p(T=1|D=0) = \epsilon_{\rm FP}$$
$$p(T=0|D=1) = \epsilon_{\rm FN}$$

then the true positive and true negative rates are just:
$$p(T=0| D=0) = 1-\epsilon_{\rm FP}$$
$$p(T=1| D=1) = 1-\epsilon_{\rm FN}$$

In graphical form this is:
![http://www.astroml.org/_images/fig_contingency_table_1.png](http://www.astroml.org/_images/fig_contingency_table_1.png)

If we have a **prior** regarding how likely the disease is, we can take this into account.

$$p(D=1)=\epsilon_D$$

and then $p(D=0)=1-\epsilon_D$.

Bayes' rule then can be used to help us determine how likely it is that you have the disease if you tested positive:

$$p(D=1|T=1) = \frac{p(T=1|D=1)p(D=1)}{p(T=1)},$$

where $$p(T=1) = p(T=1|D=0)p(D=0) + p(T=1|D=1)p(D=1).$$

So
$$p(D=1|T=1) = \frac{(1 - \epsilon_{FN})\epsilon_D}{\epsilon_{FP}(1-\epsilon_D) + (1-\epsilon_{FN})\epsilon_D} \approx \frac{\epsilon_D}{\epsilon_D+\epsilon_{FP}}$$

Wondering why we can't just read $p(D=1|T=1)$ off the table?  That because the table entry is the conditional probability of the *test* given the *data*, $p(T=1|D=1)$, what we want is the conditional probability of the *data* given the *test*.

That means that to get a reliable diagnosis, we need $\epsilon_{FP}$ to be quite small.  (Because you *want* the probability to be close to unity if you test positive, otherwise it is a *false* positive).

Take an example with a disease rate of 1% and a false positive rate of 2%.  

So we have
$$p(D=1|T=1) = \frac{0.01}{0.01+0.02} = 0.333$$

Then in a sample of 1000 people, 10 people will *actually* have the disease $(1000*0.01)$, but another 20 $(1000*0.02)$ will test positive!