# Lecture 6 
* Combinatorics
* Bayes' Theorem and Applications
* Exploratory Data Analysis
* Hypothesis Testing
* Bootstrap Sampling

These are some of the theorems and corollaries that we have learned so far:

* $\forall E\in\mathcal{F}, 0 \leq P(E)\leq 1$
* $P(\Omega)=1$ and $P(\emptyset) = 0 $
* $P(A^c) = P(\overline{A}) = 1 - P(A)$
* If $A\subset B$, then $P(A)\leq P(B)$
* **DeMorgan's Law 1**: $\overline{A\cap B} = \overline{A}\cup\overline{B}$  
* **DeMorgan's Law 2**: $\overline{A\cup B} = \overline{A}\cap\overline{B}$
* $P(A\cap B) = P(A) + P(B) - P(A\cup B)$
* If $A$ and $B$ are M.E. then $A\cap B=\emptyset \Rightarrow P(A\cap B) = 0$
* **Conditional Probability**: $P(A|B) = \frac{P(A\cap B)}{P(B)}$, for $P(B)>0$
* **Chain Rules**: $P(A\cap B) = P(A|B)P(B)$ and $P(A\cap B) = P(B|A)P(A)$
* **Multiplication Rule**: $P(\bigcap_{i=1}^n A_i) = P(A_1)P(A_2|A_1)P(A_3|A_1\cap A_2)\dots P\left(A_n|A_1\cap\dots \cap A_{n-1}\right)$
* **Total Probability**: if a set of events $\{C_i\}_{i=1}^n$ are partitions of the sample space $\Omega$, then $P(A) = \sum_{i=1}^n P(A|C_i)P(C_i)$
* **Statistical Independence:** two events $A, B\in\mathcal{F}$ are statistical independent (s.i.) if and only if (iff) $P(A\cap B)=P(A)P(B)$
* If $A$ is statistically independent of $B$, then $B$ is statistically independent of $A$.
* If $A, B\in\mathcal{F}$ are s.i., then $A$ and $\bar{B}$ are s.i., $\bar{A}$ and $B$ are s.i., and $\bar{A}$ and $\bar{B}$ are s.i..
* **Conditional Independence:** Given an event $C$, the events $A$ and $B$ are said to be conditionally independent if $P(A\cap B|C) = P(A|C)P(B|C)$
* Conditionally independent events are not necessary statistically independent.

In [None]:
import random
import numpy as np
import numpy.random as npr
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('bmh')

import itertools

Library API for [itertools](https://docs.python.org/3/library/itertools.html).

___

# Combinatorics

<div class="alert alert-info">
    
A **combined experiment** is one in which the outcome is a tuple that takes one outcome from each of a sequence of subexperiments.
</div>

<div class="alert alert-info">
    <b>Cartesian Product</b>
    
The **cartesian product** of two sets $A$ and $B$ is denoted $A \times B$ and is defined by 

$$ A \times B = \{ (a,b) | a \in A \mbox{ and } b \in B\}$$

That is, it is the set of all two-tuples with the first element from set $A$ and the second element from set $B$.
</div>

## 1. Sampling with Replacement and with Ordering

<div class="alert alert-info">
    <b>Sampling with replacement and with ordering</b>
    
Consider choosing $k$ values from a set of $n$ values. The result is a $k$-tuple: $(x_1, x_2, \ldots, x_k)$, 
where $x_i \in A, \forall i=1,2,\ldots, k$. 

Thus, this is a combined experiment with $|S_1|=|S_2|=\ldots=|S_k|=|A|\equiv n$.\\

Therefore the number of distinct ordered $k$-tuple outcomes is $n^k$.
</div>

## 2. Sampling without Replacement and with Ordering

<div class="alert alert-info">
    <b>Sampling without replacement and with ordering ($k$-permutations)</b>
    
In general, the number of ways to choose $k$ items from $n$ items **without replacement** and **with ordering** is
$$ n \times (n-1) \times \ldots \times (n-k+1) = \frac{n!}{(n-k)!}$$
</div>

**<font color=blue>Example 1:</font> Consider the combined experiment of flipping a fair coin 20 times and counting the number of heads. How many ways are there to observe a count of 2 heads in 20 coin flips?**

**PYTHON technique** To compute the factorial of an integer in Python, you can use the ```scipy``` library:

Note that in counting the number of ways that 2 Heads can occur in 20 flips, (7,14) represents the same thing as (14,7). 

So, if we determine the number of **ordered** ways to choose 2 unique values out of 20, we have **overcounted** by a factor of 2.

Thus, the correct number of outcomes in $H_2$ is
$$ \left|H_2 \right| = \frac{20 \cdot 19}{2} = 190 $$ 

Now, let's try to count $|H_3|$. We know the number of ways to choose 3 **ordered** values from 20 without replacement is
$$\frac{20!}{(20-3)!} = 20 \cdot 19 \cdot 18$$

But how many repeats are there if we want to know the number of unordered sets? Let's consider how many ways we can arrange (1,2,3):

(1,2,3)
(1,3,2)
(2,1,3)
(2,3,1)
(3,1,2)
(3,2,1)

So, there are 6.

Note that the number of ways to order 3 things is the same as the number of order ways to choose 3 items from a set of 3.

<div class="alert alert-info">
    <b>Permutations</b>
    
The number of *permutations* of $k$ objects is the number of orderings of those $k$ objects, and can be calculated as
$$ k \times (k-1) \times (k-2) \times \ldots \times 2 \times 1 \\ = k! $$
</div>

## 3. Sampling without Replacement and without Ordering

Finally, we are ready to determine $|H_3|$, which is $20 \times 19 \times 18$ divided by the number of orderings of 3 items, which is $3!=6$, so

\begin{align*}
\left|H_3\right| &= \frac{20!}{(20-3)!}\frac{1}{3!} \\
&=\frac{20 \times 19 \times 18}{6} \\
&= 1140
\end{align*}

Moreover, the formula for general $H_k$ follows directly.

<div class="alert alert-info">
    <b>Sampling without Replacement and without Ordering (Combinations)</b>
    
The number of ways to choose $k$ items from a set of $n$ items **without replacement** and **without ordering** is
$$  \frac{n!}{(n-k)!k!} $$

The value of the equation can also be expressed as
$$ \binom{n}{k} = C^{n}_{k} $$
and is know as the **binomial coefficient**.
</div>

**PYTHON technique** To determine $\binom{n}{k}$ in Python, you can also use the ```scipy``` library:

Thus, the probability of any event $H_k \subset \Omega$ is 

$$ P(H_k) = \frac{|H_k|}{|\Omega|} = \frac{\binom{20}{k}}{2^{20}}$$

Let's put it all together and compare with our simulation:

In [None]:
# Simulation parameters
num_sims=10_000
flips=20
threshold=6

# Conducting experiment



# Analytical Probability




In [None]:
fig = plt.figure(figsize=(20,5))

# Counting - Observed Relative Frequencies

plt.xlabel('Number of Heads')
plt.ylabel('Relative Frequency')


# Analytical probability

plt.xlabel('Number of Heads')
plt.ylabel('Analytical Probability')


# Relative Frequencies vs Analytical Probability

plt.xlabel('Number of Heads')
plt.ylabel('Relative Frequency vs Analytical Probability')
plt.show()

print("Flip | Relative Freq. | Analytic Probability")
#
#

## 4. Sampling with Replacement and without Ordering

Suppose that we want to sample from the set $A=\{a_1,a_2,\dots,a_n\}$ $k$ times such that repetition is allowed and ordering does not matter. For example, if $A=\{1,2,3,4,5,6\}$ is the sample space of rolling a 6-sided fair die and $k=2$, then there are 21 differet ways of doing this

\begin{equation*}
\{(1,1), (1,2), (1,3), (1,4), (1,5), (1,6), (2,2), (2,3), (2,4), (2,5), (2,6), (3,3), (3,4), (3,5), (3,6), (4,4), (4,5), (4,6), (5,5), (5,6), (6,6)\}
\end{equation*}

* How can we get the number 21 without actually listing all the possibilities? 

One way to think about this is to note that any of the pairs in the above list can be represented by the number of 1's, 2's, 3's, 4's, 5's and 6's it contains. That is, if $x_i$ is the number of face $i$, we can equivalently represent each pair by a vector $(x_1,x_2,x_3,x_4,x_5,x_6)$, for example,

\begin{align*}
(1,5) &\rightarrow (x_1,x_2,x_3,x_4,x_5,x_6) = (1,0,0,0,1,0)\\
(2,2) &\rightarrow (x_1,x_2,x_3,x_4,x_5,x_6) = (0,2,0,0,0,0)\\
(3,4) &\rightarrow (x_1,x_2,x_3,x_4,x_5,x_6) = (0,0,1,1,0,0)\\
(5,5) &\rightarrow (x_1,x_2,x_3,x_4,x_5,x_6) = (0,0,0,0,2,0)
\end{align*}

Note that here $x_i \geq 0$ are integers and $x_1+x_2+x_3+x_4+x_5+x_6 = k = 2$. Thus, we can claim that the number of ways we can sample two elements from the set $A=\{1,2,3,4,5,6\}$ such that ordering does not matter and repetition is allowed is the same as solutions to the following equation

$$x_1+x_2+x_3+x_4+x_5+x_6 = 2\text{, where } x_i\in\{0,1,2\}$$

This is an interesting observation and in fact using the same argument we can make the following statement for general $k$ and $n$.

<div class="alert alert-info">
    <b>Sampling with Replacement and without Ordering (Partitions)</b>
    
The number of $k$-multisets of an $n$-set $A=\{a_1,a_2,\cdots,a_n\}$ **with replacement** and **without ordering** are binomial coefficients of the form:

$$\binom{n + k - 1}{k} = \binom{k + n - 1}{n-1}$$
</div>

**Bonus Material:** https://www.youtube.com/watch?v=UTCScjoPymA

**<font color=blue>Example 2:</font> What is the probability of a roll of 11 when rolling 2 fair 6-sided dice?**

___

# The Bayes' Theorem (sometimes called Bayes' Rule)

Consider two events $A$ and $B$, by the **chain rule** equations we know that: 

$$P(A\cap B) = P(A|B)P(B)$$
and
$$P(B\cap A) = P(B|A) P(A)$$

Note that 

\begin{align*}
P(A\cap B) &= P(B\cap A)\\
\iff P(A|B)P(B) &= P(B|A) P(A)\\
\iff P(A|B) &= \frac{P(B|A) P(A)}{P(B)}
\end{align*}

<div class="alert alert-info" role="alert">
  <strong>Bayes's Theorem</strong>
    
If the set of events $\{A_i\}_{i=1}^n$ partitions the sample space $\Omega$, and assuming $P(A_i)>0$, for all $i$. Then, for any event $B$ such that $P(B)>0$, we have

\begin{align*}
P(A_i|B) &= \frac{P(B|A_i)P(A_i)}{P(B)}
\end{align*}

where $P(B)$ can be computed using the Law of Total Probability,
  
\begin{align*}
P(B) &= P(B|A_1)P(A_1) + \cdots +P(B|A_n)P(A_n)
\end{align*}

</div>

* **Add that to the set of formulas!**

**<font color=blue>Example 3:</font> Consider the experiment where we select between a fair 6-sided die and a fair 12-sided die at random and flip it once. What is the probability that the die selected was the 12-sided die if face on top was 5?**

<!-- 
Let $S$ be the event that the fair 6-sided die was selected, $T$ the event that the fair 12-sided die was selected, and $D_i$ the event that the face $i$ was rolled.

$$P(T|D_5) = \frac{P(T\cap D_5)}{P(D_5)} = \frac{P(D_5|T)P(T)}{P(D_5)}$$

and

$$P(D_5) = P(D_5|S)P(S) + P(D_5|T)P(T) = \frac{1}{6}\times\frac{1}{2} + \frac{1}{12}\times\frac{1}{2} = 0.125$$

Putting it together,

$$P(T|D_5) = \frac{P(D_5|T)P(T)}{P(D_5|S)P(S) + P(D_5|T)P(T)} = \frac{\frac{1}{12}\times\frac{1}{2}}{\frac{1}{6}\times\frac{1}{2} + \frac{1}{12}\times\frac{1}{2}}=\frac{1}{3}$$ -->

In [5]:
num_sims=100_000
dice = ['6-sided','12-sided']

## COMPLETE IN CLASS

print('Probability that die is 12-sided if observed result is 5 is ',
      )

Probability that die is 12-sided if observed result is 5 is 


* Bayes's rule is an extremely useful theorem and is often used for **statistical inference**.

There are a number of *causes* that may result in a certain *effect*. We observe the effect, and we wish to infer the cause.

* The events $A_1, A_2,\dots,A_n$ can be characterized as a set of possible causes, and
* The event $B$ represents the effect

The probability $P(B|A_i)$ computes the probability that the effect $B$ will be observed when the cause $A_i$ is present. This amounts to a probabilistic model for a cause-effect relationship.

Given that the effect $B$ has occurred, we want to evaluate the probability $P(B|A_i)$ that the cause $A_i$ is present.

* We refer to $P(A_i|B)$ as the **<font color=green>posterior probability</font>** of event $A_i$ given the information

* We refer to $P(A_i)$ as the **<font color=orange>prior probability</font>**

* We refer to $P(B|A_i)$ as the **<font color=blue>likelihood</font>**

* We refer to $P(B)$ as the **<font color=brown>evidence/effect probability</font>**

**<font color=blue>Example 4:</font> A test for a certain rare disease is assumed to be correct 95% of the time: if a person has the disease, the test results are positive with probability 0.95, and if the person does not have the disease, the test results are negative with probability 0.95. A random person drawn from a certain *population* has probability 0.001 of having the disease. Given that the person just tested positive, what is the probability that the person has the disease?**

<!-- Let $A$ be the event that the person has the disease, $B$ the event that the test results are positive. We are given that $P(B|A) = 0.95$ and $P(A) = 0.001$. We want to compute $P(A|B)$.

\begin{align*}
P(A|B) &= \frac{P(B|A)P(A)}{P(B)}\\
&= \frac{P(B|A)P(A)}{P(B|A)P(A) + P(B|\overline{A})P(\overline{A})}\\
&= \frac{0.95 \times 0.001}{0.95\times 0.001 + 0.05\times 0.999}\\
&\approx 0.0187
\end{align*} -->

# Bayesian Statistics vs Classic Statistics

The Bayes' theorem is an *extremely* useful result, formulated by Thomas Bayes in the 18th century and later published by Richard Price.

This result originated the term **Bayesian statistics** or **Bayesian inference**, giving a different interpretation of probability when compared to **classic statistics** or **Frequentist inference**.

* **Frequentist statistics**: refers to the field of statistics that draws conclusions from data by computing relative frequency of events in the data.

* **Bayesian statistics**: refers to the field of statistics that draws conclusions from data by testing out the hypothesis and computing their observed probability from data.

In **inference**, Bayes' rule makes use of a **prior** which is an assumption made about some underlying phenomenon. Bayes' equation makes use of this prior probability to compute the probability of such cause given some observational data.

Whereas in classical inference, no assumption are made about the underlying nature of the system that generated the observational data. It's inference is purely based on how the frequency of outcomes.

![ThomasBayes](https://upload.wikimedia.org/wikipedia/commons/d/d4/Thomas_Bayes.gif)
Thomas Bayes (1701-1761), [Wikipedia page](https://en.wikipedia.org/wiki/Thomas_Bayes)

**Which statistical approach should I use to draw conclusions from my data?** We will see that it *depends* on
1. the problem
2. the actual quantity (and quality) of the observational data that you have
3. whether or not you have prior beliefs
4. other factors

In [None]:
from IPython.display import Image
Image('https://imgs.xkcd.com/comics/frequentists_vs_bayesians_2x.png',width=500)

**<font color=blue>Example 5:</font> Suppose that I flipped a coin 5 times and observe the event $E=\{H,H,H,H,H\}$. Without telling you anything else, what is the probability of heads? What would your answer be?**

The **hidden state** of this problem is: what coin was used for this experiment?

* Frequentist statistics: $P(H) = \frac{|H|}{|E|} = \frac{5}{5} = 1$. 

It does not use any prior beliefs.

* Bayesian statistics: you start by assuming that chances are e.g. I'm flipping a fair coin - this is your prior belief, that the coin is fair, then $P(H|\text{fair})=\frac{1}{2}$, furthermore you also compute a probability for that hypothesis:

\begin{align*}
P(\text{fair}| E) &= \frac{P(E|\text{fair})P(\text{fair})}{P(E)}\\
&= \frac{P(E|\text{fair})P(\text{fair})}{P(E|\text{fair})P(\text{fair})+P(E|\text{unfair})P(\text{unfair})}\\
&= \frac{\left(\frac{1}{2}\right)^5\times\frac{1}{2}}{\left(\frac{1}{2}\right)^5\times\frac{1}{2}+(1)^5\times\frac{1}{2}}, \text{ assuming you believe it to be 50/50 between fair and 2-headed}\\
&\approx 0.0303
\end{align*}

If you thought I had 2 fair coin and 1 2-headed, then the probability for fair coin is $\frac{2}{3}$ and probability for 2-headed was $\frac{1}{3}$. With this, the probability of the hypothesis/cause "coin is fair" is:

$$P(\text{fair}| E) = \frac{\left(\frac{1}{2}\right)^5\times\frac{2}{3}}{\left(\frac{1}{2}\right)^5\times\frac{2}{3}+(1)^5\times\frac{1}{3}} \approx 0.0588$$

Note that a **stronger (prior) belief** influenced the probability of your **hypothesis**.

**This is where the (healthy) "rivalry" between Frequentist vs Bayesian emerges:**

* Frequentists say that we should never make assumptions (prior beliefs) because they will change the probability of the hypothesis. Frequestists support that the use of the observational data is the approach to take conclusions. Frequentist approach to probability is **data-driven**.

* Bayesians say that in situations where we do not have enough data, it is prudent to make assumptions as the conclusions will become more "realistic".

There are strategies to adjust the prior belief (correct its value to a *better* value) as we continue to collect more observations. We will study this.

## Applications of Bayesian Inference

Applications of Bayesian inference are endless. These are some examples:

1. Decision theory, e.g. communication system (example next lecture)

2. Bioinformatics and healthcare, e.g. building a risk model from genetic profiles

3. Recommender systems, e.g. Netflix

4. Stock market prediction

5. Email spam filter

6. Financing, e.g. banks are using Bayesian inference to determine interest rates of a loan by using a risk model

8. many, many others...

___

# Exploratory Data Analysis

*A first look at the data*.

<div class="alert alert-success">
    <b>Exploratory Data Analysis</b>
    
**Exploratory data analysis** or **EDA** is a critical first step in analyzing the data from an experiment. Here are the main reasons we use EDA:
* detection of mistakes
* checking of assumptions
* preliminary selection of appropriate models
* determining relationships among the explanatory variables, and
* assessing the direction and rough size of relationships between explanatory and outcome variables.

Loosely speaking, any method of looking at data that does not include formal statistical modeling and inference falls under the term exploratory data analysis.
</div>

Exploratory data analysis is generally cross-classified in two ways. First, each method is either 

1. **non-graphical**, or 
2. **graphical**. 

And second, each method is either 
* **univariate**, or 
* **multivariate** (usually just bivariate).

<div class="alert alert-info">
    <b>Types of EDA</b>
    
The four types of EDA are:
* univariate non-graphical
* multivariate non-graphical
* univariate graphical
* multivariate graphical
</div>

Non-graphical methods generally involve calculation of **summary statistics**, while graphical methods obviously summarize the data in a diagrammatic or pictorial way. 

* Univariate methods look at one variable (data column) at a time, while multivariate methods look at two or more variables at a time to explore relationships. 
    * Usually our multivariate EDA will be bivariate (looking at exactly two variables), but occasionally it will involve three or more variables. 
    * *It is almost always a good idea to perform univariate EDA on each of the components of a multivariate EDA before performing the multivariate EDA.*

## Univariate Data

The data that come from making a particular measurement on all of the subjects in a sample represent our observations for a single characteristic such as age, gender, speed at a task, or response to a stimulus. 

We should think of these measurements as representing a *sample distribution* of the variable, which in turn more or less represents the *population distribution* of the variable. 

The usual goal of univariate non-graphical EDA is to better appreciate the *sample distribution* and also to make some tentative conclusions about what population distribution(s) is/are compatible with the sample distribution. 

* Outlier detection is also a part of this analysis.

<div class="alert alert-info">
    <b>Population</b>
    
A **population** is a group of people, objects, events or observations that is being studied.
</div>

<div class="alert alert-info">
    <b>Parameters</b>
    
Often we are trying to assess some qualities or properties of that population. We call these **parameters**.
</div>

When the population is too large to directly measure the parameters of interest, then we try to draw inferences from a subset of the population.

<div class="alert alert-info">
    <b>Sample</b>
    
A **sample** from a population is a subset of the population that can be used to draw inferences about the parameters of interest.
</div>

* A sample is usually drawn randomly from the population.

* We usually require that each member of the sample is chosen independently from other members.

* Often, but not always, each member in the population is equally likely to be included in the sample.

<div class="alert alert-info">
    <b>Statistic</b>
    
A **statistic** is a measurement of a quality or property on a sample that is used to assess a parameter of the whole population.
</div>

When samples are small, the statistics often provide little or no information about the parameters.

* For example, consider the problem of determining whether a coin is fair or two-headed. The result of flipping a coin one time provides no useful information for determining that

When samples are larger, they generally more accurately represent the population.

In practice, when dealing with data, there are generally two cases that we will encounter:

1. When designing an experiment, the statistician can choose the sample size to balance between being able to generate a useful statistic and the cost of taking more samples.

2. Sometimes the experiment has already been carried out or is not under the control of the statistician. For instance, the statistician wants to assess something based on an existing survey or compare effects of a change in laws on a set of states. In this case, the sample size is fixed.

### Example: Effect of 1994-2004 Federal Assault Weapon Ban

In 1994, the United States Congress passed a ban on a variety of semiautomatic rifles that are sometimes referred to as "assault weapons". The ban was in effect for 10 years, from 1994-2004. ([State Firearm Laws](https://www.statefirearmlaws.org/resources))

It might be guessed that the goal of any gun ban is to reduced gun violence. Thus it is natural to assess whether the "assault weapon" ban had any effect on gun violence.

Fortunately, the Center for Disease Control's National Center for Health Statistics tracks firearm mortality at the state level. Visualizations of firearm mortality by state, along with links to download the data are available here:

https://www.cdc.gov/nchs/pressroom/sosmap/firearm_mortality/firearm.htm

Although this page does not have data prior to 2005, the data for 2005 should be similar to that before the ban because the ban was only on the **sale** of certain firearms. It would take many years for this ban to actually affect the availability of firearms.

Thus, we can use two sets of data on that page to measure the effect of the "assault weapons" ban:

* The 2005 data set represents firearm mortality after the ban had been in effect for a decade
* The 2014 data set represents firearm mortality after the ban had been seized for a decade

I have download this data and it is saved in the file called **"firearms-combined.csv"**.

**Make sure you have the CSV file wherever you are working on this notebook!**

Now let's read the data from the CSV file into a dataframe:

* Death rates are measured per 100,000 total population.

Let's access the sample values for columns "RATE-2005" and "RATE-2014":

In [None]:
# Note that I went directly to a numpy array here, instead of making a list first
# The reason for using a numpy array is that we want to apply numpy methods for 
# computing statistics further below!



Let's begin by plotting this data:

## Histograms

A common visualization is to look at a histogram of the data. Unlike the histograms we previously generated, this data takes on **real values**, not just integers. Fortunately, ```matplotlib``` has functions to do the hard work of making histograms for us:

Some styling will help make this more legible:

Each bar of the histogram represents a "bin" of data values. In fact, the counts and bin edges are returned by the hist function. We can easily change the number of bins to provide more resolution:

Let's add some information to make this more useful:

* **What *inferences* might you make from this plot?**

However, it does not make sense to make the number of bins very large compared to the data size.

## Summary Statistics

Summary statistics are values calculated from sample data that measure some characteristic about the data.

* **What is the most common summary statistic?**

The **average** or **sample mean**. I **strongly** prefer the word average for the statistic computed from a set of data. 

We will use the word **mean** to refer to a type of average for random phenomena, when we do not have specific samples for those values. 

* What does the **average** or **sample mean** mean?

    1. The value where most of the data "sits" is centered around
    
    2. The value that has minimum distance from every value
    
    3. Value most likely to occur
    
    4. Value that divides group into 2 sets of equal size 

Both ```pandas``` and ```numpy``` provide methods to calculate the average:

Other **summary statistics** are used to summarize a set of observations, the most common ones are:

1. **Average** - the value where most of the data "sits" is centered around

2. **Size** - number of observations in the sample data

3. **Count** - number of non-empty observations in the sample data

4. **Median** - the "middle number" of the sorted sample data values

5. **Standard deviation** - is a measure of dispersion; it measures the average distance between a single observation and the average value

6. **Quartiles** - the boundary values for the lowest, middle and upper quarters of the sample data

7. **Inter-Quantile Range (IQR)** - where the "middle fifty" percent of the data is

In ```pandas``` we can print a summary statistic table this way:

A good graphical descriptor that displays a few of these summary statistics is the **boxplot** or **whisker plot**:

![boxplot](https://www.simplypsychology.org/boxplot.jpg)

We can use the ```matplotlib``` to display a boxplot:

Or we can use built-in ```pandas``` graphic visualizations directly on dataframes:

* **What *inferences* might you make from this plot?**

The sample mean of the 2014 data set is larger than that for the 2005 data set. This may indicate that the overturn of the assault weapon ban in 2014 is associated with an increase in firearms mortality.

However, the difference is relatively small, as are the sample sizes (50).

By performing EDA, we have gathered a lot of information and we may want to start answering some questions that require statistical hypothesis testing and modeling. 

* For example, for the firearm law example, we may *hypothesize* that the observed average difference are just based on random sampling from the underlying population, that is that the ban did not have an effect on firearm mortality rate.

___

# Binary Hypothesis Testing

* The *null hypothesis* is that there is no real difference between the two data sets, and any differences are just based on random sampling from the underlying population.

So, let's **assume that the two samples are from the same population**. 

* By combining the samples (called **pooling**), we get a new subset of the original population, if the null hypothesis is true. Moreover, any sample from this better represents the original population than either of the samples.

* We can check whether the null hypothesis is true by checking how often samples from the pooled data set have a difference in means as large as the one observed.

<div class="alert alert-info">
    <b>Pooling</b>
    
**Pooling** describes the practice of gathering together small sets of data that are assumed to have been *drawn* from the same underlying population and using the combined larger set (the *pool*) to obtain a more precise estimate of that population.
</div>

## Sampling

**The big question:** to sample **with replacement** or **without replacement**?

<div class="alert alert-info">
    <b>Bootstrapping</b>
    
**Sampling with replacement** from a pooling set is called **bootstrapping** and is the most popular resampling technique. It is meant to better emulate independent sampling from the original population.
</div>

<div class="alert alert-info">
    <b>Permutations</b>
    
**Sampling without replacement** from a pooling set better emulates **permutation** tests, where we check every possible reordering of the data into samples. This will be discussed more later.
</div>

* Generally, *sampling without replacement* is more conservative (produces a higher $p$-value) than bootstrapping. 
* Bootstraping is **easy** and **most popular**, and we apply it here.

**The Bootstrap Idea:** The original sample approximates the population from which it was drawn. So *resamples* from this sample approximate what we would get if we took many samples from the population. The bootstrap distribution of a statistic, based on many resamples, approximates the sampling distribution of the statistic, based on many samples.

### Bootstrap Model 1

* How would we randomly choose from this data **with replacement**?

* And, if each resample is a new sample, which size should the resample have?

Recall that ```numpy.random``` has a similar method:

For a significance level of $\alpha = 0.05$, let's build a Bootstrap simulation to compute the probability of observing a mean difference of 0.63 or larger:

* **What is the conclusion?**

    * **Is the result statistically significant?** <!--No, because the p-value is larger than $\alpha=0.05$.-->
    * **Can we reject the null hypothesis?** <!--No, "we cannot reject the null hypothesis". -->
    * **Conclusion:** <!--The data suggests that the ban did not have an effect of firearm mortality rate.-->

### Bootstrap Model 2

A more reasonable bootstrap approach would be to randomly assign values from 2005 or 2014 **for each state** and then assess the difference:

In [None]:
# Alternatively: Use the Pandas library



Now, we want to a special kind of array indexing: **fancy indexing**.

For a significance level of $\alpha = 0.05$, let's build a Bootstrap simulation to compute the probability of observing a mean difference of 0.63 or larger:

* **What is the conclusion?**

    * **Is the result statistically significant?** <!--Yes, because the p-value is smaller than $\alpha=0.05$.-->
    * **Can we reject the null hypothesis?** <!--Yes, we reject the null hypothesis-->
    * **Conclusion:** <!--Under this interpretation, the restriction on assault weapons is associated with an increase in mean firearms morality.-->
    
<!--It depends on how you interpret the data!-->

### Distribution of the bootstrap mean-difference

Every time we create a bootstrap value for the difference of means, we create a new random value. Let's see how the bootstrap means are distributed by looking at a histogram of those values:

A few obervations:
    
1. The difference of means has a bell shape -- we saw that before. Why do you think that is?
2. Almost all of the values fall between -0.5 and +0.5. Thus, it is not surprising that getting a mean-difference as large as 0.6 is very rare.

**Topic for later:** The **Central Limit Theorem** (CLT) for sums says that if you keep drawing larger and larger samples and taking their sums, the sums form their own normal distribution (the sampling distribution), which approaches a normal distribution as the sample size increases. 

We can now consider the question: **what values of the mean-difference will make it such that we have 95\% confidence that we should ACCEPT the null hypothesis?**

So, the percentage of data lying below -0.5 is:

Similarly, the percentage lying above 0.5 is:

Another way to express this is that 99.80% of the data is between $[-0.5, 0.5]$.

This is an example of a **confidence interval**. 

* Confidence intervals offer an alternative to $p$-values that provide more information. 

* When we say a $x$% confidence interval, we usually mean the region such that $(100-x)/2$% of samples will fall below the confidence interval, and $(100-x)/2$% of samples will fall above the confidence interval. 

The confidence interval for a bootstrap statistic cannot be known exactly, but it can be estimated accurately given enough samples of the bootstrap statistic.

___

# Confidence Intervals

**Procedure for Estimating Confidence Interval for a Bootstrap Statistic**

1. Draw $N$ samples from the pooled data using replacement
2. For each sample(s), compute the desired statistic and store it
3. Sort all of the stored statistics
4. For confidence interval $x$%:
    * the lower bound of the confidence interval is the element in position $N(1-x)/2$
    * the upper bound of the confidence interval is the element in position $N-N(1-x)/2= N \times x/2$

**<font color=blue>Example 1:</font> Compute the 95% confidence interval for the example above.**

Find the **position** in the sorted sequence of the lower bound of the confidence interval:

Now find the **value** of the sorted data at that position. That is the lower end of our confidence region:

Finding the position of the upper bound of the confidence interval is most easily done using the position of the lower bound:

Thus, the 95% confidence interval is $[-0.31, 0.31]$.

**How can confidence intervals be used in place of $p$-values?** 
* Instead of conducting a binary hypothesis test with $\alpha=0.05$, we can compute the 95% confidence interval for the mean difference. Then we observe if the result lies within the 95% confidence interval.

The observed mean-difference value was 0.63. This falls outside the 95% confidence interval $[-0.31,0.31]$. The fact that the observed value is far outside the 95% confidence interval makes it likely that we could have used a stronger criteria (like 99% confidence intervals).