In [1]:
import numpy as np
import numpy.random as npr
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd

**Make sure you have the CSV file wherever you are working on this notebook!**

## Effect of 2004 Assault Weapons Ban -- Continued

In [2]:
df=pd.read_csv("firearms-combined.csv")

FileNotFoundError: [Errno 2] File b'firearms-combined.csv' does not exist: b'firearms-combined.csv'

# Summary Statistics

Both Pandas and Numpy provide methods to calculate the average:

In [None]:
rate2005=np.array(df["RATE-2005"])
rate2014=np.array(df["RATE-2014"])

In [None]:
rate2005.mean()

In [None]:
np.mean(rate2014)

In [None]:
diff=rate2014.mean()-rate2005.mean()
diff

How could we conduct an experiment to determine if this difference might caused by sampling?

The null hypothesis is that there is no real difference between the two data sets, and any differences are just based on random sampling from the underlying population

So, let's assume that the two samples are from the same population. 

By combining (called **pooling**) the samples, we get a new subset of the original population, if the null hypothesis is true. Morever, any sample from this better represents the original population than either of the samples

We can whether the null hypothesis is true by checking how often samples from the pooled data set have a difference in means as large as the one observed 

**The big question**

To sample **with replacement** or **without replacement**

Sampling with replacement is called **bootstrapping** and is the most popular resampling technique. It is meant to better emulate independent sampling from the original population

Sampling without replacement better emulates permutation tests, where we check every possible reordering of the data into samples. This will be discussed more later

Generally, sampling without replacement is more conservative (produces a higher $p$-value) than bootstrapping. Bootstraping is **easy** and **most popular**, and we apply it here:

### Bootstrap Model 1

In [None]:
pooled=np.concatenate((rate2005,rate2014))

In [None]:
pooled,pooled.size

In [None]:
npr.choice(pooled,size=50)

What are some problems with this approach?

<!-- We treated all the states in the two samples as if they were independent -- but they are measurements on the same states! That violates our assumptions. 

The states have their own firearms mortality behavior based on other factors (rural/urban, education level, laws, etc.) -->



### Bootstrap Model 2

<!-- A more reasonable bootstrap approach would be to randomly assign values from 2005 or 2014 for each state and then assess the difference: -->

What is the conclusion? 

## Distribution of the bootstrap mean-difference

Every time we create a bootstrap value for the difference of means, we create a new random value. Let's see how the bootstrap means are distributed by looking at a histogram of those values:

A few observations:

1.

2.


### Confidence Intervals

An alternative to specifying $p$-values is to specify **confidence intervals**. The $x$% (symmetric) confidence interval for a statistic is the region such that $x/2$% of samples will fall below the confidence interval, and $x/2$% of samples will fall above the confidence interval. 

The confidence interval for a bootstrap statistic cannot be known exactly, but it can be estimated accurately given enough samples of the bootstrap statistic

**Procedure for Estimating Confidence Interval for a Bootstrap Statistic**
1. Draw $N$ samples from the pooled data using replacement
2. For each sample(s), compute the desired statistic and store it
3. Sort all of the stored statistics
4. For confidence interval $x$%:
    * the lower bound of the confidence interval is the element in position $Nx/2$
    * the upper bound of the confidence interval is the element in position $N-Nx/2-1= N(1-x)/2-1$
    
    (Assuming 0-based indexing)

**Compute the 95% confidence interval for the example above**

Thus, the 95% confidence interval is

Conclusion:

**Lecture 11 Assignment**
1. Compute the 99% confidence interval for this second way of carrying out the bootstrap on the 2005/2014 firearms mortality data sets. Resample a few times and see if the confidence interval varies. How many samples are needed to get a reliable estimate of the 99% confidence interval for this statistic?
2. Compute the 95% confidence interval for the first way of carrying out the bootstrap on the 2005/2014 firearms mortality data sets. How many samples are needed to get a reliable estimate of the 95% confidence interval for this statistic?
3. Compute the 99% confidence interval for the first way of carrying out the bootstrap on the 2005/2014 firearms mortality data sets. How many samples are needed to get a reliable estimate of the 99% confidence interval for this statistic?

## Effect of State Laws

The column "Total Laws 2014" shows the total number of gun laws in each state as of 2014. The data is from 

https://www.statefirearmlaws.org/resources

Now we have 2 data sources that are both in 2014, but they do not represent two samples from the same population. Instead, they represent two things that may depend on each other.

Again, the first thing to do is plot the data. When we have two sets of data that may be dependent, a scatter plot is usually the first tool to reach for:

What does this data suggest?

We don't know how to measure the dependence between two data sets like this -- **yet**

Let's see if we can turn this into into two data sets:

The data seems to fall into two clusters, one with < 50 laws and one with > 50 laws. 

Finding clusterings of data is a job computers are good at.

Let's use a standard clustering algorithm to see what it find.



There are several libraries that contain clustering algorithms. 

Clustering is a type of unsupervised machine learning; unsupervised means that you do not have to have labels for any of the data

![scikit-learn logo](https://scikit-learn.org/stable/_static/scikit-learn-logo-small.png)
The ```scikit-learn``` module has many useful methods for machine learning, including clustering

We will start with one of the simplest clustering algorithms:

The $K$-Means Algorithm is a randomized, iterative algorithm to cluster data. We will need to put the data into colums of a matrix as follows:

Now let's partition the data accordingly:

**Note that less50 is a pandas series object. It also has a mean method, as well as other useful summary statistics**

Again, if we want to perform a binary hypothesis test, we need to pool the data and draw representative samples from it:

What is your conclusion?

What are some issues with this analysis?