# Statistical Thinking in Python (Part 2)

## Introduction

These are my notes for DataCamp's course [_Statistical Thinking in Python (Part 2)_](https://www.datacamp.com/courses/statistical-thinking-in-python-part-2).

This course is presented by Justin Bois, Lecturer at the California Institute of Technology. Collaborators are Yashas Roy and Hugo Bowne-Anderson.

Prerequisites:

- [_Statistical Thinking in Python (Part 1)_](../Statistical%20Thinking%20in%20Python%20Part%201/Statistical%20Thinking%20in%20Python%20Part%201.ipynb)

This course is no longer part of any skill or career track, but it's an excellent course.

For bootstrap analysis, see the new course "Sampling in Python".

Be careful using np.polynomial.Polynomial.fit()!

## Resources
- Introduction to Probability Theory, STAT 414 and Pennsylvania State University: https://online.stat.psu.edu/stat414/.
- The code that underlies least squares analysis in NumPy: https://numpy.org/doc/stable/reference/generated/numpy.linalg.lstsq.html#numpy.linalg.lstsq
- Least squares fitting: https://mmas.github.io/least-squares-fitting-numpy-scipy
- Fitting polynomials in NumPy: https://numpy.org/doc/stable/reference/routines.polynomials.classes#fitting
- scikit-learn linear regression versus NumPy polyfit (not very well-written but plenty of code): https://techflare.blog/scikit-learn-linearregression-vs-numpy-polyfit/
- YouTube video about bootstrapping: https://www.youtube.com/watch?v=N4ZQQqyIf6k

## Imports and Code Initialization

Imports are collected here for convenience and clarity. Initialize the default random number generator. Add the `ecdf()` function from part one of this course.

In [None]:
import pprint

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy
import seaborn as sns

# Use dark mode for plotting.
plt.style.use("dark_background")

# Initialize the random number generator.
rng = np.random.default_rng()

# Add a function that takes as input a 1-D array of data and returns the
# x and y values of the ECDF.
def ecdf(data):
    """
    Compute the ECDF of a one-dimensional array of measurements.

    x, y = ecdf(data)
    """
    n = len(data)
    x = np.sort(data)
    y = np.arange(1, n + 1) / n
    return x, y

# Add a function for calculating and returning the Pearson
# correlation coefficient.
def pearson_r(x, y):
    """
    Compute the Pearson correlation coefficient.
    
    r = pearson_r(x, y)
    """
    corr_mat = np.corrcoef(x, y)
    return corr_mat[0, 1]

## Data Sets

| Data Set | File |
| :--- | :--- |
| 2008 election all data | 2008_all_states.csv |
| 2008 election swing states | 2008_swing_states.csv |
| Anscombe data | anscombe.csv |
| Bee sperm counts | bee_sperm.csv |
| Female literacy and fertility | female_literacy_fertility.csv |
| Finch beaks (1975) | finch_beaks_1975.csv |
| Finch beaks (2012) | finch_beaks_2012.csv |
| Fortis beak depth heredity | fortis_beak_depth_heredity.csv |
| Frog tongue data | frog_tongue.csv |
| Major League Baseball no-hitters | mlb_nohitters.csv |
| Scandens beak depth heredity | scandens_beak_depth_heredity.csv |
| Sheffield Weather Station | sheffield_weather_station.csv |

Although it is simple to load each data set into a pandas DataFrame, this course extracts the data into NumPy ndarrays.

### MLB No-Hitter Times

Load the data into a NumPy ndarray named `nohitter_times`.

In [None]:
# Create a numpy ndarray containing the number of games between no-hitters
# from 1900 through 2015.

# Load the dates and other information of no-hit baseball games.
# The dates in the CSV file are encoded as "18760715", but pd.read_csv()
# parses them correctly.
nohitters = pd.read_csv("mlb_nohitters.csv", parse_dates=[0])
print(nohitters.info())
print()
print(nohitters.head())
print()

# Extract the number of games between no-hitters for the modern era
# (1900-2015).
# mengn stands for modern_era_nohitters_game_numbers; this is a pandas Series.
mengn = nohitters[nohitters["date"] > "1900-01-01"]["game_number"]

# Create a numpy ndarray containing the number of games between no-hitters.
nohitter_times = np.array([mengn.iloc[x] - mengn.iloc[x - 1] - 1 for x in range(1, len(mengn))])
print(nohitter_times)

### Female Literacy and Fertility

Load the data into NumPy ndarrays named `illiteracy` and `fertility`.

In [None]:
# Load the data into a NumPy DataFrame.
# flf: female literacy and fertility.
# Add the thousands="," argument to convert values in the population
# column to int64 values instead of object values.
flf = pd.read_csv("female_literacy_fertility.csv", thousands=",")
print(flf.info())
print()
print(flf.head())
print()

# Create numpy.ndarrays for illiteracy and fertility values.
illiteracy = 100 - flf["female literacy"].to_numpy()
# print(type(illiteracy))
# print(illiteracy)
fertility = flf["fertility"].to_numpy()
# print(type(fertility))
# print(fertility)

### Anscombe's Quartet

See https://en.wikipedia.org/wiki/Anscombe%27s_quartet and https://blog.revolutionanalytics.com/2017/05/the-datasaurus-dozen.html.

The CSV file has two header lines.

Store the data in two lists, `anscombe_x` and `anscombe_y`, where the item of each list is a NumPy ndarray.

In [None]:
anscombe = pd.read_csv("anscombe.csv", header=[0, 1])
print(anscombe.info())
print()
print(anscombe)
print()
# Demonstrate how to extract data from the DataFrame.
# print(anscombe[("0", "x")])
# print(anscombe[("0", "y")])
# Create lists of ndrarrays for use in the exercise.
anscombe_x = [anscombe[(group, "x")].to_numpy() for group in ["0", "1", "2", "3"]]
anscombe_y = [anscombe[(group, "y")].to_numpy() for group in ["0", "1", "2", "3"]]
pprint.pprint(anscombe_x)
print()
pprint.pprint(anscombe_y)

### Michelson Speed of Light

The chapter on bootstrapping made use of this data set from the "Statistical Thinking in Python (Part 1)" course; I copied the CSV file into the directory for this course. Create a numpy.ndarray containing the speed of light results.

In [None]:
# Load the Michelson speed of light data.
light = pd.read_csv("michelson_speed_of_light.csv", index_col=0)
print(light.info())
print()
print(light.head())
print()
michelson_speed_of_light = light["velocity of light in air (km/s)"].to_numpy()
print(michelson_speed_of_light)

### Rainfall at Sheffield Weather Station

The data is in a CSV file, but the data is delimited by white space characters, and there are multiple header lines. Skip 8 rows, take the header from the 9th row, and start reading data at the 10th row

The records are monthly; combine the rainfall values for a year into a total for the year to create the variable `rainfall`, a numpy.ndarray.

In [None]:
# Read the file, using a regular expression separator of "\s+".
# The header line is on line 8.
# Missing values are coded as "---".
sheffield = pd.read_csv("sheffield_weather_station.csv", sep=r"\s+", header=8, na_values="---")
print(sheffield.info())
print()
print(sheffield.head())
print()

# We have rain in mm units by month; these need to be combined into
# annual rainfall for the years 1883 through 2015.
rainfall_list = []
for year in range(1883, 2016):
    annual_rain = sheffield[sheffield["yyyy"] == year]["rain"].sum()
    rainfall_list.append(annual_rain)
# print(rainfall_list)
# print()
rainfall = np.array(rainfall_list)
print(rainfall)
print()

# Extract data for rain_june and rain_november.
rain_june = sheffield[(sheffield["mm"] == 6) & (sheffield["yyyy"] <= 2015)]["rain"].to_numpy()
rain_november = sheffield[sheffield["mm"] == 11]["rain"].to_numpy()
print(rain_june)
print()
print(rain_november)

### 2008 Election (Swing States)

This data, from Statistical Thinking in Python (Part 1), is used in a demonstration.

In [None]:
# Load the 2008 election results for swing states into a pandas DataFrame.
# Read the data from the CSV file.
swing_states = pd.read_csv("2008_swing_states.csv")
print(swing_states.info())
print()
print(swing_states.head())

### 2008 Election (All States)

This data, from Statistical Thinking in Python (Part 1), is used for some extra work.

In [None]:
# Load the 2008 election results for swing states into a pandas DataFrame.
# Read the data from the CSV file.
all_states = pd.read_csv("2008_all_states.csv")
print(all_states.info())
print()
print(all_states.head())

### Frog Tongue

In [None]:
# Read the frog tongue force data.
frog_tongue = pd.read_csv("frog_tongue.csv", header=14)
print(frog_tongue.info())
print()
print(frog_tongue.head())

### Bee Sperm

Straub, et al. (Proc. Roy. Soc. B, 2016) investigated the effects of neonicotinoids on the sperm of pollinating bees.

In [None]:
# Read the data from the CSV file.
# Multiply the values by 2 to obtain the values in the course's arrays.
bee_sperm = pd.read_csv("bee_sperm.csv", header=3)
print(bee_sperm.info())
print()
print(bee_sperm.head())
control = bee_sperm[bee_sperm["Treatment"] == "Control"]["Alive Sperm Millions"].to_numpy()
control = control * 2
print(control)
print(np.mean(control))
print()
treated = bee_sperm[bee_sperm["Treatment"] == "Pesticide"]["Alive Sperm Millions"].to_numpy()
treated = treated * 2
print(treated)
print(np.mean(treated))

### Finch Beaks

In [None]:
# Create a DataFrame named beak_depth with columns "year" and "beak_depth".
finch_beaks_1975 = pd.read_csv("finch_beaks_1975.csv", header=0)
scandens_beaks_1975 = finch_beaks_1975[finch_beaks_1975["species"] == "scandens"].copy()
scandens_beaks_1975["year"] = 1975
print(scandens_beaks_1975.info())
print()
print(scandens_beaks_1975.head())
print()

finch_beaks_2012 = pd.read_csv("finch_beaks_2012.csv", header=0)
scandens_beaks_2012 = finch_beaks_2012[finch_beaks_2012["species"] == "scandens"].copy()
scandens_beaks_2012["year"] = 2012
print(scandens_beaks_2012.info())
print()
print(scandens_beaks_2012.head())
print()

# Create the new DataFrame for beak length data.
year_data = \
    np.concatenate(
        (
            scandens_beaks_1975["year"].to_numpy(),
            scandens_beaks_2012["year"].to_numpy()
        )
    )
beak_depth_data = \
    np.concatenate(
        (
            scandens_beaks_1975["Beak depth, mm"].to_numpy(),
            scandens_beaks_2012["bdepth"].to_numpy()
        )
    )
data_dict = {
    "beak_depth": beak_depth_data,
    "year": year_data
}
beak_depth = pd.DataFrame(data_dict)
print(beak_depth.info())
print()
print(beak_depth.head())
print()

# Create NumPy arrays for beak depth and beak length by year.
bl_1975 = scandens_beaks_1975["Beak length, mm"].to_numpy()
bd_1975 = scandens_beaks_1975["Beak depth, mm"].to_numpy()
bl_2012 = scandens_beaks_2012["blength"].to_numpy()
bd_2012 = scandens_beaks_2012["bdepth"].to_numpy()
print(bl_1975)
print()
print(bd_1975)
print()
print(bl_2012)
print()
print(bd_2012)

## Parameter Estimation by Optimization

### Optimal Parameters

scipy.stats and statsmodels are two good Python packages for statistical inference by optimization. This course, however, focuses on the use of hacker statistics, which is adaptable to a wide range of statistical problems.

#### CDFs of Expected and Observed Games between No-Hitters (Extra)

These are my thoughts before attempting the course's exercise:
- The waiting time between no-hitters could be modeled using the exponential distribution.
- What is the parameter of the exponential distribution?
- Estimate the parameter.
- Plot the ECDF.
- Using the estimated parameters and hacker statistics, get a random sample from the distribution and plot it against the ECDF.
- How do they compare?

Look at the section in the "Statistical Thinking in Python (Part 1)" course about the exponential distribution. We need to estimate the parameter for the exponential distribution by calculating the mean number of games between no-hitters.

In [None]:
# Calculate the mean of nohitter_times as tau.
# In modern baseball, there are 30 teams that play 162 games each.
# Since each game involves 2 teams, there are 30 * 162 / 2 = 2,430 games
# per season. This means there are about 3 nohitters per season on average.
tau = nohitter_times.mean()
print("mean of nohitter_times:", tau)

# Draw a random sample from the exponential distribution with parameter tau.
inter_nohitter_time = rng.exponential(tau, size=100000)

# Plot the CDF of the random sample and the ECDF of the observed data.
# Use a line plot with the expected data.
# The two CDFs are very similar.
x_e, y_e = ecdf(inter_nohitter_time)
plt.plot(x_e, y_e, label="expected")
x_o, y_o = ecdf(nohitter_times)
plt.plot(x_o, y_o, marker=".", linestyle="none", label="observed")
plt.xlabel("Games between nohitters")
plt.ylabel("CDF")
plt.legend()
plt.show()

#### How Often Do We Get No-Hitters? (PDF Plot) (Exercise)

In [None]:
# Plot the PDF of the random sample and label axes.
_ = plt.hist(inter_nohitter_time, bins=50, density=True, histtype="step")
_ = plt.xlabel('Games between no-hitters')
_ = plt.ylabel('PDF')
plt.show()

#### Do the Data Follow Our Story? (ECDF Plot) (Exercise)

This exercise repeats the work I did above plotting the ECDF of the observed and expected data.

> It looks like no-hitters in the modern era of Major League Baseball are Exponentially distributed. Based on the story of the Exponential distribution, this suggests that they are a random process; when a no-hitter will happen is independent of when the last no-hitter was.

#### How Is this Parameter Optimal? (Exercise)

Plot the theoretical (expected) CDFs from tau, tau / 2, and tau * 2 along with the observed CDF. The best match is when using tau to represent the parameter of the exponential distribution.

In [None]:
# Take samples for tau / 2 and tau * 2.
samples_half = rng.exponential(tau / 2, size=100000)
samples_double = rng.exponential(tau * 2, size=100000)

# Generate CDFs from these samples
x_half, y_half = ecdf(samples_half)
x_double, y_double = ecdf(samples_double)

# Plot the theoretical CDFs and the observed CDF.
plt.plot(x_e, y_e, label="expected (tau)")
plt.plot(x_half, y_half, label="expected (tau / 2)")
plt.plot(x_double, y_double, label="expected (tau * 2)")
plt.plot(x_o, y_o, marker='.', linestyle='none', label="observed")
plt.margins(0.02)
plt.xlabel('Games between no-hitters')
plt.ylabel('CDF')
plt.legend()
plt.show()

### Linear Regression by Least Squares

This course uses `np.polyfit()` to estimate the intercept and slope of a polynomial of degree 1 for fitting a line to the data. See https://numpy.org/doc/stable/reference/generated/numpy.polyfit.html.

```Python
slope, intercept = np.polyfit(x, y, 1)
```

NumPy recommends use of `np.polynomial.Polynomial.fit()`; see https://numpy.org/doc/stable/reference/routines.polynomials.html. Note that the new method behaves very differently.

```Python
intercept, slope = np.polynomial.Polynomial.fit(x, y, 1).convert().coef
```

#### EDA of literacy/fertility Data (Exercise)

In [None]:
# For exploratory data analysis (EDA), plot the fertility rate as a function
# of the illiteracy rate.
plt.plot(illiteracy, fertility, marker='.', linestyle='none')
plt.margins(0.02)
plt.xlabel('Percent illiterate')
plt.ylabel('Fertility')
plt.show()

# Show the Pearson correlation coefficient.
print("{:.3f}".format(pearson_r(illiteracy, fertility)))

#### Linear Regression of Fertility vs. Illiteracy (Exercise)

This code uses `np.polyfit()`.

In [None]:
# Plot the illiteracy rate versus fertility.
plt.plot(illiteracy, fertility, marker='.', linestyle='none')

# Perform a linear regression using np.polyfit().
slope, intercept = np.polyfit(illiteracy, fertility, 1)
print('intercept = {:.3f} children per woman'.format(intercept))
print('slope = {:.3f} children per woman / percent illiterate'.format(slope))

# Plot the regression line.
x = np.array([illiteracy.min(), illiteracy.max()])
y = x * slope + intercept
plt.plot(x, y)

plt.xlabel('percent illiterate')
plt.ylabel('fertility')
plt.margins(0.02)
plt.xticks(np.arange(0, 110, 10))
plt.yticks(np.arange(0, 9, 1))

plt.show()

#### Linear Regression using `Polynomial.fit()` (Extra)

This code uses `np.polynomial.Polynomial.fit().convert().coef` (which is ugly and hard to understand).

In [None]:
# Plot the illiteracy rate versus fertility.
plt.plot(illiteracy, fertility, marker='.', linestyle='none')

# Perform a linear regression using Polynomial.fit().
# The API is very ugly.
# I do not understand why .convert() has to be used.
# The model is y = b0 + b1 * x
b0, b1 = np.polynomial.Polynomial.fit(illiteracy, fertility, 1).convert().coef
print("intercept = {:.3f} children per woman".format(b0))
print("slope = {:.3f} children per woman / percent illiterate".format(b1))

# Plot the regression line.
x_reg = np.array([illiteracy.min(), illiteracy.max()])
y_reg = b0 + b1 * x_reg
plt.plot(x_reg, y_reg)

plt.xlabel('percent illiterate')
plt.ylabel('fertility')
plt.margins(0.02)
plt.xticks(np.arange(0, 110, 10))
plt.yticks(np.arange(0, 9, 1))

plt.show()

This code sets the `domain` and `window` values when fitting, preventing scaling of `domain` to `window`. This is the best way to use the polynomial fitting methods because no conversion is needed.

In [None]:
# Plot the illiteracy rate versus fertility.
plt.plot(illiteracy, fertility, marker='.', linestyle='none')

# Perform a linear regression using Polynomial.fit().
# Set the domain and the window to use the domain of the observed data.
# The model is y = b0 + b1 * x
domain = [illiteracy.min(), illiteracy.max()]
b0_1, b1_1 = np.polynomial.Polynomial.fit(illiteracy, fertility, 1, domain=domain, window=domain)
print('intercept = {:.3f} children per woman'.format(b0_1))
print('slope = {:.3f} children per woman / percent illiterate'.format(b1_1))

# Plot the regression line.
x_reg = np.array([illiteracy.min(), illiteracy.max()])
y_reg = b0_1 + b1_1 * x_reg
plt.plot(x_reg, y_reg)

plt.xlabel('percent illiterate')
plt.ylabel('fertility')
plt.margins(0.02)
plt.xticks(np.arange(0, 110, 10))
plt.yticks(np.arange(0, 9, 1))

plt.show()

#### How Is the Slope Fit Optimal? (Exercise)

For different slopes, calculate the sum of squares of the residuals (RSS), and plot them. What is the slope for which the RSS is minimal?

In [None]:
# Above, slope represents the slope of the regression line.
# Specify a range of slopes to consider: slope_vals.
# I have increased the number of slope_vals to consider in the range
# to get a better estimate of the slope_val that minimizes RSS.
slope_vals = np.linspace(0, 0.1, 200001)

# Initialize sum of square of residuals: rss.
# This is an np.ndarray with the same shape as a_vals.
rss = np.empty_like(slope_vals)

# Compute sum of square of residuals for each value of a_vals.
for i, slope_val in enumerate(slope_vals):
    rss[i] = np.sum((fertility - (slope_val * illiteracy + intercept)) ** 2)

# Plot the RSS as a function of a_vals.
plt.plot(slope_vals, rss, '-')

# Plot the slope.
plt.plot([slope, slope], [100, 200])

# Customize the figure and show it.
plt.xlabel('Slope (children per woman / percent illiterate)')
plt.ylabel('Sum of square of residuals')
plt.show()

# Find the index of the minimum value in rss. Use the index to obtain the
# corresponding slope. Compare this to the slope obtained from the
# linear regression.
min_rss = rss[0]
min_rss_index = 0
for i, rss in enumerate(rss):
    if rss < min_rss:
        min_rss = rss
        min_rss_index = i
print("min_rss: {:.3f}".format(min_rss))
print("slope for min_rss: {:.3f}".format(slope_vals[min_rss_index]))
print("least squares slope: {:.3f}".format(slope))

#### How is the Intercept Fit Optimal? (Extra)

Show that the optimal intercept was calculated by testing different intercept values using RSS (residual sum of squares).

In [None]:
# Above, intercept represents the intercept of the regression line.
# Specify a range of intercepts to consider: intercept_vals.
intercept_vals = np.linspace(1.80, 2.00, 100001)

# Initialize sum of square of residuals: rss.
# This is an np.ndarray with the same shape as intercept_vals.
rss = np.empty_like(intercept_vals)

# Compute sum of square of residuals for each value of intercept_vals.
for i, intercept_val in enumerate(intercept_vals):
    rss[i] = np.sum((fertility - (slope * illiteracy + intercept_val)) ** 2)

# Plot the RSS as a function of intercept_vals.
plt.plot(intercept_vals, rss, '-')
plt.plot([intercept, intercept], [115, 117])
plt.xlabel('Intercept (children per woman)')
plt.ylabel('Sum of square of residuals')
plt.show()

# Find the index of the minimum value in rss. Use the index to obtain the
# corresponding slope. Compare this to the slope obtained from the
# linear regression.
min_rss = rss[0]
min_rss_index = 0
for i, rss in enumerate(rss):
    if rss < min_rss:
        min_rss = rss
        min_rss_index = i
print("min_rss: {:.3f}".format(min_rss))
print("intercept for min_rss: {:.3f}".format(intercept_vals[min_rss_index]))
print("least squares intercept: {:.3f}".format(intercept))

Note that the minimum for both the slope and the intercept is the same. We can think of plotting these in 3-dimensional space, where we're looking for the joint parameters with a minimum RSS.

### The Importance of EDA: Anscombe's Quartet

See https://en.wikipedia.org/wiki/Anscombe%27s_quartet and https://blog.revolutionanalytics.com/2017/05/the-datasaurus-dozen.html.

#### The Importance of EDA (Exercise)

After importing and cleaning your data, the first step is exploratory data analysis, because:
- You can be protected from misinterpretation of the type demonstrated by Anscombe's quartet.
- EDA provides a good starting point for planning the rest of your analysis.
- EDA is not really any more difficult than any of the subsequent analysis, so there is no excuse for not exploring the data.

#### Linear Regression on Appropriate Anscombe Data (Exercise)

Using the data from the first member of the Anscombe Quartet, plot the data and the linear regression line.

In [None]:
# Calculate the slope and intercept using ordinary least squares
# linear regression.
x = anscombe_x[0]
y = anscombe_y[0]
anscombe_slope, anscombe_intercept = np.polyfit(x, y, 1)
print("slope: {:.4f}; intercept: {:.4f}".format(anscombe_slope, anscombe_intercept))

# Generate theoretical x and y data: x_theor, y_theor
# from the slope and intercept.
x_theor = np.array([min(x), max(x)])
y_theor = x_theor * anscombe_slope + anscombe_intercept

# Plot the Anscombe data and theoretical line
plt.plot(x, y, marker=".", linestyle="none")
plt.plot(x_theor, y_theor, "-")
plt.xlabel('x')
plt.ylabel('y')
plt.show()

#### Linear Regression on All Anscombe Data (Exercise)

In [None]:
# For each of the four Anscombe data sets, calculate and print the slope
# and intercept.
for x, y in zip(anscombe_x, anscombe_y):
    anscombe_slope2, anscombe_intercept2 = np.polyfit(x, y, 1)
    print('slope: {:.4f}; intercept: {:.4f}'.format(anscombe_slope2, anscombe_intercept2))

#### Plot Anscombe's Quartet (Extra)

Create a 2 x 2 plot containing Anscombe's Quartet. See the "Introduction to Data Analysis with Matplotlib" course for how to do this.

In [None]:
# This code uses loops to remove repetitive code.
# Initialize plotting parameters.
fig, ax = plt.subplots(2, 2)
fig.set_size_inches((12, 9))
xticks = np.arange(0, 21, 2)
yticks = np.arange(0, 15, 2)
marker = "."
linestyle = "none"
loc = "lower right"
label = "slope: {:.3f}, intercept: {:.3f}"

# row, column refer to the rows and columns in the figure.
# i is the index to the data in anscombe_x and anscombe_y.
for row in range(0, 2):
    for col in range(0, 2):
        # Fill each subplot.
        i = row * 2 + col
        
        # Calculate the slope and intercept.
        anscombe_slope3, anscombe_intercept3 = np.polyfit(anscombe_x[i], anscombe_y[i], 1)
        
        # Draw the regression lines.
        x_theor = np.array([anscombe_x[i].min(), anscombe_x[i].max()])
        y_theor = np.array(x_theor * anscombe_slope3 + anscombe_intercept3)
        ax[row, col].plot(x_theor, y_theor, label=label.format(anscombe_slope3, anscombe_intercept3))
        ax[row, col].legend(loc=loc)
        
        # Plot the data points.
        ax[row, col].plot(anscombe_x[i], anscombe_y[i], marker=marker, linestyle=linestyle)
        ax[row, col].set_xticks(xticks)
        ax[row, col].set_yticks(yticks)

plt.show()

## Bootstrap Confidence Intervals

### Generating Bootstrap Replicates

Bootstrapping is the use of resampled data to perform statistical inference. A bootstrap sample is created by sampling from the original data with replacement. A bootstrap replicate is a statistics computed from a bootstrap sample (a resampled array). This can be done many times to create a series of bootstrap replicates, which can be plotted as an ECDF.

#### Using `rng.choice()` for Bootstrap Sampling (Demonstration)

In [None]:
# Using rng.choice() to create a sample from an existing data set.
original_data = [1, 2, 3, 4, 5]
print(rng.choice(original_data, size=5))

#### Bootstrap Sampling from Michelson's Speed of Light Data (Demonstration)

In [None]:
# For comparison, print summary statistics of the original data.
print("empirical values")
print(np.mean(michelson_speed_of_light))
print(np.median(michelson_speed_of_light))
print(np.std(michelson_speed_of_light))
print()

# Create a bootstrap sample of Michelson's speed of light data.
light_bs_sample = rng.choice(michelson_speed_of_light, size=100)
# Compute bootstrap replicate statistics from the single bootstrap sample.
print("bootstrap values")
print(np.mean(light_bs_sample))
print(np.median(light_bs_sample))
print(np.std(light_bs_sample))

#### Getting the Terminology Down (Exercise)

If we have a data set with  repeated measurements, a bootstrap sample is an array of length _n_ that was drawn from the original data with replacement. What is a bootstrap replicate?

It is a single value of a statistic computed from the bootstrap sample.

#### Bootstrapping by Hand (Exercise)

How many unique bootstrap samples can be created from a data set containing `[-1, 0, 1]`?

There are 3 * 3 * 3 choices, or 27 different bootstrap samples that can be obtained from the data. The largest mean would come from sample `[1, 1, 1]`, which has a mean of 1.

#### Visualizing Bootstrap Samples (Exercise)

"Notice how the bootstrap samples give an idea of how the distribution of rainfalls is spread."

In [None]:
# Plot the ECDFs of 50 bootstrap samples of the original data.
for i in range(50):
    # Generate a bootstrap sample.
    bs_sample = rng.choice(rainfall, size=len(rainfall))

    # Compute and plot ECDF from bootstrap sample.
    x, y = ecdf(bs_sample)
    _ = plt.plot(x, y, marker='.', linestyle='none',
                 color='gray', alpha=0.1)

# Compute and plot ECDF from original data.
x, y = ecdf(rainfall)
_ = plt.plot(x, y, marker='.')

# Make margins and label axes.
plt.margins(0.02)
_ = plt.xlabel('Yearly rainfall (mm)')
_ = plt.ylabel('ECDF')
plt.show()

### Bootstrap Confidence Intervals

#### Bootstrap Replicate Function (Demonstration)

In [None]:
# Create a bootstrap replicate from a 1-D array of data.
# func is the function to apply to the bootstrap sample that computes
# the statistic of interest (e.g., np.mean, np.median).
def bootstrap_replicate_1d(data, func):
    bs_sample = rng.choice(data, size=len(data))
    return func(bs_sample)

print(bootstrap_replicate_1d(michelson_speed_of_light, np.mean))

#### Create Bootstrap Replicates (Demonstration)

In [None]:
# Create bootstrap replicates.
iterations = 10000
light_bs_replicates = np.empty(iterations)   
for i in range(iterations):
    light_bs_replicates[i] = bootstrap_replicate_1d(michelson_speed_of_light, np.mean)

#### Plot a Histogram of Bootstrap Replicates (Demonstration)

In [None]:
# Plot a histogram of the bootstrap replicates.
# Use 100 bins for 10000 samples.
# By setting density=True, the histogram approximates a probability
# density function.
bins=100
plt.hist(light_bs_replicates, bins=bins, density=True)
plt.xlabel("Mean speed of light (km/s)")
plt.ylabel("PDF")
plt.show()

#### Confidence Interval of a Statistic (Demonstration)

If we repeated measurements over and over again, p% of the observed values would lie within the p% confidence interval. We can use `np.percentile()` to calculate the boundaries of the 95% confidence interval.

In [None]:
# Calculate the 95% confidence interval.
# The speed of light in a vacuum is 299,792.458 km/s.
# The speed of light in air is about 90 km/s (56 mi/s) slower than c.
conf_int = np.percentile(light_bs_replicates, [2.5, 97.5])
print(conf_int)

#### Generating Many Bootstrap Replicates (Exercise)

In [None]:
# Create a function for generating many bootstrap replicates.
def draw_bs_reps(data, func, size=1):
    bs_replicates = np.empty(size)
    for i in range(size):
        bs_replicates[i] = bootstrap_replicate_1d(data, func)
    return bs_replicates

# Test the code: Draw the bootstrap replicates and plot the histogram as
# a PDF.
light_bs_replicates2 = draw_bs_reps(michelson_speed_of_light, np.mean, size=10000)
bins=100
plt.hist(light_bs_replicates2, bins=bins, density=True)
plt.xlabel("Mean speed of light (km/s)")
plt.ylabel("PDF")
plt.show()
conf_int = np.percentile(light_bs_replicates2, [2.5, 97.5])
print(conf_int)

#### Bootstrap Replicates of the Mean and SEM (Exercise)

This exercise demonstrates two ways to calculate the standard error of the mean, one from theory and one from bootstrap replicates. The two values are very close. This exercise also shows that the distribution of the bootstrap replicates (means) is well approximated by the normal distribution, as expected.

In [None]:
# Take 10,000 bootstrap replicates of the mean: bs_replicates
rainfall_bs_replicates = draw_bs_reps(rainfall, np.mean, size=10000)

# Compute and print SEM from the standard deviation of rainfall.
sem = np.std(rainfall) / np.sqrt(len(rainfall))
print("Computed SEM: {:.2f}".format(sem))

# Compute and print the standard deviation of bootstrap replicates.
# This is another estimate of the standard error of the mean!
bs_std = np.std(rainfall_bs_replicates)
print("Standard deviation of bootstrap replicate means: {:.2f}".format(bs_std))

# Make a histogram of the results
_ = plt.hist(rainfall_bs_replicates, bins=50, density=True)
_ = plt.xlabel('Mean annual rainfall (mm)')
_ = plt.ylabel('PDF')
plt.show()

#### Confidence Intervals of Rainfall Data (Exercise)

In [None]:
# Calculate the 95% confidence interval for mean annual rainfall.
percentile_low, percentile_high = np.percentile(rainfall_bs_replicates, [2.5, 97.5])
print("95% confidence interval: {:.1f} - {:.1f}".format(percentile_low, percentile_high))

#### Bootstrap Replicates of Other Statistics (Exercise)

We saw in the previous exercise that the mean is normally distributed. This is not necessarily true of other statistics, but we can use hacker statistics to explore the distribution of other statistics using bootstrap replicates.

In this exercise, it appears that the variance of the rainfall data is skewed to the right.

> This is not normally distributed, as it has a longer tail to the right. Note that you can also compute a confidence interval on the variance, or any other statistic, using `np.percentile()` with your bootstrap replicates.

In [None]:
# Plot a histogram of bootstrap replicates of the variance of rainfall
# in Sheffield. Convert units from square mm to square cm.
rainfall_bs_replicates2 = draw_bs_reps(rainfall, np.var, size=10000)
rainfall_bs_replicates2 = rainfall_bs_replicates2 / 100
plt.hist(rainfall_bs_replicates2, density=True, bins=100)
plt.xlabel("Variance of annual rainfall (square cm)")
plt.ylabel("PDF")
plt.show()

#### Confidence Interval on the Rate of No-Hitters (Exercise)

Generate 10,000 bootstrap replicates of the optimal parameter tau. Plot a histogram of your replicates and report a 95% confidence interval.

In [None]:
# Generate 10,000 bootstrap replicates of the optimal parameter tau. Plot a
# histogram of your replicates and report a 95% confidence interval.
nohitter_bs_replicates = draw_bs_reps(nohitter_times, np.mean, size=10000)
plt.hist(nohitter_bs_replicates, density=True, bins=100)
plt.xlabel(r"$\tau$" + " (mean games between no-hitters)")
plt.ylabel("PDF")
plt.show()
conf_interval = np.percentile(nohitter_bs_replicates, [2.5, 97.5])
print("95% confidence interval of tau: {:.1f} - {:.1f} games".format(conf_interval[0], conf_interval[1]))

### Pairs Bootstrap

Pairs bootstrap makes the least assumptions about the data. Each bootstrap sample contains the data for a county, where the data pair is the percentage of the vote for Obama and the total vote in the county in thousands of votes.

#### Linear Regression Model of 2008 Swing State Voting Data (Demonstration)

This code recreates the 2008 swing state voting plot used in the video.

In [None]:
# Plot the percent of votes for Obama as a function of total votes
# (thousands).

# Create the data inputs.
obama_percent = swing_states["dem_votes"] * 100 / swing_states["total_votes"]
total_votes_thousands = swing_states["total_votes"] / 1000

# Create a linear regression model.
slope_reg, intercept_reg = np.polyfit(total_votes_thousands, obama_percent, 1)

# Plot the regression line.
x = np.array([0, total_votes_thousands.max()])
y_reg = x * slope_reg + intercept_reg
plt.plot(x, y_reg)

# Plot the data points.
plt.plot(total_votes_thousands, obama_percent, marker='.', linestyle='none')

# Adjust the plot and show it.
plt.margins(0.02)
plt.xticks(np.arange(0, 1000, 100))
plt.yticks(np.arange(0, 110, 10))
plt.xlabel("Total votes (thousands)")
plt.ylabel("Percent of Votes for Obama")
plt.show()

# Print the parameters of the regression model.
print("slope: {:.3f}".format(slope_reg))
print("intercept: {:.3f}".format(intercept_reg))

When working with the voting data, we need to resample the data for the counties to get two values, the percentage of votes for Obama and the total votes in thousands. A bootstrap sample is obtained by creating a random sample of the indexes of the data and using the index to obtain the two values.

In [None]:
# Create bootstrap replicates of the linear models.
# Use obama_percent and total_votes_thousands as the sources for the bootstrap
# samples.
# I wrote all of this code before doing the exercises.
# I revised the code while watching the video.
size = 1000
indices = np.arange(len(obama_percent))
slope_reps = np.empty(size)
intercept_reps = np.empty(size)
for i in range(size):
    # Build the random sample.
    # This operation is vectorized by NumPy.
    bs_indices = rng.choice(indices, len(indices))
    obama_percent_sample = obama_percent[bs_indices]
    total_votes_thousands_sample = total_votes_thousands[bs_indices]
    
    # Create the linear regression model and save the parameters.
    slope_rep, intercept_rep = np.polyfit(total_votes_thousands_sample, obama_percent_sample, 1)
    slope_reps[i] = slope_rep
    intercept_reps[i] = intercept_rep

# Plot the regression lines from the replicates.
for i in np.arange(len(slope_reps)):
    y_rep = x * slope_reps[i] + intercept_reps[i]
    plt.plot(x, y_rep, color="gray", alpha=0.1)

# Plot the original regression line over the top of the other lines.
# Otherwise, the original regression line is obscured.
plt.plot(x, y_reg)

# Plot the points over the regression lines.
plt.plot(total_votes_thousands, obama_percent, marker=".", linestyle="none")

# Adjust the plot and show it.
plt.margins(0.02)
plt.xticks(np.arange(0, 1000, 100))
plt.yticks(np.arange(0, 110, 10))
plt.xlabel("Total votes (thousands)")
plt.ylabel("Percent of Votes for Obama")
plt.show()

#### Variation in the Intercept Replicates (Extra)

Zoom in on the intercept to visualize the variation in the intercept values for the bootstrap replicates.

In [None]:
# Plot the regression lines of the bootstrap replicates.
for i in np.arange(len(slope_reps)):
    y_rep = x * slope_reps[i] + intercept_reps[i]
    plt.plot(x, y_rep, color="gray", alpha=0.1)

# Plot the regression line over the replicate regression lines to make
# it more visible.
plt.plot(x, y_reg)

# Plot the points over the regression lines.
plt.plot(total_votes_thousands, obama_percent, marker=".", linestyle="none")

# Adjust the plot and show it.
plt.margins(0.02)
plt.xlim(0, 10)
plt.ylim(37, 42)
plt.xticks(np.arange(0, 11, 1))
plt.yticks(np.arange(36, 44, 1))
plt.xlabel("Total votes (thousands)")
plt.ylabel("Percent of Votes for Obama")
plt.show()

In [None]:
# Plot a histogram of the slope bootstrap replicates.
plt.hist(slope_reps, density=True, bins=30)
plt.plot([slope_reg, slope_reg], [0, 100])
plt.xlabel("Slope")
plt.ylabel("PDF")
plt.show()

# Print the 95% confidence interval for the slope.
print("regression slope: {:.3f}".format(slope_reg))
print(
    "95% confidence interval of the regression slope: {:.3f}-{:.3f}.".format(
        np.percentile(slope_reps, 2.5),
        np.percentile(slope_reps, 97.5)))

In [None]:
# Plot a histogram of the intercept bootstrap replicates.
plt.hist(intercept_reps, density=True, bins=30)
plt.plot([intercept_reg, intercept_reg], [0, 0.6])
plt.xlabel("Intercept")
plt.ylabel("PDF")
plt.show()
print("regression intercept: {:.3f}".format(intercept_reg))
print(
    "95% confidence interval of the intercept: {:.3f}-{:.3f}.".format(
        np.percentile(intercept_reps, 2.5),
        np.percentile(intercept_reps, 97.5)))

#### A Function to Do Pairs Bootstrap (Exercise)

Write a function for resampling pairs of data given the inputs `x`, `y`, and `size`.

In [None]:
def draw_bs_pairs_linreg(x, y, size=1):
    """
    Draw bootstrap linear regression pairs for b0 and b1 parameters.
    Linear regression model is y = b0 + b1 * x
    
    Arguments
    ---------
    x : array of float
        x axis data
    y : array of float
        y axis data
    size : int
        number of bootstrap replicates to generate
    
    Returns
    -------
    b0_replicates : array of float
        bootstrap replicates of linear regression b0 parameter (intercept)
    b1_replicates : array of float
        bootstrap replicates of linear regression b1 parameter (slope)
    
    Example
    -------
    # Generate the bootstrap replicates of the b0 and b1 parameters.
    b0_replicates, b1_replicates = draw_bs_pairs_linreg(x, y, size=1000)
    # Plot the linear regression lines of the bootstrap replicates.
    lin_reg_domain = np.array([x.min(), x.max()])
    for i in range(len(b0_replicates)):
        lin_reg_range = b0_replicates[i] + b1_replicates[i] * lin_reg_domain
        plt.plot(lin_reg_domain, lin_reg_range, color="gray", alpha=0.1)
    """
    indices = np.arange(len(x))
    domain = [x.min(), x.max()]
    b0_replicates = np.empty(size)
    b1_replicates = np.empty(size)
    for i in range(size):
        # Build the bootstrap samples from pairs of x and y values.
        bootstrap_indices = rng.choice(indices, size=len(indices))
        x_bootstrap_sample = x[bootstrap_indices]
        y_bootstrap_sample = y[bootstrap_indices]

        # Create the linear regression model and save the parameters.
        b0, b1 = \
            np.polynomial.Polynomial.fit(
                x_bootstrap_sample,
                y_bootstrap_sample,
                1,
                domain=domain,
                window=domain)
        b0_replicates[i] = b0
        b1_replicates[i] = b1
    return b0_replicates, b1_replicates

#### Test the Function to Do Pairs Bootstrap (Extra)

In [None]:
# Generate the bootstrap replicates.
b0_replicates, b1_replicates = \
    draw_bs_pairs_linreg(total_votes_thousands, obama_percent, size=1000)

# Plot the regression lines from the replicates.
# The lines are plotted using two points, from the minimum x value to 
# the maximum x value recorded in domain.
lin_reg_domain = np.array([total_votes_thousands.min(), total_votes_thousands.max()])
for i in np.arange(len(b0_replicates)):
    lin_reg_range = b0_replicates[i] + b1_replicates[i] * lin_reg_domain
    plt.plot(lin_reg_domain, lin_reg_range, color="gray", alpha=0.1)

# Recreate the original regression line and plot it over the other lines.
# Otherwise, the original regression line is obscured.
b0, b1 = np.polynomial.Polynomial.fit(total_votes_thousands, obama_percent, 1, domain=domain, window=domain)
lin_reg_range = b0 + b1 * lin_reg_domain
plt.plot(lin_reg_domain, lin_reg_range)

# Plot the points over the regression lines.
plt.plot(total_votes_thousands, obama_percent, marker=".", linestyle="none")

# Adjust the plot and show it.
plt.margins(0.02)
plt.xticks(np.arange(0, 1000, 100))
plt.yticks(np.arange(0, 110, 10))
plt.xlabel("Total votes (thousands)")
plt.ylabel("Percent of Votes for Obama")
plt.show()

#### Pairs Bootstrap of Literacy/Fertility Data (Exercise)

In [None]:
# Generate replicates of slope and intercept using pairs bootstrap
bs_intercept_reps, bs_slope_reps = draw_bs_pairs_linreg(illiteracy, fertility, size=1000)

# Compute and print 95% CI for slope.
percentiles = np.percentile(bs_slope_reps, [2.5, 97.5])
print("95% confidence interval: {:.4f} - {:.4f}".format(percentiles[0], percentiles[1]))

# Plot the histogram
_ = plt.hist(bs_slope_reps, bins=50, density=True)
_ = plt.xlabel('slope')
_ = plt.ylabel('PDF')
plt.show()

#### Plotting Bootstrap Regressions (Exercise)

In [None]:
# Plot the first 100 bootstrap replicate regression lines.

# Create an array of two x values for drawing bootstrap regression lines.
x = np.array([illiteracy.min(), illiteracy.max()])

# Plot the bootstrap lines.
for i in range(100):
    plt.plot(
        x, 
        bs_intercept_reps[i] + bs_slope_reps[i] * x,
        linewidth=0.5,
        alpha=0.2,
        color='red')

# Plot the data points as a scatter plot.    
plt.plot(illiteracy, fertility, marker=".", linestyle="none")

# Customize the plot and show it.
plt.xlabel('Illiteracy')
plt.ylabel('Fertility')
plt.xticks(np.arange(0, 110, 20))
plt.yticks(np.arange(0, 9))
plt.show()

## Introduction to Hypothesis Testing

### Formulating and Simulating a Hypothesis

Ohio and Pennsylvania are similar states: They are located in the same region of the United States, they have liberal urban populations, and they have conservative rural populations. Jason's example hypothesis is that "county-level voting in these two states have identical probability distributions." In fact, when we plotted the ECDFs for these two states in part 1 of this course, they were very similar.

The hypothesis we are testing is the null hypothesis, which in this case is that there is no difference in county-level voting between Ohio and Pennsylvania.

Plotting the ECDFs and examining the summary statistics show that the two states are similar, but is there a significant difference? We can't tell yet.

These are my notes, not mentioned in the class at this point. No alternative hypothesis is proposed. One alternative hypothesis is that the share of votes for Democrats in Pennsylvania is greater than in Ohio; this is a one-tail test. Another alternative hypothesis is that the share of votes for Democrats in Pennsylvania is different (greater than or less than) from Ohio; this is a two-tail test.

#### Comparing County-Level Voting in Ohio and Pennsylvania (Demonstration)

Plotting the ECDFs of the Democratic share of the vote in individual counties shows that voting patterns in Ohio and Pennsylvania are very similar. Likewise, the summary statistics for the two states are very similar.

In [None]:
# Plot the ECDFs.
ohio_votes = swing_states[swing_states["state"] == "OH"]
x_oh, y_oh = ecdf(ohio_votes["dem_share"])
penn_votes = swing_states[swing_states["state"] == "PA"]
x_pa, y_pa = ecdf(penn_votes["dem_share"])
plt.plot(x_oh, y_oh, marker=".", linestyle="none", label="Ohio")
plt.plot(x_pa, y_pa, marker=".", linestyle="none", label="Pennsylvania")
plt.xlabel("Democratic share of votes in county (%)")
plt.ylabel("ECDF")
plt.xticks(np.arange(0, 110, 10))
plt.legend()
plt.show()

# Print summary statistics.
print("OH mean: {:.1f}%".format(np.mean(x_oh)))
print("PA mean: {:.1f}%".format(np.mean(x_pa)))
print("PA mean - OH mean: {:.1f}%".format(np.mean(x_pa) - np.mean(x_oh)))
print("OH median: {:.1f}%".format(np.median(x_oh)))
print("PA mean: {:.1f}%".format(np.median(x_pa)))
print("PA median - OH median: {:.1f}%".format(np.median(x_pa) - np.median(x_oh)))
print("OH standard deviation: {:.1f}%".format(x_oh.std()))
print("PA standard deviation: {:.1f}%".format(x_pa.std()))
print("PA stddev - OH stddev: {:.1f}%".format(x_pa.std() - x_oh.std()))

#### Generating a Permutation Sample (Demonstration)

To test if the distributions are different, we combine the data sets (here for Ohio and Pennsylvania), permute the combined data set, and create two new data sets (that represent Ohio and Pennsylvania counties). We can compute the means of these two data sets and compare the distribution by creating many permutation samples. We test to see if the distributions of the means from the permutation samples overlap or are separate.

The code below shows how to create the permutation samples and plot their ECDFs.

In [None]:
# Generate a permutation sample.
dem_share_both = np.concatenate((x_pa, x_oh))
dem_share_perm = rng.permutation(dem_share_both)
perm_sample_pa = dem_share_perm[:len(x_pa)]
perm_sample_oh = dem_share_perm[len(x_pa):]

# Plot the permuted data.
x_pa_p, y_pa_p = ecdf(perm_sample_pa)
x_oh_p, y_oh_p = ecdf(perm_sample_oh)
plt.plot(x_oh_p, y_oh_p, marker=".", linestyle="none", label="Ohio")
plt.plot(x_pa_p, y_pa_p, marker=".", linestyle="none", label="Pennsylvania")
plt.xlabel("Permuted Democratic share of votes in county (%)")
plt.ylabel("ECDF")
plt.xticks(np.arange(0, 110, 10))
plt.legend()
plt.show()

#### Generating a Permutation Sample (Exercise)

Write a function that combines and permutes two data sets.

In [None]:
def permutation_sample(array1, array2):
    """Generate a permutation sample from two data sets."""
    # Combine the data, permute it, and return slices having sizes
    # of the original data.
    combined = np.concatenate((array1, array2))
    permuted = rng.permutation(combined)
    perm_sample_1 = permuted[:len(array1)]
    perm_sample_2 = permuted[len(array1):]
    return perm_sample_1, perm_sample_2

#### Visualize Permutation Sampling (Exercise)

The following exercises use the Sheffield rainfall data for June and November, which we will see are different.

> Notice that the permutation samples ECDFs overlap and give a purple haze. None of the ECDFs from the permutation samples overlap with the observed data, suggesting that the [null] hypothesis is not commensurate with the data. June and November rainfall are not identically distributed.

In [None]:
# Combine rain_june and rain_november from the Sheffield weather station and
# permute the samples. Plot the ECDFs of the original data and of 50 permutted
# sample.
# This uses the rain_june and rain_november variables, which were initialized
# in the Data Sets section near the top of this page.
for i in range(50):
    # Permut the samples and plot their ECDFs.
    rain_june_perm, rain_november_perm = permutation_sample(rain_june, rain_november)
    x_jun_p, y_jun_p = ecdf(rain_june_perm)
    x_nov_p, y_nov_p = ecdf(rain_november_perm)
    plt.plot(x_jun_p, y_jun_p, marker='.', linestyle='none',
                 color='red', alpha=0.1)
    plt.plot(x_nov_p, y_nov_p, marker='.', linestyle='none',
                 color='blue', alpha=0.1)

# Create and plot ECDFs from original data.
x_jun, y_jun = ecdf(rain_june)
x_nov, y_nov = ecdf(rain_november)
plt.plot(x_jun, y_jun, marker='.', linestyle='none', color='red', label="June")
plt.plot(x_nov, y_nov, marker='.', linestyle='none', color='blue', label="November")

# Customize the plot and show it.
plt.margins(0.02)
plt.xlabel('Monthly rainfall (mm)')
plt.ylabel('ECDF')
plt.legend()
plt.show()

### Test Statistics and p-Values

> A test statistic is a single number that can be computed from observed data and also from data you simulate under the null hypothesis. It serves as a basis of comparison between what the hypothesis predicts and what we actually observed. Importantly, you should choose your test statistic to be something that is pertinent to the question you are trying to answer with your hypothesis test, in this case, are the two states different? If they are identical, they should have the same mean vote share for Obama. So the difference in mean vote share should be zero. We will therefore choose the difference in means as our test statistic.

#### Test Statistic for Voting Data (Demonstration)

The difference in means for the permutation samples is called a permutation replicate (of the difference of the means).

In [None]:
# Compute the difference of the means of the permutation samples.
print("permuted OH mean: {:.1f}%".format(x_oh_p.mean()))
print("permuted PA mean: {:.1f}%".format(x_pa_p.mean()))
print("permuted PA mean - OH mean: {:.1f}%".format(x_pa_p.mean() - x_oh_p.mean()))

#### Mean Vote Difference under the Null Hypothesis (Demonstration)

Create 10,000 permutation replicates and plot them as a histogram. Compute the p-value, the probability that the difference in means is greater than or equal to the observed value given that the null hypothesis is true (that there is no difference in the means).

In [None]:
# Create the permutation replicates of the difference of the means for
# Pennsylvania and Ohio and plot them as a histogram.
size = 10000
diff_means_perm_pa_oh_array = np.empty(size)
for i in range(size):
    perm_pa_votes, perm_oh_votes = \
        permutation_sample(penn_votes["dem_share"], ohio_votes["dem_share"])
    diff_means_perm_pa_oh = perm_pa_votes.mean() - perm_oh_votes.mean()
    diff_means_perm_pa_oh_array[i] = diff_means_perm_pa_oh

# Plot the permutation replicates of the difference of means.
plt.hist(diff_means_perm_pa_oh_array, bins=100, density=True)
# Plot a vertical line showing the difference of means for the original data.
diff_obs_means_pa_oh = x_pa.mean() - x_oh.mean()
plt.plot([diff_obs_means_pa_oh, diff_obs_means_pa_oh], [0, 0.30], color="red")
plt.xlabel("PA - OH mean percent vote difference")
plt.ylabel("PDF")
plt.show()

# Print the p-value for a one-tail test.
p_value_pa_oh_1_tail = 1 - (diff_means_perm_pa_oh_array < diff_obs_means_pa_oh).mean()
print("p-value = {:.3f}".format(p_value_pa_oh_1_tail))

# Use scipy.stats.percentileofscore() to calculate the p-value.
p_value_pa_oh_1_tail_scipy = 1 - scipy.stats.percentileofscore(diff_means_perm_pa_oh_array, diff_obs_means_pa_oh) / 100
print("scipy p-value = {:.3f}".format(p_value_pa_oh_1_tail_scipy))

#### Compare Ohio with Florida (Extra)

The voting pattern in Florida appears to be different from Ohio; test this. The null hypothesis is that the means are not different. The alternative hypothesis is that the mean for Florida is less than the mean for Ohio. We reject the null hypothesis.

In [None]:
# Compare county-level voting in Ohio to Florida.
flor_votes = swing_states[swing_states["state"] == "FL"]
x_fl, y_fl = ecdf(flor_votes["dem_share"])
plt.plot(x_oh, y_oh, marker=".", linestyle="none", label="Ohio")
plt.plot(x_fl, y_fl, marker=".", linestyle="none", label="Florida")
plt.xlabel("Democratic share of votes in county (%)")
plt.ylabel("ECDF")
plt.xticks(np.arange(0, 110, 10))
plt.legend()
plt.show()

# Create the permutation replicates of the difference of the means for
# Florida and Ohio and plot them as a histogram.
size = 10000
diff_means_perm_oh_fl_array = np.empty(size)
for i in range(size):
    perm_oh_votes, perm_fl_votes = \
        permutation_sample(ohio_votes["dem_share"], flor_votes["dem_share"])
    diff_means_perm_oh_fl = perm_oh_votes.mean() - perm_fl_votes.mean()
    diff_means_perm_oh_fl_array[i] = diff_means_perm_oh_fl

# Plot the permutation replicates of the difference of means.
plt.hist(diff_means_perm_oh_fl_array, bins=100, density=True)
# Plot a vertical line showing the difference of means for the original data.
diff_obs_means_oh_fl = x_oh.mean() - x_fl.mean()
plt.plot([diff_obs_means_oh_fl, diff_obs_means_oh_fl], [0, 0.30], color="red")
plt.xlabel("OH - FL mean percent vote difference")
plt.ylabel("PDF")
plt.show()

# Print the p-value for a one-tail test.
p_value_oh_fl_1_tail = 1 - (diff_means_perm_oh_fl_array < diff_obs_means_oh_fl).mean()
print("p-value = {:.4f}".format(p_value_oh_fl_1_tail))

# Use scipy.stats.percentileofscore() to calculate the p-value.
p_value_oh_fl_1_tail_sciy = scipy.stats.percentileofscore(diff_means_perm_oh_fl_array, diff_obs_means_oh_fl) / 100
print("scipy p-value = {:.4f}".format(1 - p_value_oh_fl_1_tail_sciy))

#### Two-Tail Tests of the Difference of the Means (Extra)

This is my work. For the two-tail test, we can test whether the absolute value of the difference of the means is different from 0. The null hypothesis is that the difference is 0; the alternative hypothesis is that the absolute value of the difference is greater than zero.

For Pennsylvania and Ohio, we do not reject the null hypothesis.

In [None]:
# For PA and OH.
abs_diff_means_perm_pa_oh_array = np.abs(diff_means_perm_pa_oh_array)

# Plot the permutation replicates of the absolute values of the difference
# of means.
plt.hist(abs_diff_means_perm_pa_oh_array, bins=100, density=True)
# Plot a vertical line showing the difference of means for the original data.
diff_obs_means_pa_oh = x_pa.mean() - x_oh.mean()
plt.plot([diff_obs_means_pa_oh, diff_obs_means_pa_oh], [0, 0.60], color="red")
plt.xlabel("PA - OH mean percent vote difference")
plt.ylabel("PDF")
plt.show()

# Print the p-value for a two-tail test.
p_value_pa_oh_2_tail = 1 - (abs_diff_means_perm_pa_oh_array < diff_obs_means_pa_oh).mean()
print("p-value = {:.4f}".format(p_value_pa_oh_2_tail))

# Use scipy.stats.percentileofscore() to calculate the p-value.
p_value_pa_oh_2_tail_scipy = 1 - (scipy.stats.percentileofscore(abs_diff_means_perm_pa_oh_array, diff_obs_means_pa_oh) / 100)
print("scipy p-value = {:.4f}".format(p_value_pa_oh_2_tail_scipy))

For Ohio and Florida, for a two-tail test, we do not reject the null hypothesis.

In [None]:
# For OH and FL.
abs_diff_means_perm_oh_fl_array = np.abs(diff_means_perm_oh_fl_array)

# Plot the permutation replicates of the absolute values of the difference
# of means.
plt.hist(abs_diff_means_perm_oh_fl_array, bins=100, density=True)
# Plot a vertical line showing the difference of means for the original data.
abs_obs_diff_means_oh_fl = x_oh.mean() - x_fl.mean()
plt.plot([abs_obs_diff_means_oh_fl, abs_obs_diff_means_oh_fl], [0, 0.60], color="red")
plt.xlabel("OH - FL mean percent vote difference")
plt.ylabel("PDF")
plt.show()

# Print the p-value for a two-tail test.
p_value_oh_fl_2_tail = 1 - (abs_diff_means_perm_oh_fl_array < abs_obs_diff_means_oh_fl).mean()
print("p-value = {:.4f}".format(p_value_oh_fl_2_tail))

# Use scipy.stats.percentileofscore() to calculate the p-value.
p_value_oh_fl_2_tail_scipy = 1 - (scipy.stats.percentileofscore(abs_diff_means_perm_oh_fl_array, diff_obs_means_oh_fl) / 100)
print("scipy p-value = {:.4f}".format(p_value_oh_fl_2_tail_scipy))

#### Test Statistics (Exercise)

When performing hypothesis tests, your choice of test statistics should be pertinent to the question you are seeking to answer in your hypothesis test.

#### What Is a p-Value? (Exercise)

The p-value is generally a measure of the probability of observing a test statistic equally or more extreme than the one you observed, given that the null hypothesis is true.

#### Generating Permutation Replicates (Exercise)

Create a function for generating permutation replicates.

In [None]:
def draw_perm_reps(array1, array2, func, size=1):
    """Generate multiple permutation replicates."""
    perm_replicates = np.empty(size)
    for i in range(size):
        # Generate permutation sample
        perm_sample1, perm_sample2 = permutation_sample(array1, array2)
        # Compute the test statistic
        perm_replicates[i] = func(perm_sample1, perm_sample2)
    return perm_replicates

#### Analyze Voting in Arbitrary Pairs of States (Extra)

Use the utility functions `permutation_sample()` and `draw_perm_reps()` to repeat the analysis of voting in Pennsylvania and Ohio and other pairs of states.

In [None]:
# Define a function that calculates the difference of means.
def difference_of_means(array1, array2):
    diff_means = array1.mean() - array2.mean()
    return diff_means

# Define a function that plots the ECDFs of the two states and
# creates and plots difference of means replicates from permuted
# samples.
def plot_votes(df, state1, state2, size):
    # Compute the observed difference of means and the replicate difference of
    # means from permuted samples.
    state1_dem_share = df[df["state"] == state1]["dem_share"].to_numpy()
    state2_dem_share = df[df["state"] == state2]["dem_share"].to_numpy()
    x_state1, y_state1 = ecdf(state1_dem_share)
    x_state2, y_state2 = ecdf(state2_dem_share)
    plt.plot(x_state1, y_state1, marker=".", linestyle="none", label=state1)
    plt.plot(x_state2, y_state2, marker=".", linestyle="none", label=state2)
    plt.xlabel("Democratic share of votes in county (%)")
    plt.ylabel("ECDF")
    plt.xticks(np.arange(0, 110, 10))
    plt.legend()
    plt.show()

    # Calculate the observed difference of means and the difference of means
    # of permutated sample replicates.
    obs_diff_means = difference_of_means(state1_dem_share, state2_dem_share)
    diff_means_reps = draw_perm_reps(state1_dem_share, state2_dem_share, difference_of_means, size=size)
    abs_diff_means_reps = np.abs(diff_means_reps)

    # Plot a histogram.
    y, x, _ = plt.hist(diff_means_reps, density=True, bins=100)
    # Draw a vertical line for the observed difference of means.
    plt.vlines(obs_diff_means, 0, max(y), colors=["red"])
    # Customize and show the plot.
    plt.ylabel("PDF")
    plt.xlabel(state1 + " - " + state2 + " mean percent vote difference")
    plt.show()
    
    # Print the p-value for a two-tail test.
    p_value_2_tail = 1 - (abs_diff_means_reps < np.abs(obs_diff_means)).mean()
    print("p-value (2-tail) = {:.4f}".format(p_value_2_tail))

##### Pennsylvania and Ohio

In [None]:
plot_votes(all_states, "PA", "OH", 10000)

##### Ohio and Florida

In [None]:
plot_votes(all_states, "OH", "FL", 10000)

##### Washington and Oregon

In [None]:
plot_votes(all_states, "WA", "OR", 10000)

##### Massachusetts and Washington

In [None]:
plot_votes(all_states, "MA", "WA", 10000)

##### Massachusetts and Alabama

In [None]:
plot_votes(all_states, "MA", "AL", 10000)

##### Hawaii and Utah

These are the states with the most extreme voting patterns. It appears that Hawaii has only four counties; this distorts the PDF plot.

In [None]:
plot_votes(all_states, "HI", "UT", 10000) # The most extreme states.

#### EDA before Hypothesis Testing (Exercise)

> Kleinteich and Gorb (Sci. Rep., 4, 5225, 2014) performed an interesting experiment with South American horned frogs. They held a plate connected to a force transducer, along with a bait fly, in front of them. They then measured the impact force and adhesive force of the frog's tongue when it struck the target.

> Frog A is an adult and Frog B is a juvenile. The researchers measured the impact force of 20 strikes for each frog. In the next exercise, we will test the hypothesis that the two frogs have the same distribution of impact forces.

For the data used in the exercise, ID = "A" corresponds to ID = "II" in the original data file, and ID = "B" corresponds to ID = "IV". The impact_force values in the original file were divided by 1000 to convert the force units from mN to N.

> Eyeballing it, it does not look like they come from the same distribution. Frog A, the adult, has three or four very hard strikes, and Frog B, the juvenile, has a couple weak ones. However, it is possible that with only 20 samples it might be too difficult to tell if they have difference distributions, so we should proceed with the hypothesis test.

In [None]:
# Create a seaborn beeswarm plot of the original data.
_ = sns.swarmplot(x="ID", y="impact force (mN)", data=frog_tongue)
# Labels are provided using the column headings by default.
plt.xlabel("Frog")
plt.ylabel("Impace force (mN)")
plt.show()

#### Permutation Tests on Frog Data (Exercise)

> The average strike force of Frog A was 0.71 Newtons (N), and that of Frog B was 0.42 N for a difference of 0.29 N. It is possible the frogs strike with the same force and this observed difference was by chance. You will compute the probability of getting at least a 0.29 N difference in mean strike force under the hypothesis that _the distributions of strike forces for the two frogs are identical_. We use a permutation test with a test statistic of the difference of means to test this hypothesis.

> The p-value tells you that there is about a 0.6% chance that you would get the difference of means observed in the experiment if frogs were exactly the same. A p-value below 0.01 is typically said to be "statistically significant," but: warning! warning! warning! You have computed a p-value; it is a number. I encourage you not to distill it to a yes-or-no phrase. p = 0.006 and p = 0.000000006 are both said to be "statistically significant," but they are definitely not the same!

We reject the null hypothesis that the means are equal.

In [None]:
# Use sample permutations to test whether there is a significant different
# in means. The null hypothesis is that there is no difference in means.
# The alternative hypothesis is that the impact force for frog "II" is
# greater than the impact force for frog "IV".

# Prepare the data.
force_a = frog_tongue[frog_tongue["ID"] == "II"]["impact force (mN)"].to_numpy() / 1000
force_b = frog_tongue[frog_tongue["ID"] == "IV"]["impact force (mN)"].to_numpy() / 1000

def diff_of_means(data_1, data_2):
    """Difference in means of two arrays."""
    diff = np.mean(data_1) - np.mean(data_2)
    return diff

# Compute difference of mean impact force from experiment.
empirical_diff_means = diff_of_means(force_a, force_b)
# Draw 10,000 permutation replicates: perm_replicates
perm_replicates = draw_perm_reps(force_a, force_b, diff_of_means, size=10000)
# Compute p-value: p
p = np.sum(perm_replicates >= empirical_diff_means) / len(perm_replicates)
print("p-value = {:.4f}".format(p))

### Bootstrap Hypothesis Tests

Pipeline for hypothesis testing:
- First, clearly state the null hypothesis. Stating the null hypothesis so that it is crystal clear is essential to be able to simulate it. 
- Next, define your test statistic.
- Then generate many sets of simulated data assuming the null hypothesis is true.
- Compute the test statistic for each simulated data set.
- The p-value is then the fraction of your simulated data sets for which the test statistic is at least as extreme as for the real data. 

#### Speed of Light (Demonstration)

Michelson measured the speed of light (see previous work in this and the prequel to this course), and we have his individual measurements and his mean, 299,852.4 km/s. Newcomb also measured the speed of light, but we have only his mean value, 299,860 km/s. Are these significantly different? Our null hypothesis is that Michelson's speed of light is Newcomb's speed of light (no difference in the means). We can't do a difference of means test using permuted samples because we don't have Newcomb's individual measurements.

I think what you should do here is take bootstrap samples with replacement of Michelson's data and compute the means of the sample replicates. Then compute the p-value for the mean being Newcomb's value. The null hypothesis is that Newcomb's value could be obtained from Michelson's data.

We do not reject the null hypothesis.

In [None]:
# Obtain 10,000 bootstrap replicates of the mean from Michelson's data.
# light_bs_replicates contains 10,000 bootstrap replicate means from
# Michelson's data, as calculated above.
newcomb_mean = 299860
light_p_value = np.mean(light_bs_replicates >= newcomb_mean)
print("p-value = {:.4f}".format(light_p_value))

This is not what Justin does. He shifts Michelson's data by subtracting the Michelson mean and adding the Newcomb mean to each observation. Justin then uses bootstrapping to create bootstrap sample replicates of the mean from data where the mean is Newcomb's value for the speed of light. He uses the shifted data to simulate the null hypothesis, which is that the speed of light from Michelson's data is actually Newcomb's value. He calculates the p-value of the difference of means being -7.6 km/s given that the null hypothesis is true.

The p-value obtained this way is similar to the p-value I calculated above.

In [None]:
# Shift the Michelson data.
michelson_mean = np.mean(michelson_speed_of_light)
michelson_shifted = michelson_speed_of_light - michelson_mean + newcomb_mean
print("newcomb_mean:", newcomb_mean)
print("michelson_mean:", michelson_mean)
print("michelson_adjusted_mean:", np.mean(michelson_shifted))

In [None]:
# Plot the ECDFs of michelson_speed_of_light and michelson_adjusted.
# Note the shifting is not as big as it is in the course's figure.
x_m, y_m = ecdf(michelson_speed_of_light)
x_ms, y_ms = ecdf(michelson_shifted)
plt.plot(x_m, y_m, marker=".", linestyle="none", label="Michelson")
plt.plot(x_ms, y_ms, marker=".", linestyle="none", label="Michelson shifted")
plt.xlabel("Speed of light (km/s)")
plt.ylabel("ECDF")
plt.legend()
plt.show()

In [None]:
# Create bootstrap replicates of the difference of the mean of the bootstrap
# sample obtained from the Michelson shifted data and Newcomb's mean.
def diff_from_newcomb(data, newcomb_mean=299860):
    return np.mean(data) - newcomb_mean

# The difference here should be -7.6 km/s.
diff_observed = diff_from_newcomb(michelson_speed_of_light)
print("diff_observed: {:.2f} km/s".format(diff_observed))

# Draw bootstrap replicates of the difference of the means from Newcomb's
# mean.
shifted_bs_replicates = draw_bs_reps(michelson_shifted, diff_from_newcomb, size=10000)
# Calculate the p-value for the difference being less than or equal to the
# observed difference of -7.6 km/s.
light_p_value2 = np.mean(shifted_bs_replicates <= diff_observed)
print("p-value: {:.4f}".format(light_p_value2))

#### A One-Sample Bootstrap Hypothesis Test (Exercise)

> Another juvenile frog was studied, Frog C, and you want to see if Frog B and Frog C have similar impact forces. Unfortunately, you do not have Frog C's impact forces available, but you know they have a mean of 0.55 N. Because you don't have the original data, you cannot do a permutation test, and you cannot assess the hypothesis that the forces from Frog B and Frog C come from the same distribution. You will therefore test another, less restrictive hypothesis: The mean strike force of Frog B is equal to that of Frog C.

> To set up the bootstrap hypothesis test, you will take the mean as our test statistic. Remember, your goal is to calculate the probability of getting a mean impact force less than or equal to what was observed for Frog B if the hypothesis that the true mean of Frog B's impact forces is equal to that of Frog C is true. You first translate all of the data of Frog B such that the mean is 0.55 N. This involves adding the mean force of Frog C and subtracting the mean force of Frog B from each measurement of Frog B. This leaves other properties of Frog B's distribution, such as the variance, unchanged.

We reject the null hypothesis.

> The low p-value suggests that the null hypothesis that Frog B and Frog C have the same mean impact force is false.

In [None]:
# Frog C in the exercise is frog "III" in the data.
mean_force_b = np.mean(force_b)
print("mean_force_b: {:.3f}".format(mean_force_b))
force_c = frog_tongue[frog_tongue["ID"] == "III"]["impact force (mN)"].to_numpy() / 1000
mean_force_c = np.mean(force_c)
print("mean_force_c: {:.3f}".format(mean_force_c))

# Translate the force_b data to have a mean of mean_force_c.
translated_force_b = force_b - np.mean(force_b) + mean_force_c
# Take bootstrap replicates of the mean.
bs_replicates = draw_bs_reps(translated_force_b, np.mean, 10000)
# Compute fraction of replicates that are less than the observed Frog B force: p
p = np.sum(bs_replicates <= np.mean(force_b)) / 10000
print('p = ', p)

#### A Two-Sample Bootstrap Hypothesis Test for Difference of Means (Exercise)

> We now want to test the hypothesis that Frog A and Frog B have the same mean impact force, but not necessarily the same distribution, which is also impossible with a permutation test.

The permutation test (see above) assumes that the distributions are equal. It combines the data, creates permutations of the combined data to simulate the original data sets, and computes a replicate statistic.

> To do the two-sample bootstrap test, we shift both arrays to have the same mean, since we are simulating the hypothesis that their means are, in fact, equal. We then draw bootstrap samples out of the shifted arrays and compute the difference in means. This constitutes a bootstrap replicate, and we generate many of them. The p-value is the fraction of replicates with a difference in means greater than or equal to what was observed.

We reject the null hypothesis.

> You got a similar result as when you did the permutation test. Nonetheless, remember that it is important to carefully think about what question you want to ask. Are you only interested in the mean impact force, or in the distribution of impact forces?

In [None]:
# Combine the force data. Using the + operator adds the values to each
# other; must use np.concatenate() here.
forces_concat = np.concatenate((force_a, force_b))
mean_force = np.mean(forces_concat)

# Generate shifted arrays
force_a_shifted = force_a - np.mean(force_a) + mean_force
force_b_shifted = force_b - np.mean(force_b) + mean_force

# Compute 10,000 bootstrap replicates from shifted arrays
bs_replicates_a = draw_bs_reps(force_a_shifted, np.mean, size=10000)
bs_replicates_b = draw_bs_reps(force_b_shifted, np.mean, size=10000)

# Get replicates of difference of means: bs_replicates
bs_replicates = bs_replicates_a - bs_replicates_b

# Compute and print p-value: p
empirical_diff_means = np.mean(force_a) - np.mean(force_b)
print("empirical_diff_means = {:.3f}".format(empirical_diff_means))
p = sum(bs_replicates >= empirical_diff_means) / len(bs_replicates)
print("p-value = {:.4f}".format(p))
p = np.mean(bs_replicates >= empirical_diff_means)
print("p-value = {:.4f}".format(p))

## Hypothesis Test Examples

### A/B Testing

#### Results of an A/B Test (Demonstration)

Page A had a click-through rate of 45/500; page B had a click-through rate of 67/500. What is the probability that this happened by chance, given that these represent the same distribution? This is a hypothesis test, and one way to study this question is through using a permutation test, which assumes the null hypothesis that the difference in pages had no effect on the click-through rate.

Given the low p-value, we reject the null hypothesis that the click-through rates for the two pages could have happened by chance assuming the rates were not different for the two pages.

Be warned that statistical significance does not mean practical significance.

In [None]:
# Simulate the data. 45/500 hits for page A; 67/500 hits for page B.
clickthrough_a = np.array([1] * 45 + [0] * (500 - 45))
print("page A hit rate:", np.mean(clickthrough_a))
clickthrough_b = np.array([1] * 67 + [0] * (500 - 67))
print("page B hit rate:", np.mean(clickthrough_b))
clickthrough_combined = np.concatenate((clickthrough_a, clickthrough_b))
print("combined hit rate:", np.mean(clickthrough_combined))

# Take permuted samples.
def diff__of_means(array1, array2):
    return np.mean(array1) - np.mean(array2)

# Draw the permutation samples and plot them against the empirical difference
# of means as a histogram.
obs_diff_of_means = diff_of_means(clickthrough_b, clickthrough_a)
print("obs_diff_of_means: {:.4f}".format(obs_diff_of_means))
diff_of_means_reps = draw_perm_reps(clickthrough_b, clickthrough_a, diff_of_means, size=10000)
# Decide the bin boundaries for the histogram.
# 56/500 - 56/500 = 0.000
# 57/500 - 55/500 = 0.004
# 58/500 - 54/500 = 0.008
# etc.
bins = np.arange(-.10, .10, .004)
y, x, _ = plt.hist(diff_of_means_reps, bins=bins, density=True)
plt.vlines(obs_diff_of_means, 0, max(y), colors=["red"])
plt.xlabel("Difference of means (click-through rate)")
plt.ylabel("PDF")
plt.xticks(np.arange(-0.10, 0.11, .02))
plt.show()

# Calculate the p-value.
p = np.mean(diff_of_means_reps >= obs_diff_of_means)
print("p-value: {:.4f}".format(p))

#### The Vote for the Civil Rights Act in 1964 (Exercise)

> The Civil Rights Act of 1964 was one of the most important pieces of legislation ever passed in the USA. Excluding "present" and "abstain" votes, 153 House Democrats and 136 Republicans voted yea. However, 91 Democrats and 35 Republicans voted nay. Did party affiliation make a difference in the vote?

> To answer this question, you will evaluate the hypothesis that the party of a House member has no bearing on his or her vote. You will use the fraction of Democrats voting in favor as your test statistic and evaluate the probability of observing a fraction of Democrats voting in favor at least as small as the observed fraction of 153/244. (That's right, at least as small as. In 1964, it was the Democrats who were less progressive on civil rights issues.) To do this, permute the party labels of the House voters and then arbitrarily divide them into "Democrats" and "Republicans" and compute the fraction of Democrats voting yea.

> This small p-value suggests that party identity had a lot to do with the voting. Importantly, the South had a higher fraction of Democrat representatives, and consequently also a more racist bias.

In [None]:
# Construct arrays of data: dems, reps
dems = np.array([True] * 153 + [False] * 91)
reps = np.array([True] * 136 + [False] * 35)
print("prop_dems = {:.4f}".format(153 / (153 + 91)))
print("prop_reps = {:.4f}".format(136 / (136 + 35)))
print("prop_all  = {:.4f}".format((153 + 136) / (153 + 91 + 136 + 35)))

def frac_yea_dems(dems, reps):
    """Compute fraction of Democrat yea votes."""
    frac = np.sum(dems) / len(dems)
    return frac

# Acquire permutation replicates representing the proportion of Democrats
# who voted for the Civil Rights Act.
perm_replicates = draw_perm_reps(dems, reps, frac_yea_dems, size=10000)
y, x, _ = plt.hist(perm_replicates, bins=100, density=True)
plt.vlines(frac_yea_dems(dems, reps), 0, max(y), colors=["red"])
plt.xlabel("Proportion of Democrats voting yea")
plt.ylabel("PDF")
plt.show()
p = np.sum(perm_replicates <= (153 / (153 + 91))) / len(perm_replicates)
print('Democrats p-value = {:.5f}'.format(p))

# Do it again for the Republicans.
# The null hypothesis is no difference in voting proportion.
def frac_yea_reps(dems, reps):
    frac = np.sum(reps) / len(reps)
    return frac

perm_replicates2 = draw_perm_reps(dems, reps, frac_yea_reps, size=10000)
y2, x2, _ = plt.hist(perm_replicates2, bins=100, density=True)
plt.vlines(frac_yea_reps(dems, reps), 0, max(y2), colors=["red"])
plt.xlabel("Proportion of Republicans voting yea")
plt.ylabel("PDF")
plt.show()
p2 = np.sum(perm_replicates2 >= (135 / (136 + 35))) / len(perm_replicates2)
print("Republicans p-value = {:.5f}".format(p2))

#### A Time-on-Website Analog Using Games between No-Hitters (Exercise)

Instead of calculating time-on-website, we calculate the number of games between no-hitters in the dead ball era and the live ball era. The null hypothesis is that there is no difference -- that both data sets come from the same distribution.

This is unlikely given the mean of nht_dead versus the mean of mht_live.

> We return to the no-hitter data set. In 1920, Major League Baseball implemented important rule changes that ended the so-called dead ball era. Importantly, the pitcher was no longer allowed to spit on or scuff the ball, an activity that greatly favors pitchers. In this problem you will perform an A/B test to determine if these rule changes resulted in a slower rate of no-hitters (i.e., longer average time between no-hitters) using the difference in mean inter-no-hitter time as your test statistic.

> Your p-value is 0.0001, which means that only one out of your 10,000 replicates had a result as extreme as the actual difference between the dead ball and live ball eras. This suggests strong statistical significance. Watch out, though, you could very well have gotten zero replicates that were as extreme as the observed value. This just means that the p-value is quite small, almost certainly smaller than 0.001.

In [None]:
# Create the NumPy arrays.
# Note that the first item in dead_ball in the course is -1; this must be
# an error.
dead_ball = nohitters[nohitters["date"] <= "1920-01-01"]["game_number"]
nht_dead = np.array([dead_ball.iloc[x] - dead_ball.iloc[x - 1] - 1 for x in range(1, len(dead_ball))])
print(nht_dead)
print()
live_ball = nohitters[nohitters["date"] > "1920-01-01"]["game_number"]
nht_live = np.array([live_ball.iloc[x] - live_ball.iloc[x - 1] - 1 for x in range(1, len(live_ball))])
print(nht_live)

# Compute the observed difference in mean inter-no-hitter times.
print("mean of nht_dead: {:.1f}".format(np.mean(nht_dead)))
print("mean of nht_live: {:.1f}".format(np.mean(nht_live)))
nht_diff_obs = diff_of_means(nht_dead, nht_live)
print("nht_diff_obs: {:.1f}".format(nht_diff_obs))
# Acquire 10,000 permutation replicates of difference in mean no-hitter time.
perm_replicates = draw_perm_reps(nht_dead, nht_live, diff_of_means, size=10000)
p = sum(perm_replicates < nht_diff_obs) / len(perm_replicates)
print("p-val = {:.4f}".format(p))

#### EDA of Time between No-Hitters for Dead Ball and Live Ball Eras (Exercise)

As always, EDA should be performed before hypothesis tests.

In [None]:
# Plot the ECDFs of nht_dead and nht_live.
x_dead, y_dead = ecdf(nht_dead)
x_live, y_live = ecdf(nht_live)
plt.plot(x_dead, y_dead, marker=".", linestyle="none", label="dead")
plt.plot(x_live, y_live, marker=".", linestyle="none", label="live")
plt.xlabel("Games between no-hitters")
plt.ylabel("ECDF")
plt.legend()
plt.show()

### Test of Correlation

#### 2008 US Swing State Election Results (Demonstration)

In [None]:
# Find the Pearson correlation coefficient.
swing_states_pearson_r = np.corrcoef(swing_states[["total_votes", "dem_share"]], rowvar=False)[0, 1]
print("swing_states_pearson_r: {:.4f}".format(swing_states_pearson_r))

#### Hypothesis Test of Correlation (Demonstration)

How do we know if the correlation is real or if it happened by chance? We can carry out a hypothesis test on the correlation statistic. The null hypothesis is that there is no correlation between the number of votes in a county and the percentage of votes given to the Democratic candidate. We simulate the data assuming the null hypothesis is true. Use the Pearson correlation coefficient, rho, as the test statistic. The p-value is the fraction of replicates that have rho at least as large as observed.

Jason tried this and was unable to find even a single instance by chance where the correlation coefficient was at least 0.5362.

> This does not mean that the p-value is zero. It means that it is so low that we would have to generate an enormous number of replicates to have even one that has a test statistic sufficiently extreme. We conclude that the p-value is very very small and there is essentially no doubt that counties with higher vote count tended to vote for Obama. After all, that is how he won the election.

I confirmed that the p-value < 0.000001 by running the permutation one million times.

In [None]:
# It should be sufficient to permute either dem_share or total_votes to break
# the association of the two variables.
dem_share_permuted = rng.permutation(swing_states["dem_share"])
total_votes_permuted = rng.permutation(swing_states["total_votes"])
rho1 = pearson_r(dem_share_permuted, swing_states["total_votes"])
print("rho1 = {:.4f}".format(rho1))
rho2 = pearson_r(swing_states["dem_share"], total_votes_permuted)
print("rho2 = {:.4f}".format(rho2))
rho3 = pearson_r(dem_share_permuted, total_votes_permuted)
print("rho3 = {:.4f}".format(rho3))

# Do this 10000 times.
size = 10000
rho_replicates = np.empty(size)
for i in range(size):
    dem_share_permuted = rng.permutation(swing_states["dem_share"])
    rho = pearson_r(dem_share_permuted, swing_states["total_votes"])
    rho_replicates[i] = rho

# Plot the results.
y, x, _ = plt.hist(rho_replicates, bins=100, density=True)
plt.vlines(swing_states_pearson_r, 0, np.max(y), colors=["red"])
plt.xlabel("Pearson r")
plt.ylabel("PDF")
plt.show()

# Calculate the p-value.
print("mean of rho replicates: {:.4f}".format(np.mean(rho_replicates)))
p = np.sum(rho_replicates >= swing_states_pearson_r) / len(rho_replicates)
print("p-value: {:.6f}".format(p))

#### Simulating a Null Hypothesis Concerning Correlation (Exercise)

> The observed correlation between female illiteracy and fertility in the data set of 162 countries may just be by chance; the fertility of a given country may actually be totally independent of its illiteracy. You will test this null hypothesis in the next exercise.

> To do the test, you need to simulate the data assuming the null hypothesis is true.

> Do a permutation test: Permute the illiteracy values but leave the fertility values fixed to generate a new set of (illiteracy, fertility) data.

> [T]his exactly simulates the null hypothesis and does so more efficiently than the last option. It is exact because it uses all data and eliminates any correlation because which illiteracy value pairs to which fertility value is shuffled.

#### Hypothesis Test on Pearson Correlation (Exercise)

> The observed correlation between female illiteracy and fertility may just be by chance; the fertility of a given country may actually be totally independent of its illiteracy. You will test this hypothesis. To do so, permute the illiteracy values but leave the fertility values fixed. This simulates the hypothesis that they are totally independent of each other. For each permutation, compute the Pearson correlation coefficient and assess how many of your permutation replicates have a Pearson correlation coefficient greater than the observed one.

Once again, the p-value is very low (<0.0001).

> You got a p-value of zero. In hacker statistics, this means that your p-value is very low, since you never got a single replicate in the 10,000 you took that had a Pearson correlation greater than the observed one. You could try increasing the number of replicates you take to continue to move the upper bound on your p-value lower and lower.

In [None]:
# Use permutation samples to create 10,000 replicates of the Pearson r
# for the illiteracy and fertility data.
r_obs = pearson_r(illiteracy, fertility)
size = 10000
perm_replicates = np.empty(size)
for i in range(size):
    illiteracy_permuted = rng.permutation(illiteracy)
    perm_replicates[i] = pearson_r(illiteracy_permuted, fertility)
if_p = sum(perm_replicates >= r_obs) / len(perm_replicates)
print("p-value: {:.4f}".format(if_p))

# Plot the histogram.
y_if, x_if, _ = plt.hist(perm_replicates, bins=100, density=True)
plt.vlines(r_obs, 0, np.max(y_if), colors=["red"])
plt.xlabel("Pearson r")
plt.ylabel("PDF")
plt.show()

#### Do Neonicotinoid Insecticides Have Unintended Consequences? (Exercise)

The ECDFs show a pretty clear difference between the treatment and control; treated bees have fewer alive sperm.

In [None]:
# EDA on bee sperm data.
x_c, y_c = ecdf(control)
x_t, y_t = ecdf(treated)
plt.plot(x_c, y_c, marker=".", linestyle="none", label="control")
plt.plot(x_t, y_t, marker=".", linestyle="none", label="treated")
plt.xlabel("Millions of alive sperm per mL")
plt.ylabel("ECDF")
plt.legend()
plt.show()

#### Bootstrap Hypothesis Test on Bee Sperm Counts (Exercise)

> [T]est the following hypothesis: On average, male bees treated with neonicotinoid insecticide have the same number of active sperm per milliliter of semen than do untreated male bees. You will use the difference of means as your test statistic.

The p-value is very low. We reject the null hypothesis that the treated and control bees have the same number of active sperm per milliliter of semen.

> The p-value is small, most likely less than 0.0001, since you never saw a bootstrap replicated with a difference of means at least as extreme as what was observed. In fact, when I did the calculation with 10 million replicates, I got a p-value of 2e-05.

In [None]:
# Assume treated and control have the same mean and distribution.
# Shift the values to support this assumption.
# I find this approach counter-intuitive.
size = 1000000
diff_means = diff_of_means(control, treated)

# Shift the control and treated measurements so their means are identical.
mean_count = np.mean(np.concatenate((control, treated)))
control_shifted = control - np.mean(control) + mean_count
treated_shifted = treated - np.mean(treated) + mean_count

# Generate bootstrap replicates of the means, assuming the means of the
# control and treated samples are identical.
bs_reps_control = draw_bs_reps(control_shifted, np.mean, size=size)
bs_reps_treated = draw_bs_reps(treated_shifted, np.mean, size=size)

# Get replicates of difference of means.
# Find the p-value.
bs_replicates = bs_reps_control - bs_reps_treated
p = np.sum(bs_replicates >= diff_means) / len(bs_replicates)
print('p-value = {:.6f}'.format(p))

## A Case Study

### Finch Beaks And the Need for Statistics

Since 1973, Peter and Rosemary Grant have been studying ground finches on the island of Daphne Major Island in the Galápagos Islands. In 2014 they published the book _40 Years of Evolution: Darwin's Finches on Daphne Major Island_, Princeton University Press. They made their data available at the Dryad Data Repository, http://dx.doi.org/10.5061/dryad.g6g3h.

We will start with an investigation of how beak depth of _Geospiza scandens_ has changed over time, looking at beak depth data from 1975 and 2012. We will create parameter estimates of mean beak depth for those years. Finally, we will do a hypothesis test of whether beak depth has changed from 1975 to 2012.

#### EDA of Beak Depths of Darwin's Finches (Exercise)

> For your first foray into the Darwin finch data, you will study how the beak depth (the distance, top to bottom, of a closed beak) of the finch species _Geospiza scandens_ has changed over time. The Grants have noticed some changes of beak geometry depending on the types of seeds available on the island, and they also noticed that there was some interbreeding with another major species on Daphne Major, _Geospiza fortis_. These effects can lead to changes in the species over time.

> In the next few problems, you will look at the beak depth of _G. scandens_ on Daphne Major in 1975 and in 2012. To start with, let's plot all of the beak depth measurements in 1975 and 2012 in a bee swarm plot.

> It is kind of hard to see if there is a clear difference between the 1975 and 2012 data set. Eyeballing it, it appears as though the mean of the 2012 data set might be slightly higher, and it might have a bigger variance.

In [None]:
# Create a bee swarm plot.
_ = sns.swarmplot(x="year", y="beak_depth", data=beak_depth)
_ = plt.xlabel("Year")
_ = plt.ylabel("Beak depth (mm)")
plt.show()

#### ECDFs of Beak Depths (Exercise)

> The differences are much clearer in the ECDF. The mean is larger in the 2012 data, and the variance does appear larger as well.

In [None]:
# Plot the ECDFs of the beak depth data.
x_1975, y_1975 = ecdf(bd_1975)
x_2012, y_2012 = ecdf(bd_2012)
plt.plot(x_1975, y_1975, marker=".", linestyle="none", label="1975")
plt.plot(x_2012, y_2012, marker=".", linestyle="none", label="2012")
plt.xlabel("Beak depth (mm)")
plt.ylabel("ECDF")
plt.legend()
plt.show()

#### Parameter Estimates of Beak Depths (Exercise)

Since the 95% confidence interval does not include 0, we can reject the null hypothesis that the means are equal.

In [None]:
# Compute the difference of the sample means.
obs_mean_diff = np.mean(bd_2012) - np.mean(bd_1975)

# Get bootstrap replicates of means.
bs_replicates_1975 = draw_bs_reps(bd_1975, np.mean, size=10000)
bs_replicates_2012 = draw_bs_reps(bd_2012, np.mean, size=10000)

# Compute samples of difference of means.
bs_diff_replicates = bs_replicates_2012 - bs_replicates_1975

# Compute 95% confidence interval: conf_int
conf_int = np.percentile(bs_diff_replicates, [2.5, 97.5])
print('difference of means = {:.3f} mm'.format(obs_mean_diff))
print('95% confidence interval = {:.3f}-{:.3f} mm'.format(conf_int[0], conf_int[1]))

#### Hypothesis Test: Are Beaks Deeper in 2012? (Exercise)

> Your plot of the ECDF and determination of the confidence interval make it pretty clear that the beaks of _G. scandens_ on Daphne Major have gotten deeper. But is it possible that this effect is just due to random chance? In other words, what is the probability that we would get the observed difference in mean beak depth if the means were the same?

> Be careful! The hypothesis we are testing is not that the beak depths come from the same distribution. For that we could use a permutation test. The hypothesis is that the means are equal. To perform this hypothesis test, we need to shift the two data sets so that they have the same mean and then use bootstrap sampling to compute the difference of means.

> We get a p-value of 0.0034, which suggests that there is a statistically significant difference. But remember: it is very important to know how different they are! In the previous exercise, you got a difference of 0.2 mm between the means. You should combine this with the statistical significance. Changing by 0.2 mm in 37 years is substantial by evolutionary standards. If it kept changing at that rate, the beak depth would double in only 400 years.

In [None]:
# Compute mean of combined data set.
combined_mean = np.mean(np.concatenate((bd_1975, bd_2012)))
print("combined_mean: {:.4f}".format(combined_mean))

# Shift the samples so they have the same means.
bd_1975_shifted = bd_1975 - np.mean(bd_1975) + combined_mean
print("np.mean(bd_1975_shifted): {:.4f}".format(np.mean(bd_1975_shifted)))
bd_2012_shifted = bd_2012 - np.mean(bd_2012) + combined_mean
print("np.mean(bd_2012_shifted): {:.4f}".format(np.mean(bd_2012_shifted)))
print()

# Get bootstrap replicates of shifted data sets.
# Compute replicates of the difference of means.
# Compute the p-value and print it.
bs_replicates_1975 = draw_bs_reps(bd_1975_shifted, np.mean, size=10000)
bs_replicates_2012 = draw_bs_reps(bd_2012_shifted, np.mean, size=10000)
bs_diff_replicates = bs_replicates_2012 - bs_replicates_1975
p = np.sum(bs_diff_replicates >= obs_mean_diff) / len(bs_diff_replicates)
print('p = {:.4f}'.format(p))

### Variation in Beak Shapes

Are beak depth and beak length changing together? We can use linear regression to investigate this question. We will make use of the `draw_bs_pairs_linreg()` function we wrote earlier.

#### EDA of Beak Length and Depth (Exercise)

> In looking at the plot, we see that beaks got deeper (the red points are higher up in the y-direction), but not really longer. If anything, they got a bit shorter, since the red dots are to the left of the blue dots. So, it does not look like the beaks kept the same shape; they became shorter and deeper.

In [None]:
# Make scatter plot of 1975 data
_ = plt.plot(bl_1975, bd_1975, marker='.',
             linestyle='None', color="blue", alpha=0.5, label="1975")

# Make scatter plot of 2012 data
_ = plt.plot(bl_2012, bd_2012, marker='.',
            linestyle='None', color="red", alpha=0.5, label="2012")

# Label axes and make legend
_ = plt.xlabel('Beak length (mm)')
_ = plt.ylabel('Beak depth (mm)')
_ = plt.legend(loc='upper left')
plt.show()

### Calculation of Heritability

### Final Thoughts