# Statistical Thinking 2 - applications
- estimate parameter values
- perform linear regressions
- compute confidence intervals
- perform hypothesis tests

# 1. Parameter Estimation by optimization

Optimal parameters
- parameter values that bring the model in closest agreement with data
- note: if your model is wrong, then optimal parameters are not meaningful

Packages for statistical inference
- scipy.stats
- statsmodels
- ...or use hacker stats with numpy

In [None]:
# example: How often do we get no-hitters?
# no-hitters are Poisson process, then time b/n no-hitters is
# exponentially distributed

# the Exponential distribution has a single parameter, which we will 
# call τ, the typical interval time. The value of the parameter τ that 
# makes the exponential distribution best match the data is the 
# mean interval time (where time is in units of number of games) 
# between no-hitters.

# NumPy, pandas, matlotlib.pyplot, and seaborn imported  
# as np, pd, plt, sns

# Seed random number generator
np.random.seed(42)

# Compute mean no-hitter time: tau
tau = np.mean(nohitter_times)

# Draw out of an exponential distribution with parameter tau: inter_nohitter_time
inter_nohitter_time = np.random.exponential(tau, 100000)

# Plot the PDF and label axes
_ = plt.hist(inter_nohitter_time,
             bins=50, normed=True, histtype='step')
_ = plt.xlabel('Games between no-hitters')
_ = plt.ylabel('PDF')

# Show the plot
plt.show()

In [None]:
# Create ECDF of the real data. Overlay theoretical CDF with the 
# ECDF of the data and verify the Exponential distribution
# write a function to compute the ECDF
##################
# recall
##################
def ecdf(data):
    """Compute ECDF for a one-dimensional array of measurements."""
    # Number of data points: n
    n = len(data)
    # x-data for the ECDF: x
    x = np.sort(data)
    # y-data for the ECDF: y
    y = np.arange(1, n+1) / n
    return x, y

# Compute ECDF for versicolor data: x_vers, y_vers
x_vers, y_vers = ecdf(versicolor_petal_length)

# Generate plot
_ = plt.plot(x_vers, y_vers, marker='.', linestyle='none')
# Label the axes
_ = plt.xlabel('versicolor petal length')
_ = plt.ylabel('ECDF')
# Display the plot
plt.show()
##########################
# Create an ECDF from real data: x, y
x, y = ecdf(nohitter_times)

# Create a CDF from theoretical samples: x_theor, y_theor
x_theor, y_theor = ecdf(inter_nohitter_time)

# Overlay the plots CDF and ECDF
plt.plot(x_theor, y_theor)
plt.plot(x, y, marker='.', linestyle='none')

# Margins and axis labels
plt.margins(.02)
plt.xlabel('Games between no-hitters')
plt.ylabel('CDF')

# Show the plot
plt.show()


How is this parameter optimal?
Now sample out of an exponential distribution with τ being twice as large as the optimal τ. Do it again for τ half as large. Make CDFs of these samples and overlay them with your data. You can see that they do not reproduce the data as well. Thus, the τ you computed from the mean inter-no-hitter times is optimal in that it best reproduces the data.

Note: In this and all subsequent exercises, the random number generator is pre-seeded for you to save you some typing.

In [None]:
# Compare half tau and double tau 
# Plot the theoretical CDFs
plt.plot(x_theor, y_theor)
plt.plot(x, y, marker='.', linestyle='none')
plt.margins(0.02)
plt.xlabel('Games between no-hitters')
plt.ylabel('CDF')

# Take samples with half tau: samples_half
samples_half = np.random.exponential(tau/2,10000)

# Take samples with double tau: samples_double
samples_double = np.random.exponential(tau*2,10000)

# Generate CDFs from these samples
x_half, y_half = ecdf(samples_half)
x_double, y_double = ecdf(samples_double)

# Plot these CDFs as lines
_ = plt.plot(x_half, y_half)
_ = plt.plot(x_double, y_double)

# Show the plot
plt.show()

## Linear regression by least squares
- sometimes 2 variables are related
- parameters of linear regression: slope and intercept
- residual: distance b/n data point and regression line
- least squares: the process of finding parameters for which the sum of the squares of the residuals is minimal

Least squares with np.polyfit(x, y, degree of polynomial)
- this is the best fit line
- note: degree of polynomial = 1 for linear regression
- slope, intercept = np.polyfit(total_votes, dem_share, 1)

In [None]:
# Step 1: EDA of female illiteracy/fertility data
##### recall #######
def pearson_r(x, y):
    """Compute Pearson correlation coefficient between two arrays."""
    # Compute correlation matrix: corr_mat
    corr_mat = np.corrcoef(x,y)

    # Return entry [0,1]
    return corr_mat[0,1]
#####################
# Plot the illiteracy rate versus fertility
_ = plt.plot(illiteracy, fertility, marker='.', linestyle='none')

# Set the margins and label axes
plt.margins(.02)
_ = plt.xlabel('percent illiterate')
_ = plt.ylabel('fertility')

# Show the plot
plt.show()

# Show the Pearson correlation coefficient
print(pearson_r(illiteracy, fertility))

In [None]:
# Step 2: Linear Regression
# Assume that fertility is a linear function of the female illiteracy rate.
# That is, f=ai+b, where a is the slope and b is the intercept. We can 
# think of the intercept as the minimal fertility rate, probably somewhere
# between one and two. The slope tells us how the fertility rate varies 
# with illiteracy. We can find the best fit line using np.polyfit().

# Plot the illiteracy rate versus fertility
_ = plt.plot(illiteracy, fertility, marker='.', linestyle='none')
plt.margins(0.02)
_ = plt.xlabel('percent illiterate')
_ = plt.ylabel('fertility')

# Perform a linear regression using np.polyfit(): a, b
a, b = np.polyfit(illiteracy,fertility,1)

# Print the results to the screen
print('slope =', a, 'children per woman / percent illiterate')
print('intercept =', b, 'children per woman')

# Make theoretical line to plot: "Best Fit Line"
x = np.array([0,100])
y = a * x + b

# Add regression line to your plot
_ = plt.plot(x, y)

# Draw the plot
plt.show()

How is it optimal?
The function np.polyfit() that you used to get your regression parameters finds the optimal slope and intercept. It is optimizing the sum of the squares of the residuals, also known as RSS (for residual sum of squares). In this exercise, you will plot the function that is being optimized, the RSS, versus the slope parameter a. To do this, fix the intercept to be what you found in the optimization. Then, plot the RSS vs. the slope. Where is it minimal?

In [None]:
# Plot RSS vs slope parameter a
# Use np.linspace() to get 200 points in the range between 0 and 0.1. 
# For example, to get 100 points in the range between 0 and 0.5, 
# you could use np.linspace() like so: np.linspace(0, 0.5, 100).

# Specify slopes to consider: a_vals. Get 200 points in range 0-0.1.
a_vals = np.linspace(0, 0.1, 200)

# Initialize sum of square of residuals: rss
rss = np.empty_like(a_vals)

# Compute sum of square of residuals for each value of a_vals
# Hint: the RSS is given by np.sum((y_data - a * x_data - b)**2). 
# The variable b you computed in the last exercise.
# fertility is the y_data and illiteracy the x_data.
for i, a in enumerate(a_vals):
    rss[i] = np.sum((fertility - a*illiteracy - b)**2)
    
# Plot the RSS (y axis) vs slope (x axis)
plt.plot(a_vals, rss, '-')
plt.xlabel('slope (children per woman / percent illiterate)')
plt.ylabel('sum of square of residuals')

plt.show()

Importance of EDA - Anscombe's quartet
- 4 fictitious data sets
- avg x all the same
- avg y all the same
- linear regression all the same
- RSS all the same
- conclusion: Do graphical EDA first

In [None]:
# Linear regression on appropriate Anscombe data
# Perform linear regression: a, b
a, b = np.polyfit(x, y, 1)

# Print the slope and intercept
print(a, b)

# Generate theoretical x and y data: x_theor, y_theor
x_theor = np.array([3, 15])
y_theor = a * x_theor + b

# Plot the Anscombe data and theoretical line
_ = plt.plot(x, y, marker='.', linestyle='none')
_ = plt.plot(x_theor, y_theor)

# Label the axes
plt.xlabel('x')
plt.ylabel('y')

# Show the plot
plt.show()

In [None]:
# Linear regression on all Anscombe data
# The data are stored in lists; anscombe_x = [x1, x2, x3, x4] and 
# anscombe_y = [y1, y2, y3, y4]

# Iterate through x,y pairs
for x, y in zip(anscombe_x, anscombe_y):
    # Compute the slope and intercept: a, b
    a, b = np.polyfit(x, y, 1)

    # Print the result
    print('slope:', a, 'intercept:', b)

# 2. Bootstrap Confidence Intervals (CI)

- Resampling with replacement
- Bootstrapping = use of resampled data to perform statistical inference
- bootstrap replicate = value of the summary statistic computed from the bootstrap sample

In [None]:
# For resampling engine use: np.random.choice()
import numpy as np
# size arg for # of samples
np.random.choice([1,2,3,4,5])
# output: array([5,3,5,5,2])

# compute bootstrap replicate
bs_sample = np.random.choice(michelson_speed_of_light, size=100)
# compute summary stats
np.mean(bs_sample)
np.median(bs_sample)
np.std(bs_sample)

### Bootstrapping by hand
To help you gain intuition about how bootstrapping works, imagine you have a data set that has only three points, [-1, 0, 1]. How many unique bootstrap samples can be drawn (e.g., [-1, 0, 1] and [1, 0, -1] are unique), and what is the maximum mean you can get from a bootstrap sample? It might be useful to jot down the samples on a piece of paper.

(These are too few data to get meaningful results from bootstrap procedures, but this example is useful for intuition.)

27 bootstrap samples (3**3) and max mean of 1,1,1 = 1

### Visualizing bootstrap samples
In this exercise, you will generate bootstrap samples from the set of annual rainfall data measured at the Sheffield Weather Station in the UK from 1883 to 2015. The data are stored in the NumPy array rainfall in units of millimeters (mm). By graphically displaying the bootstrap samples with an ECDF, you can get a feel for how bootstrap sampling allows probabilistic descriptions of data.

In [None]:
for _ in range(50):
    # Generate bootstrap sample: bs_sample
    bs_sample = np.random.choice(rainfall, size=len(rainfall))

    # Compute and plot ECDF from bootstrap sample
    x, y = ecdf(bs_sample)
    _ = plt.plot(x, y, marker='.', linestyle='none',
                 color='gray', alpha=0.1)

# Compute and plot ECDF from original data
x, y = ecdf(rainfall)
_ = plt.plot(x, y, marker='.')

# Make margins and label axes
plt.margins(0.02)
_ = plt.xlabel('yearly rainfall (mm)')
_ = plt.ylabel('ECDF')

# Show the plot
plt.show()


### Bootstrap confidence intervals

In [None]:
# Bootstrap replicate function
def bootstrap_replicate_1d(data, func):
    """Generate bootstrap replicate of 1D data."""
    # func is any statistic like mean, median, etc
    bs_sample = np.random.choice(data, len(data))
    return func(bs_sample)

# example function call for boostrap replicate
boostrap_replicate_1d(michelson_speed_of_light, np.mean)

In [None]:
# many bootstrap replicates
bs_replicates = np.empty(10000)
for i in range(10000):
    bs_replicates[i] = bootstrap_replicate_1d(michelson_speed_of_light, np.mean)
    
# plot histogram of boostrap replicates
# normed arg so that total bar areas sum to 1, aka normalization
_ = plt.hist(bs_replicates, bins=30, normed=True)
_ = plt.xlabel('mean speed of light (km/s)')
_ = plt.ylabel('PDF')
plt.show()

In [None]:
# Bootstrap confidence intervals
conf_int = np.percentile(bs_replicates, [2.5, 97.5])
# output: array([ 299837., 299868.])

Exercises

In [None]:
# Generate many bootstrap replicates with function
def draw_bs_reps(data, func, size=1):
    """Draw bootstrap replicates."""

    # Initialize array of replicates: bs_replicates
    bs_replicates = np.empty(size)

    # Generate replicates
    for i in range(size):
        bs_replicates[i] = bootstrap_replicate_1d(data, func)

    return bs_replicates

Bootstrap replicates of the mean and the SEM

In this exercise, you will compute a bootstrap estimate of the probability distribution function of the mean annual rainfall at the Sheffield Weather Station. Remember, we are estimating the mean annual rainfall we would get if the Sheffield Weather Station could repeat all of the measurements from 1883 to 2015 over and over again. This is a probabilistic estimate of the mean. You will plot the PDF as a histogram, and you will see that it is Normal.

In fact, it can be shown theoretically that under not-too-restrictive conditions, the value of the mean will always be Normally distributed. (This does not hold in general, just for the mean and a few other statistics.) The standard deviation of this distribution, called the standard error of the mean, or SEM, is given by the standard deviation of the data divided by the square root of the number of data points. I.e., for a data set, sem = np.std(data) / np.sqrt(len(data)). Using hacker statistics, you get this same result without the need to derive it, but you will verify this result from your bootstrap replicates.

The dataset has been pre-loaded for you into an array called rainfall.

In [None]:
# Bootstrap replicates of the mean and the SEM

# Take 10,000 bootstrap replicates of the mean: bs_replicates
bs_replicates = draw_bs_reps(rainfall,np.mean,size=10000)

# Compute and print SEM of rainfall data
sem = np.std(rainfall) / np.sqrt(len(rainfall))
print(sem)

# Compute and print standard deviation of bootstrap replicates
bs_std = np.std(bs_replicates)
print(bs_std)

# Make a histogram of the results
_ = plt.hist(bs_replicates, bins=50, normed=True)
_ = plt.xlabel('mean annual rainfall (mm)')
_ = plt.ylabel('PDF')

# Show the plot
plt.show()

# note: computed SEM and the bootstrap replicates std is the same, and
# distribution of the bootstrap replicates of the mean is Normal.

In [None]:
# Bootstrap replicates of other statistics
# This exercise generates bootstrap replicates for the variance and plots

# Generate 10,000 bootstrap replicates of the variance: bs_replicates
bs_replicates = draw_bs_reps(rainfall, np.var, size=10000)

# Put the variance in units of square centimeters
bs_replicates = bs_replicates/100

# Make a histogram of the results
_ = plt.hist(bs_replicates, bins=50, normed=True)
_ = plt.xlabel('variance of annual rainfall (sq. cm)')
_ = plt.ylabel('PDF')

# Show the plot
plt.show()

In [None]:
# Confidence interval on the rate of no-hitters
# Draw bootstrap replicates of the mean no-hitter time (equal to tau): bs_replicates
bs_replicates = draw_bs_reps(nohitter_times,np.mean,size=10000)

# Compute the 95% confidence interval: conf_int
conf_int = np.percentile(bs_replicates,[2.5,97.5])

# Print the confidence interval
print('95% confidence interval =', conf_int, 'games')

# Plot the histogram of the replicates
_ = plt.hist(bs_replicates, bins=50, normed=True)
_ = plt.xlabel(r'$\tau$ (games)')
_ = plt.ylabel('PDF')

# Show the plot
plt.show()

# answer: estimate of typical time b/n no-hitters CI 660-870 games

### Pairs bootstrap - for linear regression
- resample pairs of data
- compute slope and intercept from resampled data
- each slope and intercept is a bootstrap replicate
- compute confidence intervals from percentiles of bootstrap replicates

In [None]:
# generate a pairs bootstrap sample
np.arrange(7)
inds = np.arrange(len(total_votes))
# sample the indices with replacement
bs_inds = np.random.choice(inds, len(inds))
bs_total_votes = total_votes[bs_inds]
bs_dem_share = dem_share[bs_inds]
# bootstrap replicate
bs_slope, bs_intercept = np.polyfit(bs_total_votes, bs_dem_share, 1)

In [None]:
# function to do pairs bootstrap
def draw_bs_pairs_linreg(x, y, size=1):
    """Perform pairs bootstrap for linear regression."""

    # Set up array of indices to sample from: inds
    inds = np.arange(len(x))

    # Initialize replicates: bs_slope_reps, bs_intercept_reps
    bs_slope_reps = np.empty(size)
    bs_intercept_reps = np.empty(size)

    # Generate replicates
    for i in range(size):
        # resampled indices
        bs_inds = np.random.choice(inds, size=len(inds))
        # new x and y sliced with bs_inds
        bs_x, bs_y = x[bs_inds], y[bs_inds]
        bs_slope_reps[i], bs_intercept_reps[i] = np.polyfit(bs_x, bs_y, 1)

    # return pain bootstrap replicates of slope and intecept
    return bs_slope_reps, bs_intercept_reps

In [None]:
# Pairs bootstrap of literacy/fertility data

# Using function above, perform pairs bootstrap to plot a histogram 
# describing the estimate of the slope from the illiteracy/fertility data.
# Also report the 95% confidence interval of the slope. 

# Generate replicates of slope and intercept using pairs bootstrap
bs_slope_reps, bs_intercept_reps = draw_bs_pairs_linreg(illiteracy, fertility, size=1000)

# Compute and print 95% CI for slope
print(np.percentile(bs_slope_reps, [2.5,97.5]))

# Plot the histogram
_ = plt.hist(bs_slope_reps, bins=50, normed=True)
_ = plt.xlabel('slope')
_ = plt.ylabel('PDF')
plt.show()

In [None]:
# Plotting bootstrap regressions

# A nice way to visualize the variability we might expect in a 
# linear regression is to plot the line you would get from each 
# bootstrap replicate of the slope and intercept.

# Generate array of x-values for bootstrap lines: x
x = np.array([0,100])

# Plot the bootstrap lines
for i in range(100):
    _ = plt.plot(x, bs_slope_reps[i]*x + bs_intercept_reps[i],
                 linewidth=0.5, alpha=0.2, color='red')

# Plot the data
_ = plt.plot(illiteracy, fertility, marker='.',linestyle='none')

# Label axes, set the margins, and show the plot
_ = plt.xlabel('illiteracy')
_ = plt.ylabel('fertility')
plt.margins(0.02)
plt.show()

# 3. Hypothesis Testing
Permutation sampling, test statistics, p-value, bootstrap hypothesis tests
- Null hypothesis - one you are testing
- Options: Look at 2 ECDFs, summary stats
- Simulate the hypothesis
- Permutation: random reordering of entries in an array

In [None]:
# Generate a permutation sample
import numpy as np
# create a tuple of arrays
dem_share_both = np.concatenate(dem_share_PA, dem_share_OH)
# scramble the array
dem_share_perm = np.random.permutation(dem_share_both)
# reassign to 2 arrays to create "Permutation Samples"
perm_sample_PA = dem_share_perm[:len(dem_share_PA)]
perm_sample_OH = dem_share_perm[len(dem_share_PA):]

## permutation sample: generate


## permutation sample: visualize


## Test statistics


## p-value


## generate permutation replicates


## EDA before hypothesis testing


## permutation test on frog data


## Bootstrap hypothesis tests


## one-sample bootstrap hypothesis test


## bootstrap test for identical distributions


## 2-sample bootstrap hypothesis test for difference of means


# 4. Hypothesis testing examples

## A/B testing

## 

## 

## 

## 

## 

## 

## 

## 

# 5. Case study

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

In [None]:
## 