# Day 3, Morning Session 2, MD/MC workshop
### Dr. Michael Shirts, CU Boulder

In [None]:
# load up some modules.

import numpy as np
import matplotlib.pyplot as plt
import scipy
import scipy.stats as stats
import pandas as pd
%matplotlib inline

## Bootstrapping for complicated uncertainties.


There is a difference in statistics between a **population** and a **sample**. The population is all the possible observations out there. For instance if I were an epidemiologist, this might be all the people in the U.S. or all the children in Massachusetts.


<img width="304" height="232" alt="Image result for population versus sample" src="http://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/BS704_BiostatisticsBasics/Sampling3.jpg">


But this can also be applied to measurements in the lab, such as possible voltage values from a battery. If you ran an infinite number of experients, you would get an infinite number of measurements, but either in this case of the whole US population case: accessing the actual population is virtually impossible.

<img width="304" height="223" alt="Image result for thermometer gif" src="https://media.giphy.com/media/26FL3uMhARSAvIZZS/giphy.gif">

As a scientist, you only have access to a sample. Part of designing an experiment is choosing how big your sample should be.

But a key problem is: if you change your sample, it could change your sample mean and sample variance significantly. So the question is, how can you understand the variation in the **population** by only looking at your **sample**.

Let me repeat: the question, which the bootstrap is trying to answer, is to understand the variation in the  **population** by understanding the  variation at your **sample**.


Let's say we were in the lab, and we were making a calibration curve, say absorbance against concentration using UV-vis measurements. We generate the following plot:

In [None]:
def generate_data(xlim=[0,5],ylim=[0,100],npoints=20,noise=1.0,seed=None,noise_model="even"):
    '''
    This function just generates some random data that might look like at UV-vis output.
    Inputs: range of data in the x and y direction, the number of points, and the level of noise.
    Add random number seed to allow repetability
    
    We have two types of noise; even noise, and the other exponential noise model.
    '''
    np.random.seed(seed) # so we can control the random noise
    x = np.linspace(xlim[0], xlim[1], npoints)
    y_raw = np.linspace(ylim[0], ylim[1], npoints)
    if noise_model == "even":
        y = y_raw + noise*np.random.rand(npoints)
    else:
        y = y_raw + np.exp(x*np.random.rand(npoints)) - 1
    return x, y 

In [None]:
xmin = 0
xmax = 5

ymin = 0
ymax = 100

npoints = 50

# in the absence of noise, the slope is 25 and the intercept is 0, 
# but because of the heteroscedasticity, the linear fit will be somewhat biased.

x, y = generate_data(xlim=[xmin,xmax],ylim=[ymin,ymax],noise=1.0,npoints=npoints,noise_model="exponenential", seed=1)


results = stats.linregress(x,y)
slope = results.slope
intercept = results.intercept
y_fit = intercept + np.array([xmin,xmax])*slope
original_data = pd.DataFrame({'x': x, 'y':y})
original_data.head()

Now we plot the data and the fit.

In [None]:
fig, ax = plt.subplots(figsize=(8,6))
ax.scatter(original_data['x'], original_data['y'], alpha=0.5, label='raw data')
ax.plot(np.array([0,5]), y_fit, linewidth=4, label='linear fit')
ax.set_xlim([0, 5])
ax.set_ylim([0, 200])
ax.legend(loc="lower right")
plt.show()

But uh oh. We can see that some are those points at high values of the inputs have error that is getting a little large. You want to get some indication of how your calibration curve would change if you repeated the experiment again (to use a million dollar word, the data has heteroskedasticity, with different amounts of statistical error for different inputs.

Instead of actually repeating the experiment, you decide to use bootstrapping, and *estimate the variation within your sample as a replacement for the variation within your population.*

To do this, you will generate "new" samples, picking from the data **with replacement**, pretending the data you collected is **entirely representative** of the actual population) **with replacement** and perform a regression on this resampling of your sample.
**With replacement**, means you can draw the same point multiple times, is an important statement. We are assuming our population is much, much, bigger than our sample, but has the exact same distribution as our sample.  So when we withdraw the value at has x=1.456, there are still infinte more points with x=1.456 remaining. So the next point you pick is just as likely to be x=1.456 as it was the first time you drew it.

You will generate some code to do this. First, though, we will see what happens if we resample from the original population, then we will see what bootstrapping can do. 

So first, we are drawing from the _population_, because we are regenerating the entire dataset, including with new amounts of noise.

In [None]:
def draw_from_population(data,npoints=20,ndraws=10):
    '''
    Draw from the population using the generate_data function,
    and determining the slope and intercept for each process. 
    '''
    fig, ax = plt.subplots(figsize=(8,6))
    ax.scatter(data['x'], data['y'], alpha=0.5, label='raw data')

    xmin = data['x'].min()
    xmax = data['x'].max()
    
    # initializing outputs
    slope = np.zeros(ndraws)
    intercept = np.zeros(ndraws)

    for i in range(ndraws):
        # vvvv - these are the lines to change for bootstrap
        #x, y = generate_data(xlim=[xmin,xmax],ylim=[ymin,ymax],npoints=npoints,noise=10.0) # don't set the seed, so we get new random data.
        x, y = generate_data(xlim=[xmin,xmax],ylim=[ymin,ymax],npoints=npoints,noise=1.0,noise_model="exponenential") # don't set the seed, so we get new random data.
        sample = pd.DataFrame({'x': x, 'y':y})
        # ^^^^ 
        results = stats.linregress(sample['x'].values,sample['y'])
        slope[i] = results.slope
        intercept[i] = results.intercept
        y_fit = intercept[i] + np.array([0,5])*slope[i]
        ax.plot(np.array([xmin,xmax]), y_fit, linewidth=2, color='b', alpha=0.2)
    return slope, intercept

In [None]:
draw_from_population(original_data,npoints=20,ndraws=10)

We can repeat this, like, 5000 times, generate a distribution of slopes and intercepts, and use the distributions to calculate confidence intervals on the slope and intercept:

In [None]:
slopes, intercepts = draw_from_population(original_data,npoints=50,ndraws=5000)

In [None]:
plt.figure()
plt.hist(slopes,bins=30)
plt.title('slope values')
plt.figure()
plt.hist(intercepts,bins=30)
plt.title('intercept values')
ci = 95

print('mean slope: {:.2f} +/- {:.2f} ({:2d}% CI: {:.2f} - {:.2f})'.format(np.mean(slopes), 
                                                       np.std(slopes,ddof=1), ci,
                                                       np.percentile(slopes, ((100-ci)/2)),
                                                       np.percentile(slopes, ((100+ci)/2))))

print('mean intercept: {:.2f} +/- {:.2f} ({:2d}% CI: {:.2f} - {:.2f})'.format(np.mean(intercepts), 
                                                       np.std(intercepts,ddof=1), ci,
                                                       np.percentile(intercepts, ((100-ci)/2)),
                                                       np.percentile(intercepts, ((100+ci)/2))))


Let's see how these values compare to the ones estimated by the standard linear estimator.

In [None]:
results = stats.linregress(original_data['x'],original_data['y'])
print(f"slope + standard error of slope = {results.slope:.2f} +/- {results.stderr:.2f}")
print(f"intercept + standard error of intercept = {results.intercept:.2f} +/- {results.intercept_stderr:.2f}")

They seem statistically consistent!  Both the constant term and the x term are within the error bars; or equivalently, the values are well within the distributions. 

### Now, to the bootstrap

To pull a sample with replacement, we will take advantage of `pandas` sampling capabilities:

In [None]:
original_data.head()
subsample = original_data.sample(n=original_data.x.count(), replace=True)
print(subsample)

Note that some samples occur more than once, and some don't appear at all! That is what with replacement means!

**Hacking Time**: That code draws *one* bootstrap sample. `Copy the bootstrap_from_population` function from above to create a new function `bootstrap_from_data` to plot the lines from _multiple bootstrap samples_ of the original dataset instead of multiple draws from the population.

In [None]:
def draw_from_bootstrap(data,ndraws=500):
    '''
    Draw from the boostrap and determining the slope and intercept for each process. 
    '''
    
    xmin = data['x'].min()
    xmax = data['x'].max()
    
    fig, ax = plt.subplots(figsize=(8,6))
    ax.scatter(data['x'], data['y'], alpha=0.5, label='raw data')

    # initializing outputs
    slope = np.zeros(ndraws)
    intercept = np.zeros(ndraws)

    for i in range(0, ndraws):
        # vvvv - these are the lines to change for bootstrap
        subsample = data.sample(n=data.x.count(), replace=True)
        # ^^^^ 
        results = stats.linregress(subsample['x'],subsample['y'])
        slope[i] = results.slope
        intercept[i] = results.intercept
        y_fit = intercept[i] + np.array([xmin,xmax])*slope[i]
        ax.plot(np.array([xmin,xmax]), y_fit, linewidth=2, color='b', alpha=0.2)
    return slope, intercept

Note that bootstrap means will not be exactly the same as the population means, as each draw from the population will be different, and if we determine the bootstrap means more precisely, they will converge to the value for that sample, which should be well WITHIN the distribution of populations means (it will be have the distribution of the population distribution of means).  

Once you have the code working, then run the line below - you should hopefully get something similar to the draw from population!

In [None]:
slopes_boot, intercepts_boot = draw_from_bootstrap(original_data,ndraws=5000)

**Hacking time, Part 2**: plot the histograms of the distribution of slopes and intercepts of the population and the bootstrap samples and compare

In [None]:
plt.figure()
plt.hist(slopes,bins=30,alpha=0.4,density=True,label='population')
plt.hist(slopes_boot,bins=30,alpha=0.4,color='orange',density=True,label='bootstrap')
plt.legend()
plt.title('slope values')
ci = 95
print('mean slope (bootstrap): {:.2f} +/- {:.2f} ({:2d}% CI: {:.2f} - {:.2f})'.format(np.mean(slopes_boot), 
                                                       np.std(slopes_boot,ddof=1),ci,
                                                       np.percentile(slopes_boot, ((100-ci)/2)),
                                                       np.percentile(slopes_boot, ((100+ci)/2))))
plt.show()

plt.figure()
plt.hist(intercepts,bins=30,alpha=0.4,density=True,label='population')
plt.hist(intercepts_boot,bins=30,alpha=0.4,color='orange',density=True,label='bootstrap')
plt.title('intercept values')
plt.legend()
ci = 95
print('mean intercept (bootstrap): {:.2f} +/- {:.2f} ({:2d}% CI: {:.2f} - {:.2f})'.format(np.mean(intercepts_boot), 
                                                       np.std(intercepts_boot,ddof=1),ci,
                                                       np.percentile(intercepts_boot, ((100-ci)/2)),
                                                       np.percentile(intercepts_boot, ((100+ci)/2))))
plt.show()

### Bootstrapping for complicated function distributions

Bootstrapping is generalizable to any statistical quantities you may be interested in. One useful example is error propagation. If you know the error in raw datasets, and you want to know how that error will impact downstream calculations, you can use bootstrapping to do it **without having to do the calculus** involved in standard error propagation, or if the error propagation is really impossible to do. 

For example, try obtaining the distribution of the log of the absolute value of the (intercept/slope) in the linear fit above. What is the average and the distribution of this quantity? (Why this quantity? No reason, really, except that it would be REALLY HARD to get the distribution of errors using any standard error propagation). Compare this to what one would get using bootstrapping.

Similarly, bootstrapping can be used to estimate error in ANY fitting procedure extending beyond linear regression.

In [None]:
plt.figure()
vals = np.sin(np.log(np.fabs(slopes/intercepts)))
vals_boot = np.sin(np.log(np.fabs(slopes_boot/intercepts_boot)))
plt.hist(vals,bins=30,alpha=0.4, density=True,label='population')
plt.hist(vals_boot,bins=30,alpha=0.4,density=True,color='orange',label='bootstrap')
plt.title('f(slope/intercepts) values')
plt.legend()
ci = 95

print('mean sin(log(abs(slope/intercept))): {:.2f} ({:2d}% CI: {:.2f} - {:.2f})'.format(np.mean(vals), ci,
                                                       np.percentile(vals, ((100-ci)/2)),
                                                       np.percentile(vals, ((100+ci)/2))))
print('mean sin(log(abs(slope/intercept))): (Bootstrap): {:.2f} ({:2d}% CI: {:.2f} - {:.2f})'.format(np.mean(vals_boot), ci,
                                                       np.percentile(vals_boot, ((100-ci)/2)),
                                                       np.percentile(vals_boot, ((100+ci)/2))))

### Bootstrapping for computing self-diffusion

in 3D, one can estimate the self-diffusion coefficient with the following formula:

$\langle (x(t)-x(t+\tau))^2\rangle = 6D\tau$ 

This is for 3D, the 6 = 2 times the number of dimensions

The algorithm is then:

1. Calculate the mean square displacement of a particle as a function of time
2. Average all the square displacements.
3. Fit the result function to a line with intercept = 0 to find the slope.
4. Divide the slope by 6.

Load in some data. Each entry in the array is a square displacement of a the oxygen atom of a water molecule. Note that the interval between steps is 0.5 ps, so the total trajectory is 1 ns long. 

In [None]:
msd_data = np.load("manytraj.npy")

In [None]:
nparticles, length = np.shape(msd_data)

In [None]:
for i in range(nparticles):
    plt.plot(msd_data[i,:],'b',alpha=0.02)

In [None]:
# we can average over all particles
avemsd = np.mean(msd_data,axis=0)

In [None]:
plt.plot(avemsd)
plt.xlabel("frames")
plt.ylabel(r"$\langle MSD \rangle$")
plt.show()

We want to minimize $\sum_i (a\tau_i-y(\tau_i))^2$.  A linear fit, but with no intercept.

In [None]:
# frames are from 0 to 1000, 0.5 ps each
# this is the function we want to minimize to:
def func(a,mymsd):
    return a*np.linspace(0,1000,2001)-mymsd

In [None]:
slope = scipy.optimize.leastsq(func,0.1, args=avemsd)[0][0]
D = result/6
print(D) # this is in nm^2 / ps. 

In [None]:
# to get it into cm^2 / s, this is 
finalD = (D / (10**7 * 10**7))*10**12
print(finalD)

## Exercise

Perform a bootstrap error analysis by constructing bootstrap samples over _particles_, that are then used to compute the average MSD.

1. Plot 500 bootstrapped MSDs
2. Show the 95% confidence interval region of MSD trajectories.
3. Plot the distribution of MDS.  Is it Gaussian? If so what is the standard error in the calculation? 

In [None]:
nbootstraps = 5000
newmsd_data = np.zeros(np.shape(msd_data)) # create a matrix the same size
Ds = np.zeros(nbootstraps)
for n in range(nbootstraps):
    for i in range(nparticles):
        # generate bootstrapped data in newmsd_data[i,:]
        newi = np.random.randint(0,nparticles)
        newmsd_data[i,:] = msd_data[newi,:]
    new_avemsd = np.mean(newmsd_data,axis=0)
    plt.plot(avemsd,'b',alpha=0.01)
    slope = scipy.optimize.leastsq(func,0.1,new_avemsd)[0][0]
    Ds[n] = slope/6  * 0.01 # 10^7 * 10^7 / 10^14
plt.show()

In [None]:
plt.hist(Ds,bins=50)

Seems pretty Gaussian! But we don't have to do any error propagation.

In [None]:
stderr_boots = np.std(Ds)
print(stderr_boots)

In [None]:
So our estimate would be:

In [None]:
print(f"D={finalD:.4g} +/- {stderr_boots:.4g} cm^2/s")

So this essentialy 6.0 +/- 0.1 cm$^2$/s. which is close to the known diffusion value of TIP3P water at 300K of 5.8-6.0 cm$^2$/s - there are some conflicts depending on how you measure it.  $D$ is known to be affected by the size of the periodic box, so one should really run the simulation with different sizes of system, and extrapolate out to large system sizes.