# Day 2, Session 1, MD workshop
### Dr. Michael Shirts, CU Boulder

## How do we understand the information we get out of simulations?

In [None]:
# load up some modules.

import numpy as np
import matplotlib.pyplot as plt
import scipy
import scipy.stats as stats
import pandas as pd
%matplotlib inline

Let's load up some typical data from a simulation:

In [None]:
potential = np.loadtxt('potential.dat')

In [None]:
plt.plot(potential)
plt.xlabel("Frame of Simulation")
plt.ylabel("Potential Energy")
plt.show()

**Question:** What should I report as the average potential energy of this simulation?

**Question:** What is the uncertainty in this average? i.e. the standard error of  the mean? Does this seem reasonable?

#### Reporting statistics for of non-normal distributions

In [None]:
betaf = scipy.stats.beta(0.8,4)
x = np.linspace(0,2)
plt.plot(x,betaf.pdf(x))
plt.show()

In [None]:
num = 10000
samples = betaf.rvs(num)
plt.hist(samples,bins=50,density=True)
plt.show()

### Pause and do some calculations

* What would you report as the mean and standard deviation of an observation from this distribution?
* What might be a better way to report the behavior of this distribution?

**Potential Answers:**

In [None]:
mean = np.mean(samples)
print(f"mean = {mean:.4f}")

In [None]:
std = np.std(samples)
print(f"std = {std:.4f}")

In [None]:
# what is the 95% confidence interval of an obserbation this distribution using the formula?
print(mean - 2*std)
print(mean + 2*std)

In [None]:
# better description
xlow = np.percentile(samples,2.5)
xhigh = np.percentile(samples,97.5)
print(f" 2.5% percentile is : {xlow:.4f}")
print(f"97.5% percentile is : {xhigh:.4f}")

### What about the error in the mean of this distribution?

In [None]:
def distribution_of_means(num,nrepeats):
    repeats = np.zeros(nrepeats)
    for i in range(nrepeats):
        repeats[i] = np.mean(betaf.rvs(num))
    return repeats

In [None]:
plt.hist(distribution_of_means(2,1000))
plt.show()

In [None]:
plt.hist(distribution_of_means(5,1000))
plt.show()

In [None]:
plt.hist(distribution_of_means(100,10000),bins=60)
plt.show()

In [None]:
# the formula for standard deviation of error now works.
print(mean - 2*std/np.sqrt(100))
print(mean + 2*std/np.sqrt(100))

## Determining independent points: the autocorrelation time

Important: All of this below is only valid for a _stationary_ timeseries.

Pandas implements an autocorrelation function as a function of _lag_ ($\tau$)

In [None]:
start = 500
panda_pot = pd.Series(potential[start:]-np.mean(potential[start:]))
panda_pot.autocorr(3)

* We can use the `autocorr` function to calculate the autocorrelation function, as a function of lag time $\tau$.  
 * Can you use your estimate of the autocorrelation function to find the the lag time $\tau$ at which the system becomes uncorrelated (i.e. the autocorrelation function) to show points spaced more than $\tau$ are independent? 
 * **Note:** We also assume that the distribution is stationary, so we need to just do this analysis after equilibration has occurred.
 * **Also note:** the formulas for autorcorrelatiob assume that the average if the timeseries is zero, so you should subtract off the average first.
 

In [None]:
start = 500
stationary = pd.Series(potential[start:])-np.mean(potential[start:])
n_autocorr = np.shape(stationary-1)[0]
nlim = int(n_autocorr/2)
acf = np.zeros(nlim)
for i in range(nlim):
    acf[i] = stationary.autocorr(i)

In [None]:
plt.plot(acf)
plt.show()

Or use the `pandas` utlilites

In [None]:
gh = pd.plotting.autocorrelation_plot(panda_pot)

**Question:** At what point does the ACF become essentially zero?

### Fitting ACF to an exponential to estimate correlation time.

In [None]:
from scipy.optimize import curve_fit

In [None]:
def exponential(x, r): #function f(x, r) = e^(r*x)
    #return np.e ** (r * x)
    return np.exp(r*x) 

In [None]:
#This function calculates tau by using an exponential fit.
#To do this it uses scipy.optimize curve_fit
#curve_fit will optimize any function to fit given data

def tau_calc(ac_data, function = exponential): #takes in data and a python function
    x_data = np.arange(len(ac_data))
    pars, cov = curve_fit(f=function, xdata=x_data, ydata=ac_data, p0=[0], bounds=(-np.inf, np.inf))
    #curve fit returns an np.array of optimally fit paramters(pars) and their coverience(cov)
    #pars in this case will return an optimized value of k to fit the dataset
    return -1/pars #tau = -1/k
# scipy curve_fit documentation: https://towardsdatascience.com/basic-curve-fitting-of-scientific-data-with-python-9592244a2509

In [None]:
tau = tau_calc(acf)[0]
print(tau)

This $\tau$ is the time it takes to go from 1 to $1/e$. So we actually can show we want to go out $2\tau$ (there's some theory that this far enough!)

In [None]:
x_data = np.arange(100)
plt.plot(x_data,acf[0:100], label='autocorrelation') 
plt.plot(exponential(x_data, -1/tau), label = 'tau_calc estimation')
plt.legend()
plt.show()

*Key point*: If the data fits an exponential well, we can treat it as uncorrelated samples if they samples are $2\tau$ apart. 

In [None]:
start = 500
mean = np.mean(potential[start:])
print(mean)

In [None]:
nsamples = (len(potential)-start)/(2*tau)
print("Nsamples =", nsamples)
stderr_mean = np.std(potential[start:])/np.sqrt(nsamples)

What should we use instead of `nsamples` above?

In [None]:
print(f"Mean = {mean}, sderr_mean = {stderr_mean}")

### Calculating thermophysical observables

What we are probably interested in is the average potential energy per molecule, not the total potential energy. 

We can estimate the heat of vaporization $H_{vap}$ by:
    
\begin{eqnarray}
H_{vap} &=& H_{gas}-H_{liquid} \\
        &=& U_{gas} + PV_{gas} - (U_{liquid} + PV_{liquid}) \\ 
        &=& \left(U_{gas} - U_{liquid}\right) - P \left(V_{gas}-V_{liquid}\right) \\ 
\end{eqnarray}

Since the kinetic energy of liquid and vapor is the same, then this is

\begin{eqnarray}
U_{pot,gas} - U_{pot,liquid} + P(V_{gas}- V_{liquid})
\end{eqnarray}

If we assume ideality of gas ($PV=nRT$), and zero internal energy (which is valid for rigid water, like TIP3P or SPC/E), then we get:

\begin{eqnarray}
       &=& - U_{pot,liquid} + P\left( \frac{nRT}{P} - V_{liquid}\right) \\
        &=& - U_{pot,liquid} + nRT - PV_{liquid} \\
\mathrm{H_{Molar}} &=& -\frac{U_{pot,liquid}}{N} + RT - \frac{PV_{liquid}}{N}
\end{eqnarray}

The data set potential.dat we have been playing with was from a simulation of TIP3P water with 900 molecules.

We note that the $\frac{PV_{liquid}}{N}$ term is almost zero away from the critical point, so we actually can ignore it in most cases.

In [None]:
(0.101)*(18.02)/(900*1000)  # 18.02 g/mol x 1 L*cm^3 / 1000 g  x 1 atm x 0.101  kJ / L*atm

So we are left with: $\mathrm{H_{Molar}} = -\frac{U_{pot,liquid}}{N}+RT$, where $U_{pot,liquid}$ is the *average* potential energy of the liquid state $\langle U \rangle$ we have been calculating with above. 

### Exercises

What is the $H_{vap}$ of water at 300 K predicted by this simulation? What is the uncertainty in the estimate? How does it compare to the experimental $H_{vap}$ of water at 300 K, which is 40.7 kJ/mol?

## Automated tool for equilibration detection and correlation

We can use scipy statistical tests to see if two parts of the distributions are within the uncertainties of each other.  

**Note**: We have to use uncorrelated samples, or it will erroneously say that they are NOT within uncertainties of each other.

A higher T-statistic means more difference between the data sets.

A lower P-value indicates a low probability that the difference in means was by chance.

P-value < 0.05 suggest they are two different data sets! 

### Exercise:  Use this code to estimate better than "eyeballing" what fraction is equibrated  

In [None]:
start = 0
cut = 500
stop = 1000
per_ind_sample = 1  # the number of indices between independent samples
# compare 
scipy.stats.ttest_ind(potential[start:cut:per_ind_sample],potential[cut:stop:per_ind_sample],equal_var=True)


Definitely not independent!  At what point might they be independent?

There are some additional tools (`pymbar.timeseries` equilibration detection), for example that can automate this more.

In [None]:
from pymbar import timeseries

In [None]:
[t0, g, Nindep] = timeseries.detect_equilibration(potential)
# t0 is the initial point detected as starting the stationary point.
# g is the estimate of the correlation time (approximately 2tau)
# Neff_max is an estimate of the _effective_ number of samples.

In [None]:
print('Equilibration time =',t0)
print('Correlation time = ',g)
print('Number of independent samples =', Nindep)