# Resampling Statistics Examples

Although the resampling statistics module is officially part of OpenPathSampling, it is designed to be much more generally useful. It will take anything that returns `pandas.DataFrame` objects, resample them, and perform statistical analyses on the results.

This example will create a simple random data from a distribution, and turn that into pandas results.

In [1]:
import openpathsampling as paths
import numpy as np
import pandas as pd

In [2]:
# 10000 samples, 4 numbers each, from random Gaussians
samples = np.random.normal(size=(1000, 4))
samples

array([[-1.37007399, -1.26796115, -0.46762326, -1.00347473],
       [-0.8272092 , -1.60356643,  0.38683253, -0.2099861 ],
       [ 1.35893117,  0.53878751,  1.70979444,  0.6944104 ],
       ..., 
       [ 0.46566366,  0.68657575, -1.45255225,  0.08287279],
       [ 0.29573195, -0.40666717,  0.11274542, -0.12782826],
       [ 0.36022961,  0.13793928, -0.23933477, -0.78715603]])

In [3]:
def make_df(sample_list):
    """Create a dataframe from these samples.
    
    The dataframe is made up of 
    
    Parameters
    ----------
    sample_list: np.array, shape (n,4)
        the samples for elements of the DataFrame; the n entries will be averaged over
    
    Returns
    -------
    pd.DataFrame
        dataframe with the transformed values
    """
    sigmas = np.array([1.0, 2.0, 3.0, 4.0])
    mus = np.array([-1.0, 0.0, 4.0, 4.5])
    sample = sample_list.mean(axis=0)
    results = np.multiply(sigmas, sample) + mus
    results.shape = (2,2)
    return pd.DataFrame(results, index=['A', 'B'], columns=['C', 'D'])

First, we must create the resampler. In this case, we'll sample by blocks. Since each sample gives an independent estimate of the resulting dataframe, we can set `n_per_block` as low as 1.

In [4]:
blocks = paths.numerics.BlockResampling(all_samples=samples, n_per_block=1)

In [5]:
full_stats = paths.numerics.ResamplingStatistics(
    function=make_df,
    inputs=blocks.blocks
)

In [6]:
full_stats.mean

Unnamed: 0,C,D
A,-0.951078,-0.114389
B,3.975997,4.516062


In [7]:
full_stats.std

Unnamed: 0,C,D
A,0.947827,2.008178
B,3.132182,4.038275


In [8]:
full_stats.percentile(95)

Unnamed: 0,C,D
A,0.691092,3.32293
B,9.12938,11.0369


In [9]:
full_stats.percentile(50)

Unnamed: 0,C,D
A,-0.961511,-0.163845
B,3.9444,4.53635


In many other examples, such as calculating rates with TIS, we can only get a reasonable result for the dataframe after many samples are collected. Here, we'll take the same data as before, but we'll pre-average it with `n_blocks=100` (100 different blocks, each of 100 samples). Note that the mean stays the same, but other numbers change, with very significant changes in standard deviation and (especially) more extreme percentiles.

In [10]:
blocks = paths.numerics.BlockResampling(all_samples=samples, n_blocks=100)

In [11]:
full_stats = paths.numerics.ResamplingStatistics(
    function=make_df,
    inputs=blocks.blocks
)

In [12]:
full_stats.mean

Unnamed: 0,C,D
A,-0.951078,-0.114389
B,3.975997,4.516062


In [13]:
full_stats.std

Unnamed: 0,C,D
A,0.298514,0.661671
B,0.923524,1.069796


In [14]:
full_stats.percentile(95)

Unnamed: 0,C,D
A,-0.401616,0.991962
B,5.51743,6.35847


In [15]:
full_stats.percentile(50)

Unnamed: 0,C,D
A,-0.957254,-0.158376
B,4.01118,4.47487
