In [1]:
%matplotlib notebook

In [2]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

## Rationale

For some certain loss functions, such the the negative evidence lower bound (NELBO) in variational inference, they are generally analytically intractable and thus unavailable in closed-form. As such, we might need to resort to taking stochastic estimates of the loss function. In these situations, it is very important to study and understand the robustness of the estimations we are making, particularly in terms of bias and variance. When proposing a new estimator, we may be interested in evaluating the loss at a fined-grained level - not only per batch, but perhaps even per data-point. 

This notebook explores storing the recorded losses in Pandas Dataframes. The recorded losses are 3d, with dimensions corresponding to epochs, batches, and data-points. Specifically, they are of shape `(n_epochs, n_batches, batch_size)`. Instead of using the deprecated Panel functionality from Pandas, we explore the preferred MultiIndex Dataframe.

Lastly, we play around with various data serialization formats supported out-of-the-box by Pandas. This might be useful if the training is GPU-intensive, so the script runs and records the loss remotely on a supercomputer, and we must write the results to file, download them and finally analyze them locally. This is usually trivial, but it is unclear what the behaviour is for more complex MultiIndex dataframes. We restrict our attention to the CSV format, which is human-friendly but very slow and inefficient, and the HDF5, which is basically diametrically opposed - it's basically completely inscrutable, but is very fast and takes up laess space.

### Synthetic Data

In [3]:
# create some noise
a = np.random.randn(50, 600, 100)
a.shape

(50, 600, 100)

In [4]:
# create some noise with higher variance and add bias.
b = 2. * np.random.randn(*a.shape) + 1.
b.shape

(50, 600, 100)

In [5]:
# manufacture some loss function
# there are n_epochs * n_batchs * batch_size 
# recorded values of the loss
loss = 10 / np.linspace(1, 100, a.size)
loss.shape

(3000000,)

### MultiIndex Dataframe

In [6]:
# we will create the indices from the 
# product of these iterators
list(map(range, a.shape))

[range(0, 50), range(0, 600), range(0, 100)]

In [7]:
# create the MultiIndex
index = pd.MultiIndex.from_product(
    list(map(range, a.shape)), 
    names=['epoch', 'batch', 'datapoint']
)

In [8]:
# create the dataframe that records the two losses
df = pd.DataFrame(
    dict(loss1=loss+np.ravel(a), 
         loss2=loss+np.ravel(b)), 
    index=index
)
df

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,loss1,loss2
epoch,batch,datapoint,Unnamed: 3_level_1,Unnamed: 4_level_1
0,0,0,10.837250,10.228649
0,0,1,9.383650,9.601012
0,0,2,9.102928,12.792865
0,0,3,9.149701,11.307185
0,0,4,9.181607,9.905578
0,0,5,8.984361,11.646015
0,0,6,8.935352,10.793933
0,0,7,9.273609,9.421425
0,0,8,10.846009,9.916008
0,0,9,10.288851,7.250876


### Visualization

In this contrived scenario, `loss2` is more biased and has higher variance.

In [9]:
# some basic plotting
fig, ax = plt.subplots()

df.groupby(['epoch', 'batch']).mean().plot(ax=ax)

plt.show()

<IPython.core.display.Javascript object>

### CSV Read/Write

In [10]:
%%time

df.to_csv('losses.csv')

CPU times: user 9.56 s, sys: 184 ms, total: 9.74 s
Wall time: 13.3 s


In [11]:
!ls -lh losses.csv

-rwxrwxrwx 1 tiao tiao 138M Nov  8 03:14 losses.csv


In [12]:
%%time

df_from_csv = pd.read_csv('losses.csv', index_col=['epoch', 'batch', 'datapoint'], float_precision='high')

  mask |= (ar1 == a)


CPU times: user 1.47 s, sys: 108 ms, total: 1.58 s
Wall time: 3.73 s


In [13]:
# does not recover exactly due to insufficient floating point precision
df_from_csv.equals(df)

False

In [14]:
# but it has recovered it up to some tiny epsilon
((df-df_from_csv)**2 < 1e-25).all()

loss1    True
loss2    True
dtype: bool

### HDF5 Read/Write

HDF5 writing is orders of magnitude faster.

In [15]:
%%time

df.to_hdf('store.h5', key='losses')

CPU times: user 44 ms, sys: 72 ms, total: 116 ms
Wall time: 720 ms


Furthermore, the file sizes are significantly smaller.

In [16]:
!ls -lh store.h5

-rwxrwxrwx 1 tiao tiao 58M Nov  8 03:15 store.h5


In [17]:
%%time

df_from_hdf = pd.read_hdf('store.h5', key='losses')

CPU times: user 28 ms, sys: 28 ms, total: 56 ms
Wall time: 105 ms


Lastly, it is far more numerical precise.

In [18]:
df.equals(df_from_hdf)

True