## Why do we need this? 


We are trying to solve several problems by proposing a new file format. Specifically: 

* Saving each conformation as individual file is producing too many files
* Using pickle-based approaches (joblib) makes format python-specific and not backwards compatible; text is clumsy
* Would be nice to save metadata, such as starting conformation, forces, or initial parameters. 


Are there alternatives?  I considered MDTraj compatible binary formats. They are PDB centered, and are usually one-file-per-trajectory. It looked hacky. 

### one file vs  many files  vs  several files 

Saving each conformation as an individual file is undesirable because it will produce too many files: filesystem check or backup on 30,000,000 files takes hours/days. 

Saving all trajectory as a single files is undesirable because 1. backup software will back up a new copy of the file every day as it grows; and 2. if the last write fails, the file will end up in the corrupted state and would need to be recovered. 

Solution is: save groups of conformations as individual files. E.g. save conformations 1-50 as one file, conformations 51-100 as a second file etc. 

This way, we are not risking to lose anything if the power goes out at the end. This way, we are not screwing with backup solutions, And we also have partial trajectories that can be analyzed. 


### Storage format - what did I choose? 

I chose the HDF5-based storage that roughly mimics the MDTraj HDF5 format. It does not have MDTraj topology because it seemed a little too complicated, and not compatible with nglview anyways. Maybe one day we will write something that fully converts it to MDTraj if necessary. 


### Overall design of the format


I decided to separate two entitys: a simulation object and a reporter. When a simulation object is initialized, a reporter (actually, a list of reporters in case you want to use several) is passed to the simulation object. Simulation object would attempt to save several things: __init__ arguments, starting conformation, energy minimization results (TODO), serialized forces, and blocks of conformations together with time, Ek, Ep. 

Each time a simulation object wants to save something, it calls reporter.report(...) for each of the reporters. It passes a string indicating what is being reported, and a dictionary to save. Reporter will have to interpret this and save the data. Reporter is also keeping appropriate counts. NOTE: generic Python objects are not supported. It has to be HDF5-compatible, meaning an array of numbers/strings, or a number/string. 

The HDF5 reporter used here saves everything into an HDF5 file. For anything except for a conformation, it would immmediately save the data into a single HDF5 file: numpy array compatible structures would be saved as datasets, and regular types (strings, numbers) would be saved as attributes. For conformations, it would wait until a certain number of conformations is received. It will then save them all at once into an HDF5 file under groups /1, /2, /3... /50 for blocks 1,2,3...50 respectively, and save them to `blocks_1-50.h5` file


### Multi-stage simulations or loop extrusion

We frequently have simulations in which a simulation object changes. One example would be changing forces or parameters throughout the simulation. Another example would be loop extrusion simulations. 

In this design, a reporter object can be reused and passed to a new simulation. This would keep counter of conformations, and also save applied forces etc. again. The reporter would create a file "applied_forces_0.h5" the first time it receives forces, and "applied_forces_1.h5" the second time it receives forces from a simulation. 


### URIs to identify individual conformations

Because we're saving several conformations into one file, we designed an URI format to quickly fetch a conformation by a unique identifyer. 

URIs are like that: `/path/to/the/trajectory/blocks_1-50.h5::42` 

This URI will fetch block #42 from a file blocks_1-50.h5, which contains blocks 1 through 50 including 1 and 50

polymerutils.load are compatible with URIs 

Also, to make it easy to load both old-style filenames and new-style URIs, there is a function polychrom.polymerutils.fetch_block

fetch_block will autodetermine the type of a trajectory folder. 

So it will fetch both `/path/to/the/trajectory/block42.dat` and  `/path/to/the/trajectory/blocks_x-y.h5::42` automatically 


In [1]:
import polychrom
import numpy as np 
import warnings
import h5py 
import glob
from polychrom.simulation import Simulation
import polychrom.starting_conformations
import polychrom.forces, polychrom.forcekits
import simtk.openmm 
import os 
import shutil
import polychrom.polymerutils

### Loading reporter and utils from a hdf5_format module 


In [2]:
from polychrom.hdf5_format import HDF5Reporter, list_URIs, load_URI, load_hdf5_file, save_hdf5_file

### Making a simulation and passing a reporter 

In [3]:
%rm  test/*
data = polychrom.starting_conformations.grow_cubic(1000,30)

"""
Here we created a hdf5Reporter attached to a foler test, and we are saving 5 blocks per file 
(you should probalby use 50 here or 100. 5 is just for a showcase)
"""
reporter = HDF5Reporter(folder="test", max_data_length=5)


"""
Passing a reporter to the simulation object - many reporters are possible, and more will be added in a future
"""
sim = Simulation(N=1000, error_tol=0.001, collision_rate=0.01, integrator ="variableLangevin", platform="CPU", 
                reporters=[reporter])
sim.set_data(data)
sim.add_force(polychrom.forcekits.polymer_chains(sim))
sim._apply_forces()
sim.add_force(polychrom.forces.spherical_confinement(sim, density=0.1))


for i in range(19):        
    """
    Here we pass two extra records: a string and an array-like object.
    First becomes an attr, and second becomes an HDF5 dataset
    """
    sim.do_block(10, save_extras={"eggs": "I don't eat green eggs and ham!!!", "spam":[1,2,3]})

"""
Here we are not forgetting to dump the last set of blocks that the reporter has. 
We have to do it at the end of every simulation. 

I tried adding it to the destructor to make it automatic,
but some weird interactions with garbage collection made it not very useable. 
"""
reporter.dump_data()




INFO:root:adding force harmonic_bonds 0
INFO:root:adding force angle 1
INFO:root:adding force polynomial_repulsive 2


Exclude neighbouring chain particles from polynomial_repulsive
Number of exceptions: 999


INFO:root:Particles loaded. Potential energy is 0.050027
INFO:root:block    0 pos[1]=[13.9 13.9 14.2] dr=0.21 t=0.9ps kin=7.76 pot=3.37 Rg=5.569 dt=25.1fs dx=15.63pm 
INFO:root:block    1 pos[1]=[13.7 13.9 14.2] dr=0.16 t=1.2ps kin=3.28 pot=8.40 Rg=5.568 dt=20.9fs dx=8.45pm 
INFO:root:block    2 pos[1]=[13.7 13.9 14.2] dr=0.09 t=1.4ps kin=7.76 pot=3.95 Rg=5.570 dt=24.7fs dx=15.38pm 
INFO:root:block    3 pos[1]=[13.8 13.9 14.3] dr=0.14 t=1.6ps kin=5.44 pot=6.29 Rg=5.576 dt=22.4fs dx=11.68pm 
INFO:root:block    4 pos[1]=[13.7 13.9 14.3] dr=0.11 t=1.8ps kin=6.41 pot=5.29 Rg=5.583 dt=22.2fs dx=12.55pm 
INFO:root:block    5 pos[1]=[13.6 14.0 14.3] dr=0.12 t=2.1ps kin=6.33 pot=5.34 Rg=5.589 dt=22.2fs dx=12.47pm 
INFO:root:block    6 pos[1]=[13.6 14.0 14.2] dr=0.12 t=2.3ps kin=6.32 pot=5.32 Rg=5.597 dt=22.2fs dx=12.47pm 
INFO:root:block    7 pos[1]=[13.7 14.1 14.2] dr=0.12 t=2.5ps kin=6.55 pot=5.06 Rg=5.606 dt=22.2fs dx=12.69pm 
INFO:root:block    8 pos[1]=[13.8 14.1 14.2] dr=0.12 t=2.7ps kin

### This is a list of files created in the trajectory folder

In [4]:
!ls -la test

total 820
drwxrwxr-x 2 magus magus   4096 Oct  1 14:23 .
drwxrwxr-x 5 magus magus   4096 Oct  1 14:22 ..
-rw-rw-r-- 1 magus magus 184664 Oct  1 14:23 applied_forces_0.h5
-rw-rw-r-- 1 magus magus 153099 Oct  1 14:23 blocks_0-4.h5
-rw-rw-r-- 1 magus magus 153237 Oct  1 14:23 blocks_10-14.h5
-rw-rw-r-- 1 magus magus 124070 Oct  1 14:23 blocks_15-18.h5
-rw-rw-r-- 1 magus magus 153268 Oct  1 14:23 blocks_5-9.h5
-rw-rw-r-- 1 magus magus  13829 Oct  1 14:23 forcekit_polymer_chains_0.h5
-rw-rw-r-- 1 magus magus   6144 Oct  1 14:23 initArgs_0.h5
-rw-rw-r-- 1 magus magus  21411 Oct  1 14:23 starting_conformation_0.h5


In [5]:
files = list_URIs("test")
files   #  these are the URIs for individual blocks

['test/blocks_0-4.h5::0',
 'test/blocks_0-4.h5::1',
 'test/blocks_0-4.h5::2',
 'test/blocks_0-4.h5::3',
 'test/blocks_0-4.h5::4',
 'test/blocks_5-9.h5::5',
 'test/blocks_5-9.h5::6',
 'test/blocks_5-9.h5::7',
 'test/blocks_5-9.h5::8',
 'test/blocks_5-9.h5::9',
 'test/blocks_10-14.h5::10',
 'test/blocks_10-14.h5::11',
 'test/blocks_10-14.h5::12',
 'test/blocks_10-14.h5::13',
 'test/blocks_10-14.h5::14',
 'test/blocks_15-18.h5::15',
 'test/blocks_15-18.h5::16',
 'test/blocks_15-18.h5::17',
 'test/blocks_15-18.h5::18']

In [6]:
"""
Loading the entire blosk by URI, with position and other information
for that, use polychrom.hdf5_format.load_URI

Note how our custom-added eggs and spam appear below. 

"""
load_URI(files[0])

{'pos': array([[13.89941519, 13.87363803, 14.18118241],
        [12.84842772, 13.98745955, 13.87544873],
        [12.74081008, 14.91348138, 14.10118727],
        ...,
        [15.23206933, 14.054661  , 14.05126287],
        [15.22749451, 14.10276508, 13.14834981],
        [14.14917856, 14.30562453, 13.11945902]]),
 'spam': array([1, 2, 3]),
 'block': 0,
 'eggs': "I don't eat green eggs and ham!!!",
 'kineticEnergy': 7.76455312043136,
 'potentialEnergy': 3.36757032285351,
 'time': 0.8944524415343728}

In [7]:
"""
It is backwards compatible with polymerutils.load as well, and it gives you just the XYZ
"""
polychrom.polymerutils.load(files[0])

array([[13.89941519, 13.87363803, 14.18118241],
       [12.84842772, 13.98745955, 13.87544873],
       [12.74081008, 14.91348138, 14.10118727],
       ...,
       [15.23206933, 14.054661  , 14.05126287],
       [15.22749451, 14.10276508, 13.14834981],
       [14.14917856, 14.30562453, 13.11945902]])

In [8]:
"""
There is also a universal function "fetch_block"
It can fetch both old-style filenames and new-style URIs just by block ID
"""

polychrom.polymerutils.fetch_block("test",2)

array([[13.71706458, 13.85991115, 14.24420638],
       [12.90186561, 13.9731532 , 13.85780053],
       [12.65247079, 14.94789949, 14.08068827],
       ...,
       [15.20566692, 14.05158885, 14.1800073 ],
       [15.25055965, 14.13950146, 13.08552713],
       [14.3905365 , 14.39328047, 13.25072922]])

In [9]:
"""
By default it fetches XYZ only, but can do full output
(of course in the old-style filenames there is no full output so default is False)
"""

polychrom.polymerutils.fetch_block("test",2, full_output=True)

{'pos': array([[13.71706458, 13.85991115, 14.24420638],
        [12.90186561, 13.9731532 , 13.85780053],
        [12.65247079, 14.94789949, 14.08068827],
        ...,
        [15.20566692, 14.05158885, 14.1800073 ],
        [15.25055965, 14.13950146, 13.08552713],
        [14.3905365 , 14.39328047, 13.25072922]]),
 'spam': array([1, 2, 3]),
 'block': 2,
 'eggs': "I don't eat green eggs and ham!!!",
 'kineticEnergy': 7.7636257031509635,
 'potentialEnergy': 3.949779532588296,
 'time': 1.384133336189678}

In [10]:
"""
Finally, loading the saved file with initial conformations. 
"""
load_hdf5_file("test/initArgs_0.h5")

{'GPU': '0',
 'N': 1000,
 'PBCbox': False,
 'collision_rate': 0.01,
 'error_tol': 0.001,
 'integrator': 'variableLangevin',
 'length_scale': 1.0,
 'mass': 100,
 'max_Ek': 10,
 'platform': 'CPU',
 'precision': 'mixed',
 'temperature': 300,
 'verbose': False}

In [11]:
"""
And how it actually looks in HDF5
"""
import h5py 

print("simple things are saved as attrs")

for i in h5py.File("test/initArgs_0.h5").attrs.items():
    print(i)
    
    
    
myfile = h5py.File("test/blocks_15-18.h5",'r') 

print("\n groups of the data files are: ")
print(list(myfile.items()))


print("\n looking at block 15 datasets")
print(list(myfile["15"].items()))


print("\n looking at block 15 attrs")
print(list(myfile["15"].attrs.items()))

print("Note that blocks in simulation and in a reporter are syncronized for a simple simulation "
      "when you're saving every block starting right away")


simple things are saved as attrs
('GPU', '0')
('N', 1000)
('PBCbox', False)
('collision_rate', 0.01)
('error_tol', 0.001)
('integrator', 'variableLangevin')
('length_scale', 1.0)
('mass', 100)
('max_Ek', 10)
('platform', 'CPU')
('precision', 'mixed')
('temperature', 300)
('verbose', False)

 groups of the data files are: 
[('15', <HDF5 group "/15" (2 members)>), ('16', <HDF5 group "/16" (2 members)>), ('17', <HDF5 group "/17" (2 members)>), ('18', <HDF5 group "/18" (2 members)>)]

 looking at block 15 datasets
[('pos', <HDF5 dataset "pos": shape (1000, 3), type "<f8">), ('spam', <HDF5 dataset "spam": shape (3,), type "<i8">)]

 looking at block 15 attrs
[('block', 15), ('eggs', "I don't eat green eggs and ham!!!"), ('kineticEnergy', 6.504475441339525), ('potentialEnergy', 4.915549541588101), ('time', 4.289104819997002)]
Note that blocks in simulation and in a reporter are syncronized for a simple simulation when you're saving every block starting right away


In [12]:
load_hdf5_file("test/forcekit_polymer_chains_0.h5")

{'angles': array([[  0,   1,   2],
        [  1,   2,   3],
        [  2,   3,   4],
        ...,
        [995, 996, 997],
        [996, 997, 998],
        [997, 998, 999]]), 'bonds': array([[  0,   1],
        [  1,   2],
        [  2,   3],
        ...,
        [996, 997],
        [997, 998],
        [998, 999]]), 'chains': array([[   0, 1000,    0]])}

In [13]:
#  Now we are just saving an array to an HDF5 file with a save_hdf5_file method 
save_hdf5_file("testfile.h5",{"a":"eggs", "b":"spam", "bacon":[1,2,3,4,5]}, mode="w")
load_hdf5_file("testfile.h5")


{'bacon': array([1, 2, 3, 4, 5]), 'a': 'eggs', 'b': 'spam'}

## and this is how you would continue a simulation 

In [16]:

rep = HDF5Reporter(folder="test",  overwrite=False, check_exists=False)
ind, data = rep.continue_sim()
ind, data  # look at what is returned

(14, {'pos': array([[13.66437623, 14.17163004, 13.84933038],
         [12.93418967, 14.27746393, 14.26268868],
         [12.67254693, 15.14264372, 14.0996308 ],
         ...,
         [15.71252204, 13.82821797, 13.94893347],
         [15.18090901, 14.30120532, 13.39670931],
         [14.80572444, 14.66241544, 13.97123064]]),
  'spam': array([1, 2, 3]),
  'block': 14,
  'eggs': "I don't eat green eggs and ham!!!",
  'kineticEnergy': 6.791202086917854,
  'potentialEnergy': 4.655816870043763,
  'time': 4.06709036259301})

Now you would run something like that, and block numbering  will be consistent

`
sim = Simulation(..., reporters=[rep]) 
sim.set_data(data["pos"])`