## Why do we need this? 


We are trying to solve several problems by proposing a new file format. Specifically: 

* Saving each conformation as individual file is producing too many files
* Using pickle-based approaches (joblib) makes format python-specific and not backwards compatible; text is clumsy
* Would be nice to save metadata, such as starting conformation, forces, or initial parameters. 


Are there alternatives?  I considered MDTraj compatible binary formats. They are PDB centered, and are usually one-file-per-trajectory. It looked hacky. 

### one file vs  many files  vs  several files 

Saving each conformation as an individual file is undesirable because it will produce too many files: filesystem check or backup on 30,000,000 files takes hours/days. 

Saving all trajectory as a single files is undesirable because 1. backup software will back up a new copy of the file every day as it grows; and 2. if the last write fails, the file will end up in the corrupted state and would need to be recovered. 

Solution is: save groups of conformations as individual files. E.g. save conformations 1-50 as one file, conformations 51-100 as a second file etc. 

This way, we are not risking to lose anything if the power goes out at the end. This way, we are not screwing with backup solutions, And we also have partial trajectories that can be analyzed. 


### Storage format - what did I choose? 

I chose the HDF5-based storage that roughly mimics the MDTraj HDF5 format. It does not have MDTraj topology because it seemed a little too complicated, and not compatible with nglview anyways. Maybe one day we will write something that fully converts it to MDTraj if necessary. 


### Overall design of the format


I decided to separate two entitys: a simulation object and a reporter. When a simulation object is initialized, a reporter (actually, a list of reporters in case you want to use several) is passed to the simulation object. Simulation object would attempt to save several things: __init__ arguments, starting conformation, energy minimization results (TODO), serialized forces, and blocks of conformations together with time, Ek, Ep. 

Each time a simulation object wants to save something, it calls reporter.report(...) for each of the reporters. It passes a string indicating what is being reported, and a dictionary to save. Reporter will have to interpret this and save the data. Reporter is also keeping appropriate counts. NOTE: generic Python objects are not supported. It has to be HDF5-compatible, meaning an array of numbers/strings, or a number/string. 

The HDF5 reporter used here saves everything into an HDF5 file. For anything except for a conformation, it would immmediately save the data into a single HDF5 file: numpy array compatible structures would be saved as datasets, and regular types (strings, numbers) would be saved as attributes. For conformations, it would wait until a certain number of conformations is received. It will then save them all at once into an HDF5 file under groups /1, /2, /3... /50 for blocks 1,2,3...50 respectively, and save them to `blocks_1-50.h5` file


### Multi-stage simulations or loop extrusion

We frequently have simulations in which a simulation object changes. One example would be changing forces or parameters throughout the simulation. Another example would be loop extrusion simulations. 

In this design, a reporter object can be reused and passed to a new simulation. This would keep counter of conformations, and also save applied forces etc. again. The reporter would create a file "applied_forces_0.h5" the first time it receives forces, and "applied_forces_1.h5" the second time it receives forces from a simulation. 


### URIs to identify individual conformations

Because we're saving several conformations into one file, we designed an URI format to quickly fetch a conformation by a unique identifyer. 

URIs are like that: `/path/to/the/trajectory/blocks_1-50.h5::42` 

This URI will fetch block #42 from a file blocks_1-50.h5, which contains blocks 1 through 50 including 1 and 50

polymerutils.load are compatible with URIs 

Also, to make it easy to load both old-style filenames and new-style URIs, there is a function polychrom.polymerutils.fetch_block

fetch_block will autodetermine the type of a trajectory folder. 

So it will fetch both `/path/to/the/trajectory/block42.dat` and  `/path/to/the/trajectory/blocks_x-y.h5::42` automatically 


In [1]:
import polychrom
import numpy as np 
import warnings
import h5py 
import glob
from polychrom.simulation import Simulation
import polychrom.starting_conformations
import polychrom.forces, polychrom.forcekits
import simtk.openmm 
import os 
import shutil
import polychrom.polymerutils

### Loading reporter and utils from a hdf5_format module 


In [2]:
from polychrom.hdf5_format import HDF5Reporter, list_URIs, load_URI, load_hdf5_file

### Making a simulation and passing a reporter 

In [3]:
%rm  test/*
data = polychrom.starting_conformations.grow_cubic(10000,30)

"""
Here we created a hdf5Reporter attached to a foler test, and we are saving 5 blocks per file 
(you should probalby use 50 here or 100. 5 is just for a showcase)
"""
reporter = HDF5Reporter(folder="test", max_data_length=5)


"""
Passing a reporter to the simulation object - many reporters are possible, and more will be added in a future
"""
sim = Simulation(N=10000, error_tol=0.001, collision_rate=0.01, integrator ="variableLangevin", platform="CPU", 
                reporters=[reporter])
sim.set_data(data)
sim.add_force(polychrom.forcekits.polymerChains(sim))
sim._apply_forces()
sim.add_force(polychrom.forces.sphericalConfinement(sim, density=0.1))


for i in range(19):        
    """
    Here we pass two extra records: a string and an array-like object.
    First becomes an attr, and second becomes an HDF5 dataset
    """
    sim.do_block(10, save_extras={"eggs": "I don't eat green eggs and ham!!!", "spam":[1,2,3]})

"""
Here we are not forgetting to dump the last set of blocks that the reporter has. 
We have to do it at the end of every simulation. 

I tried adding it to the destructor to make it automatic,
but some weird interactions with garbage collection made it not very useable. 
"""
reporter.dump_data()




INFO:root:adding force HarmonicBonds 0
INFO:root:adding force Angle 1
INFO:root:adding force PolynomialRepulsive 2


Exclude neighbouring chain particles from PolynomialRepulsive
Number of exceptions: 9999


INFO:root:Particles loaded. Potential energy is 0.050502
INFO:root:block    0 pos[1]=[13.8 13.9 13.9] dr=0.21 t=0.9ps kin=8.19 pot=3.99 Rg=11.059 dt=24.1fs dx=15.40pm 
INFO:root:block    1 pos[1]=[13.8 13.8 13.9] dr=0.17 t=1.2ps kin=3.83 pot=8.88 Rg=11.059 dt=20.6fs dx=8.99pm 
INFO:root:block    2 pos[1]=[13.8 13.8 13.9] dr=0.09 t=1.4ps kin=7.70 pot=5.06 Rg=11.060 dt=24.5fs dx=15.19pm 
INFO:root:block    3 pos[1]=[13.7 13.7 13.9] dr=0.14 t=1.6ps kin=6.31 pot=6.46 Rg=11.064 dt=22.3fs dx=12.50pm 
INFO:root:block    4 pos[1]=[13.7 13.7 13.9] dr=0.11 t=1.8ps kin=6.69 pot=6.06 Rg=11.069 dt=22.0fs dx=12.71pm 
INFO:root:block    5 pos[1]=[13.7 13.7 13.8] dr=0.12 t=2.0ps kin=6.88 pot=5.85 Rg=11.074 dt=22.0fs dx=12.89pm 
INFO:root:block    6 pos[1]=[13.7 13.6 13.8] dr=0.12 t=2.3ps kin=6.77 pot=5.93 Rg=11.078 dt=22.0fs dx=12.78pm 
INFO:root:block    7 pos[1]=[13.7 13.5 13.8] dr=0.12 t=2.5ps kin=7.27 pot=5.41 Rg=11.084 dt=22.0fs dx=13.24pm 
INFO:root:block    8 pos[1]=[13.7 13.5 13.8] dr=0.12 t=2

### This is a list of files created in the trajectory folder

In [4]:
!ls -la test

total 6116
drwxrwxr-x 2 magus magus    4096 Sep  2 18:19 .
drwxrwxr-x 5 magus magus    4096 Sep  2 18:19 ..
-rw-rw-r-- 1 magus magus 1615688 Sep  2 18:19 applied_forces_0.h5
-rw-rw-r-- 1 magus magus 1167760 Sep  2 18:19 blocks_0-4.h5
-rw-rw-r-- 1 magus magus 1169873 Sep  2 18:19 blocks_10-14.h5
-rw-rw-r-- 1 magus magus  937585 Sep  2 18:19 blocks_15-18.h5
-rw-rw-r-- 1 magus magus 1169785 Sep  2 18:19 blocks_5-9.h5
-rw-rw-r-- 1 magus magus    6144 Sep  2 18:19 initArgs_0.h5
-rw-rw-r-- 1 magus magus  174847 Sep  2 18:19 starting_conformation_0.h5


In [5]:
files = list_URIs("test")
files   #  these are the URIs for individual blocks

['test/blocks_0-4.h5::0',
 'test/blocks_0-4.h5::1',
 'test/blocks_0-4.h5::2',
 'test/blocks_0-4.h5::3',
 'test/blocks_0-4.h5::4',
 'test/blocks_5-9.h5::5',
 'test/blocks_5-9.h5::6',
 'test/blocks_5-9.h5::7',
 'test/blocks_5-9.h5::8',
 'test/blocks_5-9.h5::9',
 'test/blocks_10-14.h5::10',
 'test/blocks_10-14.h5::11',
 'test/blocks_10-14.h5::12',
 'test/blocks_10-14.h5::13',
 'test/blocks_10-14.h5::14',
 'test/blocks_15-18.h5::15',
 'test/blocks_15-18.h5::16',
 'test/blocks_15-18.h5::17',
 'test/blocks_15-18.h5::18']

In [6]:
"""
Loading the entire blosk by URI, with position and other information
for that, use polychrom.hdf5_format.load_URI

Note how our custom-added eggs and spam appear below. 

"""
load_URI(files[0])

{'pos': array([[13.80736589, 13.86134737, 13.91210916],
        [14.19402612, 12.96560652, 14.12108859],
        [13.87467106, 13.15338213, 14.94710676],
        ...,
        [13.85326369, 12.93188278, 12.06708795],
        [13.9344306 , 14.05768812, 12.03921379],
        [14.01382148, 14.29979772, 13.06234489]]),
 'spam': array([1, 2, 3]),
 'block': 0,
 'eggs': "I don't eat green eggs and ham!!!",
 'kineticEnergy': 8.18619630961378,
 'potentialEnergy': 3.9903573396415153,
 'time': 0.8845543379323831}

In [7]:
"""
It is backwards compatible with polymerutils.load as well, and it gives you just the XYZ
"""
polychrom.polymerutils.load(files[0])

array([[13.80736589, 13.86134737, 13.91210916],
       [14.19402612, 12.96560652, 14.12108859],
       [13.87467106, 13.15338213, 14.94710676],
       ...,
       [13.85326369, 12.93188278, 12.06708795],
       [13.9344306 , 14.05768812, 12.03921379],
       [14.01382148, 14.29979772, 13.06234489]])

In [8]:
"""
There is also a universal function "fetch_block"
It can fetch both old-style filenames and new-style URIs just by block ID
"""

polychrom.polymerutils.fetch_block("test",2)

array([[13.76909954, 13.75941642, 13.88492131],
       [14.32039342, 12.95625228, 14.04644927],
       [13.91778782, 13.32153794, 15.00979725],
       ...,
       [13.84389403, 13.11281116, 11.80066787],
       [13.90639802, 13.77199267, 12.2819515 ],
       [14.01490711, 14.42432237, 12.95433807]])

In [9]:
"""
By default it fetches XYZ only, but can do full output
(of course in the old-style filenames there is no full output so default is False)
"""

polychrom.polymerutils.fetch_block("test",2, full_output=True)

{'pos': array([[13.76909954, 13.75941642, 13.88492131],
        [14.32039342, 12.95625228, 14.04644927],
        [13.91778782, 13.32153794, 15.00979725],
        ...,
        [13.84389403, 13.11281116, 11.80066787],
        [13.90639802, 13.77199267, 12.2819515 ],
        [14.01490711, 14.42432237, 12.95433807]]),
 'spam': array([1, 2, 3]),
 'block': 2,
 'eggs': "I don't eat green eggs and ham!!!",
 'kineticEnergy': 7.695460831782844,
 'potentialEnergy': 5.062703028966515,
 'time': 1.358945156125511}

In [10]:
"""
Finally, loading the saved file with initial conformations. 
"""
load_hdf5_file("test/initArgs_0.h5")

{'GPU': '0',
 'N': 10000,
 'PBCbox': False,
 'collision_rate': 0.01,
 'error_tol': 0.001,
 'integrator': 'variableLangevin',
 'length_scale': 1.0,
 'mass': 100,
 'max_Ek': 10,
 'platform': 'CPU',
 'precision': 'mixed',
 'temperature': 300,
 'verbose': False}

In [11]:
"""
And how it actually looks in HDF5
"""
import h5py 

print("simple things are saved as attrs")

for i in h5py.File("test/initArgs_0.h5").attrs.items():
    print(i)
    
    
    
myfile = h5py.File("test/blocks_15-18.h5",'r') 

print("\n groups of the data files are: ")
print(list(myfile.items()))


print("\n looking at block 15 datasets")
print(list(myfile["15"].items()))


print("\n looking at block 15 attrs")
print(list(myfile["15"].attrs.items()))

print("Note that blocks in simulation and in a reporter are syncronized for a simple simulation "
      "when you're saving every block starting right away")


simple things are saved as attrs
('GPU', '0')
('N', 10000)
('PBCbox', False)
('collision_rate', 0.01)
('error_tol', 0.001)
('integrator', 'variableLangevin')
('length_scale', 1.0)
('mass', 100)
('max_Ek', 10)
('platform', 'CPU')
('precision', 'mixed')
('temperature', 300)
('verbose', False)

 groups of the data files are: 
[('15', <HDF5 group "/15" (2 members)>), ('16', <HDF5 group "/16" (2 members)>), ('17', <HDF5 group "/17" (2 members)>), ('18', <HDF5 group "/18" (2 members)>)]

 looking at block 15 datasets
[('pos', <HDF5 dataset "pos": shape (10000, 3), type "<f8">), ('spam', <HDF5 dataset "spam": shape (3,), type "<i8">)]

 looking at block 15 attrs
[('block', 15), ('eggs', "I don't eat green eggs and ham!!!"), ('kineticEnergy', 6.885725679584632), ('potentialEnergy', 5.6061969414261705), ('time', 4.237162721676979)]
Note that blocks in simulation and in a reporter are syncronized for a simple simulation when you're saving every block starting right away
