## Why do we need this? 


We are trying to solve several problems by proposing a new file format. Specifically: 

* Saving each conformation as individual file is producing too many files
* Using pickle-based approaches (joblib) makes format python-specific and not backwards compatible; text is clumsy
* Would be nice to save metadata, such as starting conformation, forces, or initial parameters. 


Are there alternatives?  I considered MDTraj compatible binary formats. They are PDB centered, and are usually one-file-per-trajectory. It looked hacky. 

### one file vs  many files  vs  several files 

Saving each conformation as an individual file is undesirable because it will produce too many files: filesystem check or backup on 30,000,000 files takes hours/days. 

Saving all trajectory as a single files is undesirable because 1. backup software will back up a new copy of the file every day as it grows; and 2. if the last write fails, the file will end up in the corrupted state and would need to be recovered. 

Solution is: save groups of conformations as individual files. E.g. save conformations 1-50 as one file, conformations 51-100 as a second file etc. 

This way, we are not risking to lose anything if the power goes out at the end. This way, we are not screwing with backup solutions, And we also have partial trajectories that can be analyzed. 


### Storage format - what did I choose? 

I chose the HDF5-based storage that roughly mimics the MDTraj HDF5 format. It does not have MDTraj topology because it seemed a little too complicated, and not compatible with nglview anyways. Maybe one day we will write something that fully converts it to MDTraj if necessary. 


### Overall design of the format


I decided to separate two entitys: a simulation object and a reporter. When a simulation object is initialized, a reporter (actually, a list of reporters in case you want to use several) is passed to the simulation object. Simulation object would attempt to save several things: __init__ arguments, starting conformation, energy minimization results (TODO), serialized forces, and blocks of conformations together with time, Ek, Ep. 

Each time a simulation object wants to save something, it calls reporter.report(...) for each of the reporters. It passes a string indicating what is being reported, and a dictionary to save. Reporter will have to interpret this and save the data. Reporter is also keeping appropriate counts. NOTE: generic Python objects are not supported. It has to be HDF5-compatible, meaning an array of numbers/strings, or a number/string. 

The HDF5 reporter used here saves everything into an HDF5 file. For anything except for a conformation, it would immmediately save the data into a single HDF5 file: numpy array compatible structures would be saved as datasets, and regular types (strings, numbers) would be saved as attributes. For conformations, it would wait until a certain number of conformations is received. It will then save them all at once into an HDF5 file under groups /1, /2, /3... /50 for blocks 1,2,3...50 respectively, and save them to "blocks_1-50.h5" file


### Multi-stage simulations or loop extrusion

We frequently have simulations in which a simulation object changes. One example would be changing forces or parameters throughout the simulation. Another example would be loop extrusion simulations. 

In this design, a reporter object can be reused and passed to a new simulation. This would keep counter of conformations, and also save applied forces etc. again. The reporter would create a file "applied_forces_0.h5" the first time it receives forces, and "applied_forces_1.h5" the second time it receives forces from a simulation. 





In [1]:
import polychrom
import numpy as np 
import warnings
import h5py 
import glob
from polychrom.simulation import Simulation
import polychrom.starting_conformations
import polychrom.forces, polychrom.forcekits
import simtk.openmm 
import os 
import shutil
import polychrom.polymerutils

### Loading reporter and utils from a hdf5_format module 


In [2]:
from polychrom.hdf5_format import hdf5Reporter, list_filenames, load_block, load_hdf5_file

### Making a simulation and passing a reporter 

In [9]:
%rm  test/*
data = polychrom.starting_conformations.grow_cubic(10000,30)

"""
Here we created a hdf5Reporter attached to a foler test, and we are saving 5 blocks per file 
(you should probalby use 50 here or 100. 5 is just for a showcase)
"""
reporter = hdf5Reporter(folder="test", max_data_length=5)


"""
Passing a reporter to the simulation object - many reporters are possible, and more will be added in a future
"""
sim = Simulation(N=10000, error_tol=0.001, collision_rate=0.01, integrator ="variableLangevin", platform="CPU", 
                reporters=[reporter])
sim.setData(data)
sim.addForce(polychrom.forcekits.polymerChains(sim))
sim._applyForces()
sim.addForce(polychrom.forces.sphericalConfinement(sim, density=0.1))


for i in range(19):        
    """
    Here we pass two extra records: a string and an array-like object.
    First becomes an attr, and second becomes an HDF5 dataset
    """
    sim.doBlock(10, saveExtras={"eggs": "I don't eat green eggs and ham!!!", "spam":[1,2,3]})

"""
Here we are not forgetting to dump the last set of blocks that the reporter has. 
We have to do it at the end of every simulation. 

I tried adding it to the destructor to make it automatic,
but some weird interactions with garbage collection made it not very useable. 
"""
reporter.dump_data()




INFO:root:adding force HarmonicBonds 0
INFO:root:adding force Angle 1
INFO:root:adding force PolynomialRepulsive 2


Exclude neighbouring chain particles from PolynomialRepulsive
Number of exceptions: 9999


INFO:root:Particles loaded. Potential energy is 0.050625
INFO:root:block    1 pos[1]=[14.2 14.0 14.0] dr=0.21 t=0.9ps kin=8.29 pot=4.02 Rg=11.082 dt=24.0fs dx=15.42pm 
INFO:root:block    2 pos[1]=[14.2 14.0 14.0] dr=0.17 t=1.2ps kin=3.89 pot=8.97 Rg=11.083 dt=20.5fs dx=9.02pm 
INFO:root:block    3 pos[1]=[14.3 14.0 14.1] dr=0.09 t=1.4ps kin=7.88 pot=5.02 Rg=11.084 dt=24.6fs dx=15.40pm 
INFO:root:block    4 pos[1]=[14.4 13.9 14.1] dr=0.14 t=1.6ps kin=6.29 pot=6.63 Rg=11.089 dt=22.0fs dx=12.34pm 
INFO:root:block    5 pos[1]=[14.5 13.9 14.1] dr=0.11 t=1.8ps kin=6.93 pot=5.97 Rg=11.094 dt=21.8fs dx=12.80pm 
INFO:root:block    6 pos[1]=[14.5 13.9 14.1] dr=0.12 t=2.0ps kin=6.82 pot=6.06 Rg=11.098 dt=21.8fs dx=12.70pm 
INFO:root:block    7 pos[1]=[14.5 13.8 14.1] dr=0.12 t=2.3ps kin=6.95 pot=5.89 Rg=11.103 dt=21.8fs dx=12.83pm 
INFO:root:block    8 pos[1]=[14.5 13.8 14.1] dr=0.12 t=2.5ps kin=7.28 pot=5.55 Rg=11.108 dt=21.8fs dx=13.12pm 
INFO:root:block    9 pos[1]=[14.6 13.8 14.0] dr=0.12 t=2

### This is a list of files created in the trajectory forces

In [10]:
!ls -la test

total 6116
drwxrwxr-x 2 magus magus    4096 Aug 24 17:50 .
drwxrwxr-x 4 magus magus    4096 Aug 24 17:49 ..
-rw-rw-r-- 1 magus magus 1615688 Aug 24 17:50 applied_forces_0.hdf5
-rw-rw-r-- 1 magus magus 1167792 Aug 24 17:50 blocks_0-4.h5
-rw-rw-r-- 1 magus magus 1170145 Aug 24 17:50 blocks_10-14.h5
-rw-rw-r-- 1 magus magus  937745 Aug 24 17:50 blocks_15-18.h5
-rw-rw-r-- 1 magus magus 1169806 Aug 24 17:50 blocks_5-9.h5
-rw-rw-r-- 1 magus magus    6144 Aug 24 17:50 initArgs_0.hdf5
-rw-rw-r-- 1 magus magus  174714 Aug 24 17:50 starting_conformation_0.hdf5


In [11]:
files = list_filenames("test")
files   #  these are the paths for individual blocks

['test/blocks_0-4.h5::0',
 'test/blocks_0-4.h5::1',
 'test/blocks_0-4.h5::2',
 'test/blocks_0-4.h5::3',
 'test/blocks_0-4.h5::4',
 'test/blocks_5-9.h5::5',
 'test/blocks_5-9.h5::6',
 'test/blocks_5-9.h5::7',
 'test/blocks_5-9.h5::8',
 'test/blocks_5-9.h5::9',
 'test/blocks_10-14.h5::10',
 'test/blocks_10-14.h5::11',
 'test/blocks_10-14.h5::12',
 'test/blocks_10-14.h5::13',
 'test/blocks_10-14.h5::14',
 'test/blocks_15-18.h5::15',
 'test/blocks_15-18.h5::16',
 'test/blocks_15-18.h5::17',
 'test/blocks_15-18.h5::18']

In [14]:
"""
Loading the entire file, with position and other information
for that, use polychrom.hdf5_format.load_block

Note how our custom-added eggs and spam appear below. 

"""
load_block(files[0])

{'pos': array([[14.20045833, 13.99957801, 14.01093603],
        [13.16833939, 13.95647912, 14.02223415],
        [12.87320312, 13.78728574, 13.0657144 ],
        ...,
        [14.96041728, 14.89754   , 12.91137594],
        [15.08821625, 14.97620249, 13.82370166],
        [13.95099467, 14.94128795, 14.0282529 ]]),
 'spam': array([1, 2, 3]),
 'block': 1,
 'eggs': "I don't eat green eggs and ham!!!",
 'kineticEnergy': 8.293918776505691,
 'potentialEnergy': 4.016401203546876,
 'time': 0.8851873628976735}

In [15]:
"""
It is backwards compatible with polymerutils.load as well, and it gives you just the XYZ
"""
polychrom.polymerutils.load(files[0])

array([[14.20045833, 13.99957801, 14.01093603],
       [13.16833939, 13.95647912, 14.02223415],
       [12.87320312, 13.78728574, 13.0657144 ],
       ...,
       [14.96041728, 14.89754   , 12.91137594],
       [15.08821625, 14.97620249, 13.82370166],
       [13.95099467, 14.94128795, 14.0282529 ]])

In [20]:
"""
Finally, loading the saved file with initial conformations. 
"""
load_hdf5_file("test/initArgs_0.hdf5")

{'GPU': '0',
 'N': 10000,
 'PBC': False,
 'collision_rate': 0.01,
 'error_tol': 0.001,
 'integrator': 'variableLangevin',
 'length_scale': 1.0,
 'mass': 100,
 'maxEk': 10,
 'name': 'sim',
 'platform': 'CPU',
 'precision': 'mixed',
 'temperature': 300,
 'verbose': False}

In [31]:
"""
And how it actually looks in HDF5
"""
import h5py 

print("simple things are saved as attrs")

for i in h5py.File("test/initArgs_0.hdf5").attrs.items():
    print(i)
    
    
    
myfile = h5py.File("test/blocks_15-18.h5",'r') 

print("\n groups are: ")
print(list(myfile.items()))


print("\n looking at block 15 datasets")
print(list(myfile["15"].items()))


print("\n looking at block 15 attrs")
print(list(myfile["15"].attrs.items()))


print("Note: block 16 is a block number inside of the simulation object, not the group number (her)")

simple things are saved as attrs
('GPU', '0')
('N', 10000)
('PBC', False)
('collision_rate', 0.01)
('error_tol', 0.001)
('integrator', 'variableLangevin')
('length_scale', 1.0)
('mass', 100)
('maxEk', 10)
('name', 'sim')
('platform', 'CPU')
('precision', 'mixed')
('temperature', 300)
('verbose', False)

 groups are: 
[('15', <HDF5 group "/15" (2 members)>), ('16', <HDF5 group "/16" (2 members)>), ('17', <HDF5 group "/17" (2 members)>), ('18', <HDF5 group "/18" (2 members)>)]

 looking at block 15 datasets
[('pos', <HDF5 dataset "pos": shape (10000, 3), type "<f8">), ('spam', <HDF5 dataset "spam": shape (3,), type "<i8">)]

 looking at block 15 attrs
[('block', 16), ('eggs', "I don't eat green eggs and ham!!!"), ('kineticEnergy', 7.011314701711484), ('potentialEnergy', 5.615686917628971), ('time', 4.21041506346761)]
