*Forenote*: despite the way of working presented here may have an interest to extract direct data without having to reload the whole LMGC90 database, there is a lot of information missing since the geometries of the bodies are not present in the file. As such reading the documentation of [a posteriori management](https://lmgc90.pages-git-xen.lmgc.univ-montp2.fr/lmgc90_dev/restart_index.html#a-posteriori-visualization) of data with LMGC90 is still, in the authors' sense, the best approach since it only relies on the LMGC90 API.


# Post with Pandas


It is possible to extract the data stored in the HDF5 file and to store them in a [pandas](https://pandas.pydata.org/pandas-docs/stable/getting_started/index.html) dataframe.
The benefit would be to easily access the data stored in an efficient way, without having to wonder how they have been saved in the file.

To make things easier, most of the job has been hidden in a `get_data_frame` function in the *utils* module next to this notebook.

Furthermore, for efficiency's sake, there are some internal data of LMGC90 which, instead of being represented with strings are represented with an integer parameter. For example, to describe if a body is rigid in 2D, instead of using the string `RBDY2` (which is the historical keyword for this), the code use just `1`.

So, the raw data extracted from the binary file is not straightforwardly usable. The first thing is to get the mapping between the integer number and the associated string (which is stored inside the file), which is done with the `get_parameters` function of the *utils* module.

To understand how these functions were written, the interested reader can have a look into:
* *HDF5_basis.ipynb* notebook which is in the *Tutorials/post/by_hand* directory and explains how to read the content of the file
* *HDF5_coordination.ipynb* notebook which is in the *Tutorials/post/by_hand* directory and show a simple example of direct information extraction.

### Imports

So let us start by importing everything needed in the notebook:

In [None]:
import h5py
import pandas as pd

from utils import get_parameters, get_data_frame

### Parameters

The first thing to get is the different parameters mappings. Use the function aforementionned ; it is then possible to explore the content to get a rough understanding of what is stored in it.

In [None]:
parameters = get_parameters('../lmgc90.h5')

In [None]:
print( parameters.keys() )
print( parameters['bdyty'] )

It is important to remember this point if there is a need to directly look into the hdf5 file using either the `h5dump` utility (which drop all data in a text file) or using some third party graphical tools allowing to explore the content of your file.

For example, by looking into the *VlocRloc* section of a recording, looking into the integer data of particular interaction, the first column describes which type of interaction it is. Having a way to remap that this integer is in fact a classical interaction type of LMGC90 is more convenient :

In [None]:
parameters['inter_id'][15]

### Rough hierarchy

As the HDF name states (**H**ierarchy **D**ata **F**ile), there is a logical construction of the file. Without explaining everything, the requirement to understand how to extract data is to know that there are three groups at the root of the file:
* *Simulation* which contains fixed data along the simulation (number of time steps, dimension, integrator...)
* *Evolution* which contains subgroup with the pattern name *ID_x* with *x* a number of record which is a increasing integer starting at 1.
* *Help* which contains meta data on the the content of each field and the parameters mapping.

There also some data stored directly into the root group allowing to check the version of LMGC90 with which the file has been generated.

Then in an *ID_x* group there may be several subgroups describing:
* *RBDY2*
* *RBDY3*
* *MAILx* which in itself may contain:
  * *mecax*
  * *therx*
  * *porox*
* *VlocRloc*

Generally speaking, each of this subgroup will have two sets of data associated which are *idata* for integer data and *rdata* with real data.

### Extracting a record

First thing is, the user must open the file to check how many records are stored and decide which one is to be read.

**Warning**: it is really important when opening the HDF5 file for reading, to close it once done with it. Otherwise, even if python is closed, the file itself, mays still considered itself opened and deemed *unavailable* or *already opened* when wanting to access it later. 

In [None]:
hfile = '../lmgc90.h5'
with h5py.File( hfile, 'r' ) as hf:
    nb_record = int( hf['Simulation/nb_record'][()] )
print(f"number of record saved: {nb_record}")

In [None]:
id_record = 1
assert 0 < id_record <= nb_record, "[ERROR] wrong record number"

Now it is possible to use the `get_data_frame` function to extract the data from the file, using the `parameters` dictionnary to remap the integer data to intelligible strings.

The attentive reader will notice that in the following block, a `compo` function is provided to the reading function. It will be explained a little later.

First thing is now to get a pandas dataframe for the all the interactions of a given record:

In [None]:
basegroup = "Evolution/ID_"+str(id_record)


def idata_compo(name, comp):
    return comp.strip() + ' ' + name if name else comp.strip()

# get idata of VlocRloc
hgroup = 'VlocRloc/idata'
iinter = get_data_frame(hfile, basegroup, hgroup, mapper=parameters, compo=idata_compo)

# get rdata of VlocRloc
hgroup = 'VlocRloc/rdata'
rinter = get_data_frame(hfile, basegroup, hgroup, compo=lambda n, c: n+"_"+c)

# concatenate to get a single dataframe
interactions = pd.concat([iinter, rinter], axis=1)

In this dataframe each *row* is an interaction and all the relevant data is stored in the different *columns*. Try the following to check the content:

In [None]:
interactions.loc[0]

In [None]:
interactions.loc[50:60,('inter_id','gapTT')]

By carefully looking into the different *columns* of this dataframe, it is clear that some vector values are stored componenent by component (for example `rl_t`, `rl_n`). This is where the postprocessing presented here with pandas differs from the postprocessing presented with numpy (in *Tutorials/post/with_numpy/*) in which a column may be a contigous vector. The point of the `compo` function mentionned earlier is just to specify how to build the name of each component from what is stored in the *Help*. As such it has been decided here that the `rl` field with the component `t` and `n` must be build with an `_` between them in this order, wereas for the `ibdyty` field, the `cd` and `an` component are put in reverse and with a space between them.

It is mainly cosmetic. But sometimes putting a little bit of effort in the cosmetic make the use a little easier, hence the possibility to change the `compo` parameters depending on the user preferences.

In the following is shown how to extract data in a similar way for `RBDY2` bodies:

In [None]:
# get idata of RBDY2
hgroup = 'RBDY2/idata'
ibody  = get_data_frame( hfile, basegroup, hgroup, mapper=parameters, compo=idata_compo )

# get rdata of RBDY2
hgroup = 'RBDY2/rdata'
rbody  = get_data_frame( hfile, basegroup, hgroup, compo=lambda n,c:n+"_"+c)

# concatenate to get one dataframe
bodies = pd.concat( [ibody,rbody], axis=1 )


## get idata of MAILx
# hgroup = 'MAILx/mecax/idata'
# imeca  = get_data_frame( hfile, basegroup, hgroup, mapper=parameters, compo=idata_compo )

## get rdata of MAILx
# hgroup = 'MAILx/mecax/rdata'
# rmeca  = get_data_frame( hfile, basegroup, hgroup, compo=lambda n,c:n+"_"+c)

## concatenate to get only one dataframe
# bodies = pd.concat( [imeca,rmeca], axis=1 )

## get fields of MAILx
# hgroup = 'MAILx/mecax/flux'
# fmeca  = get_data_frame( hfile, basegroup, hgroup, compo=lambda n,c:n+"_"+c)

In [None]:
# list interactions columns
print( interactions.columns )

Finally, it is up to the end user, depending on his/her knowledge of pandas, to fastly extract information. For example to get some general information, only for a type of interactions:

In [None]:
# getting description on 'DKJCx' interactions...
dkjcx = interactions[ interactions.loc[:,'inter_id'] == 'DKJCx' ]
print( dkjcx.loc[:,('rl_t','rl_n')].describe() )

# save data to a csv file
#interactions.to_csv('inters.csv')

In [None]:
# counting each type of interactions:
inter_by_type = interactions.groupby('inter_id')
inter_by_type['icdan'].count()

In [None]:
# adjacence table of 10 first candidates ?
list_cd = interactions.groupby('cd ibdyty')
# print( type(list_cd) )

count = 1
for cd, list_an in list_cd.groups.items():
    #print( cd, len(list_an) )
    an_id = interactions.loc[list_an,'an ibdyty']
    print( f"candidate {cd} has {len(list_an)} antagonist :" )
    print( an_id.to_string(index=False) )
    if count < 10:
        count += 1
    else:
        break