# VASP data aggregation example
This is a tutorial on how to aggregate and collate information from VASP runs into a pandas dataframe using the crystaltools package.

### Setup:
Ugur Aydin's high throughput calculation of solution enthalpy has been preprocessed and stored as hdf5 files in a directory.

In [8]:
import numpy as np
import pandas as pd
import os
import re

import crystaltools.vasprun_parser as vp
import crystaltools.fetch_tools as ft

### Define aggregate function

Here, we will be using vasprun_parser, which take a list of file paths and an aggregating function as arguments to build the dataframe.

- list of directories: We facilitate the construction of the list using the get_path_list function in fetch_tools.
- aggregating function: We define this function below.  Since people generally want different sets of information from the VASP calculation, this function will be constructed by hand.  However, once defined, you can apply this function to all the hdf5 files (13299 files in this case). This function should have a file path as argument and return a pandas series.

For the aggregating function, one can pull the information out manually using cElementTree.  However, pulling out information from the hdf5 file is facilited here by the VasprunHDFParser class in vasp_parser.  This class has convenient methods that return pieces of the VASP calculation. See docstring for details (help(vp.VasprunHDFParser)).

For the aggregating function *aggregate_vasp_data*, we would be collecting:

- el1, el2 - Primary and secondary elements
- n1, n2 - number of each element above
- isif - ISIF parameter (to determine if ions are allowed to move)
- encut - ENCUT parameter
- e_0_energy, e_fr_energy, e_wo_entrp - final energies of the calculation
- volume - volume of the calculation cell
- kx, ky, kz - kpoints
- cubic - True if the calculation cell is cublic, else False
- conv - True if the calculation has converged, else False
- path - path of the h5 file

In [9]:
def aggregate_vasp_data(file_path):
    """
        Aggregates data from h5 vasprun file defined by file_path.
        Aggregated information are as follows:
            el1, el2 - Primary and secondary elements
            n1, n2 - number of each element above
            isif - ISIF parameter (to determine if ions are allowed to move)
            encut - ENCUT parameter
            e_0_energy, e_fr_energy, e_wo_entrp - final energies of the calculation
            volume - volume of the calculation cell
            kx, ky, kz - kpoints
            cubic - True if the calculation cell is cublic, else False
            conv - True if the calculation has converged, else False
            path - path of the h5 file
            
        :params:
            file_path - file path of h5 file
 
        :return:
            pd.Series
    """
    
    with vp.VasprunHDFParser(directory='', filename=file_path) as vasprun:
        # get element names and their number
        atomtypes = vasprun.get_atomtypes()

        el1 = atomtypes.index[0]
        n1 = atomtypes.ix[0].atomspertype

        if len(atomtypes)==2:
            el2 = atomtypes.index[1]
            n2 = atomtypes.ix[1].atomspertype
        elif len(atomtypes)==1:
            el2 = np.nan
            n2 = np.nan
        elif len(atomtypes)>2:
            print('WARNING! MORE THAN 2 ELEMENTS FOUND!')

        elements = pd.Series([el1, el2, n1, n2], index=['el1', 'el2', 'n1', 'n2'])

        # get ISIF parameter
        incar = vasprun.get_incar()
        if 'ISIF' in incar.keys():
            isif = incar.ISIF
        else:
            isif = 2
        isif = pd.Series(isif, index=['isif'])

        # get ENCUT parameter
        if 'ENCUT' in incar.keys():
            encut = incar.ENCUT
        else:
            encut = np.nan
        encut = pd.Series(encut, index=['encut'])

        # get energies
        energies = vasprun.get_energies()
        energies = pd.Series(energies.as_matrix()[-1,:], index=energies.columns)

        # get volume
        volume = vasprun.get_volume()

        # get kpoints
        kpts = vasprun.get_kpt_division()

        # get cell vectors
        cv = vasprun.get_cell_vectors()
        v1 = cv[0,:]
        v2 = cv[1,:]
        v3 = cv[2,:]
        cellvec = pd.Series([v1, v2, v3], index=['v1', 'v2', 'v3'])

        # check if cubic
        if (np.linalg.norm(v1)==np.linalg.norm(v2))\
          &(np.linalg.norm(v1)==np.linalg.norm(v3))\
          &(np.linalg.norm(v2)==np.linalg.norm(v3))\
          &(np.dot(v1,v2)==0)&(np.dot(v1,v3)==0)&(np.dot(v2,v3)==0):
            cubic = pd.Series(True, index=['cubic'])
        else:
            cubic = pd.Series(False, index=['cubic'])

        # check if calculation converged
        if 'finalpos' in vasprun.root._v_children:
            conv = pd.Series(True, index=['conv'])
        else:
            conv = pd.Series(False, index=['conv'])

        # store series path
        path = pd.Series(file_path, index=['path'])
    
        return pd.concat([elements, isif, energies, volume, kpts, encut,
                          cubic, cellvec, conv, path])

### Get list of file paths
The ft.get_file_path takes the root path and a pattern as arguments.  It will look for all files matching the pattern that can be seen from the root path.  In this case, we want all files starting with vasprun and ends with h5.

In [12]:
root_path = '\Volume\Elements\python\aydin'
pattern = re.compile('vasprun\w+.h5', re.IGNORECASE)
path_list = ft.get_path_list(root_path, pattern)


### Apply aggregate_vasp_data to all files in path_list and construct the dataframe

In [13]:
%%time
df, bad_path = ft.get_aggregate_data(aggregate_vasp_data, path_list, verbose=True)

CPU times: user 731 µs, sys: 68 µs, total: 799 µs
Wall time: 768 µs


### Inspect dataframe

Processing and aggregating data from 13299 hdf5 files took 14min 13s on my humble MacBook Air, which is not bad.  It would probably be faster if we weren't checking if the cell was cubic.  Now we can inspect and interrogate the data will the full power of pandas.

In [5]:
df.head()

In [6]:
print(len(df))

0


In [7]:
df[df.el2=='H '].head()

AttributeError: 'DataFrame' object has no attribute 'el2'

### Store the dataframe in serial format for further analysis

From here, we can easily store the summarized data into a serialed format like json or xml.  I choose json in this instance.  The resulting json file is under 4MB, and can easily be transfered out of, say, the comput cluster and worked on locally on a pc.

In [None]:
df.to_json('aydin_data_summary.json')

### Docstrings of classes and functions used

In [None]:
help(vp.VasprunHDFParser)

In [None]:
help(ft.get_path_list)

In [None]:
help(ft.get_aggregate_data)