# VASP data aggregation example
This is a tutorial on how to aggregate and collate information from VASP runs into a pandas dataframe using the crystaltools package.

### Setup:
Ugur Aydin's high throughput calculation of solution enthalpy has been preprocessed and stored as hdf5 files in a directory.

In [1]:
import numpy as np
import pandas as pd
import os
import re

import crystaltools.vasprun_parser as vp
import crystaltools.fetch_tools as ft

### Define aggregate function

Here, we will be using vasprun_parser, which take a list of file paths and an aggregating function as arguments to build the dataframe.

- list of directories: We facilitate the construction of the list using the get_path_list function in fetch_tools.
- aggregating function: We define this function below.  Since people generally want different sets of information from the VASP calculation, this function will be constructed by hand.  However, once defined, you can apply this function to all the hdf5 files (13299 files in this case). This function should have a file path as argument and return a pandas series.

For the aggregating function, one can pull the information out manually using cElementTree.  However, pulling out information from the hdf5 file is facilited here by the VasprunHDFParser class in vasp_parser.  This class has convenient methods that return pieces of the VASP calculation. See docstring for details (help(vp.VasprunHDFParser)).

For the aggregating function *aggregate_vasp_data*, we would be collecting:

- el1, el2 - Primary and secondary elements
- n1, n2 - number of each element above
- isif - ISIF parameter (to determine if ions are allowed to move)
- encut - ENCUT parameter
- e_0_energy, e_fr_energy, e_wo_entrp - final energies of the calculation
- volume - volume of the calculation cell
- kx, ky, kz - kpoints
- cubic - True if the calculation cell is cublic, else False
- conv - True if the calculation has converged, else False
- path - path of the h5 file

In [2]:
def aggregate_vasp_data(file_path):
    """
        Aggregates data from h5 vasprun file defined by file_path.
        Aggregated information are as follows:
            el1, el2 - Primary and secondary elements
            n1, n2 - number of each element above
            isif - ISIF parameter (to determine if ions are allowed to move)
            encut - ENCUT parameter
            e_0_energy, e_fr_energy, e_wo_entrp - final energies of the calculation
            volume - volume of the calculation cell
            kx, ky, kz - kpoints
            cubic - True if the calculation cell is cublic, else False
            conv - True if the calculation has converged, else False
            path - path of the h5 file
            
        :params:
            file_path - file path of h5 file
 
        :return:
            pd.Series
    """
    
    with vp.VasprunHDFParser(directory='', filename=file_path) as vasprun:
        # get element names and their number
        atomtypes = vasprun.get_atomtypes()

        el1 = atomtypes.index[0]
        n1 = atomtypes.ix[0].atomspertype

        if len(atomtypes)==2:
            el2 = atomtypes.index[1]
            n2 = atomtypes.ix[1].atomspertype
        elif len(atomtypes)==1:
            el2 = np.nan
            n2 = np.nan
        elif len(atomtypes)>2:
            print('WARNING! MORE THAN 2 ELEMENTS FOUND!')

        elements = pd.Series([el1, el2, n1, n2], index=['el1', 'el2', 'n1', 'n2'])

        # get ISIF parameter
        incar = vasprun.get_incar()
        if 'ISIF' in incar.keys():
            isif = incar.ISIF
        else:
            isif = 2
        isif = pd.Series(isif, index=['isif'])

        # get ENCUT parameter
        if 'ENCUT' in incar.keys():
            encut = incar.ENCUT
        else:
            encut = np.nan
        encut = pd.Series(encut, index=['encut'])

        # get energies
        energies = vasprun.get_energies()
        energies = pd.Series(energies.as_matrix()[-1,:], index=energies.columns)

        # get volume
        volume = vasprun.get_volume()

        # get kpoints
        kpts = vasprun.get_kpt_division()

        # get cell vectors
        cv = vasprun.get_cell_vectors()
        v1 = cv[0,:]
        v2 = cv[1,:]
        v3 = cv[2,:]
        cellvec = pd.Series([v1, v2, v3], index=['v1', 'v2', 'v3'])

        # check if cubic
        if (np.linalg.norm(v1)==np.linalg.norm(v2))\
          &(np.linalg.norm(v1)==np.linalg.norm(v3))\
          &(np.linalg.norm(v2)==np.linalg.norm(v3))\
          &(np.dot(v1,v2)==0)&(np.dot(v1,v3)==0)&(np.dot(v2,v3)==0):
            cubic = pd.Series(True, index=['cubic'])
        else:
            cubic = pd.Series(False, index=['cubic'])

        # check if calculation converged
        if 'finalpos' in vasprun.root._v_children:
            conv = pd.Series(True, index=['conv'])
        else:
            conv = pd.Series(False, index=['conv'])

        # store series path
        path = pd.Series(file_path, index=['path'])
    
        return pd.concat([elements, isif, energies, volume, kpts, encut,
                          cubic, cellvec, conv, path])

### Get list of file paths
The ft.get_file_path takes the root path and a pattern as arguments.  It will look for all files matching the pattern that can be seen from the root path.  In this case, we want all files starting with vasprun and ends with h5.

In [3]:
root_path = '/Volumes/Elements/python/aydin'
pattern = re.compile('vasprun\w+.h5', re.IGNORECASE)
path_list = ft.get_path_list(root_path, pattern)

### Apply aggregate_vasp_data to all files in path_list and construct the dataframe

In [4]:
%%time
df, bad_path = ft.get_aggregate_data(aggregate_vasp_data, path_list, verbose=True)

Wall time: 20min 12s


### Inspect dataframe

Now we can inspect and interrogate the data will the full power of pandas.

In [5]:
df.head()

Unnamed: 0,conv,cubic,e_0_energy,e_fr_energy,e_wo_entrp,el1,el2,encut,isif,kx,ky,kz,n1,n2,path,v1,v2,v3,volume
0,1,1,0.005804,-264.335934,-264.337385,Zr,,400,0,8,8,8,32,,/Volumes/Elements/python/aydin/vasprun_dump/h5...,"[8.6, 0.0, 0.0]","[0.0, 8.6, 0.0]","[0.0, 0.0, 8.6]",636.056
1,1,1,-0.006999,-268.786227,-268.784477,Zr,,400,0,8,8,8,32,,/Volumes/Elements/python/aydin/vasprun_dump/h5...,"[8.84, 0.0, 0.0]","[0.0, 8.84, 0.0]","[0.0, 0.0, 8.84]",690.807104
2,1,1,0.008,-270.028159,-270.030159,Zr,,400,0,8,8,8,32,,/Volumes/Elements/python/aydin/vasprun_dump/h5...,"[9.06, 0.0, 0.0]","[0.0, 9.06, 0.0]","[0.0, 0.0, 9.06]",743.677416
3,1,1,0.017018,-268.93564,-268.939894,Zr,,400,0,8,8,8,32,,/Volumes/Elements/python/aydin/vasprun_dump/h5...,"[9.28, 0.0, 0.0]","[0.0, 9.28, 0.0]","[0.0, 0.0, 9.28]",799.178752
4,1,1,-0.003066,-265.523327,-265.522561,Zr,,400,0,8,8,8,32,,/Volumes/Elements/python/aydin/vasprun_dump/h5...,"[9.52, 0.0, 0.0]","[0.0, 9.52, 0.0]","[0.0, 0.0, 9.52]",862.801408


In [6]:
print(len(df))

13299


In [7]:
df[df.el2=='H '].head()

Unnamed: 0,conv,cubic,e_0_energy,e_fr_energy,e_wo_entrp,el1,el2,encut,isif,kx,ky,kz,n1,n2,path,v1,v2,v3,volume
215,1,1,-0.011415,-216.035977,-216.033123,Co,H,400,0,8,8,8,32,1,/Volumes/Elements/python/aydin/vasprun_dump/h5...,"[6.56, 0.0, 0.0]","[0.0, 6.56, 0.0]","[0.0, 0.0, 6.56]",282.300416
216,1,1,-0.011415,-221.813478,-221.810624,Co,H,400,0,8,8,8,32,1,/Volumes/Elements/python/aydin/vasprun_dump/h5...,"[6.72, 0.0, 0.0]","[0.0, 6.72, 0.0]","[0.0, 0.0, 6.72]",303.464448
217,1,1,-0.011048,-224.164499,-224.161737,Co,H,400,0,8,8,8,32,1,/Volumes/Elements/python/aydin/vasprun_dump/h5...,"[6.9, 0.0, 0.0]","[0.0, 6.9, 0.0]","[0.0, 0.0, 6.9]",328.509
218,1,1,-0.009262,-223.190963,-223.188647,Co,H,400,0,8,8,8,32,1,/Volumes/Elements/python/aydin/vasprun_dump/h5...,"[7.08, 0.0, 0.0]","[0.0, 7.08, 0.0]","[0.0, 0.0, 7.08]",354.894912
219,1,1,-0.005724,-220.23536,-220.233929,Co,H,400,0,8,8,8,32,1,/Volumes/Elements/python/aydin/vasprun_dump/h5...,"[7.24, 0.0, 0.0]","[0.0, 7.24, 0.0]","[0.0, 0.0, 7.24]",379.503424


### Store the dataframe in serial format for further analysis

From here, we can easily store the summarized data into a serialed format like json or xml.  I choose json in this instance.  The resulting json file is under 4MB, and can easily be transfered out of, say, the comput cluster and worked on locally on a pc.

In [8]:
df.to_json('aydin_data_summary.json')

### Docstrings of classes and functions used

In [9]:
help(vp.VasprunHDFParser)

Help on class VasprunHDFParser in module crystaltools.vasprun_parser:

class VasprunHDFParser(__builtin__.object)
 |  Class for querying vasprun hdf5 files generated
 |  from vasprun.xml
 |  
 |  Methods defined here:
 |  
 |  __enter__(self)
 |  
 |  __exit__(self, type, value, traceback)
 |  
 |  __init__(self, directory='.', filename='vasprun.h5', root='/', mode='r')
 |      :params:
 |          directory - Directory where h5 file is located. (default: 'w')
 |          filename - Fielname of h5 file. (default: 'vasprun.h5')
 |          root - Root location of the vasprun data in the h5 file.
 |              (default: '/')
 |          mode - Writing mode (default: 'r')
 |  
 |  get_atoms(self)
 |      Returns pandas dataframe of ./atomsinfo/atoms
 |      
 |      :return:
 |          pd.DataFrame
 |  
 |  get_atomtypes(self)
 |      Returns pandas dataframe of ./atomsinfo/atomtypes
 |      
 |      :return:
 |          pd.DataFrame
 |  
 |  get_cell_vectors(self)
 |      Returns a 3x

In [10]:
help(ft.get_path_list)

Help on function get_path_list in module crystaltools.fetch_tools:

get_path_list(root_path, pattern)
    Get a list of all file paths that can be seen from rooth_path
    that matches the pattern.
    
    :params:
        root_path - the root folder to be searched
        pattern -  regex pattern to be matched with filenames
    
    :return:
        path_list - list of paths that matches pattern



In [11]:
help(ft.get_aggregate_data)

Help on function get_aggregate_data in module crystaltools.fetch_tools:

get_aggregate_data(aggregate_fxn, path_list, verbose=False)
    Applies aggregate_fxn to all files defined in path_list.
    
    :params:
        aggregate_fxn - function that aggregates the data in the
                        files defined by path_list.  Must take
                        argument (filename) from path_list and return
                        a pandas data series (data_series)
        path_list - a list of paths where aggregate_fxn will be applied
    
    :return:
        df_aggregate - a pandas dataframe with the collected (data_series)
        bad_paths - list of file paths where aggregate_fxn failed

