# Parser Interface
This notebook illustrates how to great MDF-ready metadata with MDF-MatIO

In [1]:
from mdf_matio.adapters import noop_parsers
from mdf_matio import get_mdf_parsers, generate_search_index
from materials_io.utils import interface as matio
from tarfile import TarFile
import pandas as pd
import os

## Get the Available Parsers
Not all of the Parsers available in MaterialsIO are compatible with the MDF

In [2]:
all_parsers = matio.get_available_parsers()
print(f'Found {len(all_parsers)} parsers:', set(all_parsers.keys()))

  warn('The libmagic library is not installed. '


Found 7 parsers: {'em', 'dft', 'generic', 'image', 'crystal', 'ase', 'noop'}


One part of the MDF IO library is defining which parsers produce data in this format or a method for transforming the outputs of the data into a format compatible with the MDF's Search Index

In [3]:
print(f'Found {len(noop_parsers)} parsers that require zero alteration:', noop_parsers)

Found 2 parsers that require zero alteration: ['image', 'em']


In [4]:
mdf_parsers = get_mdf_parsers()
print(f'Found {len(mdf_parsers)} compatible parsers:', mdf_parsers)

Found 4 compatible parsers: ['image', 'em', 'dft', 'generic']


Some of these parsers require an "adapter" to transform the data into the MDF format.

## Demonstrate Adapters
A good example of a parser that generates data in a non-MDF format is the "generic file parser."

In [5]:
test_file = os.path.join('example-files', 'dog2.jpeg')

In [6]:
generic_parser = matio.get_parser('generic')

The generic parser produces the hashes for the data file and, if installed, autodetects the file format.

In [7]:
file_info = generic_parser.parse([test_file])
file_info

{'filename': 'dog2.jpeg',
 'length': 269360,
 'sha512': '1f47ed450ad23e92caf1a0e5307e2af9b13edcd7735ac9685c9f21c9faec62cb95892e890a73480b06189ed5b842d8b265c5e47cc6cf279d281270211cff8f90'}

The MDF search index stores this information under the the "files" block. 
Our adapter to the "generic" parser performs this operation.

In [8]:
generic_adapter = matio.get_adapter('generic')

In [9]:
generic_adapter.transform(file_info)

{'files': [{'data_type': 'Unknown',
   'filename': 'dog2.jpeg',
   'length': 269360,
   'sha512': '1f47ed450ad23e92caf1a0e5307e2af9b13edcd7735ac9685c9f21c9faec62cb95892e890a73480b06189ed5b842d8b265c5e47cc6cf279d281270211cff8f90'}]}

The advantage of this adapter is that the MDF need not implement the hashing or file-type detection framework. 
The only tool needed for using this Materials IO parser is some data reshaping - a much easier task.

## Parsing MDF-Compliant Data
The `generate_search_index` function uses these capabilities to automatically generate compliant data from a directory of files.  It determines which parsers are available, runs them on all data in a directory, applies the adapters, and then merges the metadata of files records that describe the same record (e.g., a single experiment or calculation). 

Unpacking VASP data. (It is large enough that we do not want to commit the uncompressed files to GitHub).

In [10]:
with TarFile.open(os.path.join('example-files', 'calc', 'AlNi_static_LDA.tar.gz')) as t:
    t.extractall(os.path.join('example-files', 'calc'))

Deploying the search tool

In [11]:
record_gen = generate_search_index(os.path.join('example-files', 'calc'), False)
record_gen

<generator object generate_search_index at 0x0000020FDAE3DAF0>

MaterialsIO uses generators to avoid needing to hold the entire dataset in memory at once.
Each metadata record is generated incremementally on-demand.

In [12]:
records = list(record_gen)

In [13]:
records[0]

{'files': [{'data_type': 'Unknown',
   'filename': 'AlNi_static_LDA.tar.gz',
   'length': 234350,
   'sha512': '23ffce2c6c4c3f2903fea92813c28a36a801a100f65d4e38f4af6c9750d73e015b59979cacf4eb6f8513702bbd84b4500746f4157e390cd6607f0e6c505488bf'}]}

Note that each entry contains the MDF data in the form needed by the MDF.
For example, the data from the generic file peraser is stored as the value to the key `files` and is a list.

In [14]:
dft_records = [x for x in records if 'dft' in x]

In [15]:
dft_records

[{'dft': {'converged': True,
   'cutoff_energy': 650.0,
   'exchange_correlation_functional': 'PAW'},
  'files': [{'data_type': 'Unknown',
    'filename': 'KPOINTS',
    'length': 32,
    'sha512': '2a70cea721b10dd65c9d915ff18d906749380dead08d332dc15a2c6196c636b1bda4540377b6850a8cb4e689eca92e0889f9ee5f812d98b4f0078a591b9b19eb'},
   {'data_type': 'Unknown',
    'filename': 'OSZICAR',
    'length': 663,
    'sha512': 'daedbf1bc47695fb5fc4a65f0a8ae8b60a74d84fde9c569e3d957a5ae87ebf21e5bdd482ae69e9760e7b1125270ded62eb1d01af6882b83e4f42d12683b03a36'},
   {'data_type': 'Unknown',
    'filename': 'INCAR',
    'length': 460,
    'sha512': 'd1ece02baeedec07efa2cdaecee033c78a8e43886473e05fc1d3a3b8ebb0a32336214fa6713efbce9f4a331064b62b10dc9188c4d961a090d7e38529e6b04117'},
   {'data_type': 'Unknown',
    'filename': 'OUTCAR',
    'length': 266416,
    'sha512': 'cfcf0afc831204cb31354d7c03a8bc26cf4ac1c445afcc70dc336a19ed91338ab3d5f0556265dd50eb2bd18c7d4d69b1a882ee539de165ada6a3cfcae070a828'},
   {'d

Note how this record contains >1 file and additional metadata for the DFT calculations