# Parser Interface
This notebook illustrates how to great MDF-ready metadata with MDF-MatIO

In [17]:
from mdf_matio.adapters import noop_parsers
from mdf_matio import get_mdf_parsers, generate_search_index
from materials_io.utils import interface as matio
from tarfile import TarFile
import pandas as pd
import os

## Get the Available Parsers
The MDF only uses a limited subset of the data available via each parser.
Consequently, the MDF interface to MaterialsIO only uses parsers for which we have defined this desired subset.

In [18]:
all_parsers = matio.get_available_parsers()
print(f'Found {len(all_parsers)} parsers:', set(all_parsers.keys()))

Found 7 parsers: {'crystal', 'image', 'ase', 'generic', 'noop', 'em', 'dft'}


One part of the MDF IO library is defining which parsers produce data in this format or a method for transforming the outputs of the data into a format compatible with the MDF's Search Index

In [19]:
print(f'Found {len(noop_parsers)} parsers that require zero alteration:', noop_parsers)

Found 2 parsers that require zero alteration: ['image', 'em']


In [20]:
mdf_parsers = get_mdf_parsers()
print(f'Found {len(mdf_parsers)} compatible parsers:', mdf_parsers)

Found 4 compatible parsers: {'em', 'generic', 'image', 'dft'}


Some of these parsers require an "adapter" to transform the data into the MDF format.

## Demonstrate Adapters
A good example of a parser that generates data in a non-MDF format is the "generic file parser."

In [21]:
test_file = os.path.join('example-files', 'dog2.jpeg')

In [22]:
generic_parser = matio.get_parser('generic')

The generic parser produces the hashes for the data file and, if installed, autodetects the file format.

In [23]:
file_info = generic_parser.parse([test_file])
file_info

{'length': 269360,
 'filename': 'dog2.jpeg',
 'sha512': '1f47ed450ad23e92caf1a0e5307e2af9b13edcd7735ac9685c9f21c9faec62cb95892e890a73480b06189ed5b842d8b265c5e47cc6cf279d281270211cff8f90'}

The MDF search index stores this information under the the "files" block. 
Our adapter to the "generic" parser performs this operation.

In [24]:
generic_adapter = matio.get_adapter('generic')

In [25]:
generic_adapter.transform(file_info)

{'files': [{'length': 269360,
   'filename': 'dog2.jpeg',
   'sha512': '1f47ed450ad23e92caf1a0e5307e2af9b13edcd7735ac9685c9f21c9faec62cb95892e890a73480b06189ed5b842d8b265c5e47cc6cf279d281270211cff8f90',
   'data_type': 'Unknown'}]}

The advantage of this adapter is that the MDF need not implement the hashing or file-type detection framework. 
The only tool needed for using this Materials IO parser is some data reshaping - a much easier task.

## Parsing MDF-Compliant Data
The `generate_search_index` function uses these capabilities to automatically generate compliant data from a directory of files.  It determines which parsers are available, runs them on all data in a directory, applies the adapters, and then merges the metadata of files records that describe the same record (e.g., a single experiment or calculation). 

Unpacking VASP data. (It is large enough that we do not want to commit the uncompressed files to GitHub).

In [26]:
with TarFile.open(os.path.join('example-files', 'calc', 'AlNi_static_LDA.tar.gz')) as t:
    t.extractall(os.path.join('example-files', 'calc'))

Deploying the search tool

In [27]:
record_gen = generate_search_index(os.path.join('example-files'), False)
record_gen

<generator object generate_search_index at 0x000001B0677870C0>

MaterialsIO uses generators to avoid needing to hold the entire dataset in memory at once.
Each metadata record is generated incremementally on-demand.

In [28]:
records = list(record_gen)
print(f'Generated {len(records)} records')



Generated 2 records


In [33]:
records[0]

{'image': {'width': 1910,
  'height': 1000,
  'format': 'JPEG',
  'megapixels': 1.91},
 'files': [{'length': 269360,
   'filename': 'dog2.jpeg',
   'sha512': '1f47ed450ad23e92caf1a0e5307e2af9b13edcd7735ac9685c9f21c9faec62cb95892e890a73480b06189ed5b842d8b265c5e47cc6cf279d281270211cff8f90',
   'data_type': 'Unknown'}]}

Note that each entry contains the MDF data in the form needed by the MDF.
For example, the data from the generic file peraser is stored as the value to the key `files` and is a list.

In [31]:
dft_records = [x for x in records if 'dft' in x]

In [32]:
dft_records

[{'material': {'elemental_proportions': {'Al': 1}},
  'dft': {'converged': True,
   'exchange_correlation_functional': 'PAW',
   'cutoff_energy': 650.0},
  'origin': {'type': 'computation', 'name': 'VASP', 'version': '5.3.2'},
  'files': [{'length': 523,
    'filename': 'CONTCAR',
    'sha512': 'b29f1e6ced5bb681bf0ab19e264e0d182581fd8a84e47d20eb9c323791f0fe3e87a460c6c964dbcce971556c30860a4a141795c671ced3eb8c4d95528d128383',
    'data_type': 'Unknown'},
   {'length': 18297,
    'filename': 'DOSCAR',
    'sha512': 'defdb049f871123ae2fe93adc3e7a1834f236ac771f10e82b75bdc74f80eaffb7dfe395caeac2fc145df961bffddbabc1dd6e7c57399c7c03d94fb8e078ae2e7',
    'data_type': 'Unknown'},
   {'length': 460,
    'filename': 'INCAR',
    'sha512': 'd1ece02baeedec07efa2cdaecee033c78a8e43886473e05fc1d3a3b8ebb0a32336214fa6713efbce9f4a331064b62b10dc9188c4d961a090d7e38529e6b04117',
    'data_type': 'Unknown'},
   {'length': 32,
    'filename': 'KPOINTS',
    'sha512': '2a70cea721b10dd65c9d915ff18d906749380dead0

Note how this record contains >1 file and additional metadata for the DFT calculations