# Submit Jarvis to the MDF Index
Downloads the latest copies of Jarvis and submit them to the MDF search index. First has to modify the JSON files stored with each Jarvis calculation to a form compatible with the MDF index, which requires standardizing data types, adding some new fields, and mapping fields in the JARVIS data schema into ones defined in the MDF schema. Once complete, this notebook submits the data to MDF Connect.

This notebook is designed to be run on Linux with Globus Connect Personal installed.

In [1]:
from mdf_connect_client import MDFConnectClient
from monty.json import MontyEncoder, MontyDecoder
from monty.serialization import loadfn, dumpfn
from tqdm import tqdm_notebook as tqdm
from glob import glob
import requests
import shutil
import json
import math
import cgi
import sys
import os

Proceeding without mdf_connect_client


Settings to change

In [2]:
with open(os.path.expanduser(os.path.join('~', '.globusonline', 'lta', 'client-id.txt'))) as fp:
    SOURCE_UUID=fp.readline().strip()
print('My Endpoint Name:', SOURCE_UUID)
metadata_path = os.path.join(os.getcwd(), 'feedstock')

My Endpoint Name: 757daaa4-e697-11e8-8c9a-0a1d4c5c824a


Urls for the different data objects

In [3]:
data_urls = [
    'https://ndownloader.figshare.com/files/12468083',
    'https://ndownloader.figshare.com/files/12394556'
]

## Download the Data
Get the data from Figshare, if it hasn't been downloaded already

In [4]:
if not os.path.isdir('data'):
    os.mkdir('data')

In [5]:
filenames = []
for url in data_urls:
    # Get file info from figshare
    req = requests.get(url, stream=True)
    
    # Get the file name
    filename = cgi.parse_header(req.headers['Content-Disposition'])[1]['filename']
    filenames.append(os.path.join('data', filename))
    
    # Check if file already download, and skip if it is
    data_path = os.path.join('data', filename)
    if os.path.isfile(data_path):
        continue
        
    # If not, download the file
    with open(data_path, 'wb') as fp:
        for chunk in req.iter_content(chunk_size=1024):
            fp.write(chunk)

## Extract Additional Information Out of Records
Records need some additional tuning to be the most useful for the MDF index

## Display an Example Record
Get an idea of what we are looking at

In [6]:
with open(filenames[0], 'r') as fp:
    ex_record = json.load(fp)[0]
    print(json.dumps(ex_record, indent=2))

{
  "gv": 71.22,
  "mpid": "mp-1006883",
  "encut": 600,
  "icsd": "[187983]",
  "form_enp": 0.161,
  "final_str": {
    "lattice": {
      "a": 2.8018838472201355,
      "c": 2.8018838472201355,
      "b": 2.8018838472201355,
      "matrix": [
        [
          2.8018838472201355,
          -0.0,
          0.0
        ],
        [
          -0.0,
          2.8018838472201355,
          0.0
        ],
        [
          -0.0,
          -0.0,
          2.8018838472201355
        ]
      ],
      "volume": 21.996337903898066,
      "beta": 90.0,
      "gamma": 90.0,
      "alpha": 90.0
    },
    "sites": [
      {
        "properties": {
          "velocities": [
            0.0,
            0.0,
            0.0
          ]
        },
        "abc": [
          0.5,
          0.5,
          0.5
        ],
        "xyz": [
          1.4009419236100678,
          1.4009419236100678,
          1.4009419236100678
        ],
        "label": "Co",
        "species": [
          {
        

### Build Mapping and Modify Records
The MDF can leverage some metadata which is not specifically defined in the JARVIS metadata, but can be easily derived from the metadata. We also need to make the data types in some fields consistant to work best with Globus Search. Finally, we need to establish a mapping between the JARVIS and MDF schemas

In [7]:
def standardize_field(d, field, convert):
    """Convert a field in a dictionary to a certain type. If it doesn't convert, remove it
    
    Also makes sure the result is finite (JSON does not support NaNs or infs)
    
    Args:
        d (dict): Dictionary to convert
        field ([string]): Path to field to convert
        convert (function): Conversion function
    """
    
    # If the first field isn't there, skip
    if field[0] not in d:
        return
    
    if len(field) == 1:
        if field[0] not in d:
            return
        
        # Attempt to convert the field
        try:
            new_value = convert(d[field[0]])
            
            # Check if the value is finite
            if isinstance(new_value, (float, int)) and not math.isfinite(new_value):
                del d[field[0]]
            else:
                d[field[0]] = new_value
        except:
            del d[field[0]]
    else:
        # Step a level down
        standardize_field(d[field[0]], field[1:], convert)
d = {'a': {'b': '1'}}
standardize_field(d, ['a', 'b'], int)
assert isinstance(d['a']['b'], int)

Make a list of mappings for existing JARVIS fields. `__custom` fields will be mapped to the name "jarvis". 

In [8]:
mapping = {
    '__custom.id': 'jid',
    '__custom.crossreference.materials_project': 'mpid',
    'crystal_structure.volume': 'final_str.volume',
    'dft.cutoff_energy': 'incar.ENCUT',
    '__custom.bandgap.mbj': 'mbj_gap',
    '__custom.bandgap.optb88vdw': 'op_gap',
    '__custom.formation_enthalpy': 'form_enp',
    '__custom.elastic_moduli.bulk': 'kv',
    '__custom.elastic_moduli.shear': 'gv',
    '__custom.dimensionality': 'dimensionality',
    '__custom.total_energy': 'fin_en',
    '__custom.landing_page': 'landing_page'
}

In [9]:
def construct_mdf_record(d, is_2d):
    """Given a JSON object from Jarvis, extract some extra data and ensure mapped fields are all the same type
    
    Schema is designed to be similar to that of the OQMD.
    
    Args:
        d (dict): Record to be converted
        is_2d (bool): Wether we are converting the 2d dataset
    Returns: (dict) MDF record"""

    # Records that are used multiple times
    comp = d['initial_str'].composition.get_integer_formula_and_factor()[0]
    
    # Make sure all the mappings are consistant
    for m in mapping.values():
        if m not in ['jid', 'mpid', 'final_str.volume']:  # Do not require retouching
            standardize_field(d, m.split("."), float)
            
    # Add in the dimensionality
    d['dimensionality'] = '2d' if is_2d else '3d'
    d['landing_page'] = "https://www.ctcms.nist.gov/~knc6/jsmol/%s.html"%d['jid']
    
    
    # Add additional metadata for the MDF
    d['mdf'] = {
        "dft": {
            "converged": True,
        },
        "material": {
            "composition": comp,
            "elements": [e.symbol for e in d['initial_str'].composition.element_composition.keys()]
        },
        "crystal_structure": {
            "space_group_number": d['final_str'].get_space_group_info()[1],
            "number_of_atoms": len(d['final_str']),
        },
        "origin": {
            "name": "VASP",
            "creator": "University of Vienna",
            "type": "computation"
        },
    }

Process all of the data

In [10]:
if os.path.isdir('feedstock'):
    shutil.rmtree('feedstock')
os.mkdir('feedstock')

In [11]:
jarvis_keys = set()
for f in filenames:
    # Get the data
    records = loadfn(f, cls=MontyDecoder)
    
    # Make the output directory
    for record in tqdm(records, leave=True, desc=f):
        # Convert JARVIS json into MDF-friendly JSON
        construct_mdf_record(record, '2d' in f)

    # Save the filename again
    dumpfn(records, os.path.join('feedstock', os.path.basename(f)))

HBox(children=(IntProgress(value=0, description='data/jdft_3d-7-7-2018.json', max=25923, style=ProgressStyle(d…

  % self.symbol)
  % self.symbol)
  % self.symbol)





HBox(children=(IntProgress(value=0, description='data/jdft_2d-7-7-2018.json', max=636, style=ProgressStyle(des…




Display an example record

In [12]:
json.loads(MontyEncoder().encode(record))

{'magmom': {'magmom_out': 0.9391109, 'magmom_osz': 0.9391},
 'fin_en': -13.562224,
 'op_gap': 1.4097,
 'final_str': {'@module': 'pymatgen.core.structure',
  '@class': 'Structure',
  'charge': None,
  'lattice': {'matrix': [[-2.811747146991647, -0.0004086778517834, -0.170584],
    [1.4000491144005003, 2.4380958557037666, 0.175588],
    [0.0, 0.0, -21.938889]],
   'a': 2.816916947069837,
   'b': 2.8169629159925957,
   'c': 21.938889,
   'alpha': 93.57369895805945,
   'beta': 86.5282161513322,
   'gamma': 119.99964743671218,
   'volume': 150.38531191412963},
  'sites': [{'species': [{'element': 'Co', 'occu': 1}],
    'abc': [0.003714832584933, 0.5032219446808739, 0.0430329307216297],
    'xyz': [0.6940902680751181, 1.2269018196558281, -0.8563686466255668],
    'label': 'Co',
    'properties': {'velocities': [0.0, 0.0, 0.0]}},
   {'species': [{'element': 'O', 'occu': 1}],
    'abc': [0.6834181623962591, 0.8223516869688096, 0.0003353489187117],
    'xyz': [-0.7702663172535162, 2.00469294206

### Create the Mapping
Add in the fields corrsponding to the fields we added, which are all under the `mdf` key

Add the keys that map directly to the MDF search fields

In [13]:
mapping

{'__custom.id': 'jid',
 '__custom.crossreference.materials_project': 'mpid',
 'crystal_structure.volume': 'final_str.volume',
 'dft.cutoff_energy': 'incar.ENCUT',
 '__custom.bandgap.mbj': 'mbj_gap',
 '__custom.bandgap.optb88vdw': 'op_gap',
 '__custom.formation_enthalpy': 'form_enp',
 '__custom.elastic_moduli.bulk': 'kv',
 '__custom.elastic_moduli.shear': 'gv',
 '__custom.dimensionality': 'dimensionality',
 '__custom.total_energy': 'fin_en',
 '__custom.landing_page': 'landing_page'}

In [14]:
for k in record['mdf'].keys():
    for k2 in record['mdf'][k].keys():
        x = '.'.join([k, k2])
        mapping[x] = 'mdf.' + x

In [15]:
mapping

{'__custom.id': 'jid',
 '__custom.crossreference.materials_project': 'mpid',
 'crystal_structure.volume': 'final_str.volume',
 'dft.cutoff_energy': 'incar.ENCUT',
 '__custom.bandgap.mbj': 'mbj_gap',
 '__custom.bandgap.optb88vdw': 'op_gap',
 '__custom.formation_enthalpy': 'form_enp',
 '__custom.elastic_moduli.bulk': 'kv',
 '__custom.elastic_moduli.shear': 'gv',
 '__custom.dimensionality': 'dimensionality',
 '__custom.total_energy': 'fin_en',
 '__custom.landing_page': 'landing_page',
 'dft.converged': 'mdf.dft.converged',
 'material.composition': 'mdf.material.composition',
 'material.elements': 'mdf.material.elements',
 'crystal_structure.space_group_number': 'mdf.crystal_structure.space_group_number',
 'crystal_structure.number_of_atoms': 'mdf.crystal_structure.number_of_atoms',
 'origin.name': 'mdf.origin.name',
 'origin.creator': 'mdf.origin.creator',
 'origin.type': 'mdf.origin.type'}

Make the descriptions for the JARVIS-specific fields

In [16]:
custom_desc = {
    '__custom.band_gap': 'Band gap energies (eV)',
    '__custom.crossreference': 'Cross-references to other DFT databases',
    '__custom.dimensionality': 'Dimensionality of the structure',
    '__custom.elastic_moduli': 'Elastic moduli (GPa)',
    '__custom.formation_enthalpy': 'Formation enthalpy (eV/atom)',
    '__custom.id': 'ID number in jarvis database',
    '__custom.landing_page': 'URL of landing page in Jarvis website',
    '__custom.total_energy': 'Total energy of the structure (eV/atom) '
}

### Test the Mapping
Check it out to make sure it works

In [17]:
record = json.loads(MontyEncoder().encode(record))

In [18]:
def apply_mapping(data, mapping):
    """Given a dictionary, return the MDF-formatted mapping"""
    
    output = {}
    for out_field, in_field in mapping.items():
        # Get the value from the input data
        fields = in_field.split(".")
        current_rec = data
        for field in fields[:-1]:
            current_rec = current_rec.get(field, {})
        val = current_rec.get(fields[-1])
        if val is None:
            continue
        
        # Add it to the output structure
        fields = out_field.split(".")
        current_rec = output
        for field in fields[:-1]:
            if not field in current_rec:
                current_rec[field] = {}
            current_rec = current_rec[field]
        current_rec[fields[-1]] = val
    return output

In [19]:
print(json.dumps(apply_mapping(record, mapping), indent=2))

{
  "__custom": {
    "id": "JVASP-60593",
    "crossreference": {
      "materials_project": "mp-782689"
    },
    "bandgap": {
      "optb88vdw": 1.4097
    },
    "formation_enthalpy": -0.925,
    "dimensionality": "2d",
    "total_energy": -13.562224,
    "landing_page": "https://www.ctcms.nist.gov/~knc6/jsmol/JVASP-60593.html"
  },
  "dft": {
    "cutoff_energy": 750.0,
    "converged": true
  },
  "material": {
    "composition": "CoO2",
    "elements": [
      "Co",
      "O"
    ]
  },
  "crystal_structure": {
    "space_group_number": 12,
    "number_of_atoms": 3
  },
  "origin": {
    "name": "VASP",
    "creator": "University of Vienna",
    "type": "computation"
  }
}


## Submit the Data to the MDF
We will use the MDF Connect Client to describe the dataset (e.g., who made it, where is it), and send it to Connect with Globus

In [20]:
client = MDFConnectClient()

In [21]:
client.set_source_name('jarvis')

In [22]:
client.create_dc_block(
    title="JARVIS - Joint Automated Repository for Various Integrated Simulations",
    authors=["Choudhary, Kamal", "Kalish, Irena", "Beams, Ryan", "Tavazza, Francesca"],
    affiliations="National Institute of Standards and Technology",
    publisher='Figshare',
    publication_year=2017,
    related_dois=['10.1038/s41598-017-05402-0'],
    description="JARVIS (Joint Automated Repository for Various Integrated Simulations) is a repository designed to automate materials discovery using classical force-field, density functional theory, machine learning calculations and experiments. The Force-field section of JARVIS (JARVIS-FF) consists of thousands of automated LAMMPS based force-field calculations on DFT geometries. Some of the properties included in JARVIS-FF are energetics, elastic constants, surface energies, defect formations energies and phonon frequencies of materials. The Density functional theory section of JARVIS (JARVIS-DFT) consists of thousands of VASP based calculations for 3D-bulk, single layer (2D), nanowire (1D) and molecular (0D) systems. Most of the calculations are carried out with optB88vDW functional. JARVIS-DFT includes materials data such as: energetics, diffraction pattern, radial distribution function, band-structure, density of states, carrier effective mass, temperature and carrier concentration dependent thermoelectric properties, elastic constants and gamma-point phonons. The Machine-learning section of JARVIS (JARVIS-ML) consists of machine learning prediction tools, trained on JARVIS-DFT data. Some of the ML-predictions focus on energetics, heat of formation, GGA/METAGGA bandgaps, bulk and shear modulus."
)

In [23]:
client.add_index('json', mapping)

In [24]:
client.set_custom_descriptions(custom_desc)

In [25]:
client.add_data('globus://{}{}/'.format(SOURCE_UUID, 
                                         os.path.abspath(os.path.join('feedstock'))))

Print out the submission, for record keeping

In [26]:
client.get_submission()

{'dc': {'titles': [{'title': 'JARVIS - Joint Automated Repository for Various Integrated Simulations'}],
  'creators': [{'creatorName': 'Choudhary, Kamal',
    'familyName': 'Choudhary',
    'givenName': 'Kamal',
    'affiliations': ['National Institute of Standards and Technology']},
   {'creatorName': 'Kalish, Irena',
    'familyName': 'Kalish',
    'givenName': 'Irena',
    'affiliations': ['National Institute of Standards and Technology']},
   {'creatorName': 'Beams, Ryan',
    'familyName': 'Beams',
    'givenName': 'Ryan',
    'affiliations': ['National Institute of Standards and Technology']},
   {'creatorName': 'Tavazza, Francesca',
    'familyName': 'Tavazza',
    'givenName': 'Francesca',
    'affiliations': ['National Institute of Standards and Technology']}],
  'publisher': 'Figshare',
  'publicationYear': '2017',
  'resourceType': {'resourceTypeGeneral': 'Dataset',
   'resourceType': 'Dataset'},
  'descriptions': [{'description': 'JARVIS (Joint Automated Repository for Var

Send it in to MDF

In [27]:
client.submit_dataset()

{'source_id': 'jarvis_v1-1', 'success': True, 'error': None}

In [29]:
client.check_status()


Status of convert submission jarvis_v1-1 (JARVIS - Joint Automated Repository for Various Integrated Simulations)
Submitted by Logan Ward at 2018-11-28T16:16:37.234359Z

Conversion initialization was successful.
Conversion data download was successful: 2 files will be processed (0 archives extracted).
Data conversion was successful: 26559 records parsed out of 2 groups.
Ingestion preparation was successful.
Ingestion initialization was successful.
Ingestion data download was successful.
Integration data download was not requested or required.
Globus Search ingestion was successful.
Globus Publish publication was not requested or required.
Citrine upload was not requested or required.
Materials Resource Registration was not requested or required.
Post-processing cleanup was successful.

