
More advanced MPDS API usage: unusual materials phases from the machine learning
==========

- Complexity level: green karate belt

Here we look in MPDS for the "unusual" materials phases, _i.e._ these which have the extreme values of more than one physical property. _Extreme_ in this context means close to the either of the prediction boundaries, minimum or maximum.

For instance, a crystal with very low Debye temperature, very low enthalpy of formation, very high linear thermal expansion coefficient _etc._ would match. Such cases certainly deserve attention, so let's list them.


In [None]:
!pip install mpds_client>=0.0.17

In [None]:
from __future__ import division
import time
import random
import threading

from mpds_client import MPDSDataRetrieval, MPDSDataTypes

ml_data_bounds = {
    'isothermal bulk modulus': [5, 265],
    'enthalpy of formation': [-325, 0],
    'heat capacity at constant pressure': [11, 28],
    'Seebeck coefficient': [-150, 225],
    'values of electronic band gap': [0.5, 10], # NB stands for both direct + indirect gaps
    'temperature for congruent melting': [300, 2700],
    'Debye temperature': [175, 1100],
    'linear thermal expansion coefficient': [1.0E-06, 9.5E-05]
}
bound_tolerance_factor = 12


What's the `bound_tolerance_factor`? For each machine-learning property we divide the entire range of values (_e.g._ from 0 to 1000 THz) into this number. Then we take the first and the last segment. Entries with properties in these segments will be considered **extreme** and kept for the future matching with each other.



Copy and paste your [MPDS API key](https://mpds.io/open-data-api) in the next cell, then execute. Note, if the key isn't valid, the API returns an HTTP error `403`.

Please, make sure not to expose your MPDS key publicly.


In [None]:
API_KEY = 'YOUR_MPDS_API_KEY_GOES_HERE'

In [None]:
extremes, extremes_intersects = {}, {}

def mpds_download_worker(prop, min_bound, max_bound):
    '''
    A parallel download worker
    '''
    print(" Starting with %s" % prop)

    MPDSDataRetrieval.chilouttime = random.randint(3, 4) # please, do not use values < 2
    client = MPDSDataRetrieval(dtype=MPDSDataTypes.MACHINE_LEARNING, api_key=API_KEY)

    min_entries, max_entries = [], []

    for item in client.get_data({"props": prop}, fields={'P':[
        'sample.material.entry',
        'sample.material.phase_id',
        'sample.material.chemical_formula',
        'sample.measurement[0].property.scalar'
    ]}):
        if item[3] < min_bound:
            min_entries.append(item)

        elif item[3] > max_bound:
            max_entries.append(item)

    for item in list(min_entries) + list(max_entries):

        keep_info = [prop, item[0]] + item[2:]

        if item[1] in extremes:
            extremes_intersects.setdefault(item[1], []).append(keep_info)

        else:
            extremes[item[1]] = keep_info

Below is the most time-consuming step. We need to scan all the machine-learning data. To fetch all the entries for each property requires about 15 minutes. So that will be about 2 hours in total sequentially. We're lucky to parallelize the data extraction using an **mpds_download_worker** for each property, so the total running time will be about half an hour. A reader may grab a cup of tea or coffee (a timely hydration is important).

In [None]:
start_time = time.time()
threads = []
for key in ml_data_bounds:

    # adjust bounds to match extreme entries
    margin = (ml_data_bounds[key][1] - ml_data_bounds[key][0]) / bound_tolerance_factor
    ml_data_bounds[key] = [ml_data_bounds[key][0] + margin, ml_data_bounds[key][1] - margin]

    # run in parallel, although avoiding too frequent requests
    thread = threading.Thread(target=mpds_download_worker, args=[key] + ml_data_bounds[key])
    time.sleep(random.randint(1, 5))
    thread.start()
    threads.append(thread)

for thread in threads:
    thread.join()

for phase_id in extremes_intersects:
    extremes_intersects[phase_id].append(extremes[phase_id])

for phase_id in sorted(extremes_intersects.keys()):

    print("*" * 30 + " Distinct phase https://mpds.io/#phase_id/%s " % system['Phase'] + "*" * 30)

    for card in extremes_intersects[phase_id]:
        print("%s (%s) %s = %s %s" % (card[1], card[0], ))

print("Done in %1.2f sc" % (time.time() - start_time))


Were you able to follow everything? Please, try to answer:
- How is the value of `bound_tolerance_factor` connected with the total number of results?
- How could one obtain the particular crystalline structures for these results?
- How could one in principle verify these results?
