## Prediction of Density of States (DOS) using Partial Radial Distribution Function (PRDF) 

We want to study the accuracy and time performance of the featurizations used in [Schutt et al paper](https://journals.aps.org/prb/abstract/10.1103/PhysRevB.89.205118). Here in part 1, we load some inorganic crystal compounds with maximum 6 atoms per unit cell and compute the features. 

### Importing packages 

In [1]:
import numpy as np
import pandas as pd
import pymatgen as pmg
from tqdm import tqdm_notebook as tqdm

from pymatgen.core.molecular_orbitals import MolecularOrbitals
from pymatgen import MPRester
from pymatgen.ext.matproj import MPRestError

from matminer.data_retrieval.retrieve_MP import MPDataRetrieval
from matminer.utils.conversions import dict_to_object, str_to_composition
from matminer.featurizers.composition import AtomicOrbitals
from matminer.featurizers.structure import PartialRadialDistributionFunction 

from matminer.utils.data_files.deml_elementdata import atom_num

  return f(*args, **kwds)

numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88


numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88



### Loading dataset 

In [2]:
mp = MPDataRetrieval(api_key='T6QzrvW8J07u4L2O')

Getting all dataset with less than 6 atoms per unit primitive cell

In [3]:
%%time
data = mp.get_dataframe(criteria={"nsites": {"$lte": 6}},
                        properties=["pretty_formula", "structure"])
print ("Shape of retrieved data: ", data.shape)

Shape of retrieved data:  (16590, 2)
CPU times: user 1.46 s, sys: 86 ms, total: 1.55 s
Wall time: 18.4 s


In [4]:
data.head(1)

Unnamed: 0_level_0,pretty_formula,structure
material_id,Unnamed: 1_level_1,Unnamed: 2_level_1
mp-85,In,"{'@module': 'pymatgen.core.structure', '@class..."


In [5]:
data.reset_index(inplace=True)

Convert structure to pymatgen structure object

In [6]:
data['structure_obj'] = dict_to_object(data['structure'])

Convert formula to pymatgen composition object

In [7]:
data['composition_obj'] = str_to_composition(data['pretty_formula'])

Compute orbitals occupied and set f orbital compounds to NaN.

In [8]:
data['max_atom_num'] = data['composition_obj'].apply(lambda x: max(atom_num[str(i)] for i in x))

In [9]:
def orbital_partition(x):
    if (x <= 20):
        return 'sp'
    elif (x > 20 and x < 70):
        return 'spd'
    else:
        return np.nan
    
data['max_orbital'] = data['max_atom_num'].apply(orbital_partition)

Drop compounds with f orbital

In [10]:
data.dropna(subset=['max_orbital'], inplace=True)

In [11]:
print ("Shape of data: ", data.shape)

Shape of data:  (10191, 7)


Get DOS data of materials using MPRester

In [12]:
%%time
mprester = MPRester(api_key='T6QzrvW8J07u4L2O')
def get_dos(id):
    for i in range(5):
        try:
            return mprester.get_dos_by_material_id(id)
        except MPRestError as e:
            if str(e).startswith('dos not available'):
                return np.nan
            else:
                if i < 4:
                    continue
                else: 
                    return np.nan
data['dos_obj'] = [get_dos(x) for x in tqdm(data['material_id'])]

Drop data without Complete DOS value

In [13]:
data = data.dropna(subset=['dos_obj']).reset_index(drop=True)
print ("Shape of data: ", data.shape)

Shape of data:  (6174, 8)


Compute DOS at Fermi level using matminer DOSFeaturizer

In [14]:
def compute_dos(dos):
    try:
        total_density = sum(dos.densities.values()) #sum over both spins, if present
        min_index = np.argmin(abs(dos.energies - dos.efermi))
        return total_density[min_index] # returns states/eV/_unit_cell_
    except:
        return np.nan

In [15]:
data['dos'] = data['dos_obj'].apply(compute_dos)

`compute_dos` returns DOS in unit $states/eV/unit\_cell$. Here, we divide the DOS by volume of its structure to get $states/eV/A^3$.

In [16]:
data['volume'] = data['structure_obj'].apply(lambda x: x.volume)
data['dos'] = np.true_divide(data['dos'], data['volume'])

### Compute representation 

In [17]:
cutoff, bin_size = 10.0, 2.0
prdf = PartialRadialDistributionFunction(cutoff=cutoff, bin_size=bin_size)

In [18]:
prdf.fit(data['structure_obj'].tolist())


No electronegativity for He. Setting to NaN. This has no physical meaning, and is mainly done to avoid errors caused by the code expecting a float.


No electronegativity for Ne. Setting to NaN. This has no physical meaning, and is mainly done to avoid errors caused by the code expecting a float.


No electronegativity for Ar. Setting to NaN. This has no physical meaning, and is mainly done to avoid errors caused by the code expecting a float.



PartialRadialDistributionFunction(bin_size=2.0, cutoff=10.0, exclude_elems=[],
                 include_elems=[])

In [19]:
%%time
data = prdf.featurize_dataframe(data, col_id='structure_obj', ignore_errors=True)


No electronegativity for He. Setting to NaN. This has no physical meaning, and is mainly done to avoid errors caused by the code expecting a float.


No electronegativity for Ar. Setting to NaN. This has no physical meaning, and is mainly done to avoid errors caused by the code expecting a float.


No electronegativity for Ne. Setting to NaN. This has no physical meaning, and is mainly done to avoid errors caused by the code expecting a float.


No electronegativity for He. Setting to NaN. This has no physical meaning, and is mainly done to avoid errors caused by the code expecting a float.


No electronegativity for He. Setting to NaN. This has no physical meaning, and is mainly done to avoid errors caused by the code expecting a float.


No electronegativity for Ar. Setting to NaN. This has no physical meaning, and is mainly done to avoid errors caused by the code expecting a float.



CPU times: user 47.9 s, sys: 28.7 s, total: 1min 16s
Wall time: 2min 2s


#### Save featurized data as pickle file

In [20]:
data.to_pickle('./schutt_cutoff%s_binsize%s.pkl'%(int(cutoff), int(bin_size*10)))