## Predicting total energies and formation enthalpies of metal-nonmetal compounds by linear regression 

Here, we want to save the featurized dataset used in [Deml et al paper](https://journals.aps.org/prb/pdf/10.1103/PhysRevB.93.085142). 
If there is already an updated featurized pickled file (`deml_featurized_data.pkl`), run the second jupyter notebook (`Deml_02_prediction.ipynb`) directly.


In [93]:
%matplotlib inline
import numpy as np
import pandas as pd
import os
import pymatgen as pmg

from matminer.utils.conversions import str_to_composition, composition_to_oxidcomposition
from matminer.featurizers.base import MultipleFeaturizer
from matminer.featurizers import composition as cf

from pymatgen import MPRester

### Loading deml_dataset.csv 

In [117]:
path = os.path.join(os.getcwd(), "oqmd_deml_subset.pkl")
df = pd.read_pickle(path)

In [118]:
print ("Shape of data: ", df.shape)

Shape of data:  (2508, 11)


Drop unnecessary columns

In [97]:
df = df.drop(['band_gap', 'magnetic_moment', 'path', 'stability', 'structure', 'volume_pa', 'is_ICSD', 'structure_obj'], 1)

### Part 1: Compute data representation

Compute pymatgen composition from compound formula

In [119]:
df['composition_obj'] = str_to_composition(df['composition'])

Compute ionic states

In [120]:
df['oxidation_states'] = composition_to_oxidcomposition(df['composition_obj'])

Remove compounds that cannot be featurized (due to unclear reasons)

In [121]:
df = df.reset_index(drop=True)

In [122]:
for i in [952, 1214, 1217, 1311, 1315, 1710, 1963]:
    df = df.drop([i, i])
df = df.reset_index(drop=True)

Adding a finite list of quantitative descriptors ([Deml et al 2016](https://journals.aps.org/prb/pdf/10.1103/PhysRevB.93.085142))

In [123]:
%%time
ft = MultipleFeaturizer([cf.ElementProperty.from_preset('deml'), 
                         cf.TMetalFraction(),
                         cf.ValenceOrbital()])
df = ft.featurize_dataframe(df, col_id='composition_obj')

CPU times: user 1.13 s, sys: 260 ms, total: 1.39 s
Wall time: 4.19 s


In [124]:
%%time
ft = MultipleFeaturizer([cf.CationProperty.from_preset('deml'),
                         cf.OxidationStates.from_preset('deml'),
                         cf.ElectronAffinity(),
                         cf.ElectronegativityDiff()
                        ])
df = ft.featurize_dataframe(df, col_id='oxidation_states')

CPU times: user 921 ms, sys: 334 ms, total: 1.25 s
Wall time: 2.71 s


Drop stats of f orbital valence electrons

In [137]:
df = df.drop(['frac f valence electrons', 'avg f valence electrons'], 1)

Fill in NaN values with zeros

In [138]:
df.fillna(value=0, inplace=True)

Calculate number of atoms in a formula unit

In [139]:
df['num_atoms'] = df['composition_obj'].apply(lambda x: x.num_atoms)

Sqrt and inverse of each term.

In [140]:
def inv(x):
    try:
        output = 1.0/x
    except:
        output = 0.0
    return output

In [141]:
col = df.columns
mean_col = []

In [142]:
for i in col:
    if "mean" in i:
        mean_col.append(i)
        df["inverse %s"%i] = df[i].apply(lambda x: inv(x))
        df["sqrt %s"%i] = df[i].apply(lambda x: np.sqrt(x))

Products of the primary (those without an asterisk) and stoichiometric weighted mean values. 

In [143]:
primary = ['num_atoms', 'transition metal fraction', 'avg anion electron affinity',
           'avg s valence electrons', 'avg p valence electrons', 
           'avg d valence electrons', 'frac s valence electrons', 
           'frac p valence electrons','frac d valence electrons', ]

In [144]:
for i in primary:
    for j in mean_col:
        df['%s&%s' % (i, j)] = df[i].multiply(df[j])

Final shape of data

In [145]:
print ("Shape of featurized data: ", df.shape)
df.head(1)

Shape of featurized data:  (2501, 378)


Unnamed: 0,band_gap,delta_e,magnetic_moment,path,stability,structure,total_energy,volume_pa,is_ICSD,structure_obj,...,frac d valence electrons&mean electric_pol,frac d valence electrons&mean GGAU_Etot,frac d valence electrons&mean mus_fere,frac d valence electrons&mean FERE correction,frac d valence electrons&mean total_ioniz of cations,frac d valence electrons&mean xtal_field_split of cations,frac d valence electrons&mean magn_moment of cations,frac d valence electrons&mean so_coupling of cations,frac d valence electrons&mean sat_magn of cations,frac d valence electrons&mean EN difference
0,0.0,-4.479234,-8.4e-05,/home/oqmd/libraries/icsd/27089/static,-3.195302,La F\n 1.0\n7.075548 -0.000022 0.000000\n-3.53...,-6.805786,13.0929,True,[[ 2.00018782e+00 -6.21918359e-06 5.41090000e...,...,0.341365,-0.098333,-0.091412,0.006921,143975.0,0.0,0.0,0.0,0.0,0.12


Saving featurized data to pickle file

In [90]:
df.to_pickle('./deml_featurized_data.pkl')