## Predicting total energies and enthalpies of formation of metal-nonmetal compounds by linear regression 

Here, we want to save the featurized dataset used in [Deml et al paper](https://journals.aps.org/prb/pdf/10.1103/PhysRevB.93.085142)


In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import os
import pymatgen as pmg

from mdf_forge.forge import Forge

from matminer.data_retrieval.retrieve_MDF import MDFDataRetrieval
from matminer.utils.conversions import str_to_composition, composition_to_oxidcomposition
from matminer.featurizers.base import MultipleFeaturizer
from matminer.featurizers import composition as cf

from pymatgen import MPRester

### Loading deml_dataset.csv 

In [2]:
csv_path = os.path.join(os.getcwd(), "deml_dataset.csv")
df = pd.read_csv(csv_path, comment='#')

### Part 1: Compute data representation

Compute pymatgen composition from compound formula

In [17]:
df['composition_obj'] = str_to_composition(df['composition'])

Compute ionic states

In [18]:
df['oxidation_states'] = composition_to_oxidcomposition(df['composition_obj'])

Compute total energy of a composition

In [5]:
def get_total_energy(comp, api_key):
    mp = MPRester('T6QzrvW8J07u4L2O')
    ls = mp.get_data(comp, prop='energy_per_atom')
    if (ls == []):
        return np.nan
    else:
        return ls[0]['energy_per_atom']

In [None]:
%%time
df['total_energy'] = df['composition'].apply(lambda x: get_total_energy(x, 'T6QzrvW8J07u4L2O'))

Drop rows with NaN total energy

In [None]:
original_count = len(df)
df = df.dropna(subset=['total_energy']).reset_index(drop=True)
print('Removed %d/%d entries'%(original_count - len(df), original_count))

Adding a finite list of quantitative descriptors ([Deml et al 2016](https://journals.aps.org/prb/pdf/10.1103/PhysRevB.93.085142))

In [19]:
%%time
ft = MultipleFeaturizer([cf.ElementProperty.from_preset('deml'), 
                         cf.TMetalFraction(),
                         cf.ValenceOrbital()])
df = ft.featurize_dataframe(df, col_id='composition_obj')

CPU times: user 951 ms, sys: 137 ms, total: 1.09 s
Wall time: 3.49 s


In [20]:
%%time
ft = MultipleFeaturizer([cf.CationProperty.from_preset('deml'),
                         cf.OxidationStates.from_preset('deml'),
                         cf.ElectronAffinity(),
                         cf.ElectronegativityDiff()
                        ])
df = ft.featurize_dataframe(df, col_id='oxidation_states')

CPU times: user 684 ms, sys: 170 ms, total: 854 ms
Wall time: 2.05 s


Drop stats of f orbital valence electrons

In [21]:
df = df.drop(['frac f valence electrons', 'avg f valence electrons'], 1)

Fill in NaN values with zeros

In [22]:
df.fillna(value=0, inplace=True)

Calculate number of atoms in a formula unit

In [23]:
df['num_atoms'] = df['composition_obj'].apply(lambda x: x.num_atoms)

Sqrt and inverse of each term.

In [24]:
def inv(x):
    try:
        output = 1.0/x
    except:
        output = 0.0
    return output

In [25]:
col = df.columns
mean_col = []

In [26]:
for i in col:
    if "mean" in i:
        mean_col.append(i)
        df["inverse %s"%i] = df[i].apply(lambda x: inv(x))
        df["sqrt %s"%i] = df[i].apply(lambda x: np.sqrt(x))

Products of the primary (those without an asterisk) and stoichiometric weighted mean values. 

In [27]:
primary = ['num_atoms', 'transition metal fraction', 'avg anion electron affinity',
           'avg s valence electrons', 'avg p valence electrons', 
           'avg d valence electrons', 'frac s valence electrons', 
           'frac p valence electrons','frac d valence electrons', ]

In [28]:
for i in primary:
    for j in mean_col:
        df['%s&%s' % (i, j)] = df[i].multiply(df[j])

Final shape of data

In [29]:
print ("Shape of featurized data: ", df.shape)
df.head(1)

Shape of featurized data:  (2220, 370)


Unnamed: 0,composition,delta_e,total_energy,composition_obj,oxidation_states,minimum atom_num,maximum atom_num,range atom_num,mean atom_num,std_dev atom_num,...,frac d valence electrons&mean electric_pol,frac d valence electrons&mean GGAU_Etot,frac d valence electrons&mean mus_fere,frac d valence electrons&mean FERE correction,frac d valence electrons&mean total_ioniz of cations,frac d valence electrons&mean xtal_field_split of cations,frac d valence electrons&mean magn_moment of cations,frac d valence electrons&mean so_coupling of cations,frac d valence electrons&mean sat_magn of cations,frac d valence electrons&mean EN difference
0,As1Y1,-1.555732,-6.627177,"(As, Y)","(As3-, Y3+)",33,39,6,36.0,4.242641,...,8.253056,-3.095278,-3.016634,0.078643,2297778.0,0.0,0.0,0.0,0.0,0.586667


Saving featurized data to pickle file

In [30]:
df.to_pickle('./deml_featurized_data.pkl')