## Predicting total energies and formation enthalpies of metal-nonmetal compounds by linear regression 

Here, we want to save the featurized dataset used in [Deml et al paper](https://journals.aps.org/prb/pdf/10.1103/PhysRevB.93.085142). 
If there is already an updated featurized pickled file (`deml_featurized_data.pkl`), run the second jupyter notebook (`Deml_02_prediction.ipynb`) directly.


In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import os
import pymatgen as pmg

from matminer.utils.conversions import str_to_composition, composition_to_oxidcomposition
from matminer.featurizers.base import MultipleFeaturizer
from matminer.featurizers import composition as cf

from sklearn.preprocessing import PolynomialFeatures

from pymatgen import MPRester

### Loading deml_dataset.csv 

In [2]:
path = os.path.join(os.getcwd(), "oqmd_deml_subset.pkl")
df = pd.read_pickle(path)

In [3]:
print ("Shape of data: ", df.shape)

Shape of data:  (2508, 11)


Drop unnecessary columns

In [4]:
df = df.drop(['band_gap', 'magnetic_moment', 'path', 'stability', 'structure', 'volume_pa', 'is_ICSD', 'structure_obj'], 1)

### Compute data representation

Compute pymatgen composition from compound formula

In [5]:
df['composition_obj'] = str_to_composition(df['composition'])

Compute ionic states

In [6]:
df['oxidation_states'] = composition_to_oxidcomposition(df['composition_obj'])

Remove compounds that cannot be featurized (due to unclear reasons)

In [7]:
df = df.reset_index(drop=True)

In [8]:
for i in [952, 1214, 1217, 1311, 1315, 1710, 1963]:
    df = df.drop([i, i])
df = df.reset_index(drop=True)

Adding a finite list of quantitative descriptors ([Deml et al 2016](https://journals.aps.org/prb/pdf/10.1103/PhysRevB.93.085142))

In [9]:
%%time
ft = MultipleFeaturizer([cf.ElementProperty.from_preset('deml'), 
                         cf.TMetalFraction(),
                         cf.ValenceOrbital(),
                         cf.CationProperty.from_preset('deml'),
                         cf.OxidationStates.from_preset('deml'),
                         cf.ElectronAffinity(),
                         cf.ElectronegativityDiff()])
df = ft.featurize_dataframe(df, col_id='oxidation_states')

CPU times: user 2.23 s, sys: 416 ms, total: 2.65 s
Wall time: 9.78 s


Drop stats of f orbital valence electrons

In [10]:
df = df.drop(['frac f valence electrons', 'avg f valence electrons'], 1)

Calculate number of atoms in a formula unit

In [11]:
df['num_atoms'] = df['composition_obj'].apply(lambda x: x.num_atoms)

At this point, we should have 124 main descriptors (Deml's paper)

In [12]:
print ("Shape of data: ", df.drop(['composition', 'composition_obj', 'oxidation_states'], 1).shape)

Shape of data:  (2501, 125)


Fill in NaN values with zeros

In [13]:
df.fillna(value=0, inplace=True)

Sqrt and inverse of each term.

In [14]:
def inv(x):
    try:
        output = 1.0/x
    except:
        output = 0.0
    return output

In [15]:
col = df.drop(['composition', 'composition_obj', 'oxidation_states'], 1).columns
mean_col = []

In [16]:
for i in col:
    df["inverse %s"%i] = df[i].apply(lambda x: inv(x))
    df["sqrt %s"%i] = df[i].apply(lambda x: np.sqrt(x))
    if "mean" in i:
        mean_col.append(i)    

There should be an additional 248 terms added, coming up to a total of 372 descriptors.

In [17]:
print ("Shape of data: ", df.shape)

Shape of data:  (2501, 378)


Products of the primary (those without an asterisk) and stoichiometric weighted mean values. 

In [18]:
primary = ['num_atoms', 'transition metal fraction', 'avg anion electron affinity',
           'avg s valence electrons', 'avg p valence electrons', 
           'avg d valence electrons', 'frac s valence electrons', 
           'frac p valence electrons','frac d valence electrons']

In [19]:
product = df[mean_col + primary]
col = product.columns

Use PolynomialFeatures with degree 2 from scikit-learn package

In [20]:
poly = PolynomialFeatures(degree=2, interaction_only=False, include_bias=True)
product = pd.DataFrame(poly.fit_transform(product))
product.columns = poly.get_feature_names(col)
product = product.drop(["1"]+ mean_col+primary, 1)
print (product.shape)

(2501, 496)


Merge into original dataframe

In [21]:
df[product.columns] = product

Final shape of data (Should have 4692 descriptors, but we're lacking a lot due to the products)

In [22]:
print ("Shape of featurized data: ", df.shape)
df.head(1)

Shape of featurized data:  (2501, 874)


Unnamed: 0,delta_e,total_energy,composition,composition_obj,oxidation_states,minimum atom_num,maximum atom_num,range atom_num,mean atom_num,std_dev atom_num,...,avg d valence electrons^2,avg d valence electrons frac s valence electrons,avg d valence electrons frac p valence electrons,avg d valence electrons frac d valence electrons,frac s valence electrons^2,frac s valence electrons frac p valence electrons,frac s valence electrons frac d valence electrons,frac p valence electrons^2,frac p valence electrons frac d valence electrons,frac d valence electrons^2
0,-4.479234,-6.805786,LaF3,"(La, F)","(La3+, F-)",9,57,48,21.0,33.941125,...,0.0625,0.083333,0.15625,0.010417,0.111111,0.208333,0.013889,0.390625,0.026042,0.001736


Saving featurized data to pickle file

In [23]:
df.to_pickle('./deml_featurized_data.pkl')