## Featurize and save Ward2016 based featurization data
Data is featurized based on representations used in [Ward et al paper](https://www.nature.com/articles/npjcompumats201628). 

Note: Run this before `04.ipynb`!

In [1]:
import pandas as pd
import numpy as np
import os

from matminer.utils.conversions import str_to_composition
from matminer.featurizers.base import MultipleFeaturizer
from matminer.featurizers import composition as cf

Load data

In [2]:
path = os.path.join(os.getcwd(), "oqmd_icsd_subset.pkl")
data = pd.read_pickle(path)

Remove data without formation enthalpy values

In [3]:
data.dropna(subset=['delta_e'], inplace=True)
data.reset_index(inplace=True)

Compute pymatgen composition object from composition

In [4]:
data['composition_obj'] = str_to_composition(data['composition'])

Compute the features

In [5]:
ft = MultipleFeaturizer([cf.Stoichiometry(), cf.ElementProperty.from_preset("magpie"),
                         cf.ValenceOrbital(props=['avg']), cf.IonProperty(fast=True)])
data = ft.featurize_dataframe(data, col_id='composition_obj', ignore_errors=True)


No electronegativity for Ar. Setting to NaN. This has no physical meaning, and is mainly done to avoid errors caused by the code expecting a float.


No electronegativity for He. Setting to NaN. This has no physical meaning, and is mainly done to avoid errors caused by the code expecting a float.


No electronegativity for Ne. Setting to NaN. This has no physical meaning, and is mainly done to avoid errors caused by the code expecting a float.



Remove entries with NaN or infinite features

In [6]:
original_count = len(data)
data = data.dropna()
print('Removed %d/%d entries'%(original_count - len(data), original_count))

Removed 10436/31163 entries


In [7]:
print ("Shape of featurized data: ", data.shape)

Shape of featurized data:  (20727, 158)


Save featurized data

In [8]:
data.to_pickle('./ward2016_featurized_data.pkl')