#  Computation of crystal structure representations 

We try to recreate the performance comparison of several different crystal structure representations, including Coulomb Matrix (CM), PartialRadialDistributionFunction (PRDF) and the Voronoi tessellation features, as shown in [Ward et al's paper](https://journals.aps.org/prb/abstract/10.1103/PhysRevB.96.024104). We will also compare their performance on two machine learning algorithm, Kernel Ridge Regression (KRR) and Random Forest Regression (RF).

Here, in this particular notebook, the data is featurized and saved to pickle files. <br>NOTE: Featurization takes ~2 - 3 CPU hours to run.

In [2]:
import numpy as np
import pandas as pd
import os
import pickle

from matminer.featurizers.base import MultipleFeaturizer
from matminer.featurizers.composition import ElementProperty, Stoichiometry, ValenceOrbital, IonProperty
from matminer.featurizers.structure import SiteStatsFingerprint, StructuralHeterogeneity, ChemicalOrdering, StructureComposition, CoulombMatrix, PartialRadialDistributionFunction 
from matminer.featurizers.structure import MaximumPackingEfficiency

Load data

In [3]:
%%time
data = pd.read_pickle("./oqmd_icsd_subset.pkl")

CPU times: user 9.45 s, sys: 914 ms, total: 10.4 s
Wall time: 10.5 s


Drop data without formation enthalpy value

In [4]:
data.dropna(subset=['delta_e'], inplace=True)

In [5]:
print ("Shape of data: ", data.shape)
data.reset_index(inplace=True)
data.head(1)

Shape of data:  (31163, 11)


Unnamed: 0,index,band_gap,delta_e,magnetic_moment,path,stability,structure,total_energy,volume_pa,structure_obj,composition,is_ICSD
0,234975,3.879,-3.579764,-3.2e-05,/home/oqmd/libraries/icsd/31750/static,-1.0848,Ac O\n 1.0\n4.067812 -0.000030 0.000026\n-2.03...,-7.936143,17.988,[[5.0000001e-05 2.3486100e+00 1.5314600e+00] A...,Ac2O3,True


## Create featurizer
Here we featurize data with Coulomb Matrix (CM), PartialRadialDistributionFunction (PRDF) and Voronoi tessellation features used in Ward et al (2017).

### 1) Voronoi tessellation features (Ward et al 2017) 

In [11]:
ward = MultipleFeaturizer([
    SiteStatsFingerprint.from_preset("CoordinationNumber_ward-prb-2017"),
    StructuralHeterogeneity(),
    ChemicalOrdering(),
    MaximumPackingEfficiency(),
    SiteStatsFingerprint.from_preset("LocalPropertyDifference_ward-prb-2017"),
    StructureComposition(Stoichiometry()),
    StructureComposition(ElementProperty.from_preset("magpie")),
    StructureComposition(ValenceOrbital(props=['frac'])),
    StructureComposition(IonProperty(fast=True))
])

In [12]:
print ("Total number of Ward features:", len(ward.featurize(data['structure_obj'][0])))

Total number of Ward features: 273


In [13]:
%%time
X_ward = ward.featurize_many(data['structure_obj'], ignore_errors=True)

CPU times: user 36.9 ms, sys: 73.7 ms, total: 111 ms
Wall time: 2.85 s


Save Voronoi tessellation featurized data

In [None]:
with open ("X_ward.pkl", "wb") as handle:
    pickle.dump(X_ward, handle, protocol=pickle.HIGHEST_PROTOCOL)

### 2) Coulomb Matrix features

In [None]:
%%time
cm = CoulombMatrix()
X_cm = cm.featurize_dataframe(data, col_id='structure_obj')

Save Coulomb Matrix featurized data

In [None]:
with open ("X_cm.pkl", "wb") as handle:
    pickle.dump(X_cm, handle, protocol=pickle.HIGHEST_PROTOCOL)

### 3) PRDF features

In [None]:
%%time
prdf = PartialRadialDistributionFunction(cutoff=16.0, bin_size=3.0)
prdf.fit(data['structure_obj'])
X_prdf = prdf.featurize_many(data['structure_obj'], ignore_error=True)

Save PRDF featurized data

In [None]:
with open ("X_prdf.pkl", "wb") as handle:
    pickle.dump(X_prdf, handle, protocol=pickle.HIGHEST_PROTOCOL)