#  Computation of crystal structure representations 

We try to recreate the performance comparison of several different crystal structure representations, including Coulomb Matrix (CM), PartialRadialDistributionFunction (PRDF) and the Voronoi tessellation features, as shown in [Ward et al's paper](https://journals.aps.org/prb/abstract/10.1103/PhysRevB.96.024104). We will also compare their performance on two machine learning algorithm, Kernel Ridge Regression (KRR) and Random Forest Regression (RF).

Here, in this particular notebook, the data is featurized and saved to pickle files. <br>NOTE: Featurization takes ~2 - 3 CPU hours to run.

In [1]:
import numpy as np
import pandas as pd
import os
import pickle

from matminer.featurizers.base import MultipleFeaturizer
from matminer.featurizers.composition import ElementProperty, Stoichiometry, ValenceOrbital, IonProperty
from matminer.featurizers.structure import SiteStatsFingerprint, StructuralHeterogeneity, ChemicalOrdering, StructureComposition, CoulombMatrix, PartialRadialDistributionFunction 
from matminer.featurizers.structure import MaximumPackingEfficiency

  return f(*args, **kwds)

numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88


numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88



Load data

In [2]:
%%time
data = pd.read_pickle("./oqmd_icsd_subset.pkl")

CPU times: user 10.8 s, sys: 2.12 s, total: 12.9 s
Wall time: 13.2 s


Drop data without formation enthalpy value

In [3]:
data.dropna(subset=['delta_e'], inplace=True)

In [4]:
print ("Shape of data: ", data.shape)
data.reset_index(inplace=True)
data.head(1)

Shape of data:  (31163, 11)


Unnamed: 0,index,band_gap,delta_e,magnetic_moment,path,stability,structure,total_energy,volume_pa,structure_obj,composition,is_ICSD
0,234975,3.879,-3.579764,-3.2e-05,/home/oqmd/libraries/icsd/31750/static,-1.0848,Ac O\n 1.0\n4.067812 -0.000030 0.000026\n-2.03...,-7.936143,17.988,[[5.0000001e-05 2.3486100e+00 1.5314600e+00] A...,Ac2O3,True


## Create featurizer
Here we featurize data with Coulomb Matrix (CM), PartialRadialDistributionFunction (PRDF) and Voronoi tessellation features used in Ward et al (2017).

### 1) Voronoi tessellation features (Ward et al 2017) 

In [5]:
ward = MultipleFeaturizer([
    SiteStatsFingerprint.from_preset("CoordinationNumber_ward-prb-2017"),
    StructuralHeterogeneity(),
    ChemicalOrdering(),
    MaximumPackingEfficiency(),
    SiteStatsFingerprint.from_preset("LocalPropertyDifference_ward-prb-2017"),
    StructureComposition(Stoichiometry()),
    StructureComposition(ElementProperty.from_preset("magpie")),
    StructureComposition(ValenceOrbital(props=['frac'])),
    StructureComposition(IonProperty(fast=True))
])

In [6]:
print ("Total number of Ward features:", len(ward.featurize(data['structure_obj'][0])))

Total number of Ward features: 273


In [7]:
%%time
X_ward = ward.featurize_many(data['structure_obj'], ignore_errors=True)


No electronegativity for Ar. Setting to NaN. This has no physical meaning, and is mainly done to avoid errors caused by the code expecting a float.


No electronegativity for He. Setting to NaN. This has no physical meaning, and is mainly done to avoid errors caused by the code expecting a float.


No electronegativity for Ne. Setting to NaN. This has no physical meaning, and is mainly done to avoid errors caused by the code expecting a float.


No electronegativity for Ne. Setting to NaN. This has no physical meaning, and is mainly done to avoid errors caused by the code expecting a float.



CPU times: user 21.9 s, sys: 27.4 s, total: 49.3 s
Wall time: 1h 31min 54s


Process data to remove NaN values

In [8]:
X_ward = np.array(X_ward)
X_ward = np.nan_to_num(X_ward, copy=True)
print ("Voronoi tessellation input data shape:", X_ward.shape)

Voronoi tessellation input data shape: (31163, 273)


Save Voronoi tessellation featurized data

In [9]:
pickle.dump(X_ward, open ("X_ward.pkl", "wb"), protocol=pickle.HIGHEST_PROTOCOL)

### 2) Coulomb Matrix features

In [10]:
%%time
cm = CoulombMatrix()
X_cm = cm.featurize_dataframe(data, col_id='structure_obj')

CPU times: user 13.5 s, sys: 1.84 s, total: 15.3 s
Wall time: 2min 14s


Process data to form vector descriptors using eigenvalue of CM matrix and append the descriptors to make them same size

In [11]:
X_cm = data['coulomb matrix']

X_cm = pd.Series([np.sort(np.linalg.eigvals(s)) \
            for s in X_cm], X_cm.index)
nt = max(X_cm.apply(len))

XLIST = []
for x in X_cm:
    XLIST.append(np.append(x, np.zeros(nt - x.shape[0])))
X_cm = np.array(XLIST)
print ("CM input data shape:", X_cm.shape)

CM input data shape: (31163, 272)


Save Coulomb Matrix featurized data

In [12]:
pickle.dump(X_cm, open ("X_cm.pkl", "wb"), protocol=pickle.HIGHEST_PROTOCOL)

### 3) PRDF features

In [13]:
%%time
prdf = PartialRadialDistributionFunction(cutoff=16.0, bin_size=3.0)
prdf.fit(data['structure_obj'])
X_prdf = prdf.featurize_many(data['structure_obj'], ignore_errors=True)


No electronegativity for He. Setting to NaN. This has no physical meaning, and is mainly done to avoid errors caused by the code expecting a float.


No electronegativity for Ne. Setting to NaN. This has no physical meaning, and is mainly done to avoid errors caused by the code expecting a float.


No electronegativity for Ar. Setting to NaN. This has no physical meaning, and is mainly done to avoid errors caused by the code expecting a float.



CPU times: user 19.4 s, sys: 37.9 s, total: 57.2 s
Wall time: 16min 49s


Process data to remove NaN values

In [14]:
X_prdf = np.array(X_prdf)
X_prdf = np.nan_to_num(X_prdf, copy=True)
print ("PRDF input data shape:", X_prdf.shape)

PRDF input data shape: (31163, 24030)


Save PRDF featurized data

In [15]:
pickle.dump(X_prdf, open ("X_prdf.pkl", "wb"), protocol=pickle.HIGHEST_PROTOCOL)

### 4) Formation enthalpy data 

Save formation enthalpy data as y input data.

In [16]:
pickle.dump(data['delta_e'], open ("y.pkl", "wb"), protocol=pickle.HIGHEST_PROTOCOL)