#  Computation of crystal structure representations 

We try to recreate the performance comparison of several different crystal structure representations, including Coulomb Matrix (CM), PartialRadialDistributionFunction (PRDF) and the Voronoi tessellation features, as shown in [Ward et al's paper](https://journals.aps.org/prb/abstract/10.1103/PhysRevB.96.024104). We will also compare their performance on two machine learning algorithm, Kernel Ridge Regression (KRR) and Random Forest Regression (RF).

Here, in this particular notebook, the data is featurized and saved to pickle files. <br>NOTE: Featurization takes ~2 - 3 CPU hours to run.

In [None]:
import numpy as np
import pandas as pd
import os
import pickle

from matminer.featurizers.base import MultipleFeaturizer
from matminer.featurizers.composition import ElementProperty, Stoichiometry, ValenceOrbital, IonProperty
from matminer.featurizers.structure import SiteStatsFingerprint, StructuralHeterogeneity, ChemicalOrdering, StructureComposition, CoulombMatrix, PartialRadialDistributionFunction 
from matminer.featurizers.structure import MaximumPackingEfficiency

Load data

In [None]:
%%time
data = pd.read_pickle("./oqmd_icsd_subset.pkl")

Drop data without formation enthalpy value

In [None]:
data.dropna(subset=['delta_e'], inplace=True)

In [None]:
print ("Shape of data: ", data.shape)
data.reset_index(inplace=True)
data.head(1)

## Create featurizer
Here we featurize data with Coulomb Matrix (CM), PartialRadialDistributionFunction (PRDF) and Voronoi tessellation features used in Ward et al (2017).

### 1) Voronoi tessellation features (Ward et al 2017) 

In [None]:
ward = MultipleFeaturizer([
    SiteStatsFingerprint.from_preset("CoordinationNumber_ward-prb-2017"),
    StructuralHeterogeneity(),
    ChemicalOrdering(),
    MaximumPackingEfficiency(),
    SiteStatsFingerprint.from_preset("LocalPropertyDifference_ward-prb-2017"),
    StructureComposition(Stoichiometry()),
    StructureComposition(ElementProperty.from_preset("magpie")),
    StructureComposition(ValenceOrbital(props=['frac'])),
    StructureComposition(IonProperty(fast=True))
])

In [None]:
print ("Total number of Ward features:", len(ward.featurize(data['structure_obj'][0])))

In [None]:
%%time
X_ward = ward.featurize_many(data['structure_obj'], ignore_errors=True)

Process data to remove NaN values

In [None]:
X_ward = np.array(X_ward)
X_ward = np.nan_to_num(X_ward, copy=True)
print ("Voronoi tessellation input data shape:", X_ward.shape)

Save Voronoi tessellation featurized data

In [None]:
with open ("X_ward.pkl", "wb") as handle:
    pickle.dump(X_ward, handle, protocol=pickle.HIGHEST_PROTOCOL)

### 2) Coulomb Matrix features

In [None]:
%%time
cm = CoulombMatrix()
X_cm = cm.featurize_dataframe(data, col_id='structure_obj')

Process data to form vector descriptors using eigenvalue of CM matrix and append the descriptors to make them same size

In [None]:
X_cm = data['coulomb matrix']

X_cm = pd.Series([np.sort(np.linalg.eigvals(s)) \
            for s in X_cm], X_cm.index)
nt = max(X_cm.apply(len))

XLIST = []
for x in X_cm:
    XLIST.append(np.append(x, np.zeros(nt - x.shape[0])))
X_cm = np.array(XLIST)
print ("CM input data shape:", X_cm.shape)

Save Coulomb Matrix featurized data

In [None]:
with open ("X_cm.pkl", "wb") as handle:
    pickle.dump(X_cm, handle, protocol=pickle.HIGHEST_PROTOCOL)

### 3) PRDF features

In [None]:
%%time
prdf = PartialRadialDistributionFunction(cutoff=16.0, bin_size=3.0)
prdf.fit(data['structure_obj'])
X_prdf = prdf.featurize_many(data['structure_obj'], ignore_error=True)

Process data to remove NaN values

In [None]:
X_prdf = np.array(X_prdf)
X_prdf = np.nan_to_num(X_prdf, copy=True)
print ("PRDF input data shape:", X_prdf.shape)

Save PRDF featurized data

In [None]:
with open ("X_prdf.pkl", "wb") as handle:
    pickle.dump(X_prdf, handle, protocol=pickle.HIGHEST_PROTOCOL)

### 4) Formation enthalpy data 

Save formation enthalpy data as y input data.

In [None]:
with open ("y.pkl", "wb") as handle:
    pickle.dump(data['delta_e'], handle, protocol=pickle.HIGHEST_PROTOCOL)