# Example of use of the featurization tools

The `oximachine_featurizer` package contains tools to featurize crystal structures but also to parse chemical databases. In this notebook, we will show how one can use the featurization tools.

In [None]:
import os
from glob import glob
from pathlib import Path

import pandas as pd
from pymatgen import Structure

# this package
from oximachine_featurizer.featurize import (FeatureCollector, GetFeatures,
                                             featurize)

In [2]:
example_structures = glob("structures/*.cif")

In [3]:
example_structures

['structures/KAJZIH_freeONLY.cif',
 'structures/SnO_mp-2097_computed.cif',
 'structures/BaO_mp-1342_computed.cif',
 'structures/UiO66_GC1.cif',
 'structures/ACODAA.cif',
 'structures/BaO2_mp-1105_computed.cif',
 'structures/SnO2_mp-856_computed.cif']

To use the default settings, you will only need to call the `featurize` function. This function takes a `pymatgen.Structure` object and, optionally, a list of feature scopes.

In the example below, we will loop over our structures and save the features into a dictionary with the stem of the filename as key.

In [18]:
features_dict = {}

for structure_file in example_structures:
    print(f"Featurizing {structure_file}")
    structure = Structure.from_file(structure_file)
    stem = Path(structure_file).stem
    features_dict[stem] = featurize(structure)

Featurizing structures/KAJZIH_freeONLY.cif
Featurizing structures/SnO_mp-2097_computed.cif
Featurizing structures/BaO_mp-1342_computed.cif
Featurizing structures/UiO66_GC1.cif
Featurizing structures/ACODAA.cif
Featurizing structures/BaO2_mp-1105_computed.cif
Featurizing structures/SnO2_mp-856_computed.cif


In [21]:
features_dict


           4.        ,  0.        ,  0.        ,  4.        ,  0.        ,
           2.        ,  0.        ,  0.        ,  2.        ,  0.        ,
          78.        , 14.        , -4.        ,  2.55      ,  0.        ,
           4.        ,  0.        ,  0.        ,  4.        ,  0.        ,
           2.        ,  0.        ,  0.        ,  2.        ,  0.        ,
          78.        , 14.        , -4.        ,  2.55      ,  0.        ,
           4.        ,  0.        ,  0.        ,  4.        ,  0.        ,
           2.        ,  0.        ,  0.        ,  2.        ,  0.        ,
           2.        ,  6.        ,  2.        ,  2.        ,  0.        ,
           0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
           0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
           0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
           0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
           0.        ,  

## Using the private API (for developers)

Define the features we are interested in.

In [22]:
METAL_CENTER_FEATURES = [
    "column",
    "row",
    "valenceelectrons",
    "diffto18electrons",
    "sunfilled",
    "punfilled",
    "dunfilled",
]
GEOMETRY_FEATURES = ["crystal_nn_fingerprint", "behler_parinello"]
CHEMISTRY_FEATURES = ["local_property_stats"]

In [23]:
features_dict = {}
# Get the structures for which we also have features in the output folder
already_featurized = [Path(s).stem for s in glob("features/*.pkl")]

# Iterate over all structures
for s in example_structures:
    name = Path(s).stem
    # check if they are already in the output folder
    # if (name not in already_featurized):
    #     print(name)
    # If they are not, then we will run the featurization for them
    # the features are written as pickle files to the 'features' folder
    gf = GetFeatures.from_file(s, "features")
    gf._run_featurization()

Now, we can collect the features from this folder into a matrix.

In [24]:
import numpy as np

features_dict = {}

# get all output files with features
features = glob("features/*.pkl")

for feature in features:
    try:
        rl = FeatureCollector.create_dict_for_feature_table(feature)
        print(rl)
        features = []
        for d in rl:
            features.append(d["feature"])
        features = np.vstack(features)
        features = FeatureCollector._select_features(
            CHEMISTRY_FEATURES + METAL_CENTER_FEATURES + ["crystal_nn_no_steinhardt"],
            features,
        )
        # note, that this is a simplification for this example.
        features_dict[Path(feature).stem] = features
    except Exception as e:
        print(e)

03272, 0.0, 0.0, 1.1677208799228704, 1.0647853045505706, 23.0, 5.0, 0.0, 1.54, 1.0, 4.0, 0.0, 0.0, 0.0, 0.0, 4.0, 0.0, 0.0, 3.0, 4.496, 0.0, 0.0, -2.0, 0.0, 0.0, 0.0, -10.0, 0.0, -7.0, -1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 17.435763635419583, 5.374581696800158, 0.4061483581552421, 0.0028417785966257983, 51.2414469709615, 15.522830035691467, 31.68691754629936, 2.496070854261599, 29, 4, 11, 11.0, 7.0, 1.0, 0.0, 0.0, 11], 'name': 'KAJZIH_freeONLY'}, {'metal': 'Cu', 'coordinate_x': 15, 'coordinate_y': 3, 'coordinate_z': 3, 'feature': [0, 0, 0.8526361170230569, 1.0017993606954692e-08, 4.3774938980631365e-05, 0.007498845592509273, 0.6111369862632474, 0.5807478372602987, 0, 0, 0, 0, 0.14736388297694308, 0.054042050882633336, 0.042474899118055215, 0.05968076080103379, 0.07867041227033911, 0.07921223858604357, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 13.595129825674427, 2.9966507397818676, 1.3881238732349128, 0.858

Let's look at the output. The keys of the dictionary are the names of the structures.

In [25]:
features_dict.keys()

dict_keys(['UiO66_GC1', 'ACODAA', 'BaO2_mp-1105_computed', 'SnO2_mp-856_computed', 'KAJZIH_freeONLY', 'SnO_mp-2097_computed', 'BaO_mp-1342_computed'])

And the values are the feature values.

In [26]:
len(features_dict["BaO2_mp-1105_computed"][0])

116