# How to submit to the leaderboard


In this tutorial, we will build a graph-convolutional neural network for the PBE bandgap task.
We will use the [Crystal Graph Convolutional Neural Network](https://journals.aps.org/prl/abstract/10.1103/PhysRevLett.120.145301) implemented in the [Deepchem library](https://deepchem.io/).

We choose this one, as it will need some more involved training loop and customization that you might also need in a real-world scenario.


To install Deepchem, follow the installation instructions on [the Deepchem landing page](https://deepchem.io/). You'll need to run something along the following lines (and also install PyTorch and dgl):

> conda install -c conda-forge rdkit deepchem==2.6.1
>
> pip install tensorflow-gpu~=2.4


## Imports


In [69]:
from deepchem.data import Dataset, DiskDataset
from deepchem.data.data_loader import InMemoryLoader, DataLoader
from deepchem.feat import MaterialStructureFeaturizer
from deepchem.feat.graph_data import GraphData
from deepchem.models.torch_models.cgcnn import CGCNNModel
from deepchem.molnet.load_function.molnet_loader import TransformerGenerator, _MolnetLoader
from deepchem.utils.data_utils import download_url, get_data_dir
from deepchem.utils.typing import PymatgenStructure
from loguru import logger
from mofdscribe.bench import PBEBandGapBench
from pymatgen.analysis.local_env import CrystalNN, CutOffDictNN, JmolNN
from pymatgen.core import Structure
from pymatgen.core.structure import Structure
from typing import List, Tuple, Union, Iterable, Sequence, Optional, Any, Iterator
from typing import Tuple

import deepchem as dc
import json
import logging
import pandas as pd
import numpy as np
import concurrent.futures
import time
import os


logger = logging.getLogger(__name__)

ATOM_INIT_JSON_URL = "https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/atom_init.json"
VESTA_NN = CutOffDictNN.from_preset("vesta_2019")


## Plumbing to make Deepchem work with custom datasets


In [121]:
model = CGCNNModel(mode="regression", in_edge_dim=16)  # we use default settings for demonstration purposes.


Below, we have a custom dataloader, which also supports multiprocessing (we will use this one in the Python script, in which we run it for a larger dataset). In this notebook, we will use the default `InMemoryLoader`.


In [80]:
# class InMemoryLoader(DataLoader):
#     """Facilitate Featurization of In-memory objects.

#     When featurizing a dataset, it's often the case that the initial set of
#     data (pre-featurization) fits handily within memory. (For example, perhaps
#     it fits within a column of a pandas DataFrame.) In this case, it would be
#     convenient to directly be able to featurize this column of data. However,
#     the process of featurization often generates large arrays which quickly eat
#     up available memory. This class provides convenient capabilities to process
#     such in-memory data by checkpointing generated features periodically to
#     disk.

#     Example
#     -------
#     Here's an example with only datapoints and no labels or weights.

#     >>> import deepchem as dc
#     >>> smiles = ["C", "CC", "CCC", "CCCC"]
#     >>> featurizer = dc.feat.CircularFingerprint()
#     >>> loader = dc.data.InMemoryLoader(tasks=["task1"], featurizer=featurizer)
#     >>> dataset = loader.create_dataset(smiles, shard_size=2)
#     >>> len(dataset)
#     4

#     Here's an example with both datapoints and labels

#     >>> import deepchem as dc
#     >>> smiles = ["C", "CC", "CCC", "CCCC"]
#     >>> labels = [1, 0, 1, 0]
#     >>> featurizer = dc.feat.CircularFingerprint()
#     >>> loader = dc.data.InMemoryLoader(tasks=["task1"], featurizer=featurizer)
#     >>> dataset = loader.create_dataset(zip(smiles, labels), shard_size=2)
#     >>> len(dataset)
#     4

#     """

#     def create_dataset(
#         self,
#         inputs: Sequence[Any],
#         data_dir: Optional[str] = None,
#         shard_size: Optional[int] = 8192,
#         n_workers: Optional[int] = 8,
#     ) -> DiskDataset:
#         """Create and return a `Dataset` object by featurizing provided files.

#         Reads in `inputs` and uses `self.featurizer` to featurize the
#         data in these input files.  For large files, automatically shards
#         into smaller chunks of `shard_size` datapoints for convenience.
#         Returns a `Dataset` object that contains the featurized dataset.

#         This implementation assumes that the helper methods `_get_shards`
#         and `_featurize_shard` are implemented and that each shard
#         returned by `_get_shards` is a pandas dataframe.  You may choose
#         to reuse or override this method in your subclass implementations.

#         Parameters
#         ----------
#         inputs : Sequence[Any]
#             List of inputs to process. Entries can be arbitrary objects so long as
#             they are understood by `self.featurizer`
#         data_dir : str, optional (default None)
#             Directory to store featurized dataset.
#         shard_size: int, optional (default 8192)
#             Number of examples stored in each shard.

#         Returns
#         -------
#         DiskDataset
#           A `DiskDataset` object containing a featurized representation of data
#           from `inputs`.
#         """
#         logger.info("Loading raw samples now.")
#         logger.info("shard_size: %s" % str(shard_size))

#         if not isinstance(inputs, list):
#             try:
#                 inputs = list(inputs)
#             except TypeError:
#                 inputs = [inputs]

#         def _shard_generator():
#             global_index = 0  # noqa: F841
#             all_shard = [s for s in self._get_shards(inputs, shard_size)]
#             entry = 0
#             global_entry = [0]
#             for s in all_shard:
#                 entry += len(s)
#                 global_entry.append(entry)

#             with concurrent.futures.ProcessPoolExecutor(max_workers=n_workers) as executor:
#                 time1 = time.time()
#                 exe_results = executor.map(self._featurize_shard, all_shard, global_entry)
#                 time2 = time.time()
#                 logger.info("TIMING: featurizing shard took %0.3f s" % (time2 - time1))
#             shard_results = [r for r in exe_results]

#             for sr in shard_results:
#                 X, y, w, ids = sr[0], sr[1], sr[2], sr[3]
#                 yield X, y, w, ids

#         return DiskDataset.create_dataset(_shard_generator(), data_dir, self.tasks)

#     def _get_shards(
#         self, inputs: List, shard_size: Optional[int]
#     ) -> Iterator[pd.DataFrame]:  # noqa: DAR301
#         """Break up input into shards.

#         Parameters
#         ----------
#         inputs: List
#           Each entry in this list must be of the form `(featurization_input,
#           label, weight, id)` or `(featurization_input, label, weight)` or
#           `(featurization_input, label)` or `featurization_input` for one
#           datapoint, where `featurization_input` is any input that is recognized
#           by `self.featurizer`.
#         shard_size: int, optional
#           The size of shard to generate.

#         Yields
#         -------
#         Iterator[pd.DataFrame]
#           Iterator which iterates over shards of data.
#         """
#         current_shard: List = []
#         for i, datapoint in enumerate(inputs):
#             if i != 0 and shard_size is not None and i % shard_size == 0:
#                 shard_data = current_shard
#                 current_shard = []
#                 yield shard_data
#             current_shard.append(datapoint)
#         yield current_shard  # noqa: DAR301

#     # FIXME: Signature of "_featurize_shard" incompatible with supertype "DataLoader"
#     def _featurize_shard(  # type: ignore[override]
#         self, shard: List, global_index: List
#     ) -> Tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray]:  # noqa: DAR401
#         """Featurizes a shard of an input data.

#         Parameters
#         ----------
#         shard: List
#           List each entry of which must be of the form `(featurization_input,
#           label, weight, id)` or `(featurization_input, label, weight)` or
#           `(featurization_input, label)` or `featurization_input` for one
#           datapoint, where `featurization_input` is any input that is recognized
#           by `self.featurizer`.
#         global_index: int
#           The starting index for this shard in the full set of provided inputs

#         Returns
#         ------
#         Tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray]
#           The tuple is `(X, y, w, ids)`. All values are numpy arrays.

#         Raises
#         ------
#         ValueError :
#           if entry has more than 4 elements.
#         """
#         features = []
#         labels = []
#         weights = []
#         ids = []
#         n_tasks = len(self.tasks)
#         for _i, entry in enumerate(shard):
#             if not isinstance(entry, tuple):
#                 entry = (entry,)
#             if len(entry) > 4:
#                 raise ValueError(
#                     "Entry is malformed and must be of length 1-4 containing featurization_input"
#                     "and optionally label, weight, and id."
#                 )
#             if len(entry) == 4:
#                 featurization_input, label, weight, entry_id = entry
#             elif len(entry) == 3:
#                 featurization_input, label, weight = entry
#                 entry_id = global_index
#             elif len(entry) == 2:
#                 featurization_input, label = entry
#                 weight = np.ones((n_tasks), np.float32)
#                 entry_id = global_index
#             elif len(entry) == 1:
#                 featurization_input = entry
#                 label = np.zeros((n_tasks), np.float32)
#                 weight = np.zeros((n_tasks), np.float32)
#                 entry_id = global_index
#             feature = self.featurizer(featurization_input)
#             features.append(feature)
#             weights.append(weight)
#             labels.append(label)
#             ids.append(entry_id)
#         X = np.concatenate(features, axis=0)
#         return X, np.array(labels), np.array(weights), np.array(ids)


We use a simpler featurizer, however, you can also use the default `CGCNNFeaturizer`

In [81]:
class CrystalBondFeaturizer(MaterialStructureFeaturizer):
    """
    Calculate structure graph features for crystals.

    Based on the implementation in Crystal Graph Convolutional
    Neural Networks (CGCNN). The method constructs a crystal graph
    representation including atom features and bond features (neighbor
    distances). Neighbors are determined using bond heuristics.
    Optionally, a Gaussian filter is applied to neighbor distances.
    All units are in Angstrom.
    This featurizer requires the optional dependency pymatgen. It may
    be useful when 3D coordinates are available and when using graph
    network models and crystal graph convolutional networks.
    See [1]_ for more details.
    References
    ----------
    .. [1] T. Xie and J. C. Grossman, "Crystal graph convolutional
       neural networks for an accurate and interpretable prediction
       of material properties", Phys. Rev. Lett. 120, 2018,
       https://arxiv.org/abs/1710.10324
    Examples
    --------
    >>> import pymatgen as mg
    >>> featurizer = CrystalBondFeaturizer()
    >>> lattice = mg.core.Lattice.cubic(4.2)
    >>> structure = mg.core.Structure(lattice, ["Cs", "Cl"], [[0, 0, 0], [0.5, 0.5, 0.5]])
    >>> features = featurizer.featurize([structure])
    >>> feature = features[0]
    >>> print(type(feature))
    <class 'deepchem.feat.graph_data.GraphData'>
    Note
    ----
    This class requires Pymatgen to be installed.
    """

    def __init__(self, heuristic: str = "vesta", step: float = 0.2, radius: float = 3.0):
        """Initialize CrystalBondFeaturizer.

        Parameters
        ----------
        heuristic : str
            The heuristic to use for determining neighbors.
        radius : float
            Radius of sphere for finding neighbors of atoms in unit cell. This is the radius
            of the Gaussian filter. Default is 3.0.
        step : float
            Step size for Gaussian filter. This value is used when building edge features.
            If None, use only the bond length. Default is 0.2.

        Raises
        ------
        ValueError
            If `heuristic` is not one of the following: ["vesta", "jmol", "crystal"].
        """
        heuristic = heuristic.lower()
        if heuristic == "vesta":
            self.nn = VESTA_NN
        elif heuristic == "jmol":
            self.nn = JmolNN()
        elif heuristic == "crystal":
            self.nn = CrystalNN()
        else:
            raise ValueError("Unknown heuristic: {}".format(heuristic))
        self.step = step
        self.radius = radius
        # load atom_init.json
        data_dir = get_data_dir()
        download_url(ATOM_INIT_JSON_URL, data_dir)
        atom_init_json_path = os.path.join(data_dir, "atom_init.json")
        with open(atom_init_json_path, "r") as f:
            atom_init_json = json.load(f)

        self.atom_features = {
            int(key): np.array(value, dtype=np.float32) for key, value in atom_init_json.items()
        }
        self.valid_atom_number = set(self.atom_features.keys())

    def _featurize(self, datapoint: PymatgenStructure, **kwargs) -> GraphData:
        """Calculate crystal graph features from pymatgen structure.

        Parameters
        ----------
        datapoint: pymatgen.core.Structure
                A periodic crystal composed of a lattice and a sequence of atomic
                sites with 3D coordinates and elements.

        Returns
        -------
        graph: GraphData
                A crystal graph with CGCNN style features.
        """
        if type(datapoint) is not Structure:
            logger.warning(
                f"CrystalBondFeaturizer requires pymatgen.core.Structure, got {type(datapoint)}"
            )
            raise ValueError(
                f"CrystalBondFeaturizer requires pymatgen.core.Structure, got {type(datapoint)}"
            )
        node_features = self._get_node_features(datapoint)
        edge_index, edge_features = self._get_edge_features_and_index(datapoint)
        graph = GraphData(node_features, edge_index, edge_features)
        return graph

    def _get_node_features(self, struct: PymatgenStructure) -> np.ndarray:
        """Get the node feature from `atom_init.json`.

        The `atom_init.json` was collected
        from `data/sample-regression/atom_init.json` in the CGCNN repository.

        Parameters
        ----------
        struct: pymatgen.core.Structure
            A periodic crystal composed of a lattice and a sequence of atomic
            sites with 3D coordinates and elements.

        Returns
        -------
        node_features: np.ndarray
            A numpy array of shape `(num_nodes, 92)`.
        """
        node_features = []
        for site in struct:
            # check whether the atom feature exists or not
            if site.specie.number not in self.valid_atom_number:
                raise RuntimeError("site.specie.number not in self.valid_atom_number")
            node_features.append(self.atom_features[site.specie.number])
        return np.vstack(node_features).astype(float)

    def _get_edge_features_and_index(
        self, struct: PymatgenStructure
    ) -> Tuple[np.ndarray, np.ndarray]:
        """Calculate the edge feature and edge index from pymatgen structure.

        Parameters
        ----------
        struct: pymatgen.core.Structure
            A periodic crystal composed of a lattice and a sequence of atomic
            sites with 3D coordinates and elements.
        Returns
        -------
        edge_idx: np.ndarray, dtype int
            A numpy array of shape with `(2, num_edges)`.
        edge_features: np.ndarray
            A numpy array of shape with `(num_edges, filter_length)`. The `filter_length` is
            (self.radius / self.step) + 1. The edge features were built by applying gaussian
            filter to the distance between nodes.
        """
        neighbors_ = self.nn.get_all_nn_info(struct)

        neighbors = []
        for n in neighbors_:
            sites = [s["site"] for s in n]
            n = sorted(sites, key=lambda x: x[1])
            neighbors.append(n)

        # construct bi-directed graph
        src_idx, dest_idx = [], []
        edge_distances = []
        for node_idx, neighbor in enumerate(neighbors):
            src_idx.extend([node_idx] * len(neighbor))
            dest_idx.extend([site[2] for site in neighbor])
            edge_distances.extend([site[1] for site in neighbor])

        edge_idx = np.array([src_idx, dest_idx], dtype=int)

        if self.step is None:
            edge_features = np.array(edge_distances)
        else:
            edge_features = self._gaussian_filter(np.array(edge_distances, dtype=float))
        return edge_idx, edge_features

    def _gaussian_filter(self, distances: np.ndarray) -> np.ndarray:
        """Apply Gaussian filter to an array of interatomic distances.

        Parameters
        ----------
        distances : np.ndarray
                A numpy array of the shape `(num_edges, )`.
        Returns
        -------
        expanded_distances: np.ndarray
                Expanded distance tensor after Gaussian filtering.
                The shape is `(num_edges, filter_length)`. The `filter_length` is
                (self.radius / self.step) + 1.
        """
        filt = np.arange(0, self.radius + self.step, self.step)

        # Increase dimension of distance tensor and apply filter
        expanded_distances = np.exp(-((distances[..., np.newaxis] - filt) ** 2) / self.step ** 2)

        return expanded_distances


We now implement a loader that will take the structures and labels. Some notes:

- we convert to lists as the splitters by default return generators.


In [129]:
class StructureDataLoader(_MolnetLoader):
    """StructureDataLoader loader.

    This data loader assumes that there is a folder with subfolder `cifs`.
    The `cifs` subfolders contains all the cif files to be loaded.

    Labels are loaded from a json-serialized file in the folder which
    name can be specfied with `label_file_name`.

    Note that there will be errors if the structures do not _exactly_ match
    the entries in the json file.

    Parameters
    ----------
    featurizer : Union[dc.feat.Featurizer, str]
            the featurizer to use for processing the data.  Alternatively you can pass
            one of the names from dc.molnet.featurizers as a shortcut.
    splitter : Union[dc.splits.Splitter, str], optional
            the splitter to use for splitting the data into training, validation, and
            test sets.  Alternatively you can pass one of the names from
            dc.molnet.splitters as a shortcut.  If this is None, all the data
            will be included in a single dataset.
    transformer_generators : List[Union[TransformerGenerator, str]]
            the Transformers to apply to the data.  Each one is specified by a
            TransformerGenerator or, as a shortcut, one of the names from
            dc.molnet.transformers.
    tasks : List[str]
            the names of the tasks in the dataset
    data_dir : Optional[str]
            a directory to save the raw data in
    save_dir : Optional[str]
            a directory to save the dataset in
    label_file_name : Optional[str]
            the name of the json file containing the labels. Defaults to `qmof.json`.
    identifier_column : Optional[str]
            the name of the column that contains the identifier of the structure.
            Defaults to `qmof_id`.
    """

    def __init__(
        self,
        structures: Iterable[Structure],
        labels: Iterable[float],
        idx: Iterable[int],
        splitter: Union[str, dc.splits.Splitter] = None,#"random",
        transformer_generators: List[Union[str, TransformerGenerator]] = ["normalization"],
        tasks: List[str] = ['qmof'],
        data_dir: str = None,
        save_dir: str = None,
        number_files: int = np.infty,
        shard_size: int = 8,
        n_workers: int = 1,
    ):
        # we return IStructures, however, the featurizer wants structures.
        self.structures = [[Structure.from_sites(s.sites)] for s in structures]
        self.labels = np.array(list(labels)).reshape(-1,1)
        self.idx = np.array(list(idx))

        self.number_files = number_files

        self.shard_size = shard_size
        self.n_workers = n_workers

        super().__init__(
            featurizer=CrystalBondFeaturizer(),
            splitter=splitter,
            transformer_generators=transformer_generators,
            tasks=tasks,
            data_dir=data_dir,
            save_dir=save_dir,
        )

    def create_dataset(self) -> Dataset:
        """Utilitary function to create the dataset."""
        loader = InMemoryLoader(
            tasks=self.tasks,
            featurizer=self.featurizer,
        )
        
        return loader.create_dataset(
            list(zip(self.structures, self.labels)),
            shard_size=self.shard_size,
            #     n_workers=self.n_workers,
        )


Let's check our plumbing works.


We initialize a bench class to access the dataset.


In [130]:
bench = PBEBandGapBench(None, "test", debug=True)


2022-08-02 11:53:28.297 | DEBUG    | mofdscribe.datasets.qmof_dataset:__init__:120 - Dropped 0 duplicate basenames. New length 15844
2022-08-02 11:53:28.313 | DEBUG    | mofdscribe.datasets.qmof_dataset:__init__:126 - Dropped 153 duplicate graphs. New length 15691
2022-08-02 11:53:28.899 | DEBUG    | mofdscribe.datasets.qmof_dataset:__init__:120 - Dropped 0 duplicate basenames. New length 15844
2022-08-02 11:53:28.915 | DEBUG    | mofdscribe.datasets.qmof_dataset:__init__:126 - Dropped 153 duplicate graphs. New length 15691
2022-08-02 11:53:28.965 | DEBUG    | mofdscribe.splitters.splitters:__init__:106 - Splitter settings | shuffle True, random state None, sample frac 0.01, q (0, 0.25, 0.5, 0.75, 1)


In [131]:
# get some random indices
indices = np.random.choice(np.arange(len(bench._ds)), 10)

structures = bench._ds.get_structures(indices)
labels = bench._ds._df[bench._targets].iloc[indices].values


In [132]:
loader = StructureDataLoader(structures, labels, indices, data_dir="test-data", save_dir='test')


In [133]:
ds = loader.load_dataset("test", True)


In [134]:
ds

(['qmof'],
 (<DiskDataset X.shape: (10,), y.shape: (10, 1), w.shape: (10, 1), ids: [0 0 0 0 0 0 0 0 8 8], task_names: ['qmof']>,),
 [<deepchem.trans.transformers.NormalizationTransformer at 0x2b2878eb0>])

In [122]:
model.fit(ds[1][0])




0.050350230932235715

## Implementing a model class for `Bench`


Now, that we now that the plumbing works, we only have to wrap it into a class.


In [None]:
class CGCNNModel:
    def __init__(self, model): 
        self.model = model
        self.scaler = None 
        self._fitted = False 

    def _create_ds(): 
        ...

    def fit(self, idx, structures, y):
        ...
    
    def predict(self, idx, structures):
        if not self._fitted: 
            raise Exception("Model not fitted")
    