# How to submit to the leaderboard


In this tutorial, we will build a graph-convolutional neural network for the PBE bandgap task.
We will use the [Crystal Graph Convolutional Neural Network](https://journals.aps.org/prl/abstract/10.1103/PhysRevLett.120.145301) implemented in the [Deepchem library](https://deepchem.io/).

We choose this one, as it will need some more involved training loop and customization that you might also need in a real-world scenario.


To install Deepchem, follow the installation instructions on [the Deepchem landing page](https://deepchem.io/). You'll need to run something along the following lines (and also install PyTorch and dgl):

> conda install -c conda-forge rdkit deepchem==2.6.1
>
> pip install tensorflow-gpu~=2.4


## Imports


In [17]:
from deepchem.data import Dataset
from deepchem.data import DiskDataset
from deepchem.data.data_loader import DataLoader
from deepchem.feat import CGCNNFeaturizer
from deepchem.models.torch_models.cgcnn import CGCNNModel
from deepchem.molnet.load_function.molnet_loader import TransformerGenerator, _MolnetLoader
from deepchem.molnet.load_function.molnet_loader import TransformerGenerator, _MolnetLoader
from loguru import logger
from mofdscribe.bench import PBEBandGapBench
from os import PathLike
from os.path import join
from pymatgen.core import Structure
from typing import Any, Iterator, List, Optional, Sequence, Tuple, Union, Iterable
import concurrent.futures
import deepchem as dc
import json
import logging
import numpy as np
import pandas as pd
import time

logger = logging.getLogger(__name__)


In [10]:
model = CGCNNModel(mode="regression")  # we use default settings for demonstration purposes.


DGL backend not selected or invalid.  Assuming PyTorch for now.


Setting the default backend to "pytorch". You can change it in the ~/.dgl/config.json file or export the DGLBACKEND environment variable.  Valid options are: pytorch, mxnet, tensorflow (all lowercase)


Below, we have a custom dataloader, which also supports multiprocessing. However, you could, of course, also use the default `InMemoryLoader`.


In [14]:
class InMemoryLoader(DataLoader):
    """Facilitate Featurization of In-memory objects.

    When featurizing a dataset, it's often the case that the initial set of
    data (pre-featurization) fits handily within memory. (For example, perhaps
    it fits within a column of a pandas DataFrame.) In this case, it would be
    convenient to directly be able to featurize this column of data. However,
    the process of featurization often generates large arrays which quickly eat
    up available memory. This class provides convenient capabilities to process
    such in-memory data by checkpointing generated features periodically to
    disk.

    Example
    -------
    Here's an example with only datapoints and no labels or weights.

    >>> import deepchem as dc
    >>> smiles = ["C", "CC", "CCC", "CCCC"]
    >>> featurizer = dc.feat.CircularFingerprint()
    >>> loader = dc.data.InMemoryLoader(tasks=["task1"], featurizer=featurizer)
    >>> dataset = loader.create_dataset(smiles, shard_size=2)
    >>> len(dataset)
    4

    Here's an example with both datapoints and labels

    >>> import deepchem as dc
    >>> smiles = ["C", "CC", "CCC", "CCCC"]
    >>> labels = [1, 0, 1, 0]
    >>> featurizer = dc.feat.CircularFingerprint()
    >>> loader = dc.data.InMemoryLoader(tasks=["task1"], featurizer=featurizer)
    >>> dataset = loader.create_dataset(zip(smiles, labels), shard_size=2)
    >>> len(dataset)
    4

    """

    def create_dataset(
        self,
        inputs: Sequence[Any],
        data_dir: Optional[str] = None,
        shard_size: Optional[int] = 8192,
        n_workers: Optional[int] = 8,
    ) -> DiskDataset:
        """Create and return a `Dataset` object by featurizing provided files.

        Reads in `inputs` and uses `self.featurizer` to featurize the
        data in these input files.  For large files, automatically shards
        into smaller chunks of `shard_size` datapoints for convenience.
        Returns a `Dataset` object that contains the featurized dataset.

        This implementation assumes that the helper methods `_get_shards`
        and `_featurize_shard` are implemented and that each shard
        returned by `_get_shards` is a pandas dataframe.  You may choose
        to reuse or override this method in your subclass implementations.

        Parameters
        ----------
        inputs : Sequence[Any]
            List of inputs to process. Entries can be arbitrary objects so long as
            they are understood by `self.featurizer`
        data_dir : str, optional (default None)
            Directory to store featurized dataset.
        shard_size: int, optional (default 8192)
            Number of examples stored in each shard.

        Returns
        -------
        DiskDataset
          A `DiskDataset` object containing a featurized representation of data
          from `inputs`.
        """
        logger.info("Loading raw samples now.")
        logger.info("shard_size: %s" % str(shard_size))

        if not isinstance(inputs, list):
            try:
                inputs = list(inputs)
            except TypeError:
                inputs = [inputs]

        def _shard_generator():
            global_index = 0  # noqa: F841
            all_shard = [s for s in self._get_shards(inputs, shard_size)]
            entry = 0
            global_entry = [0]
            for s in all_shard:
                entry += len(s)
                global_entry.append(entry)

            with concurrent.futures.ProcessPoolExecutor(max_workers=n_workers) as executor:
                time1 = time.time()
                exe_results = executor.map(self._featurize_shard, all_shard, global_entry)
                time2 = time.time()
                logger.info("TIMING: featurizing shard took %0.3f s" % (time2 - time1))
            shard_results = [r for r in exe_results]

            for sr in shard_results:
                X, y, w, ids = sr[0], sr[1], sr[2], sr[3]
                yield X, y, w, ids

        return DiskDataset.create_dataset(_shard_generator(), data_dir, self.tasks)

    def _get_shards(
        self, inputs: List, shard_size: Optional[int]
    ) -> Iterator[pd.DataFrame]:  # noqa: DAR301
        """Break up input into shards.

        Parameters
        ----------
        inputs: List
          Each entry in this list must be of the form `(featurization_input,
          label, weight, id)` or `(featurization_input, label, weight)` or
          `(featurization_input, label)` or `featurization_input` for one
          datapoint, where `featurization_input` is any input that is recognized
          by `self.featurizer`.
        shard_size: int, optional
          The size of shard to generate.

        Yields
        -------
        Iterator[pd.DataFrame]
          Iterator which iterates over shards of data.
        """
        current_shard: List = []
        for i, datapoint in enumerate(inputs):
            if i != 0 and shard_size is not None and i % shard_size == 0:
                shard_data = current_shard
                current_shard = []
                yield shard_data
            current_shard.append(datapoint)
        yield current_shard  # noqa: DAR301

    # FIXME: Signature of "_featurize_shard" incompatible with supertype "DataLoader"
    def _featurize_shard(  # type: ignore[override]
        self, shard: List, global_index: List
    ) -> Tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray]:  # noqa: DAR401
        """Featurizes a shard of an input data.

        Parameters
        ----------
        shard: List
          List each entry of which must be of the form `(featurization_input,
          label, weight, id)` or `(featurization_input, label, weight)` or
          `(featurization_input, label)` or `featurization_input` for one
          datapoint, where `featurization_input` is any input that is recognized
          by `self.featurizer`.
        global_index: int
          The starting index for this shard in the full set of provided inputs

        Returns
        ------
        Tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray]
          The tuple is `(X, y, w, ids)`. All values are numpy arrays.

        Raises
        ------
        ValueError :
          if entry has more than 4 elements.
        """
        features = []
        labels = []
        weights = []
        ids = []
        n_tasks = len(self.tasks)
        for _i, entry in enumerate(shard):
            if not isinstance(entry, tuple):
                entry = (entry,)
            if len(entry) > 4:
                raise ValueError(
                    "Entry is malformed and must be of length 1-4 containing featurization_input"
                    "and optionally label, weight, and id."
                )
            if len(entry) == 4:
                featurization_input, label, weight, entry_id = entry
            elif len(entry) == 3:
                featurization_input, label, weight = entry
                entry_id = global_index
            elif len(entry) == 2:
                featurization_input, label = entry
                weight = np.ones((n_tasks), np.float32)
                entry_id = global_index
            elif len(entry) == 1:
                featurization_input = entry
                label = np.zeros((n_tasks), np.float32)
                weight = np.zeros((n_tasks), np.float32)
                entry_id = global_index
            feature = self.featurizer(featurization_input)
            features.append(feature)
            weights.append(weight)
            labels.append(label)
            ids.append(entry_id)
        X = np.concatenate(features, axis=0)
        return X, np.array(labels), np.array(weights), np.array(ids)


We now implement a loader that will take the structures and labels


In [None]:
class StructureDataLoader(_MolnetLoader):
    """StructureDataLoader loader.

    This data loader assumes that there is a folder with subfolder `cifs`.
    The `cifs` subfolders contains all the cif files to be loaded.

    Labels are loaded from a json-serialized file in the folder which
    name can be specfied with `label_file_name`.

    Note that there will be errors if the structures do not _exactly_ match
    the entries in the json file.

    Parameters
    ----------
    featurizer : Union[dc.feat.Featurizer, str]
            the featurizer to use for processing the data.  Alternatively you can pass
            one of the names from dc.molnet.featurizers as a shortcut.
    loading_dir : str
            path to the json where the dataset is stored. Should contain  a cif folders and a qmof.json file.
    splitter : Union[dc.splits.Splitter, str], optional
            the splitter to use for splitting the data into training, validation, and
            test sets.  Alternatively you can pass one of the names from
            dc.molnet.splitters as a shortcut.  If this is None, all the data
            will be included in a single dataset.
    transformer_generators : List[Union[TransformerGenerator, str]]
            the Transformers to apply to the data.  Each one is specified by a
            TransformerGenerator or, as a shortcut, one of the names from
            dc.molnet.transformers.
    tasks : List[str]
            the names of the tasks in the dataset
    data_dir : Optional[str]
            a directory to save the raw data in
    save_dir : Optional[str]
            a directory to save the dataset in
    label_file_name : Optional[str]
            the name of the json file containing the labels. Defaults to `qmof.json`.
    identifier_column : Optional[str]
            the name of the column that contains the identifier of the structure.
            Defaults to `qmof_id`.
    """

    def __init__(
        self,
        structures: Iterable[Structure],
        labels: Iterable[float],
        idx: Iterable[int],
        splitter: Union[str, dc.splits.Splitter] = "random",
        transformer_generators: List[Union[str, TransformerGenerator]] = None,
        tasks: List[str] = None,
        data_dir: str = None,
        save_dir: str = None,
        number_files: int = np.infty,
        shard_size: int = 8,
        n_workers: int = 8,
    ):
        self.structures = structures
        self.labels = labels
        self.idx = idx

        self.number_files = number_files

        # keep it here such that we can reuse it later
        self.dataset_file = join(self.loading_dir, "cifs")

        self.shard_size = shard_size
        self.n_workers = n_workers

        super().__init__(
            featurizer=CGCNNFeaturizer(),
            splitter=splitter,
            transformer_generators=transformer_generators,
            tasks=tasks,
            data_dir=data_dir,
            save_dir=save_dir,
        )

    def create_dataset(self) -> Dataset:
        """Utilitary function to create the dataset."""
        loader = InMemoryLoader(
            tasks=self.tasks,
            featurizer=self.featurizer,
        )

        weights = [np.ones((len(loader.tasks)), np.float32)] * len(self.structures)

        return loader.create_dataset(
            list(zip(self.structure, self.labels, weights, self.idx)),
            shard_size=self.shard_size,
            n_workers=self.n_workers,
        )


Let's check our plumbing works.