# Creating a dataset for protein binders using a 3D representation

This tutorial shows how to create a dataset for protein binders. We want to represent 3D features of different conformers of the molecule, but there are many different conformers for a given molecules. So, we extract a set of conformers and consider their Boltzmann weights.

### Set up Django

In [1]:
import os
import django

os.environ["DJANGO_SETTINGS_MODULE"]="djangochem.settings.orgel"


django.setup()

# Shell Plus Model Imports
from features.models import AtomDescriptor, BondDescriptor, ConnectivityMatrix, DistanceMatrix, Fingerprint, ProximityMatrix, SpeciesDescriptor, TrainingSet, Transformation
from guardian.models import GroupObjectPermission, UserObjectPermission
from django.contrib.contenttypes.models import ContentType
from neuralnet.models import ActiveLearningLoop, NetArchitecture, NetCommunity, NetFamily, NeuralNetwork, NnPotential, NnPotentialStats
from jobs.models import Job, JobConfig, WorkBatch
from django.contrib.admin.models import LogEntry
from django.contrib.auth.models import Group, Permission, User
from django.contrib.sessions.models import Session
from pgmols.models import (AtomBasis, BasisSet, Batch, Calc, Cluster,
                           Geom, Hessian, Jacobian, MDFrame, Mechanism, Method, Mol, MolGroupObjectPermission,
                           MolSet, MolUserObjectPermission, PathImage, ProductLink, ReactantLink, Reaction,
                           ReactionPath, ReactionType, SinglePoint, Species, Stoichiometry, Trajectory)
# Shell Plus Django Imports
from django.core.cache import cache
from django.db import transaction
from django.utils import timezone
from django.contrib.auth import get_user_model
from django.urls import reverse
from django.conf import settings
from django.db.models import Avg, Case, Count, F, Max, Min, Prefetch, Q, Sum, When, Exists, OuterRef, Subquery



In [2]:
import sys
sys.path.insert(0, "/home/saxelrod/htvs/djangochem")

## MMFF94

Get a dataset of potential covid binders where the conformers are generated with mmff94

In [5]:
from neuralnet.utils.nff import create_bind_dataset

group_name = 'covid'
method_name = 'molecular_mechanics_mmff94'
method_descrip = 'MMFF conformer.'
molsets = ['run']
nbrlist_cutoff = 5.0
batch_size = 10
num_workers = 2
# maximum conformers per species
geoms_per_spec = 100

dataset, loader = create_bind_dataset(group_name=group_name,
                    method_name=method_name,
                    method_descrip=method_descrip,
                    geoms_per_spec=geoms_per_spec,
                    nbrlist_cutoff=nbrlist_cutoff,
                    batch_size=batch_size,
                    num_workers=num_workers,
                    molsets=molsets)

dataset.save('covid_mmff94.pth.tar')



## Crest
Same, but with crest (much better!)

In [5]:
from neuralnet.utils.nff import create_bind_dataset

method_name = 'gfn2-xtb'
method_descrip = 'Crest GFN2-xTB'

dataset, loader = create_bind_dataset(group_name=group_name,
                    method_name=method_name,
                    method_descrip=method_descrip,
                    geoms_per_spec=geoms_per_spec,
                    nbrlist_cutoff=nbrlist_cutoff,
                    batch_size=batch_size,
                    num_workers=num_workers,
                    molsets=molsets)

dataset.save('covid_crest.pth.tar')



## The dataset

Let's take a look at the dataset itself:

In [7]:
print(dataset[0])

{'num_atoms': tensor(4300), 'bind': tensor(0), 'smiles': 'c1ccc(CC(c2ccccc2)N2CCCCC2)cc1', 'geom_id': tensor([85203096., 85203096., 85203096., 85203096., 85203096., 85203096.,
        85203096., 85203104., 85203104., 85203104., 85203104., 85203104.,
        85203104., 85203104., 85203104., 85203104., 85203112., 85203112.,
        85203112., 85203112., 85203112., 85203112., 85203112., 85203120.,
        85203120., 85203120., 85203120., 85203120., 85203120., 85203120.,
        85203120., 85203128., 85203128., 85203128., 85203128., 85203128.,
        85203128., 85203128., 85203136., 85203136., 85203136., 85203136.,
        85203136., 85203136., 85203136., 85203136., 85203136., 85203144.,
        85203144., 85203144., 85203144., 85203144., 85203144., 85203144.,
        85203152., 85203152., 85203152., 85203152., 85203152., 85203152.,
        85203152., 85203152., 85203152., 85203160., 85203160., 85203160.,
        85203160., 85203160., 85203160., 85203160., 85203168., 85203168.,
        85

Each item in the dataset corresponds to one species. It has the number of atoms (`num_atoms`), whether it's a binder (`bind`), its smiles, the database IDs of its conformer geoms, the database ID of the species. It also has `mol_size`, the actual number of atoms in one molecule, which will allow us to separate the big nxyz (a stacked tensor consisting of all conformer nxyz's) into its conformers when needed. `weights` is a list of Boltzmann weights fo reach conformer. `nbr_list` tells you the neighbors of each atom, which takes into account the fact that ever 43 atoms you're actually in a different molecule.

Let's next look at batching:

In [8]:
print(next(iter(loader))[0])

NameError: name 'loader' is not defined

This looks the exact same as a regular batch, if we assumed that each smiles really had one giant xyz.