# Creating a dataset for protein binders using a 3D representation

This tutorial shows how to create a dataset for protein binders. We want to represent 3D features of different conformers of the molecule, but there are many different conformers for a given molecules. So, we extract a set of conformers and consider their Boltzmann weights.

### Set up Django

In [1]:
import os
import django

import sys

# Make sure htvs/djangochem is in your path!
sys.path.insert(0, "/home/saxelrod/htvs")
sys.path.insert(0, "/home/saxelrod/htvs/djangochem")

os.environ["DJANGO_SETTINGS_MODULE"]="djangochem.settings.orgel"


django.setup()

# Shell Plus Model Imports
from features.models import AtomDescriptor, BondDescriptor, ConnectivityMatrix, DistanceMatrix, Fingerprint, ProximityMatrix, SpeciesDescriptor, TrainingSet, Transformation
from guardian.models import GroupObjectPermission, UserObjectPermission
from django.contrib.contenttypes.models import ContentType
from neuralnet.models import ActiveLearningLoop, NetArchitecture, NetCommunity, NetFamily, NeuralNetwork, NnPotential, NnPotentialStats
from jobs.models import Job, JobConfig, WorkBatch
from django.contrib.admin.models import LogEntry
from django.contrib.auth.models import Group, Permission, User
from django.contrib.sessions.models import Session
from pgmols.models import (AtomBasis, BasisSet, Batch, Calc, Cluster,
                           Geom, Hessian, Jacobian, MDFrame, Mechanism, Method, Mol, MolGroupObjectPermission,
                           MolSet, MolUserObjectPermission, PathImage, ProductLink, ReactantLink, Reaction,
                           ReactionPath, ReactionType, SinglePoint, Species, Stoichiometry, Trajectory)
# Shell Plus Django Imports
from django.core.cache import cache
from django.db import transaction
from django.utils import timezone
from django.contrib.auth import get_user_model
from django.urls import reverse
from django.conf import settings
from django.db.models import Avg, Case, Count, F, Max, Min, Prefetch, Q, Sum, When, Exists, OuterRef, Subquery



## MMFF94

Get a dataset of potential covid binders where the conformers are generated with mmff94

In [2]:
from neuralnet.utils.nff import create_bind_dataset

group_name = 'covid'
method_name = 'molecular_mechanics_mmff94'
method_descrip = 'MMFF conformer.'
molsets = ['run']
nbrlist_cutoff = 5.0
batch_size = 10
num_workers = 2
# maximum conformers per species
geoms_per_spec = 10
# geoms_per_spec = 1


dataset, loader = create_bind_dataset(group_name=group_name,
                    method_name=method_name,
                    method_descrip=method_descrip,
                    geoms_per_spec=geoms_per_spec,
                    nbrlist_cutoff=nbrlist_cutoff,
                    batch_size=batch_size,
                    num_workers=num_workers,
                    molsets=molsets)

dataset.save('covid_mmff94.pth.tar')
# dataset.save('covid_mmff94_1_geom.pth.tar')






In [8]:
from nff.data import Dataset
# dataset = Dataset.from_file('covid_mmff94_1_geom.pth.tar')
dataset = Dataset.from_file('covid_mmff94.pth.tar')
dataset.props['bind'].sum()

tensor(394)

## Crest
The same, but with Crest. Crest is a program that combines advanced sampling methods lke metadynamics with optimizations to obtain different meta-stable conformers. It also analyzes symmetry to give the degeneracy of each conformer, and combines this with the Boltzmann factor of the conformer to give its total population. It uses XTB (semi-empirical tight-binding DFT) as the force field, which is significantly more accurate than MMFF94.

In [2]:
from neuralnet.utils.nff import create_bind_dataset
import pdb

group_name = 'covid'
method_name = 'gfn2-xtb'
method_descrip = 'Crest GFN2-xTB'
molsets = ['run']
nbrlist_cutoff = 5.0
batch_size = 10
num_workers = 2
# maximum conformers per species
geoms_per_spec = 10
# geoms_per_spec = 1


dataset, loader = create_bind_dataset(group_name=group_name,
                    method_name=method_name,
                    method_descrip=method_descrip,
                    geoms_per_spec=geoms_per_spec,
                    nbrlist_cutoff=nbrlist_cutoff,
                    batch_size=batch_size,
                    num_workers=num_workers,
                    molsets=molsets)

dataset.save('covid_crest.pth.tar')


## The dataset

Let's take a look at the dataset itself.

Number of positive binders:

In [3]:
dataset.props['bind'].sum()

tensor(139)

Length of dataset:

In [5]:
print(len(dataset))

2592


First species in the dataset: 

In [5]:
print(dataset[0])

{'nxyz': tensor([[ 6.0000,  4.3809,  1.0873,  0.1777],
        [ 6.0000,  3.8217,  0.4284, -0.9035],
        [ 6.0000,  2.4735,  0.5657, -1.1796],
        ...,
        [ 1.0000, -1.0070, -2.4155,  1.4414],
        [ 1.0000, -3.3377, -2.3545,  2.2281],
        [ 1.0000, -5.0852, -1.3165,  0.8359]]), 'bind': tensor(0), 'weights': tensor([0.1814, 0.1683, 0.1215, 0.0636, 0.0582, 0.1142, 0.0527, 0.0517, 0.1492,
        0.0393]), 'spec_id': tensor(6631940), 'num_atoms': tensor(430), 'mol_size': tensor(43), 'smiles': 'c1ccc(CC(c2ccccc2)N2CCCCC2)cc1', 'nbr_list': tensor([[  0,   1],
        [  0,   2],
        [  0,   3],
        ...,
        [429, 394],
        [429, 427],
        [429, 428]])}


Each item in the dataset corresponds to one species. It has the number of atoms (`num_atoms`), whether it's a binder (`bind`), its smiles, the database IDs of its conformer geoms, the database ID of the species. It also has `mol_size`, the actual number of atoms in one molecule, which will allow us to separate the big nxyz (a stacked tensor consisting of all conformer nxyz's) into its conformers when needed. `weights` is a list of Boltzmann weights fo reach conformer. `nbr_list` tells you the neighbors of each atom, which takes into account the fact that ever 43 atoms you're actually in a different molecule.

Let's next look at batching:

In [6]:
print(next(iter(loader)))

{'nxyz': tensor([[ 6.0000,  4.3809,  1.0873,  0.1777],
        [ 6.0000,  3.8217,  0.4284, -0.9035],
        [ 6.0000,  2.4735,  0.5657, -1.1796],
        ...,
        [ 1.0000,  0.0063,  1.5642,  1.4408],
        [ 1.0000,  2.4189,  0.5911,  1.5163],
        [ 1.0000,  2.0578,  0.8583, -1.2571]]), 'bind': tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), 'weights': tensor([1.8140e-01, 1.6827e-01, 1.2152e-01, 6.3598e-02, 5.8217e-02, 1.1422e-01,
        5.2662e-02, 5.1665e-02, 1.4919e-01, 3.9260e-02, 8.2947e-01, 1.6617e-01,
        1.4203e-03, 1.4003e-03, 6.4012e-04, 5.2009e-04, 5.0009e-05, 2.0004e-05,
        2.4004e-04, 7.0013e-05, 3.5503e-01, 2.4643e-01, 1.2268e-01, 8.3726e-02,
        5.5241e-02, 5.5166e-02, 2.6927e-02, 2.2071e-02, 1.7791e-02, 1.4941e-02,
        4.1169e-01, 5.3856e-01, 1.8871e-02, 1.9101e-03, 2.7221e-02, 2.1001e-04,
        2.4001e-04, 2.3001e-04, 1.0200e-03, 5.0002e-05, 1.7496e-01, 9.2255e-02,
        9.3527e-02, 4.0222e-02, 2.5229e-01, 6.6510e-02, 1.2409e-01, 2.7637e-02,
 

This looks the exact same as a regular batch, if we assumed that each smiles really had one giant xyz. The xyz's of the individual conformers can be recovered by splitting the batch into species through `num_atoms`, and the species into conformers through `mol_size`.

# Geometries and Boltzmann factors

Now let's check that we're really getting the right geometries and Boltzmann weights. 
- Here's the geometry of the first species with the highest Boltzmann weight:

In [16]:
mol_size_0 = dataset.props['mol_size'][0]
print(dataset.props['nxyz'][0][:mol_size_0])

tensor([[ 1.7000e+01,  2.8329e+00,  1.2598e+00, -2.0776e-01],
        [ 6.0000e+00,  2.5839e+00, -1.5335e-02,  8.8423e-01],
        [ 6.0000e+00,  2.7282e+00, -1.3414e+00,  5.2334e-01],
        [ 6.0000e+00,  3.0365e+00, -1.8108e+00, -7.1847e-01],
        [ 7.0000e+00, -2.1606e+00,  1.2852e+00,  3.2123e-01],
        [ 6.0000e+00, -2.8204e+00,  1.8081e-01,  1.0101e+00],
        [ 7.0000e+00, -2.2042e+00, -1.1064e+00,  7.0226e-01],
        [ 6.0000e+00, -2.2666e+00, -1.3062e+00, -7.4260e-01],
        [ 7.0000e+00, -1.5938e+00, -2.3789e-01, -1.4735e+00],
        [ 6.0000e+00, -2.2245e+00,  1.0284e+00, -1.1142e+00],
        [ 1.0000e+00, -1.7155e+00,  1.8397e+00, -1.6419e+00],
        [ 1.0000e+00, -3.2734e+00,  1.0017e+00, -1.4222e+00],
        [ 6.0000e+00, -2.0413e-01, -1.9140e-01, -1.0350e+00],
        [ 7.0000e+00, -9.5094e-02,  3.4757e-02,  4.0284e-01],
        [ 6.0000e+00, -7.5750e-01,  1.2957e+00,  7.1914e-01],
        [ 1.0000e+00, -6.9661e-01,  1.4687e+00,  1.7977e+00],
        

In [17]:
print("Boltzmann weights: {}".format(dataset.props['weights'][0]))
weight_0 = dataset.props['weights'][0][0]
weight_1 = dataset.props['weights'][0][1]

rel_weight = weight_1 / weight_0

print("Weight of first conformer relative to second most relevant one: {}".format(rel_weight))



Boltzmann weights: tensor([0.3742, 0.1315, 0.1294, 0.1242, 0.1064, 0.0414, 0.0352, 0.0327, 0.0251])
Weight of first conformer relative to second most relevant one: 0.3513377010822296


Check that the xyz's agree:

In [18]:
spec = Species.objects.get(id=dataset.props['spec_id'][0])
geoms = spec.geom_set.filter(calcs__method__name=method_name,
                           calcs__method__description=method_descrip,
                           calcs__props__boltzmannweight__isnull=False
                           ).order_by("-calcs__props__boltzmannweight").all()
geom_0 = geoms[0]
geom_1 = geoms[1]

calc_0 = geom_0.calcs.filter(method__name=method_name,
                        method__description=method_descrip).first()
calc_1 = geom_1.calcs.filter(method__name=method_name,
                        method__description=method_descrip).first()

geom_0.xyz


[[17.0, 2.8329098101, 1.25977548, -0.207756332],
 [6.0, 2.5839438483, -0.0153345928, 0.8842313446],
 [6.0, 2.7282107936, -1.3413828894, 0.5233412086],
 [6.0, 3.0365184967, -1.8108289526, -0.7184703834],
 [7.0, -2.1605736971, 1.2852225087, 0.3212320791],
 [6.0, -2.8204120798, 0.1808104026, 1.0100835068],
 [7.0, -2.2041538319, -1.1063585857, 0.7022629139],
 [6.0, -2.2665868945, -1.3061639571, -0.7426005197],
 [7.0, -1.5938346841, -0.2378916911, -1.4734985127],
 [6.0, -2.2244888716, 1.0283553587, -1.1142249338],
 [1.0, -1.7155102191, 1.8396771585, -1.6419159547],
 [1.0, -3.2734228639, 1.0016867065, -1.4222229556],
 [6.0, -0.2041306662, -0.1913957616, -1.0350406235],
 [7.0, -0.0950937656, 0.0347572271, 0.4028420165],
 [6.0, -0.7574968951, 1.2957213008, 0.7191432874],
 [1.0, -0.6966142045, 1.4686992449, 1.7977002319],
 [1.0, -0.244463933, 2.1068296296, 0.195005402],
 [6.0, -0.7996232901, -1.041322662, 1.0900393652],
 [1.0, -0.3174208404, -1.9915033605, 0.8437347219],
 [1.0, -0.7374003673, -

Check that the relative weights agree (the absolute ones do not, since we limited each species to a maximum of 10 conformers):

In [19]:

weight_0 = calc_0.props['boltzmannweight']
weight_1 = calc_1.props['boltzmannweight']

print(weight_0)
print(weight_1)
print("\n")

rel_weight = weight_1 / weight_0
print(rel_weight)



0.27697
0.09731


0.35133769000252735
