# Accelerated Chemical Mapping with [Graphistry](graphistry.com)

This notebook visualizes a chemical dataset describing Blood Brain Barrier Permeability (BBBP) from [MoleculeNet](http://moleculenet.org/datasets-1) and [ECFPs](https://pubs.acs.org/doi/10.1021/ci100050t).

Using these string formulations of molecular 3D structure we can take advantage of string-based computational algorithms. These string representations look like the following:


*   OCC#Cc1cc(Cl)c(C(=O)Nc2ccnc(NC(=O)C3CC3)c2)c(Cl)c1
151276       
*   CCNc1ncnc2c1nc(NC3CCCC3)n2[C@@H]4O[C@H](CO)[C@@H](O)[C@H]4O
172750    
*   CCC(C1=C(O)C2=C(CCCCCC2)OC1=O)c3cccc(NS(=O)(=O)c4ccc(Cl)cc4)c3
155015    
*   CC1CCN(CC1)c2nc(ccc2CNC(=O)Nc3ccc(CNS(=O)(=O)C)c(F)c3)C(F)(F)F


The formulation of the structure into linear form helps us immensely, and thus we are able to parse and reduce these complex molecules down to 2 dimensions using conventional statistical tools, namely UMAP. Ultimately we demonstrate how such an OPEN-SOURCE analysis can be sped-up and scaled-up massively with the [graphistry](graphistry.com) environment and toolkit


* Speedup: From minutes to seconds - 3 min to 10 seconds on a small T4 GPU
* Visual insight: Add interactivity, similarity edges, and visual scale to a traditional static scatterplot to better investigate pairwise correlations and overall clusters

# Import accelerator libraries

In [None]:
!pip install -q --extra-index-url=https://pypi.nvidia.com cuml-cu12
import cuml,cudf
print(cuml.__version__)

!pip -q install graphistry[ai]
# !pip install -U -q --force git+https://github.com/graphistry/pygraphistry.git#@dev/depman_gpufeat
# !pip install cu_cat

In [4]:

import graphistry
graphistry.register(api=3,protocol="https", server="hub.graphistry.com", username=g_user, password=g_pass) ## key id, secret key

print(graphistry.__version__)

# import cu_cat
# print(cu_cat.__file__)

import os
from collections import Counter
import cProfile
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pstats import Stats
import cuml,cudf
from time import time
import warnings
warnings.filterwarnings('ignore')
from typing import List
import seaborn as sns
pd.set_option('display.max_colwidth', 200)

0.33.9


In [5]:
!nvidia-smi --query-gpu=gpu_name --format=csv,noheader

Tesla T4


# Import Basics

In [None]:
!pip install -q rdkit
!pip install --pre -q deepchem

from rdkit import Chem, DataStructs
from rdkit.Chem.rdchem import Mol
from rdkit.Chem.MolStandardize.rdMolStandardize import LargestFragmentChooser

from rdkit import RDLogger
lg = RDLogger.logger()
lg.setLevel(RDLogger.CRITICAL)

# Embed BBBP in Global Chemical Space Approximation (Dataset-Agnostic Embedding)

### Read in and process ChEMBL data

In [6]:
# Read in data from MoleculeNet
chembl = pd.read_csv("https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/chembl_sparse.csv.gz", compression='gzip')

# Sample a random 20k
chembl = chembl.sample(n=20000)

In [None]:
chemblA = pd.read_csv("https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/chembl_sparse.csv.gz", compression='gzip')

chem_data = chembl["smiles"][chembl.smiles.str.len()<500] ## lets simplify and just look at "short molecules" for this exercise


In [None]:
chem_data.dropna()

201332                OCC#Cc1cc(Cl)c(C(=O)Nc2ccnc(NC(=O)C3CC3)c2)c(Cl)c1
151276       CCNc1ncnc2c1nc(NC3CCCC3)n2[C@@H]4O[C@H](CO)[C@@H](O)[C@H]4O
172750    CCC(C1=C(O)C2=C(CCCCCC2)OC1=O)c3cccc(NS(=O)(=O)c4ccc(Cl)cc4)c3
155015    CC1CCN(CC1)c2nc(ccc2CNC(=O)Nc3ccc(CNS(=O)(=O)C)c(F)c3)C(F)(F)F
231881                                    Cc1nc(cs1)C#Cc2cc(Cl)cc(c2)C#N
                                       ...                              
197652         CN(C)C(=O)c1cc2cc(Nc3nccc(n3)c4cn(cn4)C5CC5)cc(Cl)c2[nH]1
63558                    COc1cc(OC)cc(\C=C\2/CCC\C(=C/c3ccccc3F)\C2=O)c1
23052                            CCN1CCC(=C(C1)C(=O)OCCc2ccccn2)c3ccccc3
154256                          CN[C@@H]1CCN(C1)c2nc(N)nc3c2CCCc4ccccc34
72859                COc1ccc(cc1)C2(N=C(N)c3nc(C)sc23)c4cccc(c4)c5cncnc5
Name: smiles, Length: 19959, dtype: object

 ## with CPU

In [None]:
g2 = graphistry.nodes(chem_data)

t=time()
g4=g2.umap(engine='umap_learn',metric = "jaccard",
                      n_neighbors = 25,
                      n_components = 2,
                      dbscan=True,
                      min_dist = 0.001)
j=time()-t
print('\n Total ', np.round(time() - t,1), 'seconds passed')





 Total  174.6 seconds passed


## and GPU

In [None]:
g2 = graphistry.nodes((chem_data))


t=time()
g4=g2.umap(engine='cuml',metric = "jaccard",
                      n_neighbors = 25,
                      n_components = 2,
                      dbscan=True,
                      min_dist = 0.001)
j=time()-t
print('\n Total ', np.round(time() - t,1), 'seconds passed')





 Total  9.3 seconds passed


In [None]:
g4.plot()

# Embed BBBP with UMAP

### Read in and process small data

In [8]:
# Read in data from MoleculeNet
bbbp = pd.read_csv("https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/BBBP.csv")

# Clean up columnn names so they are easier to interpret
bbbp = bbbp[["smiles", "p_np", "name"]].reset_index(drop=True).rename({"p_np": "permeable"}, axis=1)

In [9]:
BBBP=bbbp[~bbbp.name.duplicated(keep='first')]
BBBP[['name','permeable']][BBBP.smiles.str.len()>3]#.reset_index(drop=True)

Unnamed: 0,name,permeable
0,Propanolol,1
1,Terbutylchlorambucil,1
2,40730,1
3,24,1
4,cloxacillin,1
...,...,...
2045,licostinel,1
2046,ademetionine(adenosyl-methionine),1
2047,mesocarb,1
2048,tofisoline,1


### ... and with graphistry

In [10]:
BBBP=bbbp[~bbbp.name.duplicated(keep='first')]

g = graphistry.nodes(cudf.from_pandas(BBBP[['smiles','permeable']][BBBP.smiles.str.len()>3]))
t=time()
# g2=g.featurize(feature_engine='cu_cat',memoize=True)
g3=g.umap(engine='cuml',metric = "jaccard",
                      n_neighbors = 25,
                      n_components = 2,
                      low_memory = False,
                      min_dist = 0.001)
print('\n Total ', np.round(time() - t,1), 'seconds passed')




 Total  43.0 seconds passed


In [None]:
g3.encode_point_color('permeable',palette=["hotpink", "dodgerblue"],as_continuous=True).plot()
