# Chemical structure prediction

In this notebook, the packages ChemProp (https://github.com/chemprop/chemprop) and RDKit's Chemistry Drawer (https://github.com/dylanwal/chemistry_drawer) were utilized to train, test, predict, interpret, and draw drugs and their pharmacokinetic parameters relevant to the report titled "Chemical structure and machine learning: leveraging the power of deep neural networks to aid the development of long-acting drugs for HIV and HIV coinfections". For this project, ChemProp was used on the command line to train algorithms that would predict binary classifiers related to volume of distribution, clearance, and half-life of a drug dataset, and then predict on an HIV/TB/HepB dataset to identify molecules most suitable for long-acting formulation. 

For information on the command line arguments used, please reference the conda_hx.txt file in this repo. The report will likely be deposited on bioRXiv, and this page will be updated when that becomes available. 

All figures were put together in Adobe Illustrator.

# Visualization of datasets - figures 1-2

In [None]:
#import necessary packages (for all analyses in this script)
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from cProfile import label
from turtle import color
import chemdraw
from rdkit import chem

In [None]:
#create a histogram of the pharmacokinetic parameter of interest (in this case volume of distribution)
Vd = pd.read_csv("C:/classification_chemprop/Vd/Vd_class_master.csv")
sns.histplot(data=Vd['volume-of-distribution (L/KG)'], bins=20, log_scale=True, palette="hls")

In [None]:
#create the dumbell plot in the report

my_range=range(1, len(improved.index)+1)

plt.hlines(y=my_range, xmin=improved['combined'], xmax=improved['alone'], color='grey', alpha=0.4)
plt.scatter(improved['alone'], my_range, color='skyblue', alpha=1, label='alone')
plt.scatter(improved['combined'], my_range, color='green', alpha=0.4, label= 'combined')
plt.legend()

plt.yticks(my_range, improved['PK_param'])
plt.title("Comparison of the value 1 and the value 2", loc='left')
plt.xlabel('Value of the variables')
plt.ylabel('Group')


plt.show()

# Visualization of Drug Molecules - Figure 4

In [None]:
#use ChemDraw to create the molecule visualizations from the report from SMILES / SMARTS

mol = "CN1C(=NNC1=O)CN2C=CC(=C(C2=O)OC3=CC(=CC(=C3)C#N)Cl)C(F)(F)F" #or any molecule of interest

molecule = chemdraw.Molecule(mol)

drawer = chemdraw.Drawer(molecule, title=mol)
fig = drawer.draw()
fig.show()

# Use of RDKit Chem package to determine substructures drugs

In [None]:
#import compounds_smarts dataset, which includes substructures and their SMARTS configurations
compounds_smarts = pd.read_csv("C:/SMARTS_functional_groups.csv")

#convert SMARTS config column to list
SMARTS_list = compounds_smarts['SMARTS'].to_list()

#from interpret csv, which predicted the substructures relevant to the classification decision of ChemProp, loop through the functional_group list to return a number of matches of each substructure found in molecule of interest

mol = Chem.MolFromSmiles('Nc1nc(N[CH3:1])[cH:1][cH:1]n1') #or whatever substructure was returned from ChemProp interpret

functional_group = []
for x in SMARTS_list:
    x = Chem.MolFromSmarts(x)
    functional_group.append(x)
for y in functional_group:
    match = mol.GetSubstructMatches(y)
    print(len(match))