# PSPG 245B - Molecular Comparison Notebook

This notebook is used to explore why a prediction from the first script was made. It takes a file of drugs along with a target ID (e.g., “CHEMBL...”), and the two ChEMBL reference files as mentioned before. It uses RDKit to calculate the molecular similarity of all the ligands for the target you specified as compared to the drugs you gave it, and shows the most similar ligands.

There is less explanation and more fill-in-the-blank in this notebook, as many of the patterns should match the notebooks you've seen before. Each function contains a desciption of what it is doing in the docstring, which may help you fill in any missing parts. And as always, if you have questions, please ask.

### Import modules

In [9]:
# Jupyter Display
from IPython.core.display import display,HTML
display(HTML("<style>.container {width:85% !important;} </style>"))

# Shell I/O tools
import os,sys
import gzip
import csv

# Custom functions
from utils import map_target_identifiers, flatten_list

# Data handling modules
import numpy as np
import pandas as pd

# Chemical Handling Modules
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.SimDivFilters.SimilarityPickers import TopNOverallPicker
from rdkit.Chem import Draw

# Vizualization Modules
%matplotlib notebook
import matplotlib.pyplot as plt
import seaborn as sns

## Directions
Fix the functions so that our handler function at the end of the notebook will work.

**Note_1:** Not all the functions need to be adjusted. Those that work as is, have a comment above mentioning the function does not need to be altered.

**Note_2:** The Jupyter notebook stores all variables created in memory unless explicitely deleted. Thus if you name a variable something and change the name in the same cell, the original variable will STILL be there. This can cause problems if you forget to change all instances of the initial variable later in your script. The easiest way to not worry about this is to restart the kernel, which will flush the memory. However you will have to reload every cell again.

**Note_3:** Ask questions! We are here to help :) 

### Default Values

In [11]:
# Default Directories
BASE_DIR = os.getcwd()

# Default Files
EX_TARGET = 'CHEMBL3018'
QUERY_CPDS_F = os.path.join(BASE_DIR, 'data', 'candidate_compounds.sample.csv')
CHEMBL_MOLS_F  = os.path.join(BASE_DIR, 'data', 'chembl_21_binding_molecules.csv.gz')
CHEMBL_TARGS_F = os.path.join(BASE_DIR, 'data', 'chembl_21_binding_targets.csv.gz')

### Primary Functions

In [12]:
####  THIS FUNCTION DOES NOT NEED TO BE ALTERED  ####
def viz_most_similar_ligands_to_candidate(topSim, candidate_id, candidate_mol, ligand_to_mol, target, mols_per_row=6):
    """Generate .png file of 2D representation of candidate compound and N-most similar target ligands."""
    print("Generating .png of top {} most similar ligands of target {}, "
          "to query compound {}.".format(mols_per_row-1, target, candidate_id))
    ofn = candidate_id.lower() + '_{}_top{}_simCpds.png'.format(target.lower(), mols_per_row-1)
    sim_mols = [ligand_to_mol[cpid] for cpid in topSim]
    sim_mols.insert(0, candidate_mol)
    img=Draw.MolsToGridImage(sim_mols, molsPerRow=mols_per_row, subImgSize=(200,200), legends=[x.GetProp("_cpid") for x in sim_mols], returnPNG=False)
    img.save(ofn)
    return

In [13]:
####  THIS FUNCTION DOES NOT NEED TO BE ALTERED  ####
def get_nMost_similar_ligands_to_candiates(target, ligand_to_mol, candidate_to_mol, nSim=5):
    """Return target's N-most similar ligands for each candidate compound of interest."""
    ligand_fps = []
    
    # Generate fingerprints from ligand mol-objects, assign id to fp, and group
    for mol in ligand_to_mol.values():
        fp = Chem.RDKFingerprint(mol)
        fp._id = mol.GetProp('_cpid')
        ligand_fps.append(fp)
    
    # Iterate through candidate compounds, use "picker" to identify N-most similar ligands
    for candidate, mol in candidate_to_mol.items():
        candidate_fp = Chem.RDKFingerprint(mol)
        candidate_fp._id = mol.GetProp('_cpid')
        picker = TopNOverallPicker(numToPick=nSim, probeFps=[candidate_fp], dataSet=ligand_fps)
        topSim = [fp._id for fp,score in picker]
    viz_most_similar_ligands_to_candidate(topSim, candidate, mol, ligand_to_mol, target, mols_per_row=nSim+1)
    return

In [None]:
## FILL IN THE BLANKS ###
def gen_mol_from_smile(cpid, smile): # Much of this should look familiar.
    """Generate rdkit mol object from smile, and set cpid as mol property."""
    ? = Chem.MolFromSmiles(?)
    if mol is None:
        return None
    mol.SetProp('_cpid', ?)
    return mol

In [14]:
## FILL IN THE BLANKS ##
def map_candidate_compounds_to_mol(candidate_cpds_f):
    """Calculates a dictionary mapping candidate compound-IDs to rdkit mol-objects from provided csv file."""
    candidate_to_mol = {}
    print('Mapping candidate compound-IDs to rdkit mol-objects')
    fi = open(candidate_cpds_f, 'rt' )
    reader = csv.reader(fi)
    next(reader)
    for cpid, smile in reader:
        mol = gen_mol_from_smile(?, ?) # Hint: What did you just implement? What were the arguments?
        if mol is None:
            continue
        candidate_to_mol[?] = mol
    print('\tMapped {} candidate compound-IDs to mol-objects\n'.format(len(candidate_to_mol)))
    return candidate_to_mol

In [16]:
def map_target_ligands_to_mol(targ_ligands, chembl_mol_f):
    """Similar to function, creates dictionary of ligand-IDs mapped to their 
    corresponding rdkit-mol objects, with the caveate that all ligands not in targ_ligands are ignored."""
    ligand_to_mol = {}
    targ_ligands = set(targ_ligands)
    print('Mapping ligand-IDs to rdkit mol-objects')
    fi = gzip.open(chembl_mol_f, 'rt')
    reader = csv.reader(fi)
    next(reader)
    for ?, ?, ? in reader: # Hint: look at the structure of chembl_21_binding_molecules.csv.gz
        if cpid in targ_ligands:
            mol = gen_mol_from_smile(?, ?)
            if mol is None:
                continue
            ligand_to_mol[?] = mol
    print('\tMapped {} ligands to mol-objects'.format(len(ligand_to_mol)))
    return ligand_to_mol

In [None]:
## FILL IN THE BLANKS ##
def identify_target_ligands(target, chembl_targ_f):
    """Return list of ligands associated with the requested protein target."""
    ligands = []
    print('Identifying ligands of target: {}'.format(target))
    fi = gzip.open(chembl_targ_f, 'rt' )
    reader = csv.reader(fi)
    next(reader)
    for chid, unid, assoc_ligands, tdesc in reader:
        if chid != ?:
            continue
        for ligand in assoc_ligands.split(':'):
            ligands.append(?)
    ligands = set(ligands)
    print('\tFound {} ligands\n'.format(len(ligands)))
    return ligands

In [18]:
def compare_compound_to_target_ligands(candidate_cpds_f, target, chembl_mol_f, chembl_target_f, nSim=5):
    """Identifies ligands of a particular target, and retrieves the N-most similar ligands to each candidate compound"""
    targ_ligands = identify_target_ligands(target, chembl_target_f)
    ligand_to_mol = map_target_ligands_to_mol(targ_ligands, chembl_mol_f)
    candidate_to_mol = map_candidate_compounds_to_mol(candidate_cpds_f)
    get_nMost_similar_ligands_to_candiates(target, ligand_to_mol, candidate_to_mol, nSim=nSim)
    return 

In [None]:
compare_compound_to_target_ligands(QUERY_CPDS_F, EX_TARGET, CHEMBL_MOLS_F, CHEMBL_TARGS_F, nSim=5)

### *Question: Do these ligands share common patterns, functional groups, “warheads”, etc with your compound? Which ones?*