# Test using sample data

In this notebook, we get a sample data from ChEMBL and using an unseen molecule, the program suggests putative protein targets.

For this notebook to run correctly, it needs to be running using _pyspark_ kernel.  This is done by starting Jupyter notebook using the following command line statement:

```bash
PYSPARK_DRIVER_PYTHON="jupyter" PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark --master yarn --verbose
```

- `--master yarn` instructs pyspark to run in YARN mode.  Leaving this out runs in standalone mode (just one node).
- `--verbose` provides more details in pyspark console.  This is useful for debugging and to understand what is going on.

Monitor cluster tasks from [Personal Hadoop Dashboard](http://hadoop1:8088/cluster).

Information about _pyspark packages_ is found at [Apache Spark pyspark homepage](http://spark.apache.org/docs/latest/api/python/pyspark.html "pyspark packages").

Python API docs are available from [here](https://spark.apache.org/docs/1.6.2/api/python/index.html).

In [1]:
# set Spark Context logging level to info - this is useful for debugging purposes
sc.setLogLevel("INFO")

The dependencies in the cell below are used by Spark workers, thus they need to be available by all cluster nodes.  This is done by using Spark Context's `addPyFile()`. 

In [2]:
sc.addPyFile("moleculehelper.py") # 300 - Ligand framework
sc.addPyFile("pythonhelper.py")   # 001 - Python Helper.ipynb

In [3]:
from pyspark.sql.types import *
from pyspark.sql.functions import col, desc

from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit import DataStructs
from rdkit.Chem import MACCSkeys

from moleculehelper import *
from chemblhelper import ChEMBLHelper
from pythonhelper import *
from hdfshelper import HDFSHelper

import os.path

## Get data from ChEMBL

### All Data (sample)

In this setup, ChEMBL database is downloaded and attached to a MySQL server running at address (192.168.151.11).  Any ChEMBL related methods are encapsulated in a Python helper class, named `ChEMBLHelper`.  This class is defined in [`000 - ChEMBL Helper.ipynb`](000%20-%20ChEMBL%20Helper.ipynb).  It takes care to connect to MySQL server, get the required data and return is in the different formats required.  See `ChEMBLHelper` Jupyter Notebook for more detail, especially the `__doc__` documentation for information about the methods and how they work.  See [PEP-0257](https://www.python.org/dev/peps/pep-0257/) for semantic and conventions associated with Python docstrings.

__NOTE__: MySQL connection string is hardcoded in the mentioned helper class.

In [4]:
# get small dataset from ChEMBL
chemblhelper = ChEMBLHelper()

hdfsServer = "http://hadoop1:50070"                          # hdfs path
datasetCount = 100000                                        # dataset count of bindings from ChEMBL
datasetTSVFilename = "sample" + str(datasetCount) + ".tsv"   # file with sample data
hdfsDatasetFilename = os.path.join("/user/hduser", datasetTSVFilename)

# check if the dataset file exists in hdfs and if it does not, then load data from ChEMBL database
if not HDFSHelper.fileExists(hdfsServer, hdfsDatasetFilename):
    # get data from ChEMBL and stores the file to local dfs
    chemblhelper.saveBindingsToTSV(datasetTSVFilename, datasetCount) 
    # upload data to hdfs so that it is accessbile from all cluster worker nodes
    HDFSHelper.putFile(hdfsServer, datasetTSVFilename, hdfsDatasetFilename)

In [5]:
# create an RDD with all data
dataRDD = sc.textFile(hdfsDatasetFilename) \
            .map(lambda line: line.split("\t")) 

# convert each line (currently an list) to a tuple.  This makes it easier to manipulate 
# the data, especially to convert to DataFrames
dataRDD = dataRDD.map(lambda l: tuple(l))

dataRDD.count()

100000

In [6]:
dataRDD.take(2)

[(u'1.0',
  u'1459233',
  u'222065',
  u'=',
  u'0.900000000000000000000000000000',
  u'nM',
  u'Ki',
  u'9.050000000000000000000000000000',
  u'1',
  u'O09028',
  u'MSYSLYLAFVCLNLLAQRMCIQGNQFNVEVSRSDKLSLPGFENLTAGYNKFLRPNFGGDPVRIALTLDIASISSISESNMDYTATIYLRQRWTDPRLVFEGNKSFTLDARLVEFLWVPDTYIVESKKSFLHEVTVGNRLIRLFSNGTVLYALRITTTVTCNMDLSKYPMDTQTCKLQLESWGYDGNDVEFSWLRGNDSVRGLENLRLAQYTIQQYFTLVTVSQQETGNYTRLVLQFELRRNVLYFILETYVPSTFLVVLSWVSFWISLESVPARTCIGVTTVLSMTTLMIGSRTSLPNTNCFIKAIDVYLGICFSFVFGALLEYAVAHYSSLQQMAVKDRGPAKDSEEVNITNIINSSISSFKRKISFASIEISGDNVNYSDLTMKASDKFKFVFREKIGRIIDYFTIQNPSNVDRYSKLLFPLIFMLANVFYWAYYMYF',
  u'Cc1ccc2OC(=CC(=O)c2c1)c3cc(Br)ccc3O'),
 (u'2.0',
  u'1459233',
  u'86147',
  u'=',
  u'1.500000000000000000000000000000',
  u'nM',
  u'Ki',
  u'8.820000000000000000000000000000',
  u'1',
  u'O09028',
  u'MSYSLYLAFVCLNLLAQRMCIQGNQFNVEVSRSDKLSLPGFENLTAGYNKFLRPNFGGDPVRIALTLDIASISSISESNMDYTATIYLRQRWTDPRLVFEGNKSFTLDARLVEFLWVPDTYIVESKKSFLHEVTVGNRLIRLFSNGTVLYALRITTTVTCNMDLSKYPMDTQTCKLQLESWGY

In [12]:
dataRDD.filter(lambda t: t[2] == '86147').take(1)[0][1]

u'1459233'

### Take a sample for testing
Take random 1% out molecules from dataset.  This will later be used to test the solution.  

In [7]:
moleculesSample = dataRDD.map(lambda t: long(t[2])).distinct().sample(withReplacement=False, seed=1, fraction=0.01)

In [8]:
# broadcast value as a list of lonf numbers so that mapping is faster
moleculesSampleBV = sc.broadcast(moleculesSample.collect())

In [9]:
# test data is reserved as unseen knowledge
testData = dataRDD.filter(lambda t: long(t[2]) in moleculesSampleBV.value)

In [10]:
# sample data is the known dataset
sampleData = dataRDD.subtract(testData)

In [11]:
# cross check data
dataRDDCount = dataRDD.count()
sampleDataCount = sampleData.count()
testDataCount = testData.count()

print "dataRDD count = " + str(dataRDDCount)
print "sampleData count (" + str(sampleDataCount) + \
        ") + testData count (" + str(testDataCount) + \
        ") = " + str(sampleDataCount + testDataCount)
print "test sample ratio = " + str((1.0 * testDataCount) / dataRDDCount)

dataRDD count = 100000
sampleData count (99010) + testData count (990) = 100000
test sample ratio = 0.0099


### Molecules

In [13]:
# each line need to be converted to a tuple so that later it is converted into a DF
moleculesRDD = sampleData.map(lambda t: (long(t[2]),str(t[11]))).distinct()

# binding schema
moleculesSchema = StructType([StructField("molregno", IntegerType(), False),
                              StructField("canonical_smiles", StringType(), False)])

# convert RDD to DataFrame - faster and more memory efficient
molecules = sqlContext.createDataFrame(moleculesRDD, moleculesSchema)

molecules.count()

58817

In [14]:
molecules.show(10, truncate=False)

+--------+------------------------------------------------------------+
|molregno|canonical_smiles                                            |
+--------+------------------------------------------------------------+
|1611015 |COc1ccc(cc1OC)c2cnc3[nH]cc(c4ccc(OC)c(OC)c4)c3c2            |
|979451  |CN(C)c1ncc2N=C(CCc3ccccc3)C(=O)N(Cc4ccc(F)cc4)c2n1          |
|1615431 |Cc1c(O)cccc1c2nc3c(OCC4CCCCC4)nc(N)nc3[nH]2                 |
|846502  |C[C@H]1[C@@H]2CC[C@]3(C)[C@@H]([C@H]2OC1=O)[C@](C)(O)C=CC3=O|
|1005961 |COc1cccc(c1)c2noc(N)c2C#N                                   |
|244372  |NCCc1ccc(cc1)C(=O)NCC(=O)N2CCN(CC(=O)O)C(=O)C2              |
|1000277 |Clc1cccc(NC(=O)Nc2ccc3nccnc3c2)c1                           |
|602426  |Nc1c(oc2ccc(Br)cc12)C(=O)c3ccccc3                           |
|940541  |Cc1cccc(C)c1OCc2onc(c2)C(=O)N3CCN(CC3)c4ccc(F)cc4           |
|861294  |Clc1ccc(cc1)C(=O)Nc2cc3OCCOc3cc2C(=O)c4ccc(Br)cc4           |
+--------+------------------------------------------------------

### Bindings

In [15]:
bindingsRDD = sampleData.map(lambda t: (long(t[0].rstrip(".0")),
                                        long(t[1]),                                      
                                        long(t[2]),
                                        str(t[3]),
                                        PythonHelper.getDecimal(t[4]),
                                        str(t[5]),
                                        str(t[6]),
                                        PythonHelper.getDecimal(t[7]),
                                        long(t[8]))
                            ) # each line need to be converted to a tuple so that later it is converted into a DF

# binding schema
bindingsSchema = StructType([StructField("row_id", LongType(), False),
                             StructField("assay_id", LongType(), False),                            
                             StructField("molregno", LongType(), False),
                             StructField("std_relation", StringType(), True),
                             StructField("std_value", DecimalType(), True),
                             StructField("std_units", StringType(), True),
                             StructField("std_type", StringType(), True),
                             StructField("pchembl_value", DecimalType(), True),
                             StructField("component_id", LongType(), False)])

# convert RDD to DataFrame - faster and more memory efficient
bindings = sqlContext.createDataFrame(bindingsRDD, bindingsSchema)

In [16]:
bindings.count()

99010

In [17]:
bindings.show(5)

+------+--------+--------+------------+---------+---------+----------+-------------+------------+
|row_id|assay_id|molregno|std_relation|std_value|std_units|  std_type|pchembl_value|component_id|
+------+--------+--------+------------+---------+---------+----------+-------------+------------+
| 75305|  688258| 1697408|           =|     3548|       nM|   Potency|            5|          40|
| 11557|  688546|  899243|           =|      501|       nM|   Potency|            6|           3|
| 14823|  688546| 1021201|           =|    12589|       nM|   Potency|            5|           3|
| 14284|  688546|  757723|           =|    31623|       nM|   Potency|            5|           3|
| 72527|  424976|  369315|           =|       20|        %|Inhibition|         null|          37|
+------+--------+--------+------------+---------+---------+----------+-------------+------------+
only showing top 5 rows



### Proteins

In [18]:
# each line need to be converted to a tuple so that later it is converted into a DF
proteinsRDD = sampleData.map(lambda t: (long(t[8]),str(t[9]),str(t[10]))).distinct()

# binding schema
proteinsSchema = StructType([StructField("component_id", LongType(), False),
                             StructField("accession", StringType(), True),
                             StructField("sequence", StringType(), False)])

# convert RDD to DataFrame - faster and more memory efficient
proteins = sqlContext.createDataFrame(proteinsRDD, proteinsSchema)

proteins.count()

46

In [19]:
proteins.show(10)

+------------+---------+--------------------+
|component_id|accession|            sequence|
+------------+---------+--------------------+
|          25|   P15823|MNPDLDTGHNTSAPAHW...|
|          46|   P43681|MELGGPGAPRLLPPLLL...|
|          41|   P30191|MLLLLPWLFSLLWIENA...|
|          42|   P30926|MRRAPSLVLFFLVALCG...|
|           3|   P04637|MEEPQSDPSVEPPLSQE...|
|          16|   P11230|MTPGALLMLLGALGAPL...|
|          29|   P19969|MDNGMLSRFIMTKTLLV...|
|           2|   P02708|MEPWPLLLLFSLCSAGL...|
|          30|   P20236|MITTQMWHFYVTRVGLL...|
|           4|   P04757|MGVVLLPPPLSMLMLVL...|
+------------+---------+--------------------+
only showing top 10 rows



## Run PySpark Jobs

In the next section, we will run a number of Spark jobs to get Molecule similarities

In [61]:
# this must run on the main thread
def findSimilarMolecules(querySmiles, knownMolecules, molHelper = MoleculeHelper, similarityThreshold = 0.85):
    """ Returns an RDD with similar molecules.
    """
    
    # step 1 - create a molecule helper class for each molecule, this will take
    #          more memory but will increase computation efficiency
    queryMol = dict()
    queryMol.update({0: querySmiles})
    queryRDD = sc.parallelize(queryMol).map(lambda k:(k, molHelper(queryMol[k])))
    mols = knownMolecules.rdd.map(lambda (k, v):(k, molHelper(v)))

    sm = mols.cartesian(queryRDD) \
             .map(lambda ((k1,v1),(k2,v2)): (k1, float(v1.similarity(v2)))) \
             .filter(lambda (k1, v): v >= similarityThreshold)    
            
    simSchema = StructType([StructField("molregno", LongType(), False),
                            #StructField("queryMol", LongType(), False),
                            StructField("similarity", FloatType(), False)])

    sim = sqlContext.createDataFrame(sm, simSchema)
    
    return sim

In [62]:
def getBindings(similarMolecules):
    return bindings.join(similarMolecules, bindings.molregno == similarMolecules.molregno)

In [63]:
def experiment(experimentsData, moleculeIndex, knownMolecules, molHelper = MoleculeHelper, similarityThreshold = 0.85):
    selectedMolecules = experimentsData.map(lambda t: (long(t[2]), t[11])).distinct().sortBy(lambda x: x[0])
    selectedMoleculeMolRegNo = selectedMolecules.collect()[moleculeIndex][0]
    selectedMoleculeSMILES =selectedMolecules.collect()[moleculeIndex][1]
    
    print "MolRegNo = " + str(selectedMoleculeMolRegNo)
    print "SMILES = " + selectedMoleculeSMILES
    
    expData = experimentsData.filter(lambda t: long(t[2]) == selectedMoleculeMolRegNo).map(lambda m: m[8]).distinct().collect()
    
    print "Target Componet IDs:"
    for compId in expData:
        print "  " + str(compId)
    
    sm = findSimilarMolecules(selectedMoleculeSMILES, knownMolecules, molHelper, similarityThreshold)
    return getBindings(sm)    

In [64]:
# find similar molecules using Tanimoto (default) and Morgan fingerprint
test = experiment(testData, 30, molecules, similarityThreshold = 0.5)

MolRegNo = 309565
SMILES = OC(=O)CC(NC(=O)C1CCN(CC1)C(=O)CCCNC2=NCCN2)c3cnc4ccccc4c3
Target Componet IDs:
  5
  8


In [65]:
# display data
test.orderBy(desc("similarity")).show()

+------+--------+--------+------------+---------+---------+--------+-------------+------------+--------+----------+
|row_id|assay_id|molregno|std_relation|std_value|std_units|std_type|pchembl_value|component_id|molregno|similarity|
+------+--------+--------+------------+---------+---------+--------+-------------+------------+--------+----------+
| 51363|  305859|  311499|           =|      170|       nM|    IC50|            7|           5|  311499|       1.0|
| 49079|  305552|  311499|           =|        4|       nM|    IC50|            8|           5|  311499|       1.0|
| 56293|  305553|  311499|           =|      470|       nM|    IC50|            6|           8|  311499|       1.0|
| 55289|  305552|  311499|           =|        4|       nM|    IC50|            8|           8|  311499|       1.0|
| 49047|  305375|  309829|           =|       19|       nM|    IC50|            8|           5|  309829|0.74666667|
| 55258|  305375|  309829|           =|       46|       nM|    IC50|    

In [60]:
sampleData.filter(lambda x: x[2] == '311499').collect()

[(u'56293.0',
  u'305553',
  u'311499',
  u'=',
  u'470.000000000000000000000000000000',
  u'nM',
  u'IC50',
  u'6.330000000000000000000000000000',
  u'8',
  u'P06756',
  u'MAFPPRRRLRLGPRGLPLLLSGLLLPLCRAFNLDVDSPAEYSGPEGSYFGFAVDFFVPSASSRMFLLVGAPKANTTQPGIVEGGQVLKCDWSSTRRCQPIEFDATGNRDYAKDDPLEFKSHQWFGASVRSKQDKILACAPLYHWRTEMKQEREPVGTCFLQDGTKTVEYAPCRSQDIDADGQGFCQGGFSIDFTKADRVLLGGPGSFYWQGQLISDQVAEIVSKYDPNVYSIKYNNQLATRTAQAIFDDSYLGYSVAVGDFNGDGIDDFVSGVPRAARTLGMVYIYDGKNMSSLYNFTGEQMAAYFGFSVAATDINGDDYADVFIGAPLFMDRGSDGKLQEVGQVSVSLQRASGDFQTTKLNGFEVFARFGSAIAPLGDLDQDGFNDIAIAAPYGGEDKKGIVYIFNGRSTGLNAVPSQILEGQWAARSMPPSFGYSMKGATDIDKNGYPDLIVGAFGVDRAILYRARPVITVNAGLEVYPSILNQDNKTCSLPGTALKVSCFNVRFCLKADGKGVLPRKLNFQVELLLDKLKQKGAIRRALFLYSRSPSHSKNMTISRGGLMQCEELIAYLRDESEFRDKLTPITIFMEYRLDYRTAADTTGLQPILNQFTPANISRQAHILLDCGEDNVCKPKLEVSVDSDQKKIYIGDDNPLTLIVKAQNQGEGAYEAELIVSIPLQADFIGVVRNNEALARLSCAFKTENQTRQVVCDLGNPMKAGTQLLAGLRFSVHQQSEMDTSVKFDLQIQSSNLFDKVSPVVSHKVDLAVLAAVEIRGVSSPDHVFLPIPNWEHKENPETEEDVGPVVQHIYELRNNGPSSFSKAMLHL

In [66]:
# using MACCS fingerprint and Tanimoto
test = experiment(testData, 30, molecules, molHelper= MoleculeMACCSHelper, similarityThreshold = 0.5)

MolRegNo = 309565
SMILES = OC(=O)CC(NC(=O)C1CCN(CC1)C(=O)CCCNC2=NCCN2)c3cnc4ccccc4c3
Target Componet IDs:
  5
  8


In [67]:
# display data
test.orderBy(desc("similarity")).show()

+------+--------+--------+------------+---------+---------+--------+-------------+------------+--------+----------+
|row_id|assay_id|molregno|std_relation|std_value|std_units|std_type|pchembl_value|component_id|molregno|similarity|
+------+--------+--------+------------+---------+---------+--------+-------------+------------+--------+----------+
| 55289|  305552|  311499|           =|        4|       nM|    IC50|            8|           8|  311499|       1.0|
| 49079|  305552|  311499|           =|        4|       nM|    IC50|            8|           5|  311499|       1.0|
| 51363|  305859|  311499|           =|      170|       nM|    IC50|            7|           5|  311499|       1.0|
| 56293|  305553|  311499|           =|      470|       nM|    IC50|            6|           8|  311499|       1.0|
| 51373|  305995|  309829|           =|     1800|       nM|    IC50|            6|           5|  309829| 0.9516129|
| 55257|  305375|  309829|           =|       19|       nM|    IC50|    

Here we do some weird stuff to demonstrate how MoleculeHelper class can be inherited to provide new fingerprint and similarity algorithms.  The weirdness is in the fact that the source code in the next cell need to be stored as a python script file and then distributed to all other Yarn nodes via Spark Context.

In [90]:
# test class

from moleculehelper import MoleculeHelper
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit import DataStructs
from rdkit.Chem import MACCSkeys

class CustomMolHelper(MoleculeHelper):    
    
    def fingerprint(self, molecule):
        return MACCSkeys.GenMACCSKeys(molecule)  
    
    def similarityAlgorithm(self, otherFingerprint, metric=None):        
        return DataStructs.DiceSimilarity(self.getFingerprint(), otherFingerprint)

In [91]:
# some weird stuff here for testing
# MAKE SURE TO RUN THE PREVIOUS CELL BEFORE CALLING THE CODE IN THIS ONE
pyFileName = "customMolHelper.py"
with open(pyFileName, "w") as pyFile:
    pyFile.write(In[len(In)-2])
    
sc.addPyFile(pyFileName)

import customMolHelper as CMH

In [None]:
# using MACCS fingerprint and Tanimoto
test = experiment(testData, 30, molecules, molHelper=CMH.CustomMolHelper, similarityThreshold = 0.5)
test.orderBy(desc("similarity")).show()

MolRegNo = 309565
SMILES = OC(=O)CC(NC(=O)C1CCN(CC1)C(=O)CCCNC2=NCCN2)c3cnc4ccccc4c3
Target Componet IDs:
  5
  8
