# Drug Discovery Project

## DATASETS:
(a) Carbonic Anhydrase II (ChEMBL205), a protein lyase,  
(b) Cyclin-dependent kinase 2 (CHEMBL301), a protein kinase,  
(c) ether-a-go-go-related gene potassium channel 1 (HERG) (CHEMBL240), a voltage-gated ion channel,  
(d) Dopamine D4 receptor (CHEMBL219), a monoamine GPCR,  
(e) Coagulation factor X (CHEMBL244), a serine protease,  
(f) Cannabinoid CB1 receptor (CHEMBL218), a lipid-like GPCR and  
(g) Cytochrome P450 19A1 (CHEMBL1978), a cytochrome P450.  
The activity classes were selected based on data availability and as representatives of therapeutically important target classes or as anti-targets.

In [1]:
!nvidia-smi

Tue Oct  5 12:56:38 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.91.03    Driver Version: 460.91.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  GeForce GTX 1080    Off  | 00000000:01:00.0  On |                  N/A |
|  0%   34C    P8    14W / 240W |    727MiB /  8116MiB |      1%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+---------------------------------------------------------------------------

In [2]:
# Import
import pandas as pd
import numpy as np
from pathlib import Path

In [3]:
from rdkit import Chem
from rdkit.Chem import AllChem



In [4]:
path = Path('../dataset/13321_2017_226_MOESM1_ESM/')
#df = pd.read_csv('../dataset/13321_2017_226_MOESM1_ESM/CHEMBL205_cl.csv', index_col=0)

In [5]:
#df.head()
list(path.iterdir())

[PosixPath('../dataset/13321_2017_226_MOESM1_ESM/CHEMBL244_cl_ecfp_512.csv'),
 PosixPath('../dataset/13321_2017_226_MOESM1_ESM/CHEMBL218_cl.csv'),
 PosixPath('../dataset/13321_2017_226_MOESM1_ESM/RdkitDescriptors.py'),
 PosixPath('../dataset/13321_2017_226_MOESM1_ESM/CHEMBL205_cl_ecfp_512.csv'),
 PosixPath('../dataset/13321_2017_226_MOESM1_ESM/CHEMBL301_cl.csv'),
 PosixPath('../dataset/13321_2017_226_MOESM1_ESM/CHEMBL205_cl_ecfp_1024.csv'),
 PosixPath('../dataset/13321_2017_226_MOESM1_ESM/CHEMBL1978_cl_ecfp_512.csv'),
 PosixPath('../dataset/13321_2017_226_MOESM1_ESM/CHEMBL240_cl.csv'),
 PosixPath('../dataset/13321_2017_226_MOESM1_ESM/CHEMBL301_cl_ecfp_512.csv'),
 PosixPath('../dataset/13321_2017_226_MOESM1_ESM/CHEMBL218_cl_ecfp_1024.csv'),
 PosixPath('../dataset/13321_2017_226_MOESM1_ESM/CHEMBL205_cl.csv'),
 PosixPath('../dataset/13321_2017_226_MOESM1_ESM/.ipynb_checkpoints'),
 PosixPath('../dataset/13321_2017_226_MOESM1_ESM/CHEMBL218_cl_ecfp_512.csv'),
 PosixPath('../dataset/13321_201

# Create finerprints for all datasets

In [6]:
# function for returning fingerprint from a specific smile.

def fp(smile, diam = 2, bits = 1024):

    mol = Chem.MolFromSmiles(smile)
    Chem.SanitizeMol(mol)
    fp = AllChem.GetMorganFingerprintAsBitVect(mol, diam, nBits = bits) 
    return fp

In [7]:
#ECFP4
#Generated Circular fingerprints hashed into n bits length vectors.

def ECFP(ifile, ofile, diam, bits):
    
    print(f"Making fingerprints for file: {ifile}")
    df = pd.read_csv(ifile)
    
    df.insert(2, "ECFP4_", df.SMILES.apply(fp))
    
    for i in range(len(df.ECFP4_[0])):
        df.insert(i + 3, f"ECFP4_{i + 1}", 0)
    
    df[[f"ECFP4_{i+1}" for i in range(len(df.ECFP4_[0]))]] = df.ECFP4_.to_list()
    
    df.drop("ECFP4_", axis = 1, inplace = True)
    
    
    df.to_csv(path/ofile, index = None)
    return df

# Run the functions on a file from dataset and store the results

In [8]:
datasets = ['CHEMBL205_cl', 'CHEMBL301_cl', 'CHEMBL218_cl', 
            'CHEMBL240_cl', 'CHEMBL219_cl', 
            'CHEMBL244_cl', 'CHEMBL1978_cl']

In [9]:
def create_fingerprints(dataset, bits):
    ECFP(path/f'{dataset}.csv', f'./{dataset}_ecfp_{bits}.csv', 2, bits)

In [10]:
for dataset in datasets: 
    create_fingerprints(dataset, 1024)

Making fingerprints for file: ../dataset/13321_2017_226_MOESM1_ESM/CHEMBL205_cl.csv
Making fingerprints for file: ../dataset/13321_2017_226_MOESM1_ESM/CHEMBL301_cl.csv
Making fingerprints for file: ../dataset/13321_2017_226_MOESM1_ESM/CHEMBL218_cl.csv
Making fingerprints for file: ../dataset/13321_2017_226_MOESM1_ESM/CHEMBL240_cl.csv
Making fingerprints for file: ../dataset/13321_2017_226_MOESM1_ESM/CHEMBL219_cl.csv
Making fingerprints for file: ../dataset/13321_2017_226_MOESM1_ESM/CHEMBL244_cl.csv
Making fingerprints for file: ../dataset/13321_2017_226_MOESM1_ESM/CHEMBL1978_cl.csv


In [11]:
for dataset in datasets:
    df = pd.read_csv(path/f'{dataset}_ecfp_1024.csv')
    df.info()
    print()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17941 entries, 0 to 17940
Columns: 1027 entries, CID to Activity
dtypes: int64(1025), object(2)
memory usage: 140.6+ MB

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7755 entries, 0 to 7754
Columns: 1027 entries, CID to Activity
dtypes: int64(1025), object(2)
memory usage: 60.8+ MB

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20924 entries, 0 to 20923
Columns: 1027 entries, CID to Activity
dtypes: int64(1025), object(2)
memory usage: 163.9+ MB

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7700 entries, 0 to 7699
Columns: 1027 entries, CID to Activity
dtypes: int64(1025), object(2)
memory usage: 60.3+ MB

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5566 entries, 0 to 5565
Columns: 1027 entries, CID to Activity
dtypes: int64(1025), object(2)
memory usage: 43.6+ MB

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12584 entries, 0 to 12583
Columns: 1027 entries, CID to Activity
dtypes: int64(1025), object(2)
memory u

In [12]:
df.head()

Unnamed: 0,CID,SMILES,ECFP4_1,ECFP4_2,ECFP4_3,ECFP4_4,ECFP4_5,ECFP4_6,ECFP4_7,ECFP4_8,...,ECFP4_1016,ECFP4_1017,ECFP4_1018,ECFP4_1019,ECFP4_1020,ECFP4_1021,ECFP4_1022,ECFP4_1023,ECFP4_1024,Activity
0,CHEMBL1454842,s1nc(nc1-c1ccncc1)-c1ccncc1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,CHEMBL1939366,s1nc(nc1-c1cccnc1)-c1cccnc1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,CHEMBL192155,s1cncc1\C=C\1/CCc2cc(OC)ccc/12,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,CHEMBL517816,s1cccc1CN(n1ncnc1)Cc1ccc(cc1)C(C)(C)C,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,CHEMBL523973,s1cccc1CN(n1ccnc1)Cc1ccc(cc1)C(C)(C)C,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
