#### Aim of this script
Understand the ChEMBL database and how to extract data from ChEMBL
i.e. (compound, acitivity data) pairs for a target of interest. 
The extracted data is used for cheminformatics tasks, such as similarity search, clustering, and machine learning

### ChEMBL database
#### (Current Release: ChEMBL 29)
“ChEMBL is a manually curated database of bioactive molecules with drug-like properties. It brings together chemical, bioactivity and genomic data to aid the translation of genomic information into effective new drugs.” (ChEMBL website)

Current data content (as of 09.2020, ChEMBL 27)

    * >1.9 million distinct compounds

    * >16 million activity values

    * Assays are mapped to ~13,000 targets

Data source include scientific literature, PubChem bioassays, Drugs of Neglected Disease Initiative (DNDi), BindingDB database, ...

###### Access ChEMBL data: via a web-interface, the EBI-RDF platform and the ChEMBL webresource client

##### ChEMBL webresource client
* Python client library for accessing ChEMBL data
* Handles interaction with the HTTPS protocol
* Lazy evaluation of results -> reduced number of network requests

##### Compound activity measures
* IC50: half maximal inhibitory concentration
Indicates how much of a particular drug or other substance is needed to inhibit a given biological process by half
* Ki
* EC50
etc

### Connect to ChEMBL database
Load Python libraries

In [2]:
import math
from pathlib import Path
from zipfile import ZipFile

import numpy as np
import pandas as pd
from rdkit.Chem import PandasTools
from chembl_webresource_client.new_client import new_client
from tqdm import tqdm

Set directory path

In [272]:
path = '/Users/hek/Research/Cheminformatics/Project_1_NPS/Stimulant vs. Hallucinogens/Dataset/Bioassay data/ChEMBL data/'

In [271]:
def convert_to_logp(value):
    logp_value = 9 - math.log10(value)
    return logp_value

Create resource objects for API access

In [60]:
targets_api = new_client.target
compounds_api = new_client.molecule
bioactivities_api = new_client.activity

In [61]:
type(targets_api)

chembl_webresource_client.query_set.QuerySet

#### Get target data
* Get UniProt ID of the target of interest (EGFR kinase:P00533) from UniProt website (https://www.uniprot.org/)
* Use UniProt ID to get target information
Select a different UniProt ID, if you are interested in another target

##### UniProt ID
* NET: P23975
* DAT: Q01959
* SERT: P31645
* CB1: P21554
* CB2: P34972
* 5-HT2A: P28223
* 5-HT2C: P28335
* mu-opioid: P35372

In [2061]:
Name = "GABAgamma1"
uniprot_id="P23574"

##### Fetch target data from ChEMBL

In [2062]:
# Get target information from ChEMBL but restrict it to specified values only
targets = targets_api.get(target_components__accession=uniprot_id).only(
    "target_chembl_id", "organism", "pref_name", "target_type"
)
print(f'The type of the targets is "{type(targets)}"')

The type of the targets is "<class 'chembl_webresource_client.query_set.QuerySet'>"


In [2063]:
targets = pd.DataFrame.from_records(targets)
targets

Unnamed: 0,organism,pref_name,target_chembl_id,target_type
0,Rattus norvegicus,GABA receptor gamma-1 subunit,CHEMBL296,SINGLE PROTEIN
1,Rattus norvegicus,GABA receptor gamma-1 subunit,CHEMBL296,SINGLE PROTEIN
2,Rattus norvegicus,GABA-A receptor; anion channel,CHEMBL1907607,PROTEIN COMPLEX GROUP
3,Rattus norvegicus,Benzodiazepine receptors; peripheral & central,CHEMBL2096683,SELECTIVITY GROUP


##### Download target data from ChEMBL
* The results of the query are stored in targets, a QuerySet. 
* The results are not fetched from ChEMBL until we ask for it
* We ask for the results use pandas.DataFrame.from_records

##### Select ChEMBL ID (select target)

#### Get bioactivity data
Query bioactivity data for the target of interest
##### Fetch bioactivity data for the target from ChEMBL
Fetch the bioactivity data and filter it to only consider
* human proteins
* bioactivity type: IC50, Ki, or other type
* exact measurements (relation'='), and
* binding data (assay type 'B')

In [2064]:
target = targets.iloc[0]
chembl_id = target.target_chembl_id
print("Target organism", target.organism)
print("Target name", target.pref_name)
print("Target type",target.target_type)
print("The target ChEMBL ID is", chembl_id)

Target organism Rattus norvegicus
Target name GABA receptor gamma-1 subunit
Target type SINGLE PROTEIN
The target ChEMBL ID is CHEMBL296


In [2049]:
bioassay_type = 'Ki'
#bioassay_type = 'IC50' 
#bioassay_type = 'EC50' 

In [2065]:
bioactivities = bioactivities_api.filter(
    target_chembl_id=chembl_id, type=bioassay_type, relation="=", assay_type="B"
).only(
    "activity_id",
    "assay_chembl_id",
    "assay_description",
    "assay_type",
    "molecule_chembl_id",
    "type",
    "standard_units",
    "relation",
    "standard_value",
    "target_chembl_id",
    "target_organism",
)

print(f"Length and type of bioactivities object: {len(bioactivities)}, {type(bioactivities)}")

Length and type of bioactivities object: 0, <class 'chembl_webresource_client.query_set.QuerySet'>


In [887]:
#bioactivities[0]

##### Download bioactivity data from ChEMBL

In [2007]:
bioactivities_df = pd.DataFrame.from_records(bioactivities)
print(f"DataFrame shape: {bioactivities_df.shape}")
bioactivities_df.head()

DataFrame shape: (415, 13)


Unnamed: 0,activity_id,assay_chembl_id,assay_description,assay_type,molecule_chembl_id,relation,standard_units,standard_value,target_chembl_id,target_organism,type,units,value
0,816955,CHEMBL680331,Displacement of [3H]-Ro-15-1788 from human GAB...,B,CHEMBL45198,=,nM,33.3,CHEMBL5112,Homo sapiens,Ki,nM,33.3
1,816955,CHEMBL680331,Displacement of [3H]-Ro-15-1788 from human GAB...,B,CHEMBL45198,=,nM,33.3,CHEMBL5112,Homo sapiens,Ki,nM,33.3
2,2401385,CHEMBL942244,Displacement of [3H]Ro-151788 from benzodiazep...,B,CHEMBL6597,=,nM,3.1,CHEMBL5112,Homo sapiens,Ki,nM,3.1
3,2461400,CHEMBL991887,Binding affinity to GABAA alpha-5-beta-2-gamma...,B,CHEMBL471115,=,nM,369.0,CHEMBL5112,Homo sapiens,Ki,nM,369.0
4,2461401,CHEMBL991887,Binding affinity to GABAA alpha-5-beta-2-gamma...,B,CHEMBL458321,=,nM,182.0,CHEMBL5112,Homo sapiens,Ki,nM,182.0


There are columns of "standard_units/units" and "standard_values/values". Drop the two "nonstandardized" columns

##### Preprocess and filter bioactivity data
1. Convert standard_value's datatype from object to float
2. Delete entries with missing values
3. Keep only entires with standard_unit == nM
4. Delete duplicate molecules
5. Reset DataFrame index
6. Rename columns

In [409]:
def preprocess_bio_df(df):
    df_tmp1 = df.copy()
    df_tmp1 = df_tmp1.astype({"standard_value": "float64"})
    df_tmp1.dropna(axis=0, how="any", inplace=True)
    print(
    f"Number of non-nM entries:\
    {df_tmp1[df_tmp1['standard_units'] != 'nM'].shape[0]}"
)
    df_tmp1 = df_tmp1[df_tmp1["standard_units"] == "nM"]
    print(f"Units after filtering: {df_tmp1['standard_units'].unique()}")
    
    df_group = df_tmp1[["molecule_chembl_id","standard_value"]].groupby("molecule_chembl_id").mean()
    #df_group.reset_index(drop=True, inplace=True)
    
    #df_unique = df_tmp1[["molecule_chembl_id","standard_value","standard_units","type"]]
    df_tmp1.drop_duplicates("molecule_chembl_id", keep="first", inplace=True)
    
    df_tmp1.reset_index(drop=True, inplace=True)
    
    # merge df_unique and df_group by "molecule_chembl_id"
    output_df = pd.merge(
        df_tmp1,
        df_group,
        on="molecule_chembl_id",
    )

    # Reset row indices
    output_df["mean_standard_value"] = output_df["standard_value_y"]
    output_df.reset_index(drop=True, inplace=True)
    return output_df  

In [2008]:
bio_df_preprocessed = preprocess_bio_df(bioactivities_df)

print(bio_df_preprocessed.shape[0])
bio_df_preprocessed.head(2)

Number of non-nM entries:    0
Units after filtering: ['nM']
409


Unnamed: 0,activity_id,assay_chembl_id,assay_description,assay_type,molecule_chembl_id,relation,standard_units,standard_value_x,target_chembl_id,target_organism,type,units,value,standard_value_y,mean_standard_value
0,816955,CHEMBL680331,Displacement of [3H]-Ro-15-1788 from human GAB...,B,CHEMBL45198,=,nM,33.3,CHEMBL5112,Homo sapiens,Ki,nM,33.3,33.3,33.3
1,2401385,CHEMBL942244,Displacement of [3H]Ro-151788 from benzodiazep...,B,CHEMBL6597,=,nM,3.1,CHEMBL5112,Homo sapiens,Ki,nM,3.1,3.1,3.1


##### Fetch compound data from ChEMBL

Example of a molecule entries

In [None]:
# Get molecule by ChEMBL ID
#molecule = new_client.molecule.get('CHEMBL4084262')
#molecule

For the compounds which we have defined bioactivity data for, fetch the compound ChEMBL IDs, structures, and molecular properties

In [2009]:
compounds_provider = compounds_api.filter(
    molecule_chembl_id__in=list(bio_df_preprocessed["molecule_chembl_id"])
).only("molecule_chembl_id", "molecule_structures", "molecule_properties")

compounds = list(tqdm(compounds_provider))

compounds_df = pd.DataFrame.from_records(
    compounds,
)
print(f"DataFrame shape: {compounds_df.shape}")

# Remove entires with missing molecule structure entry
compounds_df.dropna(axis=0, how="any", inplace=True)
print(f"DataFrame shape: {compounds_df.shape}")

# Delete duplicate molecule
compounds_df.drop_duplicates("molecule_chembl_id", keep="first", inplace=True)
print(f"DataFrame shape: {compounds_df.shape}")

100%|██████████| 409/409 [00:21<00:00, 19.33it/s]

DataFrame shape: (409, 3)
DataFrame shape: (409, 3)
DataFrame shape: (409, 3)





* Multiple molecular structure representations. Only keep the 'canonical_smiles' *

In [None]:
#compounds_df.iloc[0].molecule_structures.keys()

* Multiple molecule_properties. Only keep the 'alogp' *

In [None]:
#compounds_df.iloc[0].molecule_properties.keys()

In [2010]:
compounds_preprocessed = compounds_df.copy()

canonical_smiles = []
alogp = []
MW = []

for i, compounds in compounds_preprocessed.iterrows():
    try:
        canonical_smiles.append(compounds["molecule_structures"]["canonical_smiles"])
        alogp.append(compounds["molecule_properties"]["alogp"])
        MW.append(compounds["molecule_properties"]["full_mwt"])
    except KeyError:
        canonical_smiles.append(None)
        alogp.append(None)

compounds_preprocessed["canonical_smiles"] =canonical_smiles
compounds_preprocessed["alogp"] =alogp
compounds_preprocessed["MW"] =MW
compounds_preprocessed = compounds_preprocessed.astype({"MW": "float64"})

compounds_preprocessed.drop(["molecule_structures","molecule_properties"], axis=1, inplace=True)
print(f"DataFrame shape: {compounds_preprocessed.shape}")

compounds_preprocessed.dropna(axis=0, how="any", inplace=True)
print(f"DataFrame shape: {compounds_preprocessed.shape}")

DataFrame shape: (409, 4)
DataFrame shape: (409, 4)


#### Output (bioactivity-compound) data

##### Summary of compound and bioactivity data

In [1507]:
print(f"Bioactivities filtered: {bio_df_preprocessed.shape[0]}")
bio_df_preprocessed.columns

Bioactivities filtered: 80


Index(['activity_id', 'assay_chembl_id', 'assay_description', 'assay_type',
       'molecule_chembl_id', 'relation', 'standard_units', 'standard_value_x',
       'target_chembl_id', 'target_organism', 'type', 'units', 'value',
       'standard_value_y', 'mean_standard_value'],
      dtype='object')

In [1508]:
print(f"Compounds filtered: {compounds_preprocessed.shape[0]}")
compounds_preprocessed.columns

Compounds filtered: 80


Index(['molecule_chembl_id', 'canonical_smiles', 'alogp', 'MW'], dtype='object')

##### Merge both datasets
Merge values of interest from 'bio_df_preprocessed' and 'compounds_preprocessed' in an 'output_df' based on the compounds "ChEMBL IDs" (molecule_chembl_id), keeping the following columns
* ChEMBL IDs: ('molecule_chembl_id')
* SMILES: ('canonical_smiles')
* Units: ('standard_units')
* Bioactivity data: IC50 or Ki ('standard_value')

In [2011]:
# Merge DataFrames
output_df = pd.merge(
    bio_df_preprocessed[["molecule_chembl_id", "mean_standard_value", "standard_units","type"]],
    compounds_preprocessed,
    on="molecule_chembl_id",
)

# Reset row indices
output_df = output_df[output_df['mean_standard_value']>0]
output_df = output_df[output_df['MW']<=900]
output_df.reset_index(drop=True, inplace=True)

print(f"Dataset with {output_df.shape[0]} entries.")

Dataset with 409 entries.


##### Add pKi or pIC50 values
Low IC50 and Ki values are difficult to read (values are distributed over multiple scales), so we converted to log scale

##### Save dataset

In [2012]:
# Apply conversion to each row of the DataFrame
output_df["p"+bioassay_type]=output_df.apply(lambda x: convert_to_logp(x.mean_standard_value), axis=1)

output_df.to_csv(path+Name+"_"+uniprot_id+"_"+bioassay_type+".csv",index=False)

### Collect data for decoys

##### DUDE (http://dude.docking.org/)
Mysinger MM, Carchia M, Irwin JJ, Shoichet BK J. Med. Chem., 2012, Jul 5. doi 10.1021/jm300687e.
* Convert to canonical smiles use RDKit Chem MolToSmiles

In [3]:
from rdkit import Chem
#from rdkit.Chem import Draw

In [None]:
#mol = [Chem.MolFromSmiles(smi) for smi in smis]
#Draw.MolsToGridImage(mol)

In [None]:
def create_decoy_df(input_file):
    decoy_df = pd.DataFrame()
    with open(input_file,'r') as f:
        lines = f.readlines()
        SMI = []
        for i in range(len(lines)):
            SMI.append(lines[i].split()[0])
        
    decoy_df["SMILES"] = SMI
    
    return decoy_df

In [None]:
DUDE_path = "/Users/hek/Research/Cheminformatics/DUDE database/"
Decoy_path = DUDE_path + "Decoys/"

target_list = pd.read_csv(DUDE_path+"DUDE Target list.csv")
targets = target_list["Target"]

* Change the decoy smiles.txt file to csv

In [None]:
for target in targets:
    #print(DUDE_path+target+"/decoys_final.ism")
    input_file = DUDE_path+target+"/decoys_final.ism"
    
    decoy_df = create_decoy_df(input_file)
    #decoy_df["canonical smiles"] = [Chem.MolToSmiles(Chem.MolFromSmiles(smi),True) for smi in decoy_df["SMILES"]]
    decoy_df.to_csv(Decoy_path+target+".csv",index=False)

In [None]:
for target in targets[103:104]:
    print(target)
    #print(Decoy_path+target+".csv")
    input_file = Decoy_path+target+".csv"
    df = pd.read_csv(input_file)
    #print(df.shape[0])
    Can_smi = []
    for i in range(df.shape[0]):
        print(i)
        #print(df.loc[i,"SMILES"])
        smi = df.loc[i,"SMILES"]
        Can_smi.append(Chem.MolToSmiles(Chem.MolFromSmiles(smi),True))
    df["Canonical smiles"] = Can_smi
    
    df.to_csv(input_file,index=False)

* Merged decoy dataset

In [None]:
df_list = []
for target in targets:
    input_file = Decoy_path+target+".csv"
    df_list.append(pd.read_csv(input_file))
    
    df_merge = pd.concat(df_list, axis=0) 

In [None]:
df_merge.shape

In [None]:
df_merge.drop_duplicates("Canonical smiles", keep="first", inplace=True)
df_merge.shape

In [None]:
df_merge.to_csv(Decoy_path+"Merged DUD Decoys with canonical smiles.csv",index=False)

### Actives Aquisition
* Canonical SMILES use RDKit
* Physico-chemical properties: MW, xlogP, H-bond donors, H-bond acceptors, nrotb, net molecular charge
* Molecular fingerprint: MORGAN (ECFP4, radius 2, 2048 bit)

In [4]:
from rdkit import Chem, DataStructs
from rdkit.Chem import (
    PandasTools,
    Draw,
    Descriptors,
    MACCSkeys,
    rdFingerprintGenerator,
    AllChem
)

##### Check whether SMILES are canonicalized

In [277]:
def check_canonicalization(df):
    for i in range(df.shape[0]):
        smi = df.loc[i,"canonical_smiles"]
        rdksmi = Chem.MolToSmiles(Chem.MolFromSmiles(smi),True)
        assert smi == rdksmi

In [2013]:
#filename = "CB1_P47746_Ki.csv"
filename = Name+"_"+uniprot_id+"_"+bioassay_type+".csv"
print(filename)
df = pd.read_csv(path+filename)

check_canonicalization(df)

GABAalpha5_P31644_Ki.csv


##### Active labeling using pKI or pIC50
cut-off value used:
* 5 = which is equivalent to 10000 nM (or 10 μM)
* 6 = which is equivalent to 1000 nM (or 1 μM)
* 7 = which is equivalent to 100 nM (or 0.1 μM)

In [1791]:
def get_actives(df,active_type,thres):
    # Add column for bioactivity
    df["active"] = np.zeros(len(df))
    print(df.shape[0])
    # Mark every molecule as active with an activity pvalue of >=5.0, 6.0, 7.0, 0 as inactive otherwise
    df.loc[df[df[active_type] >= thres].index, "active"] = 1.0
    
    # Check output
    print("Number of active compounds:", int(df.active.sum()))
    #print("Number of inactive compounds:", len(df) - int(df.active.sum()))
    
    # Subset only the active and reindex
    df_active = df[df["active"] == 1.0]
    #print(df_active.shape[0])
    
    # Save the active subset
    df_active.to_csv(path+"Actives >= "+str(thres)+" "+filename,index=False)

In [2014]:
print(bioassay_type)
get_actives(df,"p"+bioassay_type,5.0)
get_actives(df,"p"+bioassay_type,6.0)
get_actives(df,"p"+bioassay_type,7.0)

Ki
409
Number of active compounds: 409
409
Number of active compounds: 405
409
Number of active compounds: 400


##### Check all active files are created

In [2067]:
df_list = pd.read_excel(path+"Target - Assays - Model list.xlsx")

In [2073]:
targets = df_list.Target
UniProtKB = df_list.UniProtKB
Act_type = df_list.Activity_type

In [2094]:
def check_active_files(df):
    count = 0
    for i in range(df.shape[0]):
        target = df.loc[i,"Target"]
        UniProtKB = df.loc[i,'UniProtKB']
        Act_type = df.loc[i,'Activity_type']
        filename = "Actives >= 7.0 "+target+'_'+UniProtKB+'_'+Act_type+'.csv'
        print(filename)
        df_tmp = pd.read_csv(path+filename)
        #print("nunique compounds", df_tmp.shape[0])
        #df_tmp.shape[0] == df.loc[i,'N_actives_p5']
        if df_tmp.shape[0] >= 50:
            count += 1
    print(count)

In [2095]:
check_active_files(df_list)

Actives >= 7.0 CB1_P21554_Ki.csv
Actives >= 7.0 CB1_P21554_EC50.csv
Actives >= 7.0 CB1_P21554_IC50.csv
Actives >= 7.0 CB1_P20272_Ki.csv
Actives >= 7.0 CB1_P20272_IC50.csv
Actives >= 7.0 CB1_P47746_Ki.csv
Actives >= 7.0 CB1_P47746_IC50.csv
Actives >= 7.0 CB2_P34972_Ki.csv
Actives >= 7.0 CB2_P34972_EC50.csv
Actives >= 7.0 CB2_P34972_IC50.csv
Actives >= 7.0 CB2_Q9QZN9_Ki.csv
Actives >= 7.0 CB2_P47936_Ki.csv
Actives >= 7.0 5HT1A_P08908_Ki.csv
Actives >= 7.0 5HT1A_P08908_EC50.csv
Actives >= 7.0 5HT1A_P08908_IC50.csv
Actives >= 7.0 5HT1A_P19327_Ki.csv
Actives >= 7.0 5HT1A_P19327_IC50.csv
Actives >= 7.0 5HT1B_P28222_Ki.csv
Actives >= 7.0 5HT1B_P28222_IC50.csv
Actives >= 7.0 5HT1B_P28564_IC50.csv
Actives >= 7.0 5HT1B_P28564_Ki.csv
Actives >= 7.0 5HT1D_P28221_Ki.csv
Actives >= 7.0 5HT1D_P28221_IC50.csv
Actives >= 7.0 5HT1F_P30939_Ki.csv
Actives >= 7.0 5HT2A_P28223_Ki.csv
Actives >= 7.0 5HT2A_P28223_IC50.csv
Actives >= 7.0 5HT2A_P28223_EC50.csv
Actives >= 7.0 5HT2C_P28335_Ki.csv
Actives >= 7.0 5

### Caclulate molecular properties of Ro5
* MW <= 500
* Number of hydrogen bond acceptors (HBAs) <= 10
* Number of hydrogen bond donors (HBDs) <= 5
* Calculated LogP (octanol-water coefficient) <= 5

In [286]:
filename = "Actives CB1_P21554_Ki.csv"
df = pd.read_csv(path+filename)

PandasTools.AddMoleculeColumnToFrame(df, smilesCol="canonical_smiles")
df["n_hba"] = df["ROMol"].apply(Descriptors.NumHAcceptors)
df["n_hbd"] = df["ROMol"].apply(Descriptors.NumHDonors)
df[["MW", "n_hba", "n_hbd", "alogp"]].head(2)

Unnamed: 0,MW,n_hba,n_hbd,alogp
0,456.93,4,2,4.2
1,520.96,4,1,5.13


In [None]:
#df.head(2)

In [287]:
df_Ro5_properties = df[["MW", "n_hba", "n_hbd", "alogp"]]
df_Ro5_properties.describe()

Unnamed: 0,MW,n_hba,n_hbd,alogp
count,2670.0,2670.0,2670.0,2670.0
mean,444.386674,4.101873,0.989513,5.337165
std,78.468886,1.57803,0.774671,1.364656
min,247.38,0.0,0.0,0.8
25%,387.52,3.0,0.0,4.41
50%,436.855,4.0,1.0,5.38
75%,488.015,5.0,1.0,6.27
max,858.48,11.0,5.0,11.45


##### Drop ROMol column and save dataset

In [None]:
#output_df = df.drop(["ROMol","maccs","morgan"], axis=1)
#print("DataFrame shape", output_df.shape)
#output_df.to_csv(path+filename, index=False)

### Inactives Aquisition
* Random sample 10 decoys for each active
* Don't accept if TANIMOTO is higher than 0.9

In [77]:
from random import sample

In [78]:
path = '/Users/hek/Research/Cheminformatics/Project_1_NPS/Stimulant vs. Hallucinogens/Dataset/Bioassay data/ChEMBL data/'
output_path = '/Users/hek/Research/Cheminformatics/Project_1_NPS/Stimulant vs. Hallucinogens/ChEMBL Dataset ML results/'

In [79]:
DUDE_path = "/Users/hek/Research/Cheminformatics/DUDE database/"
Decoy_path = DUDE_path + "Decoys/"
df_decoy = pd.read_csv(Decoy_path+"Merged DUD Decoys with canonical smiles.csv")
decoy_index = df_decoy.shape[0]

In [102]:
df_list = pd.read_excel(path+"Target - Assays - Model list.xlsx")

In [81]:
def sample_decoys(df_active):
    print(df_active.shape[0])
    decoy_smi_list = []
    df_inactive = pd.DataFrame()
    for active_smi in df_active['canonical_smiles']:
        m_active = Chem.MolFromSmiles(active_smi)
        fp_maccs_active = MACCSkeys.GenMACCSKeys(m_active)
        fp_morgan_active = AllChem.GetMorganFingerprintAsBitVect(m_active, 2, nBits=1024)
        # Random sample 4 decoys without replacement
        random_4 = sample(range(decoy_index),4)
        for i in random_4:
            decoy_smi = df_decoy.iloc[i,][1]
            #print(decoy_smi)
            m_decoy = Chem.MolFromSmiles(decoy_smi)
            fp_maccs_decoy = MACCSkeys.GenMACCSKeys(m_decoy)
            fp_morgan_decoy = AllChem.GetMorganFingerprintAsBitVect(m_decoy, 2, nBits=1024)
            
            assert DataStructs.TanimotoSimilarity(fp_maccs_active, fp_maccs_decoy)< 0.9
            assert DataStructs.TanimotoSimilarity(fp_morgan_active, fp_morgan_decoy)< 0.9
            
            decoy_smi_list.append(decoy_smi)
    print(len(decoy_smi_list))
    df_inactive["canonical_smiles"] = decoy_smi_list
    
    return df_inactive

In [82]:
def merge_final_dataset(df_active, df_inactive):
    df_inactive["active"] = np.zeros(len(df_inactive))
    df_active=df_active[["canonical_smiles","active"]]
    print(df_active.shape)
    df_inactive=df_inactive[["canonical_smiles","active"]]
    print(df_inactive.shape)
    df_merge = pd.concat([df_active, df_inactive], axis=0)
    print(df_merge.shape)
    
    return df_merge

In [91]:
count = 1 
for i in range(df_list.shape[0]):
#for i in range(2):
    print("Assembling dataset", count)
    target = df_list.loc[i,"Target"]
    UniProtKB = df_list.loc[i,'UniProtKB']
    Act_type = df_list.loc[i,'Activity_type']
    filename = "Actives >= 5.0 "+target+'_'+UniProtKB+'_'+Act_type+'.csv'
    #print(filename)
    df_active = pd.read_csv(path+filename)
    df_list.loc[i,"N_actives_p5"] = df_active.shape[0]
    if df_active.shape[0] >= 50:
        df_inactive = sample_decoys(df_active)
        df_final = merge_final_dataset(df_active, df_inactive)
        
        #Calculate bit vector maccsfp and morganfp fingerprints 
        PandasTools.AddMoleculeColumnToFrame(df_final, smilesCol="canonical_smiles")
        df_final["maccs"] = df_final.ROMol.apply(MACCSkeys.GenMACCSKeys)
        df_final["maccsfp"] = df_final["maccs"].apply(lambda x: x.ToBitString()[1:])

        df_final["morgan"] = df_final.ROMol.apply(lambda x: AllChem.GetMorganFingerprintAsBitVect(x,2, nBits=1024))
        df_final["morganfp"] = df_final["morgan"].apply(lambda x: x.ToBitString())
        
        #save final dataset
        df_final_output = df_final.drop(["ROMol","maccs","morgan"], axis=1)
        df_final_output.to_csv(path+"Final dataset "+filename, index=False)
        count += 1
    else:
        continue  
print(count)    
print("A total of ",count, "final model training datasets are prepared!!!")

Assembling dataset 1
2670
10680
(2670, 2)
(10680, 2)
(13350, 2)
Assembling dataset 2
631


KeyboardInterrupt: 

In [89]:
df_list.to_excel(path+"Target - Assays - Model list.xlsx",index=False)

##### Inactives for NPS set

In [2171]:
filename = "Drugs Raman or SERS in literature _ Paper 1.csv"
output_filename = "Decoys Drugs Raman or SERS in literature _ Paper 1.csv"
df_active = pd.read_csv(output_path+filename)
print("Number of NPS compounds", df_active.shape[0])
df_active.head(2)

Number of NPS compounds 189


Unnamed: 0,Name,Other name,Formula,MW,CAS,PubChem CID,RotBondCount,Conformers,Canonical SMILES,Pharm class,Pharm target,Pharm class label,Chem core,InChI Key,StdInChI,canonical_smiles,maccsfp,morganfp
0,Heroin,,C21H23NO5,369.4,561-27-3,5462328,4,9.0,CC(=O)OC1C=CC2C3CC4=C5C2(C1OC5=C(C=C4)OC(=O)C)...,Sedatives,μ-opioid receptor,1,Alkaloid,GVGLGOZIDCSQPN-PVHGPHFFSA-N,InChI=1S/C21H23NO5/c1-11(23)25-16-6-4-13-10-15...,CC(=O)Oc1ccc2c3c1OC1C(OC(C)=O)C=CC4C(C2)N(C)CC...,0000000000000000000000000000000000000000000000...,0000000000010001000000000000000001001000000000...
1,Morphine,,C17H19NO3,285.34,57-27-2,5288826,0,1.0,CN1CCC23C4C1CC5=C2C(=C(C=C5)O)OC3C(C=C4)O,Sedatives,μ-opioid receptor,1,Alkaloid,BQJCRHHNABKAKU-KBQPJGBKSA-N,InChI=1S/C17H19NO3/c1-18-7-6-17-10-3-5-13(20)1...,CN1CCC23c4c5ccc(O)c4OC2C(O)C=CC3C1C5,0000000000000000000000000000000000000000000000...,0000000000000001000000000000000001001000000000...


In [2148]:
# Convert "Canonical SMILES" to standardized "canonical_smiles" using RDKit
df_active['canonical_smiles'] = [Chem.MolToSmiles(Chem.MolFromSmiles(smi),True) for smi in df_active["Canonical SMILES"]]

In [2149]:
# Also random sample 4 decoys for each NPS compound
df_inactive = sample_decoys(df_active)
print(df_inactive.shape[0])

189
756
756


In [2153]:
# Calculate bit vector maccs and morgan fingerprint for both NPS and decoys
PandasTools.AddMoleculeColumnToFrame(df_active, smilesCol="canonical_smiles")
df_active["maccs"] = df_active.ROMol.apply(MACCSkeys.GenMACCSKeys)
df_active["maccsfp"] = df_active["maccs"].apply(lambda x: x.ToBitString()[1:])

df_active["morgan"] = df_active.ROMol.apply(lambda x: AllChem.GetMorganFingerprintAsBitVect(x,2, nBits=1024))
df_active["morganfp"] = df_active["morgan"].apply(lambda x: x.ToBitString())

In [2167]:
df_active_output = df_active.drop(["ROMol","maccs","morgan"], axis=1)

df_active_output.to_csv(output_path+filename,index=False)

In [2166]:
df_active_output.head(2)

Unnamed: 0,Name,Other name,Formula,MW,CAS,PubChem CID,RotBondCount,Conformers,Canonical SMILES,Pharm class,Pharm target,Pharm class label,Chem core,InChI Key,StdInChI,canonical_smiles,maccsfp,morganfp
0,Heroin,,C21H23NO5,369.4,561-27-3,5462328,4,9.0,CC(=O)OC1C=CC2C3CC4=C5C2(C1OC5=C(C=C4)OC(=O)C)...,Sedatives,μ-opioid receptor,1,Alkaloid,GVGLGOZIDCSQPN-PVHGPHFFSA-N,InChI=1S/C21H23NO5/c1-11(23)25-16-6-4-13-10-15...,CC(=O)Oc1ccc2c3c1OC1C(OC(C)=O)C=CC4C(C2)N(C)CC...,0000000000000000000000000000000000000000000000...,0000000000010001000000000000000001001000000000...
1,Morphine,,C17H19NO3,285.34,57-27-2,5288826,0,1.0,CN1CCC23C4C1CC5=C2C(=C(C=C5)O)OC3C(C=C4)O,Sedatives,μ-opioid receptor,1,Alkaloid,BQJCRHHNABKAKU-KBQPJGBKSA-N,InChI=1S/C17H19NO3/c1-18-7-6-17-10-3-5-13(20)1...,CN1CCC23c4c5ccc(O)c4OC2C(O)C=CC3C1C5,0000000000000000000000000000000000000000000000...,0000000000000001000000000000000001001000000000...


In [2168]:
PandasTools.AddMoleculeColumnToFrame(df_inactive, smilesCol="canonical_smiles")
df_inactive["maccs"] = df_inactive.ROMol.apply(MACCSkeys.GenMACCSKeys)
df_inactive["maccsfp"] = df_inactive["maccs"].apply(lambda x: x.ToBitString()[1:])

df_inactive["morgan"] = df_inactive.ROMol.apply(lambda x: AllChem.GetMorganFingerprintAsBitVect(x,2, nBits=1024))
df_inactive["morganfp"] = df_inactive["morgan"].apply(lambda x: x.ToBitString())

In [2172]:
df_inactive_output = df_inactive.drop(["ROMol","maccs","morgan"], axis=1)
df_inactive_output.to_csv(output_path+output_filename, index=False)

In [2173]:
df_inactive_output.head(2)

Unnamed: 0,canonical_smiles,maccsfp,morganfp
0,CN1C(=O)N[C@H](c2ccc([N+](=O)[O-])cc2)[C@@H](C...,0000000000000000000000010000000000011000010000...,0000000000000010000000000000000001001000000000...
1,C[C@H](NC(=O)Nc1ccc(NC(=O)c2ccco2)cc1C(F)(F)F)...,0000000000000000000000000000000000001000011000...,0100000000000000010000000000000001000000001000...
