# Descriptor Calculation for Reference Sets

This Notebooks precalculates and saves a set of descriptors for the 3 reference sets listed below.

All data sets were standardized and deduplicated (by the InChIKeys of the parent structures) using the `stand_struct.py` script from the `python_scripts/` folder.  
The actual external data sets are not included in this repository.

### Enamine Advanced Screenin

The Enamine Advanced Screening collection was downloaded from the Enamine home page on 04-Dec-2020 ([link](https://enamine.net/compound-collections/screening-collection/advanced-collection)).  
After standardization and deduplication, it contained 526897 compounds.  

`$ stand_struct enamine_adv.sdf full` 

For the Enamine data set, a random subset of 50000 structures was chosen, after sorting by `NumHA`.

### DrugBank Approved and Investigational

The subsets of approved and investigational drugs were downloaded from DrugBank (v5.1.8, [link](https://go.drugbank.com/releases)).  
The two files were standardized and deduplicated into one data set:

`stand_struct drugbank_5_1_8_appr.sdf,drugbank_5_1_8_inv.sdf full`

The resulting data set contained 4866 entries:

### ChEMBL Natural Products (ChEMBL NPs)

A download of the SDF version of ChEMBL v30 ([link](https://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/latest/chembl_30.sdf.gz)) was downloaded and processed into the standardized form (deduplication, no canonicalization):

`$ stand_struct chembl_30.sdf.gz full --nocanon`

The generated file `chembl_v30_full_nocanon.tsv` contained 2027851 entries.

The list of ChEMBL_IDs for the ChEMBL Natural Products set was extracted by applying the `extract_nps_from_sqlite.py` script to the downladed version SQLITE of ChEMBL v30 ([linl](https://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/latest/chembl_30_sqlite.tar.gz)). The script then merged the Smiles from the `chembl_v30_full_nocanon.tsv` file and applied a deglycosylation.

The final ChEMBL NP data set contained 45679 entries.

In [None]:
%reload_ext autoreload
%autoreload 2
def warn(*args, **kwargs):
    pass  # to silence scikit-learn warnings

import warnings
warnings.filterwarnings('ignore')
warnings.warn = warn

# Type hints
from typing import Iterable, List, Set, Dict, Union, Optional

# Global Imports
# from collections import Counter
# import glob

import pandas as pd
# import numpy as np
# from scipy.stats import median_absolute_deviation as mad

from rdkit import DataStructs
from rdkit.Chem import AllChem as Chem, QED
from rdkit.Chem import Descriptors as Desc
from rdkit.Chem import rdMolDescriptors as rdMolDesc
from rdkit.Chem import Fragments
# from rdkit.Chem import Draw
# from rdkit.Chem.Draw import IPythonConsole

from Contrib.NP_Score import npscorer

# from mol_frame import mol_frame as mf
# from cellpainting3 import processing as cpp, tools as cpt

# import matplotlib.pyplot as plt
# import seaborn as sns
from tqdm.notebook import tqdm
tqdm.pandas()

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Project-local Imports
from jupy_tools import plt_style
from jupy_tools import utils as u, mol_view as mv
from jupy_tools.utils import info

## Configuration

In [None]:
DATA_EN = "enamine_adv_full"
DATA_DB = "drugbank_5.1.8_appr_inv_full"
DATA_NP = "chembl_30_np_full_nocanon_deglyco"

fscore = npscorer.readNPModel()
def score_np(mol):
    return npscorer.scoreMol(mol, fscore)

## Load Data Sets

In [None]:
# Enamine Screening Library
# After calculating NumHA, the subset is generated
df_en = u.read_tsv(f"/home/pahl/comas/notebooks/sdf/enamine/{DATA_EN}.tsv")

# DrugBank Approved and Investigation Drugs
df_db = u.read_tsv(f"/home/pahl/comas/notebooks/sdf/drugbank/{DATA_DB}.tsv")

# NPs from ChEMBL 30
df_np = u.read_tsv(f"/home/pahl/comas/notebooks/sdf/chembl30/{DATA_NP}.tsv")

## Add Descriptor Data

In [None]:
descriptors = {
    "NP_Like": lambda x: round(score_np(x), 2), 
    "QED": lambda x: round(QED.default(x), 3),
    "NumHA": lambda x: x.GetNumAtoms(),
    "MW": lambda x: round(Desc.ExactMolWt(x), 2),
    "NumRings": rdMolDesc.CalcNumRings,
    "NumRingsArom": rdMolDesc.CalcNumAromaticRings,
    "NumRingsAli": rdMolDesc.CalcNumAliphaticRings,
    "NumHDon": rdMolDesc.CalcNumLipinskiHBD,
    "NumHAcc": rdMolDesc.CalcNumLipinskiHBA,
    "LogP": lambda x: round(Desc.MolLogP(x), 2),
    "TPSA": lambda x: round(rdMolDesc.CalcTPSA(x), 2),
    "NumRotBd": rdMolDesc.CalcNumRotatableBonds,
    "NumAtOx": lambda x: len(
        [a for a in x.GetAtoms() if a.GetAtomicNum() == 8]
    ),
    "NumAtN": lambda x: len(
        [a for a in x.GetAtoms() if a.GetAtomicNum() == 7]
    ),
    "NumAtHal": Fragments.fr_halogen,
    "NumAtBridgehead": rdMolDesc.CalcNumBridgeheadAtoms,
    "FCsp3": lambda x: round(rdMolDesc.CalcFractionCSP3(x), 3), 
}

In [None]:
# Enamine
desc = "NumHA"
df_en = u.calc_from_smiles(df_en, desc, descriptors[desc])
df_en = df_en.sort_values(by=desc, ascending=False)
df_en = df_en.sample(n=50000, random_state=0xc0ffee)

for desc in descriptors:
    if desc == "NumHA":
        continue  # was already calculated above
    print(f"{desc:15}: ")
    df_en = u.calc_from_smiles(df_en, desc, descriptors[desc])

In [None]:
# DrugBank
for desc in descriptors:
    print(f"{desc:15}: ")
    df_db = u.calc_from_smiles(df_db, desc, descriptors[desc])

In [None]:
# ChEMBL NPs
for desc in descriptors:
    print(f"{desc:15}: ")
    df_np = u.calc_from_smiles(df_np, desc, descriptors[desc])

The following files are not included in the repository:

In [None]:
u.write_tsv(df_en, f"input/{DATA_EN}_sample_50k_desc.tsv")  # 50000 entries
u.write_tsv(df_db, f"input/{DATA_DB}_desc.tsv")  # 4866 entries
u.write_tsv(df_np, f"input/{DATA_NP}_desc.tsv")  # 45679 entries