# Assembling a Few-Shot Learning Dataset of Molecules from ChEMBL

Here we describe the procedure used to extract the final dataset. The final dataset was obtained through implementation of four key steps: 

1. Query ChEMBL to obtain initial raw data
2. Clean the data to ensure good quality, and threshold to derive binary classification labels
3. Selection of assays for use in the pretraining, vs. those selected as few-shot testing tasks and for validation.
4. Featurization of the data to prepare suitable input to a range of models

## Setup

Extracting the dataset requires access to a MySQL database server holding the ChEMBL dataset. You can download the data and find instructions on setting this up on https://chembl.gitbook.io/chembl-interface-documentation/downloads.
You will then need to update `fs_mol/preprocessing/utils/config.ini` with the connection information about your MySQL server.

Finally, we need to set up a few small bits to run this notebook successfully:

In [5]:
import os
import sys

# This should be the location of the checkout of the FS-Mol repository:
FS_MOL_CHECKOUT_PATH = os.path.join(os.environ['HOME'], "Projects", "FS-Mol")
# This should be the where the result of the data extraction will be stored, requiring roughly TODO of space
FS_MOL_RESULT_PATH = "/tmp/fs_mol"

os.chdir(FS_MOL_CHECKOUT_PATH)
sys.path.insert(0, FS_MOL_CHECKOUT_PATH)
os.makedirs(FS_MOL_RESULT_PATH)

## 1. Querying ChEMBL

We query a SQL instance of the full ChEMBL database to obtain the raw data.
This is implemented by the script `fs_mol/preprocessing/query.py`, which takes a list of candidate assays that should be considered (the one we used for the dataset released is stored in `TODO`), and creates one `.csv` file for each assay using a range of fields detailed in `fs_mol/preprocessing/utils/queries.py`.

We take a multiple option approach, as we recognise that not all entries in ChEMBL have complete protein target information. When no protein target information is available, the query is carried out for any other information that may be suitable for characterizing the assay such as the target cell type or tissue.

In [7]:
! python fs_mol/preprocessing/query.py --save-dir fs_mol/raw_data --assay-list-file TODO

Traceback (most recent call last):
  File "fs_mol/preprocessing/query.py", line 258, in <module>
    run(args)
  File "fs_mol/preprocessing/query.py", line 148, in run
    db_config = read_db_config()
  File "/home/mabrocks/Projects/FS-Mol/fs_mol/preprocessing/utils/db_utils.py", line 30, in read_db_config
    raise Exception(f"{section} not found in the {filename} file")
Exception: mysql not found in the config.ini file


As a result of this raw data extraction, we obtain 36,093 separate raw assay files as `.csv`s:

In [8]:
! some shell command showing 36093

/bin/bash: some: command not found



### Initial List of Assays
Our initial query of ChEMBL selects only those assays that contain more than 32 datapoints. We accessed CHEMBL27 and selected all assays with more than 32 measurements. We record the assay ids and confidence scores, where confidence reflects the level of information about the target protein in the assay: '9' is a known single protein target, '0' is completely unknown, for instance it could be as broad as an entire tissue.

To regenerate this list (after changing criteria, for example), you can run `fs_mol/preprocessing/initial_query.py`:

In [10]:
! TODO

/bin/bash: TODO: command not found



## 2. Cleaning

The cleaning procedure takes place in three keys stages, detailed in `fs_mol/preprocessing/clean.py`:

1. Assays are selected to proceed to the next stage only if they reflect activity or inhibition measurements with units of "%", "uM" or "nM".
2. SMILES are standardized, and XC50 (IC50 or EC50) measurements are converted to -log10([C]/NM) prior to thresholding. This step also de-duplicates measurements where applicable.
3. A final (optional) thresholding step is applied.

### Standardization

The standardization procedure for SMILES is as follows: 

- Remove salts
- Disconnect any metallo-organic complexes
- Make certain the correct ion is present
- Choose the largest fragment if the SMILES string represents disconnected components
- Remove excess charges
- Choose the canonical tautomer

After this procedure, molecules are rejected if they have a molecular weight > 900 Da, and exact SMILES-value duplicate pairs are dropped within an assay. 

**De-duplication** of SMILES then accepts a degree of variation in the measured value for the same SMILES -- if a SMILES value is repeated in a dataframe, we accept measurements where the standard value measured is within the same order of magnitude, to fairly capture measurement noise. We reject all measurements for that SMILES if that is not the case. While this may reject stereoisomers with profoundly different chemical behaviors, we wish to remove erroneous measurements of other molecules. 

### Thresholding

As part of cleaning the data, we automatically derive active/inactive labels from the activity data. Our thresholding proceeds via a automated procedure that attempts to adapt flexibly to each assay to ensure that we do not discount a number of measurements due to overly rigid thresholding rules. 

We take the median value of an assay's activity measurements, and use this as a threshold provided it is in the range 5 $\le$ median(pXC) $\le$ 7 for enzymes, or 4 $\le$ median(pXC) $\le$ 6 for all other assays. If the median is outside this range, we select PKX = 5.0 as our fixed threshold. 

With this threshold we are able to derive a binary activity label.

Overall, the cleaning can be applied to the extraced data as follows:

In [9]:
! clean.py command line and output

/bin/bash: clean.py: command not found



# 3. Assay Selection for train-valid-test split

Our assay selection proceeds via examining the final sizes of the assays and their associated protein information. We begin with a list of 27004 assays for which cleaning did not result in removal of all data. Not all assays have available protein information. 

In [78]:
import os
import pandas as pd

In [79]:
df = pd.read_csv(os.path.join(os.getcwd(), "targets/target_info.csv"))

print(f"We have {df.cleaned_size.sum()} measurements from our first pass of cleaning (cleaning_failed == False)")

We have 5104074 measurements from our first pass of cleaning (cleaning_failed == False)


In [80]:
df = pd.concat(
    [
        df.loc[df['target_id'].notna()].astype({"target_id": int}).astype({"target_id": str}),
        df.loc[df['target_id'].isna()]
    ],
    ignore_index=True
)

# first select out assays that are very small
df = df[df.cleaned_size>=32]
print(f"We have {len(df[df.target_id.notna()].target_id.unique())} unique known targets")

We have 2584 unique known targets


TODO: we need a brief description here of how the EC numbers were assigned (from Nadine).

To select test tasks, we require that they only have well known target ids, and since we also wish to categorise by EC number, we will select those for which a good EC number can be obtained. 

We first extract everything that cannot be included as a few-shot test task, which involves the cases of:
- having no good EC number (NaN or EC number considered unreliable). 
- no single target ID available (eg. non-single-protein measurements)

In [97]:
possible_test = df[df.target_id.notna()]
possible_test = possible_test[possible_test.reliable_target_EC_super.notna()]
possible_test = possible_test[possible_test.reliable_target_EC_super == True]

print(f"Prior to filtering we have: {len(possible_test)} assays with well known EC super classes")
print(f"This consists of {len(possible_test.target_id.unique())} targets with {possible_test.cleaned_size.sum()} individual measurement points.")

Prior to filtering we have: 8498 assays with well known EC super classes
This consists of 1516 targets with 2039793 individual measurement points.


We make further stringent requirements here on the test tasks: they must be less than 5000 datapoints to avoid high-throughput screens, as these are generally considered noisy and not in keeping with the QSAR data considered here. 

In [98]:
best = possible_test.loc[
    (possible_test["cleaned_size"] >= 128) &
    (possible_test["confidence"] >= 8) &
    (possible_test["percentage_pos"] <= 70) &
    (possible_test["percentage_pos"] >= 30) &
    (possible_test["cleaned_size"] <= 5000)
]

print(f"We have {len(set(possible_test.target_id.unique()))} possible test targets")

We have 1516 possible test targets


We would like 200 final few-shot tasks, but we may not be able to achieve this without impoverishing the training set. 

How many of each EC class would this represent, to maintain the current proportions in the data? 

In [99]:
best.EC_super_class.value_counts() * 200/ best.EC_super_class.value_counts().sum()

2    167.582418
3     19.505495
1      7.142857
4      2.197802
6      2.197802
7      1.098901
5      0.274725
Name: EC_super_class, dtype: float64

In [11]:
from collections import defaultdict
required_target_numbers = {"EC_super_class": [str(x) for x in range(1, 8)], "target_count": [10, 150, 30, 3, 1, 3, 2]}
ids = defaultdict(set)
test_ids = set()
for c, target_count in zip(required_target_numbers["EC_super_class"],required_target_numbers["target_count"]):
    ids[c] = set(best[best.EC_super_class == c].target_id.value_counts().tail(target_count).index)
    test_ids = test_ids.union(ids[c])

NameError: name 'best' is not defined

### Training set

We can assemble the training set from all that remains now that we have selected our target protein IDs. It should be composed of all 'good' protein measurements with known targets and well known EC classes, as well as everything else where these values may be uncertain (for instance, the cases of non-enzymatic proteins where there is no such thing as an EC number).

We then also remove non-protein target measurements (no target id), and suggest these are only to be used as part of an extended training set.

We supply final tables of the train, valid and test set assays.

We note that test assays are required to have a confidence score of 8 or 9, where this reflects a single protein target.

In [105]:
pd.read_csv(os.path.join(os.getcwd(), "targets/train_proteins.csv"))

Unnamed: 0,chembl_id,target_id,confidence,component_synonym,type_synonym,protein_class_desc,pref_name,EC_super_class,EC_super_class_name,protein_family,protein_super_family,EC_name,reliable_target_EC,reliable_target_protein_desc,reliable_target_EC_super,reliable_target_protein_super
0,CHEMBL1614128,104881,4,,,,,,,,,,,,,
1,CHEMBL1614304,104881,4,,,,,,,,,,,,,
2,CHEMBL1738583,104881,4,,,,,,,,,,,,,
3,CHEMBL1738683,104881,4,,,,,,,,,,,,,
4,CHEMBL1769604,104950,4,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4933,CHEMBL997737,12239,9,1.4.3.4,EC_NUMBER,enzyme reductase,Oxidoreductase,1,oxidoreductase,reductase,reductase,Monoamine oxidase,True,True,True,True
4934,CHEMBL997740,12453,9,1.4.3.4,EC_NUMBER,enzyme reductase,Oxidoreductase,1,oxidoreductase,reductase,reductase,Monoamine oxidase,True,True,True,True
4935,CHEMBL998472,11560,9,,,,,,,,,,,,,
4936,CHEMBL998585,11839,9,,,,,,,,,,,,,


In [109]:
pd.read_csv(os.path.join(os.getcwd(), "targets/test_proteins.csv"))

Unnamed: 0,chembl_id,target_id,confidence,component_synonym,type_synonym,protein_class_desc,pref_name,EC_super_class,EC_super_class_name,protein_family,protein_super_family,EC_name,reliable_target_EC,reliable_target_protein_desc,reliable_target_EC_super,reliable_target_protein_super
0,CHEMBL1119333,93,8,3.1.1.7,EC_NUMBER,enzyme hydrolase,Hydrolase,3,hydrolase,hydrolase,hydrolase,Acetylcholinesterase,True,True,True,True
1,CHEMBL1243967,11177,8,2.7.1.153,EC_NUMBER,enzyme transferase,Transferase,2,transferase,transferase,transferase,"Phosphatidylinositol-4,5-bisphosphate 3-kinase",True,True,True,True
2,CHEMBL1243970,10056,8,2.7.11.1,EC_NUMBER,enzyme kinase protein kinase atypical pikk,Atypical protein kinase PIKK family,2,transferase,kinase_atypical,kinase,Non-specific serine/threonine protein kinase,True,True,True,True
3,CHEMBL1614292,101284,8,2.7.1.11,EC_NUMBER,enzyme,Enzyme,2,transferase,undefined enzyme/protein,undefined enzyme/protein,6-phosphofructokinase,True,True,True,True
4,CHEMBL1614433,103718,8,3.1.21.2,EC_NUMBER,enzyme hydrolase,Hydrolase,3,hydrolase,hydrolase,hydrolase,Deoxyribonuclease IV,True,True,True,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
152,CHEMBL3707738,38,9,3.3.2.9,EC_NUMBER,enzyme protease serine sc s33,Serine protease S33 family,3,hydrolase,protease_serine,protease,Microsomal epoxide hydrolase,True,True,True,True
153,CHEMBL4005586,100447,9,2.3.1.286,EC_NUMBER,epigenetic regulator eraser hdac hdac class...,HDAC class III,2,transferase,epigenetic_regulator,epigenetic,Protein acetyllysine N-acetyltransferase,True,True,True,True
154,CHEMBL4133035,12054,9,"('1.13.11.-', '1.13.11.12', '1.13.11.31', '1.1...",EC_NUMBER,enzyme reductase,Oxidoreductase,1,oxidoreductase,reductase,reductase,"('unknown EC number', 'Linoleate 13S-lipoxygen...",False,True,True,True
155,CHEMBL641707,11203,9,2.3.1.26,EC_NUMBER,enzyme transferase,Transferase,2,transferase,transferase,transferase,Sterol O-acyltransferase,True,True,True,True


In [108]:
pd.read_csv(os.path.join(os.getcwd(), "targets/valid_proteins.csv"))

Unnamed: 0,chembl_id,target_id,confidence,component_synonym,type_synonym,protein_class_desc,pref_name,EC_super_class,EC_super_class_name,protein_family,protein_super_family,EC_name,reliable_target_EC,reliable_target_protein_desc,reliable_target_EC_super,reliable_target_protein_super
0,CHEMBL1243966,12576,8,2.7.1.153,EC_NUMBER,enzyme transferase,Transferase,2,transferase,transferase,transferase,"Phosphatidylinositol-4,5-bisphosphate 3-kinase",True,True,True,True
1,CHEMBL1963790,12947,8,2.7.11.1,EC_NUMBER,enzyme kinase protein kinase cmgc cdk cdk5,CMGC protein kinase CDK5 subfamily,2,transferase,kinase_cmgc,kinase,Non-specific serine/threonine protein kinase,True,True,True,True
2,CHEMBL1963930,11523,8,3.1.3.48,EC_NUMBER,enzyme phosphatase protein phosphatase tyr,Tyrosine protein phosphatase,3,hydrolase,phosphatase_tyr,phosphatase,Protein-tyrosine-phosphatase,True,True,True,True
3,CHEMBL1964107,12090,8,2.7.12.1,EC_NUMBER,enzyme kinase protein kinase cmgc dyrk dyrk1,CMGC protein kinase Dyrk1 subfamily,2,transferase,kinase_cmgc,kinase,Dual-specificity kinase,True,True,True,True
4,CHEMBL2354206,105691,8,3.6.4.-,EC_NUMBER,epigenetic regulator reader brd,Bromodomain,3,hydrolase,epigenetic_regulator,epigenetic,unknown EC number,True,True,True,True
5,CHEMBL3705467,102844,8,3.3.2.9,EC_NUMBER,enzyme protease serine sc s33,Serine protease S33 family,3,hydrolase,protease_serine,protease,Microsomal epoxide hydrolase,True,True,True,True
6,CHEMBL3705869,10899,8,2.7.11.2,EC_NUMBER,enzyme kinase protein kinase atypical pdhk,Atypical protein kinase PDHK subfamily,2,transferase,kinase_atypical,kinase,[Pyruvate dehydrogenase (acetyl-transferring)]...,True,True,True,True
7,CHEMBL3706064,12004,8,3.1.4.-,EC_NUMBER,enzyme phosphodiesterase pde_4 pde_4b,Phosphodiesterase 4B,3,hydrolase,phosphodiesterase_pde_4,phosphodiesterase,unknown EC number,True,True,True,True
8,CHEMBL3888867,10857,8,"('1.14.11.-', '1.14.11.29')",EC_NUMBER,enzyme reductase,Oxidoreductase,1,oxidoreductase,reductase,reductase,"('unknown EC number', 'Hypoxia-inducible facto...",False,True,True,True
9,CHEMBL763161,126,8,1.14.99.1,EC_NUMBER,enzyme reductase,Oxidoreductase,1,oxidoreductase,reductase,reductase,Prostaglandin-endoperoxide synthase,True,True,True,True


In practice, the validation set chooses only those tasks for which EC super class is 2 as we want to reduce the time taken by validation steps, and note that majority of tasks in testing are associated with kinases.

# 4. Featurization

In featurization we take the SMILES string (here termed 'canonical' following the careful cleaning) and use it to create rdkit mol objects, from which further featurization can proceed. This takes place in the `featurize.py`. 

The final featurized files include the SMILES string, but also the ECFP fingerprints, standard physico-chemical descriptors from rdkit, and a graph featurization. The graph featurization relies on `metadata.pkl.gz` as it is created by a set of featurizers with fixed vocabularies to maintain consistent featurization across all assays.

In [None]:
! command line to perform this