# Assemble a Few-Shot Learning Dataset of Molecules from ChEMBL

Here we describe the procedure used to extract the final dataset. The final dataset was obtained through implementation of four key steps: 

1. Query ChEMBL to obtain initial raw data
2. Clean the data to ensure good quality, and threshold to derive binary classification labels
3. Selection of assays for use in the pretraining, vs. those selected as few-shot testing tasks and for validation.
4. Featurization of the data to prepare suitable input to a range of models

## 1. Querying ChEMBL

Our initial query of ChEMBL selected only those assays that contain more than 32 datapoints. We accessed CHEMBL27 and seelcted all assays with more than 32 measurements. We record the assay ids and confidence scores, where confidence reflects the level of information about the target protein in the assay: '9' is a known single protein target, '0' is completely unknown, for instance it could be as broad as an entire tissue. 

The resulting list of assays (or indeed list of ChEMBL assay ids supplied following an alternative query fitting the user's needs) can be passed to the script `query.py`. 

Here we extract a range of fields detailed in `preprocessing/utils/queries.py`. We take a multiple option approach, as we recognise that not all entries in ChEMBL have complete protein target information. When no protein target information is available, the query is carried out for any other information that may be suitable for characterizing the assay such as the target cell type or tissue. 

As a result of this initial query, we obtained 36,093 separate raw assay files as csvs. The cleaning process we followed considerably reduces this count. 

## 2. Cleaning

The cleaning procedure takes place in three keys stages, detailed in `preprocessing/clean.py`:

1. Assays are selected to proceed to the next stage only if they reflect activity or inhibition measurements with units of "%", "uM" or "nM".
2. SMILES are standardized, and XC50 (IC50 ir EC50) measurements are converted to -log10([C]/NM) prior to thresholding.
3. A final (optional) thresholding step is applied. 

The standardization procedure for SMILES is as follows: 

- Remove salts
- Disconnect any metallo-organic complexes
- Make certain the correct ion is present
- Choose the largest fragment if the SMILES string represents disconnected components
- Remove excess charges
- Choose the canonical tautomer

Following this procedure, molecules are rejected with a molecular weight > 900 Da, and exact SMILES-value duplicate pairs are dropped within an assay. 

**De-duplication** of SMILES then accepts a degree of variation in the measured value for the same SMILES -- if a SMILES value is repeated in a dataframe, we accept measurements where the standard value measured is within the same order of magnitude, to fairly capture measurement noise. We reject all measurements for that SMILES if that is not the case. While this may reject stereoisomers with profoundly different chemical behaviors, we wish to remove erroneous measurements of other molecules. 

###  Thresholding

Our thresholding proceeds via a automated procedure that attempts to adapt flexibly to each assay to ensure that we do not discount a number of measurements due to overly rigid thresholding rules. 

We take the median value of an assay's activity measurements, and use this as a threshold provided it is in the range 5 $\le$ median(pXC) $\le$ 7 for enzymes, or 4 $\le$ median(pXC) $\le$ 6 for all other assays. If the median is outside this range, we select PKX = 5.0 as our fixed threshold. 

With this threshold we are able to apply a binary activity label.

## Assay Selection for train-valid-test split

Our assay selection proceeds via examining the final sizes of the assays and their associated protein information. We begin with a list of 27004 assays for which cleaning did not result in removal of all data. Not all assays have available protein information. 

In [2]:
import os
import pandas as pd

In [14]:
mntpath = "/mnt/genchemdata/preprocessed-data/metamol/metamol/"
df = pd.read_csv(os.path.join(mntpath, "all_data_prep.csv"))

In [15]:
df.cleaned_size.sum()

5104074

In [17]:
# df.standard_units.value_counts()

In [30]:
df =pd.concat([df.loc[df['target_id'].notna()].astype({"target_id": int}).astype({"target_id": str}), 
          df.loc[df['target_id'].isna()]],
          ignore_index=True)[df.cleaning_failed==False]

In [31]:
df.to_csv("/home/megstanley/")

Unnamed: 0,chembl_id,target_id,assay_type,assay_organism,raw_size,cleaned_size,cleaning_failed,cleaning_size_delta,num_pos,percentage_pos,...,pref_name,EC_super_class,EC_super_class_name,protein_family,protein_super_family,EC_name,reliable_target_EC,reliable_target_protein_desc,reliable_target_EC_super,reliable_target_protein_super
0,CHEMBL1614128,104881,B,,367,367,False,0.0,181.0,49.318801,...,,,,,,,,,,
1,CHEMBL1614304,104881,B,,367,367,False,0.0,133.0,36.239782,...,,,,,,,,,,
2,CHEMBL1738583,104881,F,,91,91,False,0.0,45.0,49.450549,...,,,,,,,,,,
3,CHEMBL1738683,104881,F,,82,82,False,0.0,41.0,50.000000,...,,,,,,,,,,
4,CHEMBL1769604,104950,B,,79,79,False,0.0,39.0,49.367089,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
27132,CHEMBL871770,,F,Mus musculus,30,30,False,0.0,15.0,50.000000,...,,,,,,,,,,
27133,CHEMBL891544,,F,,24,24,False,0.0,12.0,50.000000,...,,,,,,,,,,
27134,CHEMBL914633,,B,Escherichia coli,16,16,False,0.0,0.0,0.000000,...,,,,,,,,,,
27135,CHEMBL942712,,F,Homo sapiens,24,24,False,0.0,1.0,4.166667,...,,,,,,,,,,


In [23]:
tdf = pd.read_csv(os.path.join(mntpath, "valid_proteins.csv"))

In [24]:
tdf

Unnamed: 0,chembl_id,target_id,confidence,component_synonym,type_synonym,protein_class_desc,pref_name,EC_super_class,EC_super_class_name,protein_family,protein_super_family,EC_name,reliable_target_EC,reliable_target_protein_desc,reliable_target_EC_super,reliable_target_protein_super
0,CHEMBL1243966,12576,8,2.7.1.153,EC_NUMBER,enzyme transferase,Transferase,2,transferase,transferase,transferase,"Phosphatidylinositol-4,5-bisphosphate 3-kinase",True,True,True,True
1,CHEMBL1963790,12947,8,2.7.11.1,EC_NUMBER,enzyme kinase protein kinase cmgc cdk cdk5,CMGC protein kinase CDK5 subfamily,2,transferase,kinase_cmgc,kinase,Non-specific serine/threonine protein kinase,True,True,True,True
2,CHEMBL1963930,11523,8,3.1.3.48,EC_NUMBER,enzyme phosphatase protein phosphatase tyr,Tyrosine protein phosphatase,3,hydrolase,phosphatase_tyr,phosphatase,Protein-tyrosine-phosphatase,True,True,True,True
3,CHEMBL1964107,12090,8,2.7.12.1,EC_NUMBER,enzyme kinase protein kinase cmgc dyrk dyrk1,CMGC protein kinase Dyrk1 subfamily,2,transferase,kinase_cmgc,kinase,Dual-specificity kinase,True,True,True,True
4,CHEMBL2354206,105691,8,3.6.4.-,EC_NUMBER,epigenetic regulator reader brd,Bromodomain,3,hydrolase,epigenetic_regulator,epigenetic,unknown EC number,True,True,True,True
5,CHEMBL3705467,102844,8,3.3.2.9,EC_NUMBER,enzyme protease serine sc s33,Serine protease S33 family,3,hydrolase,protease_serine,protease,Microsomal epoxide hydrolase,True,True,True,True
6,CHEMBL3705869,10899,8,2.7.11.2,EC_NUMBER,enzyme kinase protein kinase atypical pdhk,Atypical protein kinase PDHK subfamily,2,transferase,kinase_atypical,kinase,[Pyruvate dehydrogenase (acetyl-transferring)]...,True,True,True,True
7,CHEMBL3706064,12004,8,3.1.4.-,EC_NUMBER,enzyme phosphodiesterase pde_4 pde_4b,Phosphodiesterase 4B,3,hydrolase,phosphodiesterase_pde_4,phosphodiesterase,unknown EC number,True,True,True,True
8,CHEMBL3888867,10857,8,"('1.14.11.-', '1.14.11.29')",EC_NUMBER,enzyme reductase,Oxidoreductase,1,oxidoreductase,reductase,reductase,"('unknown EC number', 'Hypoxia-inducible facto...",False,True,True,True
9,CHEMBL763161,126,8,1.14.99.1,EC_NUMBER,enzyme reductase,Oxidoreductase,1,oxidoreductase,reductase,reductase,Prostaglandin-endoperoxide synthase,True,True,True,True
