# Assemble a Few-Shot Learning Dataset of Molecules from ChEMBL

Here we describe the procedure used to extract the final dataset. The final dataset was obtained through implementation of four key steps: 

1. Query ChEMBL to obtain initial raw data
2. Clean the data to ensure good quality, and threshold to derive binary classification labels
3. Selection of assays for use in the pretraining, vs. those selected as few-shot testing tasks and for validation.
4. Featurization of the data to prepare suitable input to a range of models

## 1. Querying ChEMBL

Our initial query of ChEMBL selected only those assays that contain more than 32 datapoints. We accessed CHEMBL27 and seelcted all assays with more than 32 measurements. We record the assay ids and confidence scores, where confidence reflects the level of information about the target protein in the assay: '9' is a known single protein target, '0' is completely unknown, for instance it could be as broad as an entire tissue. 

The resulting list of assays (or indeed list of ChEMBL assay ids supplied following an alternative query fitting the user's needs) can be passed to the script `query.py`. 

Here we extract a range of fields detailed in `preprocessing/utils/queries.py`. We take a multiple option approach, as we recognise that not all entries in ChEMBL have complete protein target information. When no protein target information is available, the query is carried out for any other information that may be suitable for characterizing the assay such as the target cell type or tissue. 

As a result of this initial query, we obtained 36,093 separate raw assay files as csvs. The cleaning process we followed considerably reduces this count. 

## 2. Cleaning

The cleaning procedure takes place in three keys stages, detailed in `preprocessing/clean.py`:

1. Assays are selected to proceed to the next stage only if they reflect activity or inhibition measurements with units of "%", "uM" or "nM".
2. SMILES are standardized, and XC50 (IC50 ir EC50) measurements are converted to -log10([C]/NM) prior to thresholding.
3. A final (optional) thresholding step is applied. 

The standardization procedure for SMILES is as follows: 

- Remove salts
- Disconnect any metallo-organic complexes
- Make certain the correct ion is present
- Choose the largest fragment if the SMILES string represents disconnected components
- Remove excess charges
- Choose the canonical tautomer

Following this procedure, molecules are rejected with a molecular weight > 900 Da, and exact SMILES-value duplicate pairs are dropped within an assay. 

**De-duplication** of SMILES then accepts a degree of variation in the measured value for the same SMILES -- if a SMILES value is repeated in a dataframe, we accept measurements where the standard value measured is within the same order of magnitude, to fairly capture measurement noise. We reject all measurements for that SMILES if that is not the case. While this may reject stereoisomers with profoundly different chemical behaviors, we wish to remove erroneous measurements of other molecules. 

###  Thresholding

Our thresholding proceeds via a automated procedure that attempts to adapt flexibly to each assay to ensure that we do not discount a number of measurements due to overly rigid thresholding rules. 

We take the median value of an assay's activity measurements, and use this as a threshold provided it is in the range 5 $\le$ median(pXC) $\le$ 7 for enzymes, or 4 $\le$ median(pXC) $\le$ 6 for all other assays. If the median is outside this range, we select PKX = 5.0 as our fixed threshold. 

With this threshold we are able to apply a binary activity label.

## Assay Selection for train-valid-test split

Our assay selection proceeds via examining the final sizes of the assays and their associated protein information. We begin with a list of 27004 assays for which cleaning did not result in removal of all data. Not all assays have available protein information. 

In [1]:
import os
import pandas as pd

In [28]:
df = pd.read_csv(os.path.join(os.getcwd(), "target_info.csv"))

In [36]:
print(f"We have {df.cleaned_size.sum()} measurements from our first pass of cleaning (cleaning_failed == False)")

We have 5047500 measurements from our first pass of cleaning (cleaning_failed == False)


In [30]:
df =pd.concat([df.loc[df['target_id'].notna()].astype({"target_id": int}).astype({"target_id": str}), 
          df.loc[df['target_id'].isna()]],
          ignore_index=True)

In [35]:
# first select out assays that are very small
df = df[df.cleaned_size>=32]
print(f"We have {len(df[df.target_id.notna()].target_id.unique())} unique known targets")

We have 2584 unique known targets


TODO: we need a brief description here of how the EC numbers were assigned (from Nadine).

To select test tasks, we require that they only have well known target ids, and since we also wish to categorise by EC number, we will select those for which a good EC number can be obtained. 

We first extract everything that cannot be included as a few-shot test task, which involves the cases of:
- having no good EC number (NaN or EC number considered unreliable). 
- no single target ID available (eg. non-single-protein measurements)

In [None]:
# of those that have a target id, select out the data that has reliable EC classes

target_filtered = filtered[filtered.target_id.notna()]
target_filtered[target_filtered.reliable_target_EC_super.notna()]
good_ecs = notnatargets[notnatargets.reliable_target_EC_super == True]

In [39]:
possible_test = df.iloc[df.target_id.notna() and df.reliable_target_EC_super.notna()]

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

In [40]:
df.target_id.notna()

0         True
1         True
2         True
3         True
4         True
         ...  
26993    False
26994    False
26995    False
26996    False
27003    False
Name: target_id, Length: 24083, dtype: bool