# TDC 105: Metric Functions and Oracles 

[Tianfan Fu, Wenhao Gao]

Welcome to the TDC community! 

### What is molecular generation? 

Molecular generation is one of the most fundamental challenges in drug discovery that is to identify one or more molecules with a specific set of properties of interest, such as binding affinity and drug-likeness. Generative algorithms, compared to high-throughput (virtual) screening which is to experimentally (computationally) evaluate every candidate in a library, have an advantage of exploring the chemical space selectively and avoiding enumerating the whole chemical space. Further, the generative algorithms can explore chemical space beyond the limited beginning pool and provide novel chemical structures with preferential intellectual property positions, whereas molecules in screening are often pre-existing. Current generative algorithm can be divided into two classes: distribution learning and goal-directed generation. Distribution learning models are meant to interpolate within a chemical space comprised of a training set of molecules and to generate new molecules with similar properties (e.g., variational encoder, generative adversarial network). Goal-directed generations try to generate new molecules that maximize an oracle function (e.g., generich algorithm, reinforcement learning). 


### What is metric function? 

A metric function is to evaluate the output of distribution learning. In distribution learning tasks, we approximate an unknown distribution $p(x)$ with some distribution $q(x)$ based on a set of molecules sampled from $p(x)$. A well-trained model generates diverse and unique molecules that can be used for building a tailored virtual library. Thus metric functions evaluate the quality of a batch of molecules and the deviation between $p(x)$ and $q(x)$. Note our current metric functions are adopted from MOSES benchmark.


### What is oracle? 

An oracle is a black box that evaluates the ground truth score for input molecule(s). In goal-directed molecular generation tasks, a computational scoring function is usually used as a surrogate oracle to assess how much the molecules fulfill the requirements, such as inhibiting (e.g., Glycogen Synthase Kinase 3 Beta, GSK3B) or drug-likeliness (e.g., QED). The objective of the goal-directed molecular generation is to propose molecules maximizing such scores.

In this tutorial, we will cover the basics of TDC oracle and after this tutorial, you will be able to leverage most of the useful oracles supported!

We assume you have familiarize yourself with the installations and data loaders. If not, please visit [TDC 101 Data Loaders](https://github.com/mims-harvard/TDC/blob/master/tutorials/TDC_101_Data_Loader.ipynb) first!

We will first introduce metric functions, then introduce oracles and some advanced usage of oracles.


## Import the oracle

In [1]:
from tdc import Oracle

## Molecular Property Oracle

It includes:
    
* `QED` is an indicator of drug-likeness, ranging from 0 to 1. 

* `Penalized LogP` is a logP score that accounts for ring size and synthetic accessibility

* `DRD2` measures a molecule's biological activity against a biological target named the dopamine type 2 receptor (DRD2)

* `GSK3` measures a molecular's biological activity against GSK3β. 

* `JNK3` measures a molecular's biological activity against JNK3. 

* `SA` measures a molecular's synthetic accessibility. 

We also includes the oracles from [GuacaMol](https://github.com/BenevolentAI/guacamol), it can be divided into several categories:

### similarity

* `aripiprazole_similarity` measures a molecule's Tanimoto similarity with Aripiprazole. 

* `albuterol_similarity` measures a molecule's Tanimoto similarity with Albuterol. 

* `mestranol_similarity` measures a molecule's Tanimoto similarity with Mestranol. 

### rediscovery

* `celecoxib_rediscovery` measures a molecule's Tanimoto similarity with celecoxib's SMILES to check whether it could be rediscovered. 

* `troglitazone_rediscovery` measures a molecule's Tanimoto similarity with troglitazone's SMILES to check whether it could be rediscovered. 

* `thiothixene_rediscovery` measures a molecule's Tanimoto similarity with thiothixene's SMILES to check whether it could be rediscovered. 


### isomer

* `isomers_c7h8n2o2` measures a molecule's similarity in terms of atom counter to C7H8N2O2. 


* `isomers_c9h10n2o2pf2cl` measures a molecule's similarity in terms of atom counter to C9H10N2O2PF2Cl. 



### MPO (multiproperty objective)

* `osimertinib_mpo` measures the geometric means of several scores, including the molecule's Tanimoto similarity to osimertinib, TPSA score and LogP score. 

* `fexofenadine_mpo` measures the geometric means of several scores, including the molecule's Tanimoto similarity to fexofenadine, TPSA score and LogP score. 

* `ranolazine_mpo` measures the geometric means of several scores, including the molecule's Tanimoto similarity to ranolazine, TPSA score LogP score and number of fluorine atoms. 

* `perindopril_mpo` measures the geometric means of several scores, including the molecule's Tanimoto similarity to perindopril and number of aromatic rings. 

* `amlodipine_mpo` measures the geometric means of several scores, including the molecule's Tanimoto similarity to amlodipine and number rings. 

* `sitagliptin_mpo` measures the geometric means of several scores, including the molecule's Tanimoto similarity to sitagliptin, TPSA score, LogP score and isomer score with C16H15F6N5O. 

* `zaleplon_mpo` measures the geometric means of several scores, including the molecule's Tanimoto similarity to zaleplon and isomer score with C19H17N3O2. 


### median molecule

* `median1` measures the average score of the molecule's Tanimoto similarity to Camphor and Menthol. 

* `median2` measures the average score of the molecule's Tanimoto similarity to Tadalafil and sildenafil. 


### others 

* `valsartan_smarts` measures the arithmetic means of several scores, including (1) binary score about whether contain a certain SMARTS structure (2) LogP, (3) TPSA and (4) Bertz score.  

* `deco_hop` measures the arithmetic means of several scores, including (1) binary score about whether contain certain SMARTS structures and the molecule's Tanimoto similarity to PHCO. 

* `scaffold_hop` measures the arithmetic means of several scores, including (1) binary score about whether contain certain SMARTS structures and the molecule's Tanimoto similarity to PHCO. 



TDC's oracles can take in either a list of SMILES strings or a single SMILES string. As an example, we define a list of SMILES string and a single SMILES string below:

In [2]:
smiles_lst = ['CC(C)(C)[C@H]1CCc2c(sc(NC(=O)COc3ccc(Cl)cc3)c2C(N)=O)C1', \
			  'C[C@@H]1CCc2c(sc(NC(=O)c3ccco3)c2C(N)=O)C1', \
			  'CCNC(=O)c1ccc(NC(=O)N2CC[C@H](C)[C@H](O)C2)c(C)c1', \
			  'C[C@@H]1CCN(C(=O)CCCc2ccccc2)C[C@@H]1O']


osimertinib_smiles = 'COc1cc(N(C)CCN(C)C)c(NC(=O)C=C)cc1Nc2nccc(n2)c3cn(C)c4ccccc34'

For some oracles, we need to use load well-trained machine learning model to evaluate the score, e.g., 

* `DRD2`

* `GNK3`

* `JNK3`

We download the model in the first time calling it, then save it as local copy. After that, we load the local copy when calling it. 

In [3]:
gsk3 = Oracle(name = 'GSK3')

Found local copy...


For other oracles, the score can be evaluated using some deterministic rules, e.g.,

* `QED`

* `Penalized LogP`

All the guacamol related oracles belong to this category. 


In [4]:
qed = Oracle(name = 'QED')

Oracles can be called to evaluate a single SMILES. 

In [5]:
gsk3(osimertinib_smiles)

0.54

In [6]:
qed(osimertinib_smiles)

0.3105348061023151

Oracles can be called to evaluate a list of SMILES. The output is a list of float. 

In [7]:
gsk3(smiles_lst)

[0.03, 0.02, 0.0, 0.0]

## Adcanced Usage of Oracle

We also provide more flexibility interface to some oracles for advanced usaers, so that they can build their own oracle functions:

* `isomer_meta` measures a molecule's similarity to a target molecular formula (e.g., C7H8N2O2). 

* `rediscovery_meta` measures a molecule's Tanimoto similarity with the target SMILES to check whether it could be rediscovered. 

* `similarity_meta` measures a molecule's Tanimoto similarity with the target SMILES.

* `median_meta` measures the average score of a molecule's similarity to two target SMILES. 



(I) "Isomer scoring for C7H8N2O2" using TDC built-in function.

In [8]:
isomer_c7h8n2o2 = Oracle(name = 'Isomer_C7H8N2O2')
print(isomer_c7h8n2o2(osimertinib_smiles))

3.758742514056854e-48


(II) Specify another "Isomer scoring for C7H8N2O2" using Meta Oracle. Note that the input to `target_smiles` should be a chemical formula. 

In [9]:
isomers_c7h8n2o2_1 = Oracle(name = 'Isomer_Meta', target_smiles = 'C7H8N2O2')
print(isomers_c7h8n2o2_1(osimertinib_smiles))

3.758742514056854e-48


We see they generate the same results, confirming the correctness of meta oracles. As another example, we have meta median oracle, where we can specify two molecules that we want to simultaneously be similar to. You can also specify the fingerprint methods, the score modifier functions, and the aggregate mean function.

In [10]:
median2 = Oracle(name = 'median2')
print(median2(osimertinib_smiles))

0.1407807656815755


In [11]:
tadalafil_smiles = 'O=C1N(CC(N2C1CC3=C(C2C4=CC5=C(OCO5)C=C4)NC6=C3C=CC=C6)=O)C'
sildenafil_smiles = 'CCCC1=NN(C2=C1N=C(NC2=O)C3=C(C=CC(=C3)S(=O)(=O)N4CCN(CC4)C)OCC)C'
median2_2 = Oracle(name = 'Median_Meta', 
                   target_smiles = (tadalafil_smiles, sildenafil_smiles), 
                   fp1 = 'ECFP6', 
                   fp2 = 'ECFP6', 
                   modifier_func1 = None, 
                   modifier_func2 = None, 
                   means = 'geometric')


print(median2_2(osimertinib_smiles))

0.1407807656815755


We see again the meta oracle generates the same results as the built-in functions, confirming the correctness.

That's it for this tutorial! Hope you can now leverage TDC for molecule generations!

If you haven't, do checkout all the previous tutorials:

* [TDC 101 Data Loader](https://github.com/mims-harvard/TDC/blob/master/tutorials/TDC_101_Data_Loader.ipynb)

* [TDC 102 Data Functions](https://github.com/mims-harvard/TDC/blob/master/tutorials/TDC_102_Data_Functions.ipynb)

* [TDC 103 Part 1: Datasets - Small Molecules](https://github.com/mims-harvard/TDC/blob/master/tutorials/TDC_103.1_Datasets_Small_Molecules.ipynb)

* [TDC 103 Part 2: Datasets - Biologics](https://github.com/mims-harvard/TDC/blob/master/tutorials/TDC_103.2_Datasets_Biologics.ipynb)

* [TDC 104 ML Model Examples with DeepPurpose](https://github.com/mims-harvard/TDC/blob/master/tutorials/TDC_104_ML_Model_DeepPurpose.ipynb)