Skip to content

Translate the SMILES string into another representation form. In this form, we can find the optimal active molecule through various intelligent algorithms, and express the molecules through reverse translation.

License

Notifications You must be signed in to change notification settings

leelasd/ChemMORT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

56 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ChemMORT

The ChemMORT (Molecular Represent & Translate) consists of three modules, including SMILES Encoder, Embedding Decoder and Molecular Optimizer.

Introduction

SMILES Encoder

The ChemMRAT SMILES Encoder allows the user to easily embed a SMILES string to a 512-dimensional vector, which can be used for building a QSAR model. Especially for DNN, the encoding descriptors satisfy the fundamental idea of representation learning: DNNs should learn a suitable representation of the data from a simple but complete featurization rather than relying on sophisticated human-engineered representations. Besides, DNNs often require massive amounts of data for training, but the available QSAR data is often small. Through enumerating the SMILES of a molecule, the data is extended to several times of the original repository. Users can input a chemical to be evaluated in the following three ways: drawing it in an included chemical sketcher window, entering a structure text file, or imputing the SMILES of the chemical structures.

Embedding Decoder

The ChemMRAT Embedding Decoder was implemented to translate the embedding descriptors, retrieved from ChemMRAT SMILES Encoder, to a SMILES string. The Decoder assists the molecular property optimization where the user could adjust the embedding descriptors to hit an aimed property, and then use decoder to obtain the SMILES of each molecule. Users can input a .csv file, and a .smi file can be returned after a few seconds to minutes.

Molecular Optimizer

The ChemMRAT Molecular Optimizer, merged the Encoder, Decoder and Particle Swarm Optimization (PSO) method, was designed to optimize molecules with respect to a single objective, under constraints with chemical substructures and a multi-objective value function. Not only does our proposed method exhibit competitive or better performance in finding optimal solutions compared to baseline method, is also achieves significant reduction in computational time. After users input the SMILES of a chemical structure and select property to be optimized, several best solutions can be obtained in the results.

Endpoint of Optimizer

Endpoint Description Performance Type Method Dataset
logD7.4 Log of the octanol/water distribution coefficient at pH7.4.
* Optimal: 1~3
Test Set
RMSE: 0.555±0.010
MAE: 0.426±0.007
R2: 0.840±0.004
5-Fold CV
RMSE: 0.562±0.009
MAE: 0.428±0.13
R2: 0.834±0.005
Basic property XGBoost
AMES The probability to be positive in Ames test.
* The smaller AMES score, the less likely to be AMES positive.
Test Set
ACC: 0.813±0.007
SEN: 0.835±0.013
SPE: 0.787±0.013
AUC: 0.888±0.004
5-Fold CV
ACC: 0.810±0.016
SEN: 0.838±0.014
SPE: 0.777±0.031
AUC: 0.889±0.013
Toxicity XGBoost
Caco-2 Papp (Caco-2 Permeability)
Optimal: higher than -5.15 Log unit or -4.70 or -4.80
Test Set
RMSE: 0.332±0.007
MAE: 0.244±0.004
R2: 0.718±0.019
5-Fold CV
RMSE: 0.328±0.004
MAE: 0.245±0.005
R2: 0.728±0.011
Absorption XGBoost& Data Augment
MDCK Papp (MDCK Permeability)
Test Set
RMSE: 0.323±0.022
MAE: 0.232±0.011
R2: 0.650±0.041
5-Fold CV
RMSE: 0.322±0.034
MAE: 0.235±0.021
R2: 0.644±0.057
Absorption XGBoost& Data Augment
PPB Plasma Protein Binding
* Significant with drugs that are highly protein-bound and have a low therapeutic index.
Test Set
RMSE: 0.152±0.003
MAE: 0.104±0.002
R2: 0.691±0.016
5-Fold CV
RMSE: 0.154±0.010
MAE: 0.106±0.007
R2: 0.691±0.025
Distribution DNN
QED quantitative estimate of drug-likeness n/a Drug-likeness score Molecular Function
SlogP Log of the octanol/water partition coefficient, based on an atomic contribution model [Crippen 1999].
* Optimal: 0< LogP <3
* logP <0: poor lipid bilayer permeability.
* logP >3: poor aqueous solubility.
Fitted on an extensive training set of 9920 molecules, with R2 = 0.918 and σ = 0.677 Basic property Molecular Function
logS Log of Solubility
* Optimal: higher than -4 log mol/L
* <10 μg/mL: Low solubility.
* 10–60 μg/mL: Moderate solubility.
* >60 μg/mL: High solubility
Test Set
RMSE: 0.823±0.026
MAE: 0.572±0.009
R2: 0.862±0.011
5-Fold CV
RMSE: 0.842±0.084
MAE: 0.592±0.056
R2: 0.839±0.029
Basic property XGBoost
hERG The probability to be hERG Blocker
* The higher hERG score, the more likely to be hERG Blocker.
Test Set
ACC: 0.814±0.026
SEN: 0.841±0.042
SPE: 0.760±0.065
AUC: 0.854±0.032
5-Fold CV
ACC: 0.800±0.036
SEN: 0.820±0.068
SPE: 0.754±0.147
AUC: 0.857±0.053
Toxicity XGBoost
Hepatoxicity The probability of owning liver toxicity
* The smaller hepatoxicity score, the less likely to be liver toxic.
Test Set
ACC: 0.729±0.016
SEN: 0.732±0.019
SPE: 0.724±0.044
AUC: 0.794±0.015
5-Fold CV
ACC: 0.700±0.026
SEN: 0.701±0.030
SPE: 0.691±0.075
AUC: 0.764±0.030
Toxicity XGBoost
LD50 LD50 of acute toxicity
* High-toxicity: 1~50 mg/kg.
* Toxicity: 51~500 mg/kg.
* low-toxicity: 501~5000 mg/kg.
Test Set
ACC: 0.765±0.007
SEN: 0.764±0.015
SPE: 0.765±0.014
AUC: 0.848±0.007
5-Fold CV
ACC: 0.741±0.045
SEN: 0.742±0.128
SPE: 0.740±0.111
AUC: 0.833±0.033
Toxicity XGBoost

Downloading Pretrained Model

A pretrained model as described in ref. 1 is available on Google Drive. Download and unzip by execuiting the bash script "download_default_model.sh":

./download_default_model.sh

The default_model.zip file can also be downloaded manualy under https://drive.google.com/open?id=1oyknOulq_j0w9kzOKKIHdTLo5HphT99h

Dev Environment

tensorflow=='1.14.0'
scikit-learn=='0.23.2'
rdkit=='2019.03.1'

Base

cddd
mso

About

Translate the SMILES string into another representation form. In this form, we can find the optimal active molecule through various intelligent algorithms, and express the molecules through reverse translation.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published