The ChemMORT (Molecular Represent & Translate) consists of three modules, including SMILES Encoder, Embedding Decoder and Molecular Optimizer.
The ChemMRAT SMILES Encoder allows the user to easily embed a SMILES string to a 512-dimensional vector, which can be used for building a QSAR model. Especially for DNN, the encoding descriptors satisfy the fundamental idea of representation learning: DNNs should learn a suitable representation of the data from a simple but complete featurization rather than relying on sophisticated human-engineered representations. Besides, DNNs often require massive amounts of data for training, but the available QSAR data is often small. Through enumerating the SMILES of a molecule, the data is extended to several times of the original repository. Users can input a chemical to be evaluated in the following three ways: drawing it in an included chemical sketcher window, entering a structure text file, or imputing the SMILES of the chemical structures.
The ChemMRAT Embedding Decoder was implemented to translate the embedding descriptors, retrieved from ChemMRAT SMILES Encoder, to a SMILES string. The Decoder assists the molecular property optimization where the user could adjust the embedding descriptors to hit an aimed property, and then use decoder to obtain the SMILES of each molecule. Users can input a .csv file, and a .smi file can be returned after a few seconds to minutes.
The ChemMRAT Molecular Optimizer, merged the Encoder, Decoder and Particle Swarm Optimization (PSO) method, was designed to optimize molecules with respect to a single objective, under constraints with chemical substructures and a multi-objective value function. Not only does our proposed method exhibit competitive or better performance in finding optimal solutions compared to baseline method, is also achieves significant reduction in computational time. After users input the SMILES of a chemical structure and select property to be optimized, several best solutions can be obtained in the results.
Endpoint | Description | Performance | Type | Method | Dataset |
---|---|---|---|---|---|
logD7.4 | Log of the octanol/water distribution coefficient at pH7.4. * Optimal: 1~3 |
Test Set RMSE: 0.555±0.010 MAE: 0.426±0.007 R2: 0.840±0.004 5-Fold CV RMSE: 0.562±0.009 MAE: 0.428±0.13 R2: 0.834±0.005 |
Basic property | XGBoost | |
AMES | The probability to be positive in Ames test. * The smaller AMES score, the less likely to be AMES positive. |
Test Set ACC: 0.813±0.007 SEN: 0.835±0.013 SPE: 0.787±0.013 AUC: 0.888±0.004 5-Fold CV ACC: 0.810±0.016 SEN: 0.838±0.014 SPE: 0.777±0.031 AUC: 0.889±0.013 |
Toxicity | XGBoost | |
Caco-2 | Papp (Caco-2 Permeability) Optimal: higher than -5.15 Log unit or -4.70 or -4.80 |
Test Set RMSE: 0.332±0.007 MAE: 0.244±0.004 R2: 0.718±0.019 5-Fold CV RMSE: 0.328±0.004 MAE: 0.245±0.005 R2: 0.728±0.011 |
Absorption | XGBoost& Data Augment | |
MDCK | Papp (MDCK Permeability) |
Test Set RMSE: 0.323±0.022 MAE: 0.232±0.011 R2: 0.650±0.041 5-Fold CV RMSE: 0.322±0.034 MAE: 0.235±0.021 R2: 0.644±0.057 |
Absorption | XGBoost& Data Augment | |
PPB | Plasma Protein Binding * Significant with drugs that are highly protein-bound and have a low therapeutic index. |
Test Set RMSE: 0.152±0.003 MAE: 0.104±0.002 R2: 0.691±0.016 5-Fold CV RMSE: 0.154±0.010 MAE: 0.106±0.007 R2: 0.691±0.025 |
Distribution | DNN | |
QED | quantitative estimate of drug-likeness | n/a | Drug-likeness score | Molecular Function | |
SlogP | Log of the octanol/water partition coefficient, based on an atomic contribution model [Crippen 1999]. * Optimal: 0< LogP <3 * logP <0: poor lipid bilayer permeability. * logP >3: poor aqueous solubility. |
Fitted on an extensive training set of 9920 molecules, with R2 = 0.918 and σ = 0.677 | Basic property | Molecular Function | |
logS | Log of Solubility * Optimal: higher than -4 log mol/L * <10 μg/mL: Low solubility. * 10–60 μg/mL: Moderate solubility. * >60 μg/mL: High solubility |
Test Set RMSE: 0.823±0.026 MAE: 0.572±0.009 R2: 0.862±0.011 5-Fold CV RMSE: 0.842±0.084 MAE: 0.592±0.056 R2: 0.839±0.029 |
Basic property | XGBoost | |
hERG | The probability to be hERG Blocker * The higher hERG score, the more likely to be hERG Blocker. |
Test Set ACC: 0.814±0.026 SEN: 0.841±0.042 SPE: 0.760±0.065 AUC: 0.854±0.032 5-Fold CV ACC: 0.800±0.036 SEN: 0.820±0.068 SPE: 0.754±0.147 AUC: 0.857±0.053 |
Toxicity | XGBoost | |
Hepatoxicity | The probability of owning liver toxicity * The smaller hepatoxicity score, the less likely to be liver toxic. |
Test Set ACC: 0.729±0.016 SEN: 0.732±0.019 SPE: 0.724±0.044 AUC: 0.794±0.015 5-Fold CV ACC: 0.700±0.026 SEN: 0.701±0.030 SPE: 0.691±0.075 AUC: 0.764±0.030 |
Toxicity | XGBoost | |
LD50 | LD50 of acute toxicity * High-toxicity: 1~50 mg/kg. * Toxicity: 51~500 mg/kg. * low-toxicity: 501~5000 mg/kg. |
Test Set ACC: 0.765±0.007 SEN: 0.764±0.015 SPE: 0.765±0.014 AUC: 0.848±0.007 5-Fold CV ACC: 0.741±0.045 SEN: 0.742±0.128 SPE: 0.740±0.111 AUC: 0.833±0.033 |
Toxicity | XGBoost |
A pretrained model as described in ref. 1 is available on Google Drive. Download and unzip by execuiting the bash script "download_default_model.sh":
./download_default_model.sh
The default_model.zip file can also be downloaded manualy under https://drive.google.com/open?id=1oyknOulq_j0w9kzOKKIHdTLo5HphT99h
tensorflow=='1.14.0'
scikit-learn=='0.23.2'
rdkit=='2019.03.1'