Integrating (Deep) Machine Learning and Cheminformatics for Predicting Human Intestinal Absorption of Small Molecules
A set of 2648 compounds were collected from some early as well as recent works and curated to build a robust dataset. Five machine learning (ML) algorithms have been trained with a set of molecular descriptors of these compounds which have been selected after rigorous feature engineering. Additionally, a graph convolution neural network (GCNN) based model and a graph attention network (GAT) absed model was developed using the same set of compounds to exploit the predictability with automated extracted features.
The numerical analyses show that out the five ML models, Random forest and LightGBM could predict with an accuracy of 87.71% and 86.04% on the test set and 81% and 77% with the external validation set respectively. Whereas with the GCNN and GAT based models, the final accuracy obtained was 77.69% and 78.58% on the test set and 79.29% and 79.42% on the external validation set respectively.
Package Version
interpret 0.6.2 ipykernel 6.29.3 ipython 8.25.0 jupyter 1.0.0 lightgbm 4.4.0 matplotlib 3.9.0 mordred 1.2.0 notebook 7.2.0 numba 0.60.0 numpy 1.26.4 openpyxl 3.1.5 pandas 2.2.2 pip 24.0 plotly 5.23.0 rdkit 2023.9.6 scikit-learn 1.5.1 scipy 1.14.0 seaborn 0.13.2 shap 0.46.0 sklearn-rvm 0.1.1 torch 2.3.1 torch_geometric 2.5.3 torchaudio 2.3.1 torchvision 0.18.1 wheel 0.43.0 widgetsnbextension 4.0.11 xgboost 2.1.0