Skip to content

pharmbio/cplogd-v2.0

Repository files navigation

cpLogD v2.0

This is a re-take of the cpLogD model previously published in A confidence predictor for logD using conformal regression and a support-vector machine. The main update is that it is based on a newer version of ChEMBL (v33, May 2023), and that it is built using a later version of CPSign - which is now open source for non-commercial use. Note: the old cpLogD was based on the computed property acd_logd in ChEMBL but which is no longer supplied, and have been replaced by the CX LogD 7.4 property which was now used.

The service is now publicly available at https://cplogd.serve.scilifelab.se/ and you can view the OpenAPI specification at https://cplogd.serve.scilifelab.se/api/openapi.json.

Steps for generation

1. Downloading data from ChEMBL

The latest version of ChEMBL (version 33) was downloaded from ChEMBL, and data was extracted following the procedure outlined in download data. The extracted dataset can be found in the compressed file cx_logd.csv.gz.

2. Model development and evaluation

How the modeling was performed is detailed in train and evaluate model. Model evaluation was performed in the same way as for the initial model, using a withheld dataset of 100,000 test compounds.

3. Docker image generation

To serve the model as Java web server with OpenAPI documented REST interface we copy the trained-model.jar that was generated in the last step into the generate_service directory and use the Dockerfile to build a local docker image. This image is based on the base containers from the cpsign_predict_services repository. Follow the guide in that repo to publish your own service, or download our image from the packages tab at GitHub if you wish to run it yourself.

Model performance

Here we show the model performance for the new cpLogD model and compare it to the old model.

Model calibration

image

The error rate exactly matches the significance level from 0.6 significance and above, and even produces slightly lower error rate for significance levels lower than 0.6. In short - this shows that the model is indeed well calibrated and the predictions can be trusted.

Model efficiency

The original work compared different hyper-parameter settings and presented Median Prediction Interval (MPI) for a set of confidence levels:

image

Here are the MPI for the new (v2) model:

10% 20% 30% 40% 50% 60% 70% 80% 90% 95% 99%
0.043 0.085 0.127 0.169 0.213 0.262 0.325 0.418 0.606 0.849 1.77

The new model (v2) thus beats the old models (bold faced in the first table) at all significance levels except for 10% confidence where it only differs on the last digit. As stated in the original paper, confidence levels 70-99% are the most interesting, where the new model almost halves the MPI for confidence levels 70-95% and is about 40% smaller for 99% confidence. All results can be found in validation_stats file. For convenience we also plot these based on the significance level:

image

Accuracy of midpoint prediction

The original paper also presented the accuracy of the underlying SVM model, thus we present the same values here (Q$^2$=squared correlation coefficient, RMSEP=root mean square error of prediction):

Model Q$^2$ RMSEP
v1 (old) 0.973 0.41
v2 (new) 0.984 0.315

Our new model thus also improve the midpoint of the predictions, which is as expected as we have increased the size of the training data.

About

cpLodD version 2.0 - using ChEMBL v33

Resources

Stars

Watchers

Forks

Releases

No releases published

Languages