# Introduction to Jaqpot and Machine Learning in Chemistry

In this short tutorial, we will explore how to use **Jaqpotpy**, a Python client for the **Jaqpot** platform. Jaqpot is a cloud-based platform that allows scientists to easily train, validate, and deploy machine learning models, especially in fields like **chemistry**, **toxicology**, and **risk assessment**.

Our goal is to show how a predictive model can be created using chemical data — without diving into advanced programming or infrastructure setup.

We'll focus on the following:
- Using **chemical descriptors** from SMILES (a textual representation of molecules)
- Training a model via the **Jaqpot API**
- Making predictions with the trained model

## Install and import required python modules.

In [None]:
!pip install jaqpotpy
!pip install rdkit==2023.9.6
!pip install onnx==1.17.0
!pip install numpy==1.26.4
!pip install onnx==1.17.0
!pip install pyTDC
import os
# To restart the session with new rdkit+numpy version
os.kill(os.getpid(), 9)

Collecting jaqpotpy
  Downloading jaqpotpy-6.24.0-py3-none-any.whl.metadata (4.0 kB)
Collecting jaqpot-api-client>=6.43.0 (from jaqpotpy)
  Downloading jaqpot_api_client-6.43.4-py3-none-any.whl.metadata (1.7 kB)
Collecting jaqpot-python-sdk>=6.0.2 (from jaqpotpy)
  Downloading jaqpot_python_sdk-6.0.5-py3-none-any.whl.metadata (2.0 kB)
Collecting onnx>=1.17.0 (from jaqpotpy)
  Downloading onnx-1.18.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.9 kB)
Collecting onnxmltools>=1.12.0 (from jaqpotpy)
  Downloading onnxmltools-1.13.0-py2.py3-none-any.whl.metadata (8.2 kB)
Collecting onnxruntime>=1.19.0 (from jaqpotpy)
  Downloading onnxruntime-1.22.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (4.5 kB)
Collecting polling2>=0.5.0 (from jaqpotpy)
  Downloading polling2-0.5.0-py2.py3-none-any.whl.metadata (2.7 kB)
Collecting python-keycloak>=4.3.0 (from jaqpotpy)
  Downloading python_keycloak-5.5.0-py3-none-any.whl.metadata (6.0 kB)
Collecting 

In [None]:
from tdc.single_pred import Tox
import pandas as pd
from sklearn.feature_selection import mutual_info_classif
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from jaqpotpy.datasets import JaqpotTabularDataset
from sklearn.feature_selection import VarianceThreshold
from jaqpotpy.models import SklearnModel
from sklearn.linear_model import LogisticRegression
from jaqpotpy.descriptors.molecular import RDKitDescriptors, MACCSKeysFingerprint
from jaqpotpy.doa import Leverage, BoundingBox
from jaqpotpy import Jaqpot

## Import chemical data from TDCommons

DILI (Drug Induced Liver Injury)

Source: https://tdcommons.ai/single_pred_tasks/tox/#dili-drug-induced-liver-injury

Dataset Description: Drug-induced liver injury (DILI) is fatal liver disease caused by drugs and it has been the single most frequent cause of safety-related drug marketing withdrawals for the past 50 years (e.g. iproniazid, ticrynafen, benoxaprofen). This dataset is aggregated from U.S. FDA’s National Center for Toxicological Research.

Task Description: Binary classification. Given a drug SMILES string, predict whether it can cause liver injury (1) or not (0).

Dataset Statistics: 475 drugs.

In [None]:
data = Tox(name = 'DILI')

Downloading...
100%|██████████| 26.7k/26.7k [00:00<00:00, 459kiB/s]
Loading...
Done!


## Dataset Description

In this notebook, we work with a dataset that contains **chemical compounds** represented by their **SMILES strings** and several **molecular descriptors**.

- The **SMILES** notation is a way to encode a molecule's structure in a simple string, which is widely used in cheminformatics. For example, the SMILES string for **Aspirin** is `CC(=O)OC1=CC=CC=C1C(=O)O`.
- The **descriptors** (generated using RDKit) include information such as molecular weight, number of hydrogen donors/acceptors, surface area, and more. These numerical values are used by the model to learn relationships between molecular structure and the target property.

In [None]:
smiles = data.get_data()["Drug"]
y = data.get_data()["Y"]
all_data = pd.DataFrame(data={"smiles": smiles, "y": y})

In [None]:
print(smiles)

0                                   CC(=O)OCC[N+](C)(C)C
1                                  C[N+](C)(C)CC(=O)[O-]
2           O=C(NC(CO)C(O)c1ccc([N+](=O)[O-])cc1)C(Cl)Cl
3                                        O=C(O)c1ccccc1O
4                         CC(NC(C)(C)C)C(=O)c1cccc(Cl)c1
                             ...                        
470             CCCC(CCC)C(=O)O.CCCC(CCC)C(=O)[O-].[Na+]
471    CCCCC(CC)COC(=O)CC(C(=O)OCC(CC)CCCC)S(=O)(=O)[...
472    C=C1c2cccc(O)c2C(O)=C2C(=O)C3(O)C(O)=C(C(N)=O)...
473                               O=C1OC(C(O)CO)C(O)=C1O
474    CN(C)C1C(=O)C(C(N)=O)=C(O)C2(O)C(=O)C3=C(O)c4c...
Name: Drug, Length: 475, dtype: object


In [None]:
print(y)

0      0.0
1      0.0
2      0.0
3      0.0
4      0.0
      ... 
470    1.0
471    0.0
472    1.0
473    0.0
474    1.0
Name: Y, Length: 475, dtype: float64


### Split Data into Train-Test
Splitting the dataset into training and testing sets is a crucial step in machine learning. It allows us to evaluate the performance of our model on unseen data, ensuring that it generalizes well and does not simply memorize the training data. By training the model on one portion of the data (training set) and testing it on another (testing set), we can assess its predictive accuracy and avoid overfitting.

In [None]:
train_smiles, test_smiles, train_y, test_y = train_test_split(smiles, y, test_size=0.2, stratify=y, random_state=42)
train_data = pd.DataFrame(data={"smiles": train_smiles, "y": train_y})
test_data = pd.DataFrame(data={"smiles": test_smiles, "y": test_y})
print(f"Train data shape: {train_data.shape}")
print(f"Test data shape: {test_data.shape}")

Train data shape: (380, 2)
Test data shape: (95, 2)


### Create Datasets with Molecular Descriptors
The `JaqpotTabularDataset` is a class provided by the `jaqpotpy` library to handle tabular datasets, particularly for machine learning tasks in cheminformatics. It simplifies the process of converting raw chemical data into a format suitable for model training and prediction.

Key input variables for `JaqpotTabularDataset`:
- `df`: The input dataframe containing the data.
- `smiles_cols`: A list of column names in the dataframe that contain SMILES strings, which represent the molecular structures.
- `featurizer`: A list of featurizers (e.g., `RDKitDescriptors`, `MACCSKeysFingerprint`) used to compute molecular descriptors or fingerprints from the SMILES strings.
- `y_cols`: A list of column names in the dataframe that contain the target variable(s).
- `task`: The type of machine learning task, such as `"binary_classification"`, `"regression"`, or `"multi_class_classification"`.

This class automatically computes the molecular descriptors, handles missing values, and prepares the dataset for training or testing.

In [None]:
task = "binary_classification"
smiles_col = ["smiles"]
y_cols = ["y"]

train_dataset = JaqpotTabularDataset(
    df=train_data,
    smiles_cols=smiles_col,
    featurizer=[RDKitDescriptors(use_fragment=True)],
    y_cols=y_cols,
    task=task,
)
train_dataset.X = train_dataset.X.dropna(axis=1)

test_dataset = JaqpotTabularDataset(
    df=test_data,
    smiles_cols=smiles_col,
    featurizer=[RDKitDescriptors(use_fragment=True)],
    y_cols=y_cols,
    task=task)
test_dataset.X = train_dataset.X.dropna(axis=1)

Columns with non-finite values: Index(['BCUT2D_MWHI', 'BCUT2D_MWLOW', 'BCUT2D_CHGHI', 'BCUT2D_CHGLO',
       'BCUT2D_LOGPHI', 'BCUT2D_LOGPLOW', 'BCUT2D_MRHI', 'BCUT2D_MRLOW'],
      dtype='object')
Rows with non-finite values with row_index: Index([125, 185, 257, 263, 319, 336], dtype='int64')
The corresponding SMILES are: 125             CCCC(CCC)C(=O)O.CCCC(CCC)C(=O)[O-].[Na+]
185                                                [Li+]
257     [C-]#N.[C-]#N.[C-]#N.[C-]#N.[C-]#N.[Fe+4].[N-]=O
263    CCCCC(CC)COC(=O)CC(C(=O)OCC(CC)CCCC)S(=O)(=O)[...
319                 NC1CCCCC1N.O=C([O-])C(=O)[O-].[Pt+2]
336    COCCNC(=O)CN(CCN(CCN(CC(=O)[O-])CC(=O)NCCOC)CC...
Columns with non-finite values: Index(['BCUT2D_MWHI', 'BCUT2D_MWLOW', 'BCUT2D_CHGHI', 'BCUT2D_CHGLO',
       'BCUT2D_LOGPHI', 'BCUT2D_LOGPLOW', 'BCUT2D_MRHI', 'BCUT2D_MRLOW'],
      dtype='object')
Rows with non-finite values with row_index: Index([3, 62], dtype='int64')
The corresponding SMILES are: 3                              

### Setting up SklearnModel with Jaqpotpy and Cross-Validation

The `SklearnModel` class in `jaqpotpy` allows us to integrate scikit-learn models into the Jaqpot platform. This enables seamless training, evaluation, and deployment of machine learning models for cheminformatics tasks.

#### Key Steps to Set Up `SklearnModel`:
1. **Dataset**: Provide a `JaqpotTabularDataset` object containing the preprocessed data (`train_dataset` in this case).
2. **Model**: Specify the scikit-learn model to be used (e.g., `LogisticRegression`).
3. **Preprocessing**: Optionally, include preprocessing steps like scaling (e.g., `MinMaxScaler`).
4. **Domain of Applicability (DOA)**: Optionally, define DOA methods (e.g., `Leverage`, `BoundingBox`) to assess the reliability of predictions.

#### Cross-Validation:
Cross-validation is a technique used to evaluate the performance of a machine learning model. It involves splitting the dataset into multiple folds, training the model on some folds, and testing it on the remaining fold(s). This process is repeated multiple times, and the results are averaged to provide a robust estimate of the model's performance.

In `jaqpotpy`, the `cross_validate` method performs cross-validation on the provided dataset. Key benefits include:
- **Avoiding Overfitting**: Ensures the model generalizes well to unseen data.
- **Performance Metrics**: Provides metrics like accuracy, precision, recall, etc., averaged across folds.
- **Model Selection**: Helps in selecting the best model or hyperparameters.

By combining `SklearnModel` with cross-validation, we ensure that our model is both well-trained and rigorously evaluated before deployment. This workflow is essential for building reliable and reproducible machine learning models in cheminformatics.

In [None]:
sklearn_model = LogisticRegression(random_state=42)
jaqpot_model = SklearnModel(
    dataset=train_dataset, model=sklearn_model, preprocess_x = MinMaxScaler()
)
jaqpot_model.fit()
cv = jaqpot_model.cross_validate(train_dataset, n_splits=5)
print("Average scores on Cross-Validation:", cv)

Goodness-of-fit metrics on training set:
{'accuracy': 0.8578947368421053, 'balancedAccuracy': 0.8576691875121194, 'precision': array([0.90052356, 0.81481481]), 'recall': array([0.83091787, 0.89017341]), 'f1Score': array([0.86432161, 0.85082873]), 'jaccard': array([0.76106195, 0.74038462]), 'matthewsCorrCoef': 0.718209069779487, 'confusionMatrix': array([[172,  19],
       [ 35, 154]])}
Average scores on Cross-Validation: {'accuracy': 0.8131578947368421, 'balancedAccuracy': 0.8132031717702034, 'precision': array([0.82234455, 0.8040618 ]), 'recall': array([0.81008321, 0.81898256]), 'f1Score': array([0.8141094 , 0.80979359]), 'jaccard': array([0.68717393, 0.68082706]), 'matthewsCorrCoef': 0.6277205685336912, 'confusionMatrix': array([[31.4,  6.8],
       [ 7.4, 30.4]])}


### Feature Selection for Model Optimization

Feature selection is a crucial step in machine learning to improve model performance and reduce overfitting. The next code cell defines a function `feature_selection` that performs the following steps:

1. **Correlation Analysis**:
    - Computes the Pearson correlation matrix for the features.
    - Removes features that are highly correlated (correlation > 0.9) to avoid redundancy.

2. **Variance Thresholding**:
    - Scales the features using `MinMaxScaler`.
    - Removes features with very low variance (threshold < 0.001), as they provide little to no information for the model.

3. **Mutual Information (MI) Analysis**:
    - Calculates the mutual information scores between features and the target variable.
    - Retains features with MI scores above a specified threshold (≥ 0.01), ensuring that only the most informative features are selected.

The selected features are then returned as a list, which will be used to refine the training and testing datasets in subsequent steps. This process ensures that the model is trained on the most relevant and non-redundant features, leading to better generalization and performance.

In [None]:
import pandas as pd
import numpy as np

def feature_selection(X, y):
    # Corellation
    corr_matrix = X.corr(method='pearson')
    corr_matrix = corr_matrix.abs()
    upper_tri = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
    to_keep = []
    for column in upper_tri.columns:
      if any(upper_tri[column] <= 0.9):
        to_keep.append(column)

    X = X[to_keep]

    # Var Threshold
    scaler = MinMaxScaler()
    X_scaled = scaler.fit_transform(X)
    selector = VarianceThreshold(threshold=0.001)
    selector.fit(X_scaled)
    selected_features = X.columns[selector.get_support(indices=True)]
    X = X[selected_features]

    # Mututal Info
    mi_scores = mutual_info_classif(X, y)
    mi_scores = pd.Series(mi_scores, name="MI Scores", index=X.columns)
    mi_scores = mi_scores.sort_values(ascending=False)
    selected_features = mi_scores[mi_scores >= 0.01].index.tolist()

    return selected_features

selected_features = feature_selection(train_dataset.X, train_dataset.y)
print("Number of selected features = ", len(selected_features))
print("List of selected features:", selected_features)

Number of selected features =  130
List of selected features: ['SlogP_VSA6', 'MinAbsPartialCharge', 'PEOE_VSA9', 'SMR_VSA7', 'PEOE_VSA3', 'SMR_VSA10', 'HallKierAlpha', 'MaxPartialCharge', 'MinEStateIndex', 'FractionCSP3', 'EState_VSA8', 'SMR_VSA3', 'EState_VSA1', 'SMR_VSA1', 'EState_VSA2', 'VSA_EState2', 'NumHeteroatoms', 'NumAromaticHeterocycles', 'SMR_VSA6', 'NOCount', 'SMR_VSA4', 'TPSA', 'PEOE_VSA12', 'PEOE_VSA1', 'VSA_EState9', 'fr_aniline', 'PEOE_VSA7', 'SlogP_VSA10', 'SMR_VSA5', 'FpDensityMorgan1', 'SlogP_VSA2', 'VSA_EState8', 'PEOE_VSA6', 'EState_VSA3', 'SlogP_VSA5', 'SlogP_VSA1', 'fr_sulfonamd', 'Kappa1', 'Chi0v', 'EState_VSA10', 'NumAromaticRings', 'NHOHCount', 'MaxAbsPartialCharge', 'VSA_EState6', 'FpDensityMorgan3', 'SlogP_VSA12', 'EState_VSA9', 'fr_Ar_N', 'EState_VSA6', 'fr_Nhpyrrole', 'PEOE_VSA8', 'MinPartialCharge', 'SlogP_VSA8', 'AvgIpc', 'HeavyAtomMolWt', 'Ipc', 'PEOE_VSA2', 'fr_COO2', 'SMR_VSA9', 'SlogP_VSA4', 'VSA_EState1', 'Chi4v', 'BertzCT', 'qed', 'NumValenceElectr

Keep the ```selected_features``` into the dataset and frop the rest.

In [None]:
train_dataset.select_features(SelectColumns=selected_features)
test_dataset.select_features(SelectColumns=selected_features)

### Domain of Applicability (DOA) in QSARs

The Domain of Applicability (DOA) is a critical concept in Quantitative Structure-Activity Relationship (QSAR) modeling. It defines the chemical space within which the model's predictions are considered reliable. DOA ensures that the model is applied only to compounds similar to those in the training dataset, reducing the risk of inaccurate predictions for out-of-scope molecules. By incorporating DOA, QSAR models become more robust, interpretable, and trustworthy for decision-making in drug discovery and toxicology.

In [None]:
doa = [Leverage(), BoundingBox()]
jaqpot_model = SklearnModel(
    dataset=train_dataset, model=sklearn_model, preprocess_x = MinMaxScaler(), doa = doa
)
jaqpot_model.fit()
cv = jaqpot_model.cross_validate(train_dataset, n_splits=5)
print("Average scores on Cross-Validation:", cv)

Goodness-of-fit metrics on training set:
{'accuracy': 0.8526315789473684, 'balancedAccuracy': 0.8524889886146431, 'precision': array([0.87958115, 0.82539683]), 'recall': array([0.8358209 , 0.87150838]), 'f1Score': array([0.85714286, 0.84782609]), 'jaccard': array([0.75      , 0.73584906]), 'matthewsCorrCoef': 0.7061526476719189, 'confusionMatrix': array([[168,  23],
       [ 33, 156]])}
Average scores on Cross-Validation: {'accuracy': 0.8078947368421053, 'balancedAccuracy': 0.8078444057719023, 'precision': array([0.82605755, 0.78963127]), 'recall': array([0.79779794, 0.82520377]), 'f1Score': array([0.80846346, 0.80423316]), 'jaccard': array([0.68114206, 0.67299332]), 'matthewsCorrCoef': 0.6193154307735212, 'confusionMatrix': array([[31.6,  6.6],
       [ 8. , 29.8]])}


### Uploading the Model to Jaqpot

Once the model is trained and validated, the next step is to upload it to the Jaqpot platform. Jaqpot provides a seamless way to deploy machine learning models, making them accessible for further use and integration. By uploading the model, it becomes possible to share, test, and utilize it in various applications while ensuring controlled access and visibility settings.

In [None]:
jaqpot = Jaqpot()
jaqpot.login()

jaqpot_model.deploy_on_jaqpot(
    jaqpot=jaqpot,
    name="DILI Classification model",
    description="This is my first attempt to train and upload a Jaqpot model",
    visibility="PRIVATE",
)

Open this URL in your browser and log in:
https://login.jaqpot.org/realms/jaqpot/protocol/openid-connect/auth?client_id=jaqpot-client&response_type=code&redirect_uri=urn:ietf:wg:oauth:2.0:oob&scope=openid email profile&state=random_state_value&nonce=
Enter the authorization code you received: 05bf6069-c541-4355-b7de-187483113868.c1192fdc-e9c4-4142-95ab-e3d026368bcf.40e0db1a-58ce-461a-8fbb-6a4451d8587a


[1m [32m 2025-05-13 17:55:14,069 - INFO - Model has been successfully uploaded. The url of the model is https://app.jaqpot.org/dashboard/models/2129[0m
