<a href="https://colab.research.google.com/github/ntua-unit-of-control-and-informatics/jaqpot-google-collab-examples/blob/main/Scikit-learn-models/feature-preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Feature Preprocessing

## Using Multiple Featurizers

This guide is about using multiple featurizers and performing feature selection.

First, we import necessary libraries.

In [1]:
# Install `jaqpotpy`
!pip install jaqpotpy

# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.feature_selection import VarianceThreshold
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder, StandardScaler
from sklearn.ensemble import RandomForestRegressor
from jaqpotpy.models import SklearnModel
from jaqpotpy.datasets import JaqpotTabularDataset
from jaqpotpy.descriptors import RDKitDescriptors, MACCSKeysFingerprint

Collecting jaqpotpy
  Downloading jaqpotpy-7.1.0-py3-none-any.whl.metadata (4.0 kB)
Collecting jaqpot-api-client>=6.43.0 (from jaqpotpy)
  Downloading jaqpot_api_client-7.0.3-py3-none-any.whl.metadata (1.7 kB)
Collecting jaqpot-python-sdk>=6.0.2 (from jaqpotpy)
  Downloading jaqpot_python_sdk-6.2.3-py3-none-any.whl.metadata (2.0 kB)
Collecting onnx==1.18.0 (from jaqpotpy)
  Downloading onnx-1.18.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.9 kB)
Collecting onnxmltools>=1.12.0 (from jaqpotpy)
  Downloading onnxmltools-1.14.0-py2.py3-none-any.whl.metadata (8.1 kB)
Collecting onnxruntime>=1.19.0 (from jaqpotpy)
  Downloading onnxruntime-1.23.2-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (5.1 kB)
Collecting polling2>=0.5.0 (from jaqpotpy)
  Downloading polling2-0.5.0-py2.py3-none-any.whl.metadata (2.7 kB)
Collecting python-keycloak>=4.3.0 (from jaqpotpy)
  Downloading python_keycloak-7.0.2-py3-none-any.whl.metadata (6.0 kB)
Collecting rd

Create a dataframe with SMILES strings, a categorical variable, temperature, and activity values

In [2]:
data = pd.read_csv("https://github.com/ntua-unit-of-control-and-informatics/jaqpot-google-colab-examples/raw/doc/JAQPOT-425/Sklearn_jupyter_examples/datasets/regression_smiles_categorical.csv")

Define a list of desired featurizers.

In [3]:
from jaqpotpy.descriptors import RDKitDescriptors, MACCSKeysFingerprint
featurizers = [RDKitDescriptors(), MACCSKeysFingerprint()]

We then pass this list of featurizers to the `JaqpotTabularDataset` object when creating the training dataset:

In [4]:
train_dataset = JaqpotTabularDataset(
    df=data,
    x_cols=["cat_col", "temperature"],
    y_cols=["activity"],
    smiles_cols=["smiles"],
    task="REGRESSION",
    featurizers=featurizers,
)

By providing a list of featurizers, the dataset will generate both RDKit descriptors and MACCS keys fingerprints for the SMILES data, resulting in a more comprehensive set of molecular features.

## Feature Selection

In the second script, we demonstrate the use of feature selection. After creating the `JaqpotTabularDataset` object, we apply a feature selection technique using the `select_features()` method:

In [5]:
# Use VarianceThreshold to select features with a minimum variance of 0.1
FeatureSelector = VarianceThreshold(threshold=0.1)
train_dataset.select_features(
    FeatureSelector,
    ExcludeColumns=["cat_col"],  # Explicitly exclude the categorical variable
)

This will apply the VarianceThreshold feature selector to the dataset, excluding the "cat_col" variable, which is a categorical feature that cannot be included in the selection process.

Alternatively, you can directly select specific columns by name using the `SelectColumns` argument:

In [6]:
myList = [
    "temperature",
    "cat_col",
    "MaxAbsEStateIndex",
    "MaxEStateIndex",
    "MinAbsEStateIndex",
    "MinEStateIndex",
    "SPS",
    "MolWt",
    "HeavyAtomMolWt",
]
train_dataset.select_features(SelectColumns=myList)

This method allows you to manually choose the features you want to include in the model, which can be useful if you have domain knowledge about the most relevant variables.

## Feature Preprocessing

In the first script, we define a preprocessing pipeline for the feature columns and the target column:

In [7]:
# Preprocessing for the feature columns
double_preprocessing = [
    ColumnTransformer(
        transformers=[
            ("OneHotEncoder", OneHotEncoder(), ["cat_col"]),
        ],
        remainder="passthrough",
        force_int_remainder_cols=False,
    ),
    StandardScaler(),  # Standard scaling for numerical features after encoding
]

# Preprocessing for the target column
single_preprocessing = MinMaxScaler()

The `double_preprocessing` pipeline first applies OneHotEncoder to the categorical "cat_col" feature, then applies StandardScaler to the numerical features (including the encoded categorical variable).

The `single_preprocessing` pipeline applies MinMaxScaler to the target variable "activity".

We then pass these preprocessing pipelines to the `SklearnModel` object:

In [8]:
jaqpot_model = SklearnModel(
    dataset=train_dataset,
    model=RandomForestRegressor(random_state=42),
    preprocess_x=double_preprocessing,
    preprocess_y=single_preprocessing,
)
jaqpot_model.fit()

Goodness-of-fit metrics on training set:
{'r2': 0.9376826862159472, 'mae': 0.9549999999999983, 'rmse': 1.3120060975468062}


This ensures that the feature and target variables are properly preprocessed before being used to train the machine learning model.

By using multiple featurizers, feature selection, and feature preprocessing, you can create more robust and effective machine learning models with JaqpotPy.