# Model customization in MLflow

In MLflow, model customization allows us to define custom models with specific behavior beyond what is provided by the standard machine learning libraries. This is particularly useful when we need to integrate special preprocessing steps, post-processing, or custom prediction logic within a model.

When customizing a model in MLflow, we typically create a class that inherits from `mlflow.pyfunc.PythonModel`. This class allows us to define how the model should be loaded, how predictions should be made, and any other custom logic we want to include.

#### Key concepts in model customization

1. **Custom python model**: A Python class that extends `mlflow.pyfunc.PythonModel`. This class serves as a wrapper around our model and defines custom methods for loading the model and making predictions.

2. **Essential methods**:
   - **`load_context(self, context)`**: This method is responsible for loading the model and any other necessary resources. The `context` parameter provides access to the artifacts logged with the model.
   - **`predict(self, context, model_input)`**: This is the core method which defines the custom prediction logic. This method receives `context` (which contains artifacts and other runtime info) and `model_input` (the input data for making predictions). This method returns the predictions.
    - While `load_context` and `predict` are essential, we can also add other methods to the custom model class as needed. For example, a `preprocess` method can be included to handle data preprocessing steps before making predictions. This is not required but can be useful for integrating complex preprocessing or feature engineering workflows within the model class.
3. **Logging the custom model**: Once we define our custom model class, we can log it to MLflow using `mlflow.pyfunc.log_model`, and then load it using `mlflow.pyfunc.load_model`.

In [1]:
import mlflow
import mlflow.pyfunc
import pandas as pd
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris

import warnings
warnings.filterwarnings('ignore')

import logging
logging.getLogger('mlflow').setLevel(logging.ERROR)

#### Preparing the data

In [2]:
# Load the Iris dataset
data = load_iris()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

### Setting up the experiment

We will start by setting up a new MLflow experiment where all runs will be logged. If the experiment does not exist, it will be created.

In [3]:
# Set up the experiment
mlflow.set_experiment("Iris Classification Experiment")

<Experiment: artifact_location='file:///C:/Users/israe/Documents/Codes/Notebooks/mlruns/191308692135956385', creation_time=1724749168776, experiment_id='191308692135956385', last_update_time=1724749168776, lifecycle_stage='active', name='Iris Classification Experiment', tags={}>

### Defining a custom Python model

Let's define a custom model that wraps a scikit-learn models but adds some custom preprocessing before making predictions. We will create a model that applies polynomial feature transformation and scaling, uses an ensemble of classifiers, and applies a threshold to the predicted probabilities.

In [4]:
# Define a custom MLflow Python model
class CustomModel(mlflow.pyfunc.PythonModel):
    def load_context(self, context):
        # Load the pre-trained models and scaler artifacts from MLflow
        self.poly_features = mlflow.sklearn.load_model(context.artifacts["poly_features"])
        self.scaler = mlflow.sklearn.load_model(context.artifacts["scaler"])
        self.logistic_regression = mlflow.sklearn.load_model(context.artifacts["logistic_regression"])
        self.random_forest = mlflow.sklearn.load_model(context.artifacts["random_forest"])
        self.gradient_boosting = mlflow.sklearn.load_model(context.artifacts["gradient_boosting"])
        
    def preprocess(self, model_input):
        # Apply polynomial features and scaling to the input data
        poly_input = self.poly_features.transform(model_input)
        scaled_input = self.scaler.transform(poly_input)
        return scaled_input
        
    def predict(self, context, model_input):
        # Preprocess the input data
        preprocessed_input = self.preprocess(model_input)

        # Make predictions using the ensemble of pre-trained models
        rf_pred = self.random_forest.predict_proba(preprocessed_input)
        gb_pred = self.gradient_boosting.predict_proba(preprocessed_input)
        lr_pred = self.logistic_regression.predict_proba(preprocessed_input)

        # Combine predictions (simple average in this case)
        combined_pred = (rf_pred + gb_pred + lr_pred) / 2

        # Apply a threshold to binary classification (example: threshold = 0.5)
        threshold = 0.5
        pred = (combined_pred[:, 1] > threshold).astype(int)
        
        return pred

**Custom model definition**:
- **`CustomModel` Class**: This class inherits from `mlflow.pyfunc.PythonModel`.
- **`load_context`**: This method loads the pre-trained models, polynomial features transformer and scaler from MLflow artifacts. These artifacts are logged as part of the model when we log it with MLflow.
- **`preprocess`**: The input data is first transformed using polynomial features and then scaled using the `StandardScaler` before making prediction.
- **`predict`**: This method preprocesses the input using the polynomial features transformer and scaler. It then makes predictions using the pre-trained random forest, gradient boosting, and logistic regression models. The predictions from these models are combined (using a simple average in this case), and a threshold is applied to convert probabilities into binary classification. This example illustrates how to encapsulate preprocessing, prediction, and post-processing logic within a custom model.


### Training feature transformers and models
We will train a set of models and a feature transformer, including polynomial features transformation, scaling, and multiple classification models. We need to train the components before integrating them into a custom model since training each component separately allows us to ensure that each part of the workflow is correctly tuned and validated. This modular approach also makes it easier to debug and test each component independently. In addition, training components separately allow them to be reused across different models or experiments. The custom model class focuses on integrating and using pre-trained components.

In [5]:
# Initialize and fit the feature transformer, scaler, and models
poly_features = PolynomialFeatures(degree=2, include_bias=False)
scaler = StandardScaler()

random_forest = RandomForestClassifier(n_estimators=50, random_state=42)
gradient_boosting = GradientBoostingClassifier(n_estimators=50, random_state=42)
logistic_regression = LogisticRegression(max_iter=200)

# Create a pipeline for polynomial features and scaling
pipeline = make_pipeline(poly_features, scaler)

# Fit the pipeline and models on the training data
X_train_poly_scaled = pipeline.fit_transform(X_train)
random_forest.fit(X_train_poly_scaled, y_train)
gradient_boosting.fit(X_train_poly_scaled, y_train)
logistic_regression.fit(X_train_poly_scaled, y_train)

### Logging the custom model to MLflow
Now, we will log the trained models and feature transformer as artifacts, and wrap them with our custom Python model.

In [6]:
# Start a new MLflow run
with mlflow.start_run(run_name="Custom Model with Scaler Example") as run:
    # Log the polynomial features transformer and retrieve its URI
    poly_features_uri = mlflow.sklearn.log_model(poly_features, artifact_path="poly_features").model_uri

    # Log the scaler and retrieve its URI
    scaler_uri = mlflow.sklearn.log_model(scaler, artifact_path="scaler").model_uri

    # Log the random forest model and retrieve its URI
    rf_model_uri = mlflow.sklearn.log_model(random_forest, artifact_path="random_forest").model_uri
    
    # Log the gradient boosting model and retrieve its URI
    gb_model_uri = mlflow.sklearn.log_model(gradient_boosting, artifact_path="gradient_boosting").model_uri
   
    # Log the logistic regression model and retrieve its URI
    lr_model_uri = mlflow.sklearn.log_model(logistic_regression, artifact_path="logistic_regression").model_uri
    
    # Define a custom model with the artifact paths pointing to the logged scaler and logistic regression model
    custom_model = CustomModel()
    
    # Log the custom model
    mlflow.pyfunc.log_model(
        artifact_path="custom_model",
        python_model=custom_model,
        artifacts={
            "poly_features": poly_features_uri,
            "scaler": scaler_uri,
            "random_forest": rf_model_uri,
            "gradient_boosting": gb_model_uri,
            "logistic_regression": lr_model_uri,
        },
        conda_env=mlflow.pyfunc.get_default_conda_env()
    )

    # Get the run_id of the current run for later use
    run_id = run.info.run_id
    # Print the run ID for later use
    print(f"Run ID: {run_id}")

Downloading artifacts:   0%|          | 0/5 [00:00<?, ?it/s]

Downloading artifacts:   0%|          | 0/5 [00:00<?, ?it/s]

Downloading artifacts:   0%|          | 0/5 [00:00<?, ?it/s]

Downloading artifacts:   0%|          | 0/5 [00:00<?, ?it/s]

Downloading artifacts:   0%|          | 0/5 [00:00<?, ?it/s]

Run ID: b0516e1c2373403e8b7142ef84ab318c


- **Logging artifacts separately**: First, we log artifacts separately before logging the custom model. By logging each component separately, each artifact is tracked individually in the MLflow experiment run. This provides more granular control over each component, allowing for better versioning, comparison, and tracking.
    - **`mlflow.sklearn.log_model`**: Logs the trained component as an artifact. Then, the URI of the logged scaler is captured in `component_uri`. The result will be that the component is stored in MLflow’s artifact repository, complete with its own metadata and a pickle file representing the trained scaler.
- **Logging the custom model**: Then, we log the custom Python model with **`mlflow.pyfunc.log_model`**. In addition to the custom model's own metadata and a pickle file for the custom model itself, it also references and includes the metadata and pickle files of the associated artifacts (feature transformers and classification model).
    - **Linking artifacts**: The `artifacts` parameter is a dictionary that maps logical names (e.g., "poly_features", "scaler", etc.) to the URIs of the artifacts previously logged (`component_uri`). When logging a custom model, we use the artifacts parameter to reference already logged components. This requires that the components (e.g., scaler, logistic regression model) are already stored in MLflow's artifact store, thus necessitating separate logging.
    - `mlflow.pyfunc.log_model` creates a storage directory which contains a pickle file for the custom model itself and metadata including the logic to load the feature transformer and classification model artifacts. Inside this directory, subdirectories will point to the previously logged feature transformers and classification models, each containing their respective pickle files and their own metadata files.
- Conda environment: The `conda_env` parameter specifies the environment needed to run the model, ensuring that the correct dependencies are installed when the model is deployed or reused.


### Loading and using the custom model
Now that the model is logged, we can load it and use it for predictions.

In [7]:
# Load the custom model
loaded_model = mlflow.pyfunc.load_model(f"runs:/{run_id}/custom_model")

# Make predictions using the custom model on unscaled test data
predictions = loaded_model.predict(X_test)

# Evaluate the performance
accuracy = accuracy_score(y_test, predictions)
print(f"Custom model accuracy: {accuracy}")

Custom model accuracy: 0.68


After logging the custom model, it is loaded using `mlflow.pyfunc.load_model`, and predictions are made on the test set.
- **`mlflow.pyfunc.load_model`**: Loads the custom model from the specified URI.
- **`predict`**: The `predict` method in the custom model first preprocesses the input data using the polynomial features transformer and scaler. It then makes predictions by combining the outputs from the ensemble of pre-trained models (RandomForestClassifier, GradientBoostingClassifier, and LogisticRegression). The predictions are averaged, and a threshold is applied for binary classification.