# 1 - Prerequisites

In [15]:
# Ensure you have the dependencies for this notebook
%pip install -r logging_and_customizing_models.txt

Note: you may need to restart the kernel to use updated packages.


In [16]:
# Setting Up Experiment

import mlflow
mlflow.set_experiment("heart-disease-classifier")

<Experiment: artifact_location='', creation_time=1715347798718, experiment_id='684cdabe-c571-4073-829c-5e45196eede9', last_update_time=None, lifecycle_stage='active', name='heart-disease-classifier', tags={}>

In [17]:
# Reading the data

import pandas as pd

file_url = "https://azuremlexampledata.blob.core.windows.net/data/heart-disease-uci/data/heart.csv"
df = pd.read_csv(file_url)
df["thal"] = df["thal"].astype("category").cat.codes

In [18]:
# Split data into train and test set

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df.drop("target", axis=1), df["target"], test_size=0.3
)

# 2 - Logging models using `autolog()`

In [19]:
import mlflow
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score

with mlflow.start_run():

    mlflow.xgboost.autolog()

    model = XGBClassifier(use_label_encoder=False, eval_metric="logloss")
    model.fit(X_train, y_train, eval_set=[(X_test, y_test)], verbose=False)
    y_pred = model.predict(X_test)

    accuracy = accuracy_score(y_test, y_pred)
    mlflow.log_metric("accuracy", accuracy)



# 3 - Logging models supported by MLFlow

If you need to log the models in a particular way, then you can use the method `log_model` to log the models as you need to

**Usually, you will log the model in this way when:**

* You need to indicate `pip` packages or dependencies different from the ones that are automatically detected.
* You need to indicate a `conda` environment different from the default one.
* Your models uses a signature different from the one inferred. This is specifically important when you deal with inputs that are tensors where the signature needs specific shapes.
* You want to include input examples.
* You want to include specific artifacts into the package that will be needed.
* Somehow the default behaviour of autolog doesn't fill your purpoise.

To log a model, you use the `log_method` model of the flavor you are working with

In [20]:
import mlflow
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
from mlflow.models import infer_signature

with mlflow.start_run():

    mlflow.xgboost.autolog(log_models=False)

    model = XGBClassifier(use_label_encoder=False, eval_metric="logloss")
    model.fit(X_train, y_train, eval_set=[(X_test, y_test)], verbose=False)
    y_pred = model.predict(X_test)

    accuracy = accuracy_score(y_test, y_pred)
    mlflow.log_metric("accuracy", accuracy)

    signature = infer_signature(X_test, y_test)
    mlflow.xgboost.log_model(model, "classifier", signature=signature)

  inputs = _infer_schema(model_input)
  outputs = _infer_schema(model_output) if model_output is not None else None


In [21]:
# If you need to indicate a custom environment with packages, you can use:

import mlflow
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
from mlflow.models import infer_signature

from mlflow.utils.environment import _mlflow_conda_env

# Define custom packages for MLflow conda environment
custom_packages = _mlflow_conda_env(
    additional_conda_deps=None,
    additional_pip_deps=["xgboost==1.5.2"],
    additional_conda_channels=None,
)

with mlflow.start_run():
    mlflow.xgboost.autolog(log_models=False)

    model = XGBClassifier(use_label_encoder=False, eval_metric="logloss")
    model.fit(X_train, y_train, eval_set=[(X_test, y_test)], verbose=False)
    y_pred = model.predict(X_test)

    accuracy = accuracy_score(y_test, y_pred)
    mlflow.log_metric("accuracy", accuracy)

    signature = infer_signature(X_test, y_test)
    mlflow.xgboost.log_model(model, "classifier", signature=signature, conda_env=custom_packages)

  inputs = _infer_schema(model_input)
  outputs = _infer_schema(model_output) if model_output is not None else None


# 4 - Logging custom models (Theory)

**MLflow works with many tools** like FastAI, TensorFlow, and Scikit-Learn, but sometimes you might need to use something different. In those cases, you can build a custom recipe for MLflow to understand your unique model.

**Here's when you might need a custom recipe:**

* **Your model uses special tools:** If your model relies on things that MLflow doesn't normally recognize, you can create a custom recipe to tell MLflow how to handle them.
* **Your model does more than predictions:** Maybe your model not only makes predictions but also performs other tasks like forecasting. A custom recipe can help MLflow understand these extra features.

For example, if you use Scikit-Learn to make forecasts (which it doesn't do by default), you'd create a custom recipe to explain this forecasting ability to MLflow.


**a) Logging custom models that are serializable**

**b) Logging custom models that are not serializable**
- Option 1: Use artifacts with the PythonModel object
- Option 2: Use a loader module

# 5 - Logging custom models that are serializable

**Saving and Loading Python Objects**

- **Making objects last:** In Python, you can save objects (like data or settings) to files so you can use them later. This is called serialization. The saved object is like a snapshot that can be brought back to life when needed.
- **Bringing objects back:** When you load a saved object from a file, it's like restoring a picture from a snapshot. You get all the original values, properties, and methods that the object had when it was saved.

**Using the `pyfunc` Flavor**

- **MLflow's special recipe for any object:** MLflow has a feature called `pyfunc` that lets you save any kind of object as a model, as long as it meets two requirements:
    1. **Inherits from `mlflow.pyfunc.PythonModel`:** This is like making sure your object speaks the same language as `pyfunc`.
    2. **Has a `predict` method:** This method is like the object's main job, telling you what it can do (like making predictions).

**Example: Saving a Scikit-Learn Model**

- **Scikit-Learn models already speak the language:** If your model uses Scikit-Learn, you don't need `pyfunc`. Scikit-Learn has its own way of saving models that MLflow understands.
  - **Think of it like using a built-in translator:** You don't need a separate recipe because Scikit-Learn models already speak the language MLflow expects.

In [22]:
# implementation of the XGBoost flavor (it returns the probabilities instead of the classes):

from mlflow.pyfunc import PythonModel, PythonModelContext


class ModelWrapper(PythonModel):
    def __init__(self, model):
        self._model = model

    def predict(self, context: PythonModelContext, data):
        # You don't have to keep the semantic meaning of `predict`. You can use here model.recommend(), model.forecast(), etc
        return self._model.predict_proba(data)

    # You can even add extra functions if you need to. Since the model is serialized,
    # all of them will be available when you load your model back.
    def predict_batch(self, data):
        pass

In [23]:
# Log and run custom model

import mlflow
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
from mlflow.models import infer_signature

with mlflow.start_run():
    mlflow.xgboost.autolog(log_models=False)

    model = XGBClassifier(use_label_encoder=False, eval_metric="logloss")
    model.fit(X_train, y_train, eval_set=[(X_test, y_test)], verbose=False)
    y_probs = model.predict_proba(X_test)

    accuracy = accuracy_score(y_test, y_probs.argmax(axis=1))
    mlflow.log_metric("accuracy", accuracy)

    signature = infer_signature(X_test, y_probs)
    mlflow.pyfunc.log_model("classifier", python_model=ModelWrapper(model), signature=signature)

  inputs = _infer_schema(model_input)


# 6 - Logging custom models that are not serializable (Theory)

- Some machine learning models can't be saved as regular files. This means you'll need a different way to store them for later use
- Mlflow can help with this. It will take all the pieces your model needs and bundle them together

- Models that are not serializable means that they cannot be serialized as a Pickle file
- This includes models that holds references to code that can't be serialized, that do not support serialization, or that provides a more efficient way to be persisted in disk
- In this case, you are required to use a different method to persist the artifacts that you need for your model to run
- Then, Mlflow will snapshot all these artifacts and package them all for you. You have two different ways to do this, depending on your preferences:

- **Option 1: Use artifacts with the `PythonModel` object**
- **Option 2: Use a loader module**

### Option 1: Use artifacts with the PythonModel object

- Use this if you want to retain the state of your model's properties
- For instance, in a recommender system you might want to store the number of elements to recommend to any user as a parameter
- Here, you will implement a model wrapper as you did in the option above, but in this case you will use `artifacts` to indicate MLflow extra files that you want to include for loading the model state

In [24]:
from mlflow.pyfunc import PythonModel, PythonModelContext


class ModelWrapper(PythonModel):
    def load_context(self, context: PythonModelContext):
        from xgboost import XGBClassifier

        self._model = XGBClassifier(use_label_encoder=False, eval_metric="logloss")
        self._model.load_model(context.artifacts["model"])

    def predict(self, context: PythonModelContext, data):
        return self._model.predict_proba(data)

In [25]:
# Log and run custom model

import mlflow
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
from mlflow.models import infer_signature

with mlflow.start_run():
    mlflow.xgboost.autolog(log_models=False)

    model = XGBClassifier(use_label_encoder=False, eval_metric="logloss")
    model.fit(X_train, y_train, eval_set=[(X_test, y_test)], verbose=False)
    y_probs = model.predict_proba(X_test)

    accuracy = accuracy_score(y_test, y_probs.argmax(axis=1))
    mlflow.log_metric("accuracy", accuracy)

    model_path = "xgb.model"
    model.save_model(model_path)

    signature = infer_signature(X_test, y_probs)
    mlflow.pyfunc.log_model(
        "classifier",
        python_model=ModelWrapper(),
        artifacts={"model": model_path},
        signature=signature,
    )

  inputs = _infer_schema(model_input)


### Option 2: Use a loader module

- Sometimes your model logic is complex and there are several source code files being used to make your model work
- This would be the case when you have a Python library for your model for instance
- In this scenario, you want to package the library all along with your model so it can move from one place to another as a single piece

In [26]:
%%writefile loader_module.py

class MyModel():
    def __init__(self, model):
        self._model = model

    def predict(self, data):
        return self._model.predict_proba(data)

def _load_pyfunc(data_path: str):
    import os

    model = XGBClassifier(use_label_encoder=False, eval_metric='logloss')
    model.load_model(os.path.abspath(data_path))

    return MyModel(model)

Overwriting loader_module.py


**How to Use This Module:**

1. **Save Your Trained Model:** Make sure you have a pre-trained XGBoost model saved in a specific location (e.g., `.sav`, `.model`).
2. **Import the Module:** In your main script or notebook, you can import this module using `from loader_module import _load_pyfunc`.
3. **Load the Model:** Call the `_load_pyfunc` function, providing the path to your saved model file. This will return a `MyModel` instance.
4. **Make Predictions:** Use the `predict` method of the returned `MyModel` object to make predictions on new data. For example, `predictions = my_model.predict(new_data)`.


In [27]:
# Log and run custom model

import mlflow
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
from mlflow.models import infer_signature

with mlflow.start_run():
    mlflow.xgboost.autolog(log_models=False)

    model = XGBClassifier(use_label_encoder=False, eval_metric="logloss")
    model.fit(X_train, y_train, eval_set=[(X_test, y_test)], verbose=False)
    y_probs = model.predict_proba(X_test)

    accuracy = accuracy_score(y_test, y_probs.argmax(axis=1))
    mlflow.log_metric("accuracy", accuracy)

    model_path = "xgb.model"
    model.save_model(model_path)

    signature = infer_signature(X_test, y_probs)
    mlflow.pyfunc.log_model(
        "classifier",
        data_path=model_path,
        code_path=["loader_module.py"],
        loader_module="loader_module",
        signature=signature,
    )

  inputs = _infer_schema(model_input)
