[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/openlayer-ai/examples-gallery/blob/main/tabular-classification/xgboost/xgboost.ipynb)


# <a id="top">Tabular classification using XGBoost</a>

This notebook illustrates how XGBoost models can be uploaded to the Openlayer platform.

**Important considerations:**
- **Categorical features.** From `xgboost>=1.5`, XGBoost introduced experimental support for [categorical data available for public testing](https://xgboost.readthedocs.io/en/latest/tutorials/categorical.html). We recommend encoding categorical features as illustrated in this notebook and **not** using the experimental feature with `enable_categorical=True` to upload models to Openlayer. The XGBoost package presented flaky behavior when such a feature is enabled and this is why it is discouraged for now. If this is critical to you, feel free to [reach out](mailto:support@openlayer.com)!
- **Feature dtypes.** XGBoost models are very sensitive to input data types. Some of the explainability techniques used by Openlayer rely on synthetic data generated by perturbing the original data samples. In that process, `int` values might be cast to `float` and if your XGBoost model was expecting an `int`, it will throw an error. To make sure that your model works well in the platform, make sure to **perform the casting inside the `predict_proba` function**, before creating the `xgb.DMatrix` and doing predictions with the model.

## <a id="toc">Table of contents</a>

1. [**Getting the data and training the model**](#1)
    - [Downloading the dataset](#download)
    - [Preparing the data](#prepare)
    - [Training the model](#train)
    

2. [**Using Openlayer's Python API**](#2)
    - [Instantiating the client](#client)
    - [Creating a project](#project)
    - [Uploading datasets](#dataset)
    - [Uploading models](#model)
    - [Committing and pushing to the platform](#commit)

In [None]:
%%bash

if [ ! -e "requirements.txt" ]; then
    curl "https://raw.githubusercontent.com/openlayer-ai/examples-gallery/main/tabular-classification/xgboost/requirements.txt" --output "requirements.txt"
fi

In [None]:
!pip install -r requirements.txt

## <a id="1"> 1. Getting the data and training the model </a>

[Back to top](#top)

In this first part, we will get the dataset, pre-process it, split it into training and validation sets, and train a model. Feel free to skim through this section if you are already comfortable with how these steps look for an XGBoost model.   

In [None]:
import numpy as np
import pandas as pd
import xgboost as xgb

from sklearn.model_selection import train_test_split

### <a id="download">Downloading the dataset </a>

We have stored the dataset on the following S3 bucket. If, for some reason, you get an error reading the csv directly from it, feel free to copy and paste the URL in your browser and download the csv file. Alternatively, you can also find the dataset on [this Kaggle competition](https://www.kaggle.com/datasets/uciml/mushroom-classification).

In [None]:
DATASET_URL = "https://openlayer-static-assets.s3.us-west-2.amazonaws.com/examples-datasets/tabular-classification/mushrooms.csv"

In [None]:
df = pd.read_csv(DATASET_URL)
df.head()

### <a id="prepare">Preparing the data</a>

In [None]:
def data_encode_one_hot(df, encoders):
    """ Encodes categorical features using one-hot encoding. """
    df = df.copy(True)
    df.reset_index(drop=True, inplace=True) # Causes NaNs otherwise
    for feature, enc in encoders.items():
        print(f"encoding {feature}")
        enc_df = pd.DataFrame(enc.transform(df[[feature]]).toarray(), columns=enc.get_feature_names([feature]))
        df = df.join(enc_df)
        df = df.drop(columns=feature)
    return df

In [None]:
def create_encoder_dict(df, categorical_feature_names):
    """ Creates encoders for each of the categorical features. 
        The predict function will need these encoders. 
    """
    from sklearn.preprocessing import OneHotEncoder
    encoders = {}
    for feature in categorical_feature_names:
        enc = OneHotEncoder(handle_unknown='ignore')
        enc.fit(df[[feature]])
        encoders[feature] = enc
    return encoders

In [None]:
# replacing class names with 0 and 1
class_map = {"e": 0, "p": 1}

X, y = df.loc[:, df.columns != "class"], df[["class"]].replace(class_map)

In [None]:
encoders = create_encoder_dict(X, list(X.columns))

X_enc_one_hot = data_encode_one_hot(X, encoders)
X_enc_one_hot

In [None]:
x_train, x_val, y_train, y_val = train_test_split(X, y, test_size = 0.2, random_state = 0)
x_train_one_hot = data_encode_one_hot(x_train, encoders)
x_val_one_hot = data_encode_one_hot(x_val, encoders)

### <a id="train">Training the model</a>

In [None]:
# Using XGBoost data format
dtrain = xgb.DMatrix(x_train_one_hot, label=y_train)
dval = xgb.DMatrix(x_val_one_hot, label=y_val)

In [None]:
param = {'max_depth':2, 'eta':1, 'objective':'binary:logistic' }
num_round = 2

xgboost_model = xgb.train(param, dtrain, num_round)

In [None]:
preds = xgboost_model.predict(dval)
labels = dval.get_label()

In [None]:
print(
    "error rate=%f"
    % (
        sum(1 for i in range(len(preds)) if int(preds[i] > 0.5) != labels[i])
        / float(len(preds))
    )
)

## <a id="2"> 2. Using Openlayer's Python API</a>

[Back to top](#top)

Now it's time to upload the datasets and model to the Openlayer platform.

In [None]:
!pip install openlayer

### <a id="client">Instantiating the client</a>

In [None]:
import openlayer

client = openlayer.OpenlayerClient("YOUR_API_KEY_HERE")

### <a id="project">Creating a project on the platform</a>

In [None]:
from openlayer.tasks import TaskType

project = client.create_or_load_project(name="XGBoost project", 
                                        task_type=TaskType.TabularClassification,
                                        description="Evaluation of ML approaches")

### <a id="dataset">Uploading datasets</a>

In [None]:
# Add the ground truths to the ordinal dataset for Openlayer
x_val['class'] = y_val.values
x_train['class'] = y_train.values

In [None]:
# some important parameters
class_names = ["e", "p"]  # the classes on the dataset
feature_names = list(X.columns)  # feature names in the un-processed dataset
categorical_feature_names = feature_names # all features are categorical in this dataset

In [None]:
from openlayer.datasets import DatasetType

# Validation set
project.add_dataframe(
    df=x_val,
    dataset_type=DatasetType.Validation,
    class_names=class_names,
    label_column_name='class',
    feature_names=feature_names,
    categorical_feature_names=categorical_feature_names,
)

# Training set
project.add_dataframe(
    df=x_val,
    dataset_type=DatasetType.Training,
    class_names=class_names,
    label_column_name='class',
    feature_names=feature_names,
    categorical_feature_names=categorical_feature_names,
)

We can check that both datasets are now staged using the `project.status()` method. 

In [None]:
project.status()

### <a id="model">Uploading models</a>

To upload a model to Openlayer, you will need to create a model package, which is nothing more than a folder with all the necessary information to run inference with the model. The package should include the following:
1. A `requirements.txt` file listing the dependencies for the model.
2. Serialized model files, such as model weights, encoders, etc., in a format specific to the framework used for training (e.g. `.json` for XGBoost, `.pkl` for sklearn, `.pb` for TensorFlow, and so on.)
3. A `prediction_interface.py` file that acts as a wrapper for the model and implements the `predict_proba` function. 
4. A `model_config.yaml` file that provides information about the model to the Openlayer platform, such as the framework used, feature names, and categorical feature names.

Lets prepare the model package one piece at a time
 

In [None]:
# Creating the model package folder (we'll call it `model_package`)
!mkdir model_package

**1. Adding the `requirements.txt` to the model package**

In [None]:
!scp requirements.txt model_package

**2. Serializing the model and other objects needed**

In [None]:
import pickle 

# Trained model
xgboost_model.save_model('model_package/model.json')

# Encoder for the categorical features
with open('model_package/encoders.pkl', 'wb') as handle:
    pickle.dump(encoders, handle, protocol=pickle.HIGHEST_PROTOCOL)

**3. Writing the `prediction_interface.py` file**

In [None]:
%%writefile model_package/prediction_interface.py

import pickle
from pathlib import Path

import pandas as pd
from sklearn.preprocessing import OneHotEncoder
import xgboost as xgb

PACKAGE_PATH = Path(__file__).parent


class XgboostModel:
    def __init__(self):
        """This is where the serialized objects needed should
        be loaded as class attributes."""
        self.model = xgb.Booster()
        self.model.load_model(PACKAGE_PATH / "model.json")
        
        with open(PACKAGE_PATH / "encoders.pkl", "rb") as encoders_file:
            self.encoders = pickle.load(encoders_file)

    def _data_encode_one_hot(self, df: pd.DataFrame) -> pd.DataFrame:
        """Pre-processing needed for our particular use case."""

        df = df.copy(True)
        df.reset_index(drop=True, inplace=True)  # Causes NaNs otherwise
        for feature, enc in self.encoders.items():
            enc_df = pd.DataFrame(
                enc.transform(df[[feature]]).toarray(),
                columns=enc.get_feature_names([feature]),
            )
            df = df.join(enc_df)
            df = df.drop(columns=feature)
        return df

    def predict_proba(self, input_data_df: pd.DataFrame):
        """Makes predictions with the model. Returns the class probabilities."""

        encoded_df = self._data_encode_one_hot(input_data_df)
        
        # Converting the data to the XGBoost data format
        data_xgb = xgb.DMatrix(encoded_df)
    
        # Making the predictions with the model
        preds = self.model.predict(data_xgb)
    
        # Post-processing the predictions to the format Openlayer expects
        preds_proba = [[1 - p, p] for p in preds]
        
        return preds_proba


def load_model():
    """Function that returns the wrapped model object."""
    return XgboostModel()

**4. Creating the `model_config.yaml`**

In [None]:
import yaml 

model_config = {
    "name": "My XGBoost model",
    "model_type": "xgboost",
    "class_names": class_names,
    "categorical_feature_names": categorical_feature_names,
    "feature_names":feature_names
}

with open('model_package/model_config.yaml', 'w') as model_config_file:
    yaml.dump(model_config, model_config_file, default_flow_style=False)

Lets check that the model package contains everything needed:

In [None]:
test_ = x_val.loc[:, x_val.columns != 'class']

In [None]:
from openlayer.validators import ModelValidator

model_validator = ModelValidator(
    model_package_dir="model_package", 
    sample_data = test_.iloc[:10, :]
)
model_validator.validate()

Now, we are ready to add the model:

In [None]:
project.add_model(
    model_package_dir="model_package",
    sample_data=test_.iloc[:10, :]
)

We can check that both datasets and model are staged using the `project.status()` method.

In [None]:
project.status()

### <a id="commit"> Committing and pushing to the platform </a>

Finally, we can commit the first project version to the platform. 

In [None]:
project.commit("Initial commit!")

In [None]:
project.status()

In [None]:
project.push()