[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/openlayer-ai/examples-gallery/blob/main/monitoring/quickstart/monitoring-quickstart.ipynb)


# <a id="top">Monitoring quickstart</a>

This notebook illustrates a typical monitoring flow using Openlayer.


## <a id="toc">Table of contents</a>

1. [**Creating a project and an inference pipeline**](#inference-pipeline)   

2. [**Uploading a reference dataset**](#reference-dataset)

3. [**Publishing batches of production data**](#publish-batches)


4. [**Publishing ground truths**](#ground-truths)

## <a id="inference-pipeline"> 1. Creating a project and an inference pipeline </a>

[Back to top](#top)

In [None]:
!pip install openlayer

In [None]:
import openlayer
from openlayer.tasks import TaskType

client = openlayer.OpenlayerClient("YOUR_API_KEY_HERE")
project = client.create_or_load_project(
    name="Churn Prediction ",
    task_type=TaskType.TabularClassification,
)

Now that you are authenticated and have a project on the platform, it's time to create an inference pipeline. Creating an inference pipeline is what enables the monitoring capabilities in a project.

In [None]:
inference_pipeline = project.create_inference_pipeline()

# Or 
# inference_pipeline = project.load_inference_pipeline(name="Production")

## <a id="reference-dataset"> 2. Uploading a reference dataset </a>

[Back to top](#top)

A reference dataset is optional, but it enables drift monitoring. Ideally, the reference dataset is a representative sample of the training set used to train the deployed model. In this section, we first load the dataset and then we upload it to Openlayer using the `upload_reference_dataframe` method.

### <a id="download-reference"> Downloading the data </a>

In [None]:
%%bash

if [ ! -e "churn_train.csv" ]; then
    curl "https://openlayer-static-assets.s3.us-west-2.amazonaws.com/examples-datasets/monitoring/churn_train.csv" --output "churn_train.csv"
fi

In [None]:
import pandas as pd

training_set = pd.read_csv("./churn_train.csv")

### <a id="upload-reference"> Uploading the dataset to Openlayer </a>

In [None]:
dataset_config = {
    "categoricalFeatureNames": ["Gender", "Geography"],
    "classNames": ["Retained", "Exited"],
        "featureNames": [
        "CreditScore", 
        "Geography",
        "Gender",
        "Age", 
        "Tenure",
        "Balance",
        "NumOfProducts",
        "HasCrCard",
        "IsActiveMember",
        "EstimatedSalary",
        "AggregateRate",
        "Year"
    ],
    "labelColumnName": "Exited",
    "label": "training"
}

In [None]:
inference_pipeline.upload_reference_dataframe(
    dataset_df=training_set,
    dataset_config=dataset_config
)

## <a id="publish-batches"> 3. Publishing batches of data </a>

[Back to top](#top)

In production, as the model makes predictions, the data can be published to Openlayer. This is done with the `publish_batch_data` method. 

The data published to Openlayer can have a column with **inference ids** and another with **timestamps** (UNIX ms format). These are both optional and, if not provided, will receive default values. The inference id is particularly important if you wish to publish ground truths at a later time. 

### <a id="download-batches"> Download the data </a>

In [None]:
%%bash

if [ ! -e "prod_data_no_ground_truths.csv" ]; then
    curl "https://openlayer-static-assets.s3.us-west-2.amazonaws.com/examples-datasets/monitoring/prod_data_no_ground_truths.csv" --output "prod_data_no_ground_truths.csv"
fi

In [None]:
production_data = pd.read_csv("prod_data_no_ground_truths.csv")

In [None]:
batch_1 = production_data.loc[:342]
batch_2 = production_data.loc[342:684]
batch_3 = production_data.loc[684:]

In [None]:
batch_1.head()

### <a id="publish-batches"> Publish to Openlayer </a>

Here, we're simulating three calls to `publish_batch_data`. In practice, this is a code snippet that lives in your inference pipeline and that gets called after the model predictions.

In [None]:
batch_config = {
    "categoricalFeatureNames": ["Gender", "Geography"],
    "classNames": ["Retained", "Exited"],
    "featureNames": [
        "CreditScore", 
        "Geography",
        "Gender",
        "Age", 
        "Tenure",
        "Balance",
        "NumOfProducts",
        "HasCrCard",
        "IsActiveMember",
        "EstimatedSalary",
        "AggregateRate",
        "Year"
    ],
    "timestampColumnName": "timestamp",
    "inferenceIdColumnName": "inference_id"
}


In [None]:
inference_pipeline.publish_batch_data(
    batch_df=batch_1,
    batch_config=batch_config
)

In [None]:
inference_pipeline.publish_batch_data(
    batch_df=batch_2,
    batch_config=batch_config
)

In [None]:
inference_pipeline.publish_batch_data(
    batch_df=batch_3,
    batch_config=batch_config
)

## <a id="ground-truths"> 4. Publishing ground truths for past batches </a>

[Back to top](#top)

The `publish_ground_truths` method can be used to update the ground truths for batches of data already published to the Openlayer platform. The inference id is what gets used to merge the ground truths with the corresponding rows.

### <a id="download-truth"> Download the data </a>

In [None]:
%%bash

if [ ! -e "prod_ground_truths.csv" ]; then
    curl "https://openlayer-static-assets.s3.us-west-2.amazonaws.com/examples-datasets/monitoring/prod_ground_truths.csv" --output "prod_ground_truths.csv"
fi

In [None]:
ground_truths = pd.read_csv("prod_ground_truths.csv")

### <a id="publish-truth">Publish ground truths </a>

In [None]:
inference_pipeline.publish_ground_truths(
    df=ground_truths,
    ground_truth_column_name="Exited",
    inference_id_column_name="inference_id",
)