(batch-predict-tutorial)=
# Batch Prediction and Drift Detection

In this tutorial, we will be leveraging a function from the [MLRun Function Marketplace](https://www.mlrun.org/marketplace/) to perform [batch prediction](https://github.com/mlrun/functions/tree/development/batch_predict) using a logged model and new prediction dataset. Additionally, the function will also calculate data drift by comparing the new prediction dataset with the original training set.

Make sure you have reviewed the basics in MLRun [**Quick Start Tutorial**](../01-mlrun-basics.html).

Tutorial steps:
- [**Create an MLRun project**](#setup-project)
- [**Log a model with a given framework and training set**](#log-model-with-training-data)
- [**Import an MLRun function from the marketplace**](#import-batch-predict-function)
- [**Perform batch prediction and calculate data drift**](#run-batch-predict)

## MLRun Installation and Configuration

Before running this notebook make sure `mlrun` is installed and that you have configured the access to the MLRun service. 

In [None]:
# install MLRun if not installed, run this only once (restart the notebook after the install !!!)
%pip install mlrun

## Setup Project

First, we will import the dependencies and create an [MLRun project](https://docs.mlrun.org/en/latest/projects/project.html). This will contain all of our models, functions, datasets, etc:

In [1]:
import mlrun
import os
import pandas as pd

> 2022-09-20 12:46:53,580 [info] Server and client versions are not the same: {'parsed_server_version': VersionInfo(major=1, minor=0, patch=4, prerelease=None, build=None), 'parsed_client_version': VersionInfo(major=1, minor=1, patch=0, prerelease=None, build=None)}


In [2]:
project = mlrun.get_or_create_project(name="batch-predict", context="./")

> 2022-09-20 12:47:09,013 [info] Created and saved project batch-predict: {'from_template': None, 'overwrite': False, 'context': './', 'save': True}
> 2022-09-20 12:47:09,014 [info] created project batch-predict and saved in MLRun DB


This tutorial will not focus on training a model, but rather will start from the point of already having a trained model with a corresponding training and prediction dataset.

We will be using the following model files and datasets to perform the batch prediction. The model is a [DecisionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) from sklearn and the datasets are in `parquet` format.

In [3]:
model_path = mlrun.get_sample_path('models/batch-predict/model.pkl')
training_set_path = mlrun.get_sample_path('data/batch-predict/training_set.parquet')
prediction_set_path = mlrun.get_sample_path('data/batch-predict/prediction_set.parquet')

## View Data

The training data has 20 numerical features and a binary (0,1) label:

In [4]:
pd.read_parquet(training_set_path).head()

Unnamed: 0,feature_0,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9,...,feature_11,feature_12,feature_13,feature_14,feature_15,feature_16,feature_17,feature_18,feature_19,label
0,0.572754,0.171079,0.40308,0.955429,0.272039,0.360277,-0.995429,0.437239,0.991556,0.010004,...,0.112194,-0.319256,-0.392631,-0.290766,1.265054,1.037082,-1.200076,0.820992,0.834868,0
1,0.623733,-0.149823,-1.410537,-0.729388,-1.996337,-1.213348,1.461307,1.187854,-1.790926,-0.9816,...,0.428653,-0.50382,-0.798035,2.038105,-3.080463,0.408561,1.647116,-0.838553,0.680983,1
2,0.814168,-0.221412,0.020822,1.066718,-0.573164,0.067838,0.923045,0.338146,0.981413,1.481757,...,-1.052559,-0.241873,-1.232272,-0.010758,0.8068,0.661162,0.589018,0.522137,-0.924624,0
3,1.062279,-0.966309,0.341471,-0.737059,1.460671,0.367851,-0.435336,0.445308,-0.655663,-0.19622,...,0.641017,0.099059,1.902592,-1.024929,0.030703,-0.198751,-0.342009,-1.286865,-1.118373,1
4,0.195755,0.576332,-0.260496,0.841489,0.398269,-0.717972,0.81055,-1.058326,0.36861,0.606007,...,0.195267,0.876144,0.151615,0.094867,0.627353,-0.389023,0.662846,-0.857,1.091218,1


The prediciton data has 20 numerical features, but no label - this is what we will be predicting:

In [5]:
pd.read_parquet(prediction_set_path).head()

Unnamed: 0,feature_0,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9,feature_10,feature_11,feature_12,feature_13,feature_14,feature_15,feature_16,feature_17,feature_18,feature_19
0,-2.059506,-1.314291,2.721516,-2.132869,-0.693963,0.376643,3.01779,3.876329,-1.294736,0.030773,0.401491,2.775699,2.36158,0.173441,0.87951,1.141007,4.60828,-0.518388,0.12969,2.794967
1,-1.190382,0.891571,3.72607,0.67387,-0.252565,-0.729156,2.646563,4.782729,0.318952,-0.781567,1.473632,1.101721,3.7234,-0.466867,-0.056224,3.344701,0.194332,0.463992,0.292268,4.665876
2,-0.996384,-0.099537,3.421476,0.162771,-1.143458,-1.026791,2.114702,2.517553,-0.15462,-0.465423,-1.723025,1.729386,2.82034,-1.041428,-0.331871,2.909172,2.138613,-0.046252,-0.732631,4.716266
3,-0.289976,-1.680019,3.126478,-0.704451,-1.149112,1.174962,2.860341,3.753661,-0.326119,2.128411,-0.508,2.328688,3.397321,-0.93206,-1.44237,2.058517,3.881936,2.090635,-0.045832,4.197315
4,-0.294866,1.044919,2.924139,0.814049,-1.455054,-0.270432,3.380195,2.339669,1.029101,-1.171018,-1.459395,1.283565,0.677006,-2.147444,-0.49415,3.222041,6.219348,-1.91411,0.317786,4.143443


## Log Model with Training Data

Next, we will log the model using MLRun experiment tracking. This is usually done in a training pipeline, but you can also bring in your pre-trained models from other sources. See [Working with data and model artifacts](https://docs.mlrun.org/en/latest/training/working-with-data-and-model-artifacts.html) and [Automated experiment tracking](https://docs.mlrun.org/en/latest/concepts/auto-logging-mlops.html) for more information.

In this example, we are logging a training set with the model for future comparison, however you can also directly pass in your training set to the batch prediction function.

In [6]:
model_artifact = project.log_model(
    key="model",
    model_file=model_path,
    framework="sklearn",
    training_set=pd.read_parquet(training_set_path)
)

In [7]:
model_artifact.uri

'store://models/batch-predict/model#0:latest'

## Import Batch Predict Function

Then, we will import the [batch prediction](https://github.com/mlrun/functions/tree/development/batch_predict) function from the [MLRun Function Marketplace](https://www.mlrun.org/marketplace/):

In [8]:
fn = mlrun.import_function("hub://batch_predict:development").apply(mlrun.auto_mount())

## Run Batch Predict

Finally, we will perform our batch prediction by passing in our model and datasets. See the corresponding [batch predict example notebook](https://github.com/mlrun/functions/blob/development/batch_predict/batch_predict.ipynb) for an exhaustive list of what other parameters are supported:

In [9]:
run = fn.run(
    handler="predict",
    inputs={
        "dataset": prediction_set_path,
        # If you do not log a dataset with your model, you can pass it in here:
#         "sample_set" : training_set_path
    },
    params={
        "model": model_artifact.uri,
        "perform_drift_analysis" : True,
        "label_columns": "label",
    },
)

> 2022-09-20 12:47:57,786 [error] error getting build status: 404 Client Error: Not Found for url: http://mlrun-api:8080/api/v1/build/status?name=batch-predict&project=batch-predict&tag=&logs=no&offset=0&last_log_timestamp=0&verbose=no: details: {'reason': "MLRunNotFoundError('Function tag not found batch-predict/batch-predict')"}
> 2022-09-20 12:47:57,787 [info] Function is not deployed and auto_build flag is set, starting deploy...
> 2022-09-20 12:47:57,945 [info] Started building image: .mlrun/func-batch-predict-batch-predict:latest

> 2022-09-20 12:49:39,885 [info] starting run batch-predict-predict uid=44abc84aa0d740ba8942d2333cbaf33e DB=http://mlrun-api:8080
> 2022-09-20 12:49:40,174 [info] Job is running in the background, pod: batch-predict-predict-bhjsp
> 2022-09-20 12:49:48,378 [info] Server and client versions are not the same: {'parsed_server_version': VersionInfo(major=1, minor=0, patch=4, prerelease=None, build=None), 'parsed_client_version': VersionInfo(major=1, minor=1,

project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
batch-predict,...3cbaf33e,0,Sep 20 12:49:48,completed,batch-predict-predict,v3io_user=nickkind=jobowner=nickmlrun/client_version=1.1.0host=batch-predict-predict-bhjsp,dataset,model=store://models/batch-predict/model#0:latestperform_drift_analysis=Truelabel_columns=label,drift_status=Falsedrift_metric=0.29934242566253266,predictiondrift_table_plotfeatures_drift_results





> 2022-09-20 12:49:52,662 [info] run executed, status=completed


## View Batch Predictions and Drift Status

These are the batch predicictions on the prediction set from the model:

In [10]:
run.artifact("prediction").as_df().head()

Unnamed: 0,feature_0,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9,...,feature_11,feature_12,feature_13,feature_14,feature_15,feature_16,feature_17,feature_18,feature_19,label
0,-2.059506,-1.314291,2.721516,-2.132869,-0.693963,0.376643,3.01779,3.876329,-1.294736,0.030773,...,2.775699,2.36158,0.173441,0.87951,1.141007,4.60828,-0.518388,0.12969,2.794967,0
1,-1.190382,0.891571,3.72607,0.67387,-0.252565,-0.729156,2.646563,4.782729,0.318952,-0.781567,...,1.101721,3.7234,-0.466867,-0.056224,3.344701,0.194332,0.463992,0.292268,4.665876,0
2,-0.996384,-0.099537,3.421476,0.162771,-1.143458,-1.026791,2.114702,2.517553,-0.15462,-0.465423,...,1.729386,2.82034,-1.041428,-0.331871,2.909172,2.138613,-0.046252,-0.732631,4.716266,0
3,-0.289976,-1.680019,3.126478,-0.704451,-1.149112,1.174962,2.860341,3.753661,-0.326119,2.128411,...,2.328688,3.397321,-0.93206,-1.44237,2.058517,3.881936,2.090635,-0.045832,4.197315,1
4,-0.294866,1.044919,2.924139,0.814049,-1.455054,-0.270432,3.380195,2.339669,1.029101,-1.171018,...,1.283565,0.677006,-2.147444,-0.49415,3.222041,6.219348,-1.91411,0.317786,4.143443,1


There is also a drift table plot that compares the drift between the training data and prediction data per feature:

![drift_table_plot](../../_static/images/tutorial/drift_table_plot.png)

In [None]:
run.artifact("drift_table_plot").show()

Finally, you also get a numerical drift metric and boolean flag denoting whether or not data drift is detected:

In [12]:
run.status.results

{'drift_status': False, 'drift_metric': 0.29934242566253266}

## Next Steps

In a production setting, you probably want to incorporate this as part of a larger pipeline or application.

For example, if you use this function for the prediction capabilities, you can pass the `prediction` output as the input to another pipeline step, store it in an external location like S3, or send to an application or user.

If you use this function for the drift detection capabilities, you can use the `drift_status` and `drift_metrics` outputs to automate further pipeline steps, send a notification, or kick off a re-training pipeline.