diff --git a/modules/src/histogram_data_drift/histogram_data_drift.ipynb b/modules/src/histogram_data_drift/histogram_data_drift.ipynb index 54a15016..eceb28ca 100644 --- a/modules/src/histogram_data_drift/histogram_data_drift.ipynb +++ b/modules/src/histogram_data_drift/histogram_data_drift.ipynb @@ -1,29 +1,307 @@ { "cells": [ { + "cell_type": "markdown", + "id": "283b6000-4acd-4eb3-bf51-25ee79e9e5dc", + "metadata": {}, + "source": [ + "# Histogram Data Drift Demo\n", + "The Histogram Data Drift monitoring app is MLRun’s default data drift application for model monitoring. It’s considered a built-in app within the model monitoring flow and is deployed by default when model monitoring is enabled for a project. For more information, see the [MLRun documentation](https://docs.mlrun.org/en/latest/model-monitoring/index.html#model-monitoring-applications).\n", + "\n", + "This notebook walks through a simple example of using this app from the hub to monitor data drift between a baseline dataset and a new dataset, using the `evaluate()` method." + ] + }, + { + "cell_type": "markdown", + "id": "da432405-e8bb-400c-b1e0-45e31b0571f1", + "metadata": {}, + "source": [ + "## Set up a project and prepare the data" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "62fcc7a4-4df5-4f2e-bd97-6aa831bbf958", + "metadata": {}, + "outputs": [], + "source": [ + "import mlrun\n", + "project = mlrun.get_or_create_project(\"histogram-data-drift-demo\",'./histogram-data-drift-demo')" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "d7ec1628-0303-4bbb-ba34-5cd96eaef304", + "metadata": {}, + "outputs": [], + "source": [ + "sample_data = mlrun.get_sample_path(\"data/batch-predict/training_set.parquet\")\n", + "reference_data = mlrun.get_sample_path(\"data/batch-predict/prediction_set.parquet\")" + ] + }, + { + "cell_type": "markdown", + "id": "072f1411-33a2-444e-88bf-76d9394d7877", + "metadata": {}, + "source": [ + "## Get the module from the hub and edit its defaults" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "5c04dec9-ea6e-410e-a36d-42a71a223caa", + "metadata": {}, + "outputs": [], + "source": [ + "hub_mod = mlrun.get_hub_module(\"hub://histogram_data_drift\", download_files=True)\n", + "src_file_path = hub_mod.get_module_file_path()" + ] + }, + { + "cell_type": "markdown", + "id": "ce26e487-bfe5-442c-9d5a-04a8d75407a6", + "metadata": {}, + "source": [ + "Since the histogram data drift application doesn’t produce artifacts by default, we need to modify the class defaults. This can be done in one of two ways: either by editing the downloaded source file directly and then evaluating with the standard class, or - as we’ll do now - by adding an inheriting class to the same file and evaluating using that new class." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "055a31d8-00fd-4f55-b07c-1169db6af919", + "metadata": {}, + "outputs": [], + "source": [ + "# add a declaration of an inheriting class to change the default parameters\n", + "wrapper_code = \"\"\"\n", + "class HistogramDataDriftApplicationWithArtifacts(HistogramDataDriftApplication):\n", + " # The same histogram application but with artifacts\n", + "\n", + " def __init__(self) -> None:\n", + " super().__init__(produce_json_artifact=True, produce_plotly_artifact=True)\n", + "\"\"\"\n", + "with open(src_file_path, \"a\") as f:\n", + " f.write(wrapper_code)" + ] + }, + { + "cell_type": "markdown", + "id": "c17b176b-f838-472f-aaeb-7cedaeb66b56", + "metadata": {}, + "source": [ + "Now we can actually import it as a module, using the `module()` method" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "6f57d3c9-9e7e-4fde-b78b-2daf799893e1", + "metadata": {}, + "outputs": [], + "source": [ + "app_module = hub_mod.module()\n", + "hist_app = app_module.HistogramDataDriftApplicationWithArtifacts # or the standard class if you chose to modify its code" + ] + }, + { + "cell_type": "markdown", + "id": "a017bc5a-4935-456b-8648-57c11e11df27", + "metadata": {}, + "source": [ + "And we are ready to call `evaluate()` (notice that the run is linked to the current (active) project)" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "c20fc990-d0e6-4aab-a576-29cea322bfb5", "metadata": {}, + "outputs": [], + "source": [ + "run_result = hist_app.evaluate(\n", + " func_path=hub_mod.get_module_file_path(),\n", + " sample_data=sample_data,\n", + " reference_data=reference_data,\n", + " run_local=True\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "661cdf4d-ee2a-4156-8a71-59f2a1e3b9eb", + "metadata": {}, + "source": [ + "## Examine the results" + ] + }, + { "cell_type": "markdown", - "source": "# Histogram Data Drift Demo", - "id": "2517d91b275da01d" + "id": "e715b6aa-75c0-4352-b98f-bd5a790e1d06", + "metadata": {}, + "source": [ + "First, we'll print nicely the average results:" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "3688d6a0-6cae-4141-8851-dfd12842c484", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "hellinger_mean : 0.34211088243167637\n", + "kld_mean : 2.2839485090490426\n", + "tvd_mean : 0.30536\n", + "general_drift : 0.3237354412158382\n" + ] + } + ], + "source": [ + "for i in range (3):\n", + " metric = run_result.status.results[\"return\"][i]\n", + " print(metric[\"metric_name\"], \": \", metric[\"metric_value\"])\n", + "result = run_result.status.results[\"return\"][3]\n", + "print(result[\"result_name\"], \": \", result[\"result_value\"])" + ] + }, + { + "cell_type": "markdown", + "id": "0422ca13-661b-4574-ad51-d1665be6acdb", + "metadata": {}, + "source": [ + "And we can also examine these metrics per feature, along with other metrics, using the artifacts the app generated for us.\n", + "\n", + "The rightmost column indicates whether the feature has drifted or not. The drift decision rule is the value per-feature mean of the Total Variance Distance (TVD) and Hellinger distance scores. In the histogram-data-drift application, the \"Drift detected\" threshold is 0.7 and the \"Drift suspected\" threshold is 0.5" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "d9e7e688-6a71-4b9b-8b99-b2d7f42077e0", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "
\n", + "\n", + "