[Docs] Add/update model monitoring (#5401)

mlrun · May 28, 2024 · b9dbe97 · b9dbe97
1 parent cc385be
commit b9dbe97
Show file tree

Hide file tree

Showing 27 changed files with 2,664 additions and 695 deletions.
diff --git a/docs/_static/image_sources/model-monitoring.pptx b/docs/_static/image_sources/model-monitoring.pptx
diff --git a/docs/_static/images/model-monitoring.png b/docs/_static/images/model-monitoring.png
diff --git a/docs/concepts/monitoring.md b/docs/concepts/monitoring.md
@@ -1,4 +1,4 @@
-(model-monitoring)=
+(model-monitoring-overview)=
 # Model monitoring
 
 By definition, ML models in production make inferences on constantly changing data. Even models that have been trained on massive data sets, with the most meticulously labelled data, start to degrade over time, due to concept drift. Changes in the live environment due to changing behavioral patterns, seasonal shifts, new regulatory environments, market volatility, etc., can have a big impact on a trained model’s ability to make accurate predictions.

diff --git a/docs/conf.py b/docs/conf.py
@@ -169,7 +169,13 @@ def current_version():
     "onnx",
 ]
 
-redirects = {"functions-architecture": "functions.html"}
+redirects = {
+    "runtimes/functions-architecture": "runtimes/functions.html",
+    "monitoring/initial-setup-configuration": "monitoring/model-monitoring-deployment.html",
+    "tutorials/05-batch-infer.ipynb": "tutorials/06-batch-infer.ipynb",
+    "tutorials/06-model-monitoring.ipynb": "tutorials/05-model-monitoring.ipynb",
+}
+
 smartquotes = False
 
 # -- Autosummary -------------------------------------------------------------

diff --git a/docs/data-prep/ingest-data-fs.md b/docs/data-prep/ingest-data-fs.md
@@ -12,7 +12,7 @@ the ingestion process runs the graph transformations, infers metadata and stats,
 
 When targets are not specified, data is stored in the configured default targets (i.e. NoSQL for real-time and Parquet for offline).
 
-### Ingestion engines
+## Ingestion engines
 
 MLRun supports a several ingestion engines:
 - `storey` engine (default) is designed for real-time data (e.g. individual records) that will be transformed using Python functions and classes

diff --git a/docs/index.md b/docs/index.md
@@ -62,7 +62,7 @@ Project access can be restricted to a set of users and roles.
 
 `````{div} full-width
 {octicon}`mortar-board` **Docs:**
-{bdg-link-info}`Projects and automation <./projects/project.html>`
+{bdg-link-info}`Projects and automation <./projects/ci-cd-automate.html>`
 {bdg-link-info}`CI/CD integration <./projects/ci-integration.html>`
 <br> {octicon}`code-square` **Tutorials:**
 {bdg-link-primary}`Quick start <./tutorials/01-mlrun-basics.html>`
@@ -79,7 +79,7 @@ In addition, the MLRun [**Feature store**](./feature-store/feature-store.html) a
 
 `````{div} full-width
 {octicon}`mortar-board` **Docs:**
-{bdg-link-info}`Ingest and process data <ingesting-process-data>`
+{bdg-link-info}`Ingest and process data <./data-prep/index.html>`
 {bdg-link-info}`Feature store <./feature-store/feature-store.html>`
 {bdg-link-info}`Data and artifacts <./concepts/data.html>`
 <br> {octicon}`code-square` **Tutorials:**
@@ -94,7 +94,7 @@ MLRun allows you to easily build ML pipelines that take data from various source
 
 `````{div} full-width
 {octicon}`mortar-board` **Docs:**
-{bdg-link-info}`Develop and train models <development>`
+{bdg-link-info}`Develop and train models <./development/index.html>`
 {bdg-link-info}`Model training and tracking <./development/model-training-tracking.html>`
 {bdg-link-info}`Batch runs and workflows <./concepts/runs-workflows.html>`
 <br> {octicon}`code-square` **Tutorials:**
@@ -111,7 +111,7 @@ MLRun rapidly deploys and manages production-grade real-time or batch applicatio
 
 `````{div} full-width
 {octicon}`mortar-board` **Docs:**
-{bdg-link-info}`Deploy models and applications <deployment>`
+{bdg-link-info}`Deploy models and applications <./deployment/index.html>`
 {bdg-link-info}`Realtime pipelines <./serving/serving-graph.html>`
 {bdg-link-info}`Batch inference <./deployment/batch_inference.html>`
 <br> {octicon}`code-square` **Tutorials:**
@@ -129,10 +129,10 @@ Observability is built into the different MLRun objects (data, functions, jobs,
 
 `````{div} full-width
 {octicon}`mortar-board` **Docs:**
-{bdg-link-info}`Monitor and alert <monitoring>`
-{bdg-link-info}`Model monitoring overview <./monitoring/model-monitoring-deployment.html>`
+{bdg-link-info}`Monitor and alert <.//monitoring/model-monitoring.html>`
+{bdg-link-info}`Model monitoring overview <./monitoring/index.html>`
 <br> {octicon}`code-square` **Tutorials:**
-{bdg-link-primary}`Model monitoring and drift detection <./tutorials/05-model-monitoring.html>`
+{bdg-link-primary}`Realtime monitoring and drift detection <./tutorials/05-model-monitoring.html>`
 `````
 
 <a id="core-components"></a>
@@ -199,7 +199,7 @@ MLRun includes the following major components:
 
 **{ref}`Real-time serving pipeline <serving-graph>`:** Rapid deployment of scalable data and ML pipelines using real-time serverless technology, including API handling, data preparation/enrichment, model serving, ensembles, driving and measuring actions, etc.
 
-**{ref}`Real-time monitoring <monitoring>`:** Monitors data, models, resources, and production components and provides a feedback loop for exploring production data, identifying drift, alerting on anomalies or data quality issues, triggering retraining jobs, measuring business impact, etc.
+**{ref}`Real-time monitoring <monitoring-overview>`:** Monitors data, models, resources, and production components and provides a feedback loop for exploring production data, identifying drift, alerting on anomalies or data quality issues, triggering retraining jobs, measuring business impact, etc.
 
 
 

diff --git a/docs/monitoring/index.md b/docs/monitoring/index.md
@@ -1,39 +1,65 @@
-(monitoring)=
+(monitoring-overview)=
 
-# Monitor and alert
+# Model monitoring
 
-```{note}
-Monitoring is supported by Iguazio's streaming technology, and open-source integration with Kafka.
-```
-
-```{note}
-This is currently a beta feature.
-```
+In v1.6.0. MLRun introduces a {ref}`new paradigm of model monitoring <model-monitoring>`. 
+The {ref}`legacy mode <legacy-model-monitoring>` is currently supported only for the CE version of MLRun.
 
-The MLRun's model monitoring service includes built-in model monitoring and reporting capability. With monitoring you get
+The MLRun's model monitoring service includes built-in model monitoring and reporting capabilities. With monitoring you get
 out-of-the-box analysis of:
 
-- **Model performance**: machine learning models train on data. It is important you know how well they perform in production.
+- **Continuous Assessment**: Model monitoring involves the continuous assessment of deployed machine learning models in real-time. 
+   It's a proactive approach to ensure that models remain accurate and reliable as they interact with live data.
+- **Model performance**: Machine learning models train on data. It is important you know how well they perform in production.
   When you analyze the model performance, it is important you monitor not just the overall model performance, but also the
-  feature-level performance. This gives you better insights for the reasons behind a particular result
-- **Data drift**: the change in model input data that potentially leads to model performance degradation. There are various
-  statistical metrics and drift metrics that you can use in order to identify data drift.
-- **Concept drift**: applies to the target. Sometimes the statistical properties of the target variable, which the model is
-  trying to predict, change over time in unforeseen ways.
-- **Operational performance**: applies to the overall health of the system. This applies to data (e.g., whether all the
-  expected data arrives to the model) as well as the model (e.g., response time, and throughput). 
+  feature-level performance. This gives you better insights for the reasons behind a particular result.
+- **Data drift**: The change in model input data that potentially leads to model performance degradation. There are various
+  statistical metrics and drift metrics that you can use to identify data drift.
+- **Concept drift**: The statistical properties of the target variable (what the model is predicting) change over time. 
+In other words, the meaning of the input data that the model was trained on has significantly changed over time,  and no longer matches the input data used to train the model. For this new data, accuracy of the model predictions is low. Drift analysis statistics are computed once an hour. See more details in <a href="https://www.iguazio.com/glossary/concept-drift/" target="_blank">Concept Drift</a>.
+- **Operational performance**: The overall health of the system. This applies to data (e.g., whether all the
+  expected data arrives to the model) as well as the model (e.g., response time and throughput). 
+
+You can set up notifications on various channels once an issue is detected. For example, notification
+to your IT via email and Slack when operational performance metrics pass a threshold. You can also set-up automated actions, for example,
+call a CI/CD pipeline when data drift is detected and allow a data scientist to review the model with the revised data
+
+## Architecture
+
+<img src="../_static/images/model-monitoring.png" width="1100" >
+
+The model monitoring process flow starts with collecting operational data from a function in the model serving pod. The model 
+monitoring stream pod forwards data to a Parquet database. 
+The controller periodically checks the Parquet DB for new data and forwards it to the relevant application. 
+Each monitoring application is a separate nuclio real-time function. Each one listens to a stream that is filled by 
+the monitoring controller on each `base_period` interval.
+The stream function examines 
+the log entry, processes it into statistics which are then written to the statistics databases (parquet file, time series database and key value database). 
+The monitoring stream function writes the Parquet files using a basic storey ParquetTarget. Additionally, there is a monitoring feature set that refers 
+to the same target. You can use `get_offline_features` to read the data from that feature set. 
+
+In parallel, an MLRun job runs, reading the parquet files and performing drift analysis. The drift analysis data is stored so 
+that the user can retrieve it in the Iguazio UI or in a Grafana dashboard.
 
-You have the option to set up notifications on various channels once an issue is detection. For example, you can set-up notification
-to your IT via email and slack when operational performance metrics pass a threshold. You can also set-up automated actions, for example,
-call a CI/CD pipeline when data drift is detected and allow a data scientist to review the model with the revised data.
+When you enable model monitoring, you effectively deploy three components:
+- application controller function: handles the monitoring processing and the triggers the apps that trigger the writer. The controller is a scheduled batch job whose frequency is determined by `base_period`. 
+- stream function: monitors the log of the data stream. It is triggered when a new log entry is detected. The monitored data is used to create real-time dashboards, detect drift, and analyze performance.
+- writer function: writes to the database and outputs alerts.
 
-Refer to the [**model monitoring & drift detection tutorial**](../tutorials/05-model-monitoring.html) for an end-to-end example.
+## Common terminology
+The following terms are used in all the model monitoring pages:
+* **Total Variation Distance** (TVD) &mdash; The statistical difference between the actual predictions and the model's trained predictions.
+* **Hellinger Distance** &mdash; A type of f-divergence that quantifies the similarity between the actual predictions, and the model's trained predictions.
+* **Kullback–Leibler Divergence** (KLD) &mdash; The measure of how the probability distribution of actual predictions is different from the second model's trained reference probability distribution.
+* **Model Endpoint** &mdash; A combination of a model and a runtime function that can be a deployed Nuclio function or a job runtime. One function can run multiple endpoints; however, statistics are saved per endpoint.
 
 **In this section**
 
 ```{toctree}
 :maxdepth: 1
 
+model-monitoring
+monitoring-models
 model-monitoring-deployment
-initial-setup-configuration
+legacy-model-monitoring
 ```
diff --git a/docs/monitoring/initial-setup-configuration.ipynb b/docs/monitoring/initial-setup-configuration.ipynb