Skip to content

Commit

Permalink
[Docs] Add/update model monitoring (#5401)
Browse files Browse the repository at this point in the history
  • Loading branch information
jillnogold committed May 28, 2024
1 parent cc385be commit b9dbe97
Show file tree
Hide file tree
Showing 27 changed files with 2,664 additions and 695 deletions.
Binary file added docs/_static/image_sources/model-monitoring.pptx
Binary file not shown.
Binary file added docs/_static/images/model-monitoring.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion docs/concepts/monitoring.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
(model-monitoring)=
(model-monitoring-overview)=
# Model monitoring

By definition, ML models in production make inferences on constantly changing data. Even models that have been trained on massive data sets, with the most meticulously labelled data, start to degrade over time, due to concept drift. Changes in the live environment due to changing behavioral patterns, seasonal shifts, new regulatory environments, market volatility, etc., can have a big impact on a trained model’s ability to make accurate predictions.
Expand Down
8 changes: 7 additions & 1 deletion docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -169,7 +169,13 @@ def current_version():
"onnx",
]

redirects = {"functions-architecture": "functions.html"}
redirects = {
"runtimes/functions-architecture": "runtimes/functions.html",
"monitoring/initial-setup-configuration": "monitoring/model-monitoring-deployment.html",
"tutorials/05-batch-infer.ipynb": "tutorials/06-batch-infer.ipynb",
"tutorials/06-model-monitoring.ipynb": "tutorials/05-model-monitoring.ipynb",
}

smartquotes = False

# -- Autosummary -------------------------------------------------------------
Expand Down
2 changes: 1 addition & 1 deletion docs/data-prep/ingest-data-fs.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ the ingestion process runs the graph transformations, infers metadata and stats,

When targets are not specified, data is stored in the configured default targets (i.e. NoSQL for real-time and Parquet for offline).

### Ingestion engines
## Ingestion engines

MLRun supports a several ingestion engines:
- `storey` engine (default) is designed for real-time data (e.g. individual records) that will be transformed using Python functions and classes
Expand Down
16 changes: 8 additions & 8 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ Project access can be restricted to a set of users and roles.

`````{div} full-width
{octicon}`mortar-board` **Docs:**
{bdg-link-info}`Projects and automation <./projects/project.html>`
{bdg-link-info}`Projects and automation <./projects/ci-cd-automate.html>`
{bdg-link-info}`CI/CD integration <./projects/ci-integration.html>`
<br> {octicon}`code-square` **Tutorials:**
{bdg-link-primary}`Quick start <./tutorials/01-mlrun-basics.html>`
Expand All @@ -79,7 +79,7 @@ In addition, the MLRun [**Feature store**](./feature-store/feature-store.html) a

`````{div} full-width
{octicon}`mortar-board` **Docs:**
{bdg-link-info}`Ingest and process data <ingesting-process-data>`
{bdg-link-info}`Ingest and process data <./data-prep/index.html>`
{bdg-link-info}`Feature store <./feature-store/feature-store.html>`
{bdg-link-info}`Data and artifacts <./concepts/data.html>`
<br> {octicon}`code-square` **Tutorials:**
Expand All @@ -94,7 +94,7 @@ MLRun allows you to easily build ML pipelines that take data from various source

`````{div} full-width
{octicon}`mortar-board` **Docs:**
{bdg-link-info}`Develop and train models <development>`
{bdg-link-info}`Develop and train models <./development/index.html>`
{bdg-link-info}`Model training and tracking <./development/model-training-tracking.html>`
{bdg-link-info}`Batch runs and workflows <./concepts/runs-workflows.html>`
<br> {octicon}`code-square` **Tutorials:**
Expand All @@ -111,7 +111,7 @@ MLRun rapidly deploys and manages production-grade real-time or batch applicatio

`````{div} full-width
{octicon}`mortar-board` **Docs:**
{bdg-link-info}`Deploy models and applications <deployment>`
{bdg-link-info}`Deploy models and applications <./deployment/index.html>`
{bdg-link-info}`Realtime pipelines <./serving/serving-graph.html>`
{bdg-link-info}`Batch inference <./deployment/batch_inference.html>`
<br> {octicon}`code-square` **Tutorials:**
Expand All @@ -129,10 +129,10 @@ Observability is built into the different MLRun objects (data, functions, jobs,

`````{div} full-width
{octicon}`mortar-board` **Docs:**
{bdg-link-info}`Monitor and alert <monitoring>`
{bdg-link-info}`Model monitoring overview <./monitoring/model-monitoring-deployment.html>`
{bdg-link-info}`Monitor and alert <.//monitoring/model-monitoring.html>`
{bdg-link-info}`Model monitoring overview <./monitoring/index.html>`
<br> {octicon}`code-square` **Tutorials:**
{bdg-link-primary}`Model monitoring and drift detection <./tutorials/05-model-monitoring.html>`
{bdg-link-primary}`Realtime monitoring and drift detection <./tutorials/05-model-monitoring.html>`
`````

<a id="core-components"></a>
Expand Down Expand Up @@ -199,7 +199,7 @@ MLRun includes the following major components:

**{ref}`Real-time serving pipeline <serving-graph>`:** Rapid deployment of scalable data and ML pipelines using real-time serverless technology, including API handling, data preparation/enrichment, model serving, ensembles, driving and measuring actions, etc.

**{ref}`Real-time monitoring <monitoring>`:** Monitors data, models, resources, and production components and provides a feedback loop for exploring production data, identifying drift, alerting on anomalies or data quality issues, triggering retraining jobs, measuring business impact, etc.
**{ref}`Real-time monitoring <monitoring-overview>`:** Monitors data, models, resources, and production components and provides a feedback loop for exploring production data, identifying drift, alerting on anomalies or data quality issues, triggering retraining jobs, measuring business impact, etc.



Expand Down
72 changes: 49 additions & 23 deletions docs/monitoring/index.md
Original file line number Diff line number Diff line change
@@ -1,39 +1,65 @@
(monitoring)=
(monitoring-overview)=

# Monitor and alert
# Model monitoring

```{note}
Monitoring is supported by Iguazio's streaming technology, and open-source integration with Kafka.
```

```{note}
This is currently a beta feature.
```
In v1.6.0. MLRun introduces a {ref}`new paradigm of model monitoring <model-monitoring>`.
The {ref}`legacy mode <legacy-model-monitoring>` is currently supported only for the CE version of MLRun.

The MLRun's model monitoring service includes built-in model monitoring and reporting capability. With monitoring you get
The MLRun's model monitoring service includes built-in model monitoring and reporting capabilities. With monitoring you get
out-of-the-box analysis of:

- **Model performance**: machine learning models train on data. It is important you know how well they perform in production.
- **Continuous Assessment**: Model monitoring involves the continuous assessment of deployed machine learning models in real-time.
It's a proactive approach to ensure that models remain accurate and reliable as they interact with live data.
- **Model performance**: Machine learning models train on data. It is important you know how well they perform in production.
When you analyze the model performance, it is important you monitor not just the overall model performance, but also the
feature-level performance. This gives you better insights for the reasons behind a particular result
- **Data drift**: the change in model input data that potentially leads to model performance degradation. There are various
statistical metrics and drift metrics that you can use in order to identify data drift.
- **Concept drift**: applies to the target. Sometimes the statistical properties of the target variable, which the model is
trying to predict, change over time in unforeseen ways.
- **Operational performance**: applies to the overall health of the system. This applies to data (e.g., whether all the
expected data arrives to the model) as well as the model (e.g., response time, and throughput).
feature-level performance. This gives you better insights for the reasons behind a particular result.
- **Data drift**: The change in model input data that potentially leads to model performance degradation. There are various
statistical metrics and drift metrics that you can use to identify data drift.
- **Concept drift**: The statistical properties of the target variable (what the model is predicting) change over time.
In other words, the meaning of the input data that the model was trained on has significantly changed over time, and no longer matches the input data used to train the model. For this new data, accuracy of the model predictions is low. Drift analysis statistics are computed once an hour. See more details in <a href="https://www.iguazio.com/glossary/concept-drift/" target="_blank">Concept Drift</a>.
- **Operational performance**: The overall health of the system. This applies to data (e.g., whether all the
expected data arrives to the model) as well as the model (e.g., response time and throughput).

You can set up notifications on various channels once an issue is detected. For example, notification
to your IT via email and Slack when operational performance metrics pass a threshold. You can also set-up automated actions, for example,
call a CI/CD pipeline when data drift is detected and allow a data scientist to review the model with the revised data

## Architecture

<img src="../_static/images/model-monitoring.png" width="1100" >

The model monitoring process flow starts with collecting operational data from a function in the model serving pod. The model
monitoring stream pod forwards data to a Parquet database.
The controller periodically checks the Parquet DB for new data and forwards it to the relevant application.
Each monitoring application is a separate nuclio real-time function. Each one listens to a stream that is filled by
the monitoring controller on each `base_period` interval.
The stream function examines
the log entry, processes it into statistics which are then written to the statistics databases (parquet file, time series database and key value database).
The monitoring stream function writes the Parquet files using a basic storey ParquetTarget. Additionally, there is a monitoring feature set that refers
to the same target. You can use `get_offline_features` to read the data from that feature set.

In parallel, an MLRun job runs, reading the parquet files and performing drift analysis. The drift analysis data is stored so
that the user can retrieve it in the Iguazio UI or in a Grafana dashboard.

You have the option to set up notifications on various channels once an issue is detection. For example, you can set-up notification
to your IT via email and slack when operational performance metrics pass a threshold. You can also set-up automated actions, for example,
call a CI/CD pipeline when data drift is detected and allow a data scientist to review the model with the revised data.
When you enable model monitoring, you effectively deploy three components:
- application controller function: handles the monitoring processing and the triggers the apps that trigger the writer. The controller is a scheduled batch job whose frequency is determined by `base_period`.
- stream function: monitors the log of the data stream. It is triggered when a new log entry is detected. The monitored data is used to create real-time dashboards, detect drift, and analyze performance.
- writer function: writes to the database and outputs alerts.

Refer to the [**model monitoring & drift detection tutorial**](../tutorials/05-model-monitoring.html) for an end-to-end example.
## Common terminology
The following terms are used in all the model monitoring pages:
* **Total Variation Distance** (TVD) &mdash; The statistical difference between the actual predictions and the model's trained predictions.
* **Hellinger Distance** &mdash; A type of f-divergence that quantifies the similarity between the actual predictions, and the model's trained predictions.
* **Kullback–Leibler Divergence** (KLD) &mdash; The measure of how the probability distribution of actual predictions is different from the second model's trained reference probability distribution.
* **Model Endpoint** &mdash; A combination of a model and a runtime function that can be a deployed Nuclio function or a job runtime. One function can run multiple endpoints; however, statistics are saved per endpoint.

**In this section**

```{toctree}
:maxdepth: 1
model-monitoring
monitoring-models
model-monitoring-deployment
initial-setup-configuration
legacy-model-monitoring
```
205 changes: 0 additions & 205 deletions docs/monitoring/initial-setup-configuration.ipynb

This file was deleted.

Loading

0 comments on commit b9dbe97

Please sign in to comment.