<a href="https://colab.research.google.com/github/pawan-cpu/MLops/blob/main/pawank_of_3_1_Data_Skew.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lesson 3.1: Train-Test Skew Detection with Evidently

[![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/zenml-io/zenbytes/blob/main/3-1_Data_Skew.ipynb)

***Key Concepts:*** *Data-Centric ML, Data Skew, Train-Test Skew, Training-Serving Skew, Evidently*

In academia and research, the focus of ML is usually to build the best possible models for a given dataset. However, in practical applications, the overall performance of our application is often determined primarily by data quality, not by the model. That is why many ML practitioners advocate for **Data-Centric** ML approaches, where we focus on improving the data while keeping the ML model (mostly) fixed. See [this great article](https://neptune.ai/blog/data-centric-vs-model-centric-machine-learning) by neptune.ai for more details on model-centric vs. data-centric ML.

One of the most critical parts of data-centric ML is to monitor data quality. Throughout this chapter, we will learn about many potential data issues, such as train-test skew, training-serving skew, data drift, and more. Being aware of these issues, and having respective safety mechanisms in place, is essential when serving ML models to real users.

In this lesson, we will start by automatically checking for **Data Skew** within our ML pipelines. Since the performance of ML models on unseen data can be unpredictable, we should always try to design our training data to match the actual environment where our model will later be deployed. The difference between those data distributions is called **Training-Serving Skew**. Similarly, differences in distribution between our training and testing datasets are called **Train-Test Skew**.

In the following, we will use the open-source data monitoring tool [Evidently](https://evidentlyai.com/) to measure distribution differences between our datasets. See this little [blog post](https://blog.zenml.io/zenml-loves-evidently/) of ours that explains the Evidently integration in more detail.

If you haven't done so, install Evidently by running the following cell. This will also restart your notebook kernel:

In [None]:
%pip install "zenml[server]"
!zenml integration install sklearn mlflow evidently -y
!rm -rf .zen
!zenml init
%pip install pyparsing==2.4.2  # required for Colab

import IPython

# automatically restart kernel
IPython.Application.instance().kernel.do_shutdown(restart=True)

In [None]:
from zenml.environment import Environment

if Environment.in_google_colab():  # Colab only setup

    # clone zenbytes repo to get source code of previous lessons
    !git clone https://github.com/zenml-io/zenbytes.git  # noqa
    !mv zenbytes/steps .
    !mv zenbytes/pipelines .

## Detect Train-Test Skew

First, we will use Evidently to check for skew between our training and test datasets. To do so, we will define a new pipeline with an Evidently step, into which we will then pass our training and test datasets. 

At its core, Evidently’s distribution difference calculation functions take in a reference dataset and compare it with a separate comparison dataset. These are both passed in as [pandas DataFrames](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html), though CSV inputs are also possible. ZenML implements this functionality in the form of several standardized steps along with an easy way to use the visualization tools also provided along with Evidently as ‘Dashboards’.

Since our datasets were initially in [numpy.ndarray](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html) format, we also need to add another simple step that converts from numpy to pandas. The overall pipeline will then look like this:

![Pipeline2](https://github.com/zenml-io/zenbytes/blob/main/_assets/3-1/second_pipeline.png?raw=1)

Let's define this pipeline in code and import the other steps (which we have already built during previous lessons):

In [None]:
from steps.evaluator import evaluator
from steps.importer import importer
from steps.sklearn_trainer import svc_trainer

In [None]:
from zenml.pipelines import pipeline


@pipeline(enable_cache=False)
def digits_pipeline_with_train_test_checks(
    importer,
    trainer,
    evaluator,
    get_reference_data,
    skew_detector,
):
    """Digits pipeline with train-test check."""
    X_train, X_test, y_train, y_test = importer()
    model = trainer(X_train=X_train, y_train=y_train)
    evaluator(X_test=X_test, y_test=y_test, model=model)
    reference, comparison = get_reference_data(X_train, X_test)
    skew_detector(reference, comparison)

Next, let's define the two new steps. For data distribution comparison, we can simply use the predefined step of ZenMLs Evidently integration:

In [None]:
from zenml.integrations.evidently.steps import (
    EvidentlyProfileParameters,
    evidently_profile_step,
)

# configure the Evidently step
evidently_profile_params = EvidentlyProfileParameters(
    profile_sections=["datadrift"]
)

The step for converting numpy to pandas is also fairly easy to implement:

In [None]:
import numpy as np
import pandas as pd
from zenml.steps import step, Output


@step
def get_reference_data(
    X_train: np.ndarray,
    X_test: np.ndarray,
) -> Output(reference=pd.DataFrame, comparison=pd.DataFrame):
    """Convert numpy data to pandas for distribution difference calculation."""
    columns = [str(x) for x in list(range(X_train.shape[1]))]
    X_train = pd.DataFrame(X_test, columns=columns)
    X_test = pd.DataFrame(X_train, columns=columns)
    return X_train, X_test

Lastly, before we can run the pipeline, we still need to add Evidently into our ZenML MLOps stack as a data validator:

In [None]:
!zenml data-validator register evidently_validator --flavor=evidently

!zenml stack update default -dv evidently_validator

And that's it. Let's initialize and run our pipeline to try it out:

In [None]:
evidently_pipeline = digits_pipeline_with_train_test_checks(
    importer=importer(),
    trainer=svc_trainer(),
    evaluator=evaluator(),
    get_reference_data=get_reference_data(),
    skew_detector=evidently_profile_step(
        step_name="evidently_skew_detector",
        params=evidently_profile_params,
    ),
)
evidently_pipeline.run(unlisted=True)

Now we can use ZenMLs `EvidentlyVisualizer` to see the distribution comparison right in our notebook, where we can visually compare each feature's distributions.

In [None]:
from zenml.integrations.evidently.visualizers import EvidentlyVisualizer
from zenml.post_execution import get_unlisted_runs

last_run = get_unlisted_runs()[-1]

skew_detection_step = last_run.get_step(step="skew_detector")
evidently_outputs = skew_detection_step

EvidentlyVisualizer().visualize(evidently_outputs)

As we see, there is no skew between our training and test sets. That's great!

In the following lessons, we will add training-serving skew and data drift detection mechanisms into our inference pipeline. We will also set up triggers for automated alerts whenever any data issues are detected. Those lessons are still work in progress, so stay tuned!