## Homework

The goal of this homework is to familiarize users with monitoring for ML batch services, using PostgreSQL database to store metrics and Grafana to visualize them.



## Q1. Prepare the dataset

Start with `baseline_model_nyc_taxi_data.ipynb`. Download the March 2024 Green Taxi data. We will use this data to simulate a production usage of a taxi trip duration prediction service.

What is the shape of the downloaded data? How many rows are there?

* 72044
* 78537 
* 57457
* 54396

In [2]:
# 1. !conda create -n 05-monitoring-2023 python=3.11
# 2. !conda activate 05-monitoring-2023
# 3. Move to project folder 'cd 2024_mlops_homework/05-monitoring/2024'
# 4. !pip install -r requirements.txt
# 5. Move to data folder 'cd 2024_mlops_homework/05-monitoring/2024/data'
# 6. !wget https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2024-03.parquet

import pandas as pd
mar_df = pd.read_parquet('data/green_tripdata_2024-03.parquet')
mar_df.shape[0]

57457

## Q2. Metric

Let's expand the number of data quality metrics we’d like to monitor! Please add one metric of your choice and a quantile value for the `"fare_amount"` column (`quantile=0.5`).

Hint: explore evidently metric `ColumnQuantileMetric` (from `evidently.metrics import ColumnQuantileMetric`) 

What metric did you choose?

In [4]:
# 1. Open https://docs.evidentlyai.com/reference/all-metrics
# 2. Go to https://docs.evidentlyai.com/reference/all-metrics#data-quality
# 3. Pick any metric
# --> DatasetCorrelationsMetric()

## Q3. Monitoring

Let’s start monitoring. Run expanded monitoring for a new batch of data (March 2024). 

What is the maximum value of metric `quantile = 0.5` on the `"fare_amount"` column during March 2024 (calculated daily)?

* 10
* 12.5
* 14.2
* 14.8

In [19]:
import pandas as pd

from evidently import ColumnMapping
from evidently.report import Report
from evidently.metrics import ColumnQuantileMetric

data = pd.read_parquet('data/green_tripdata_2024-03.parquet')

# create target
data["duration_min"] = data.lpep_dropoff_datetime - data.lpep_pickup_datetime
data.duration_min = data.duration_min.apply(lambda td : float(td.total_seconds())/60)

# filter out outliers
data = data[(data.duration_min >= 0) & (data.duration_min <= 60)]
data = data[(data.passenger_count > 0) & (data.passenger_count <= 8)]
# Compute the daily quantiles and find the maximum quantile
data['date'] = data['lpep_pickup_datetime'].dt.date  # Extract date from datetime

# data labeling
num_features = ["passenger_count", "trip_distance", "fare_amount", "total_amount"]
cat_features = ["PULocationID", "DOLocationID"]


column_mapping = ColumnMapping(
    target=None,
    prediction=None,
    numerical_features=num_features,
    categorical_features=cat_features
)

report = Report(metrics=[
    ColumnQuantileMetric(column_name='fare_amount', quantile=0.5)
]
)

# Initialize a list to store daily quantiles
daily_quantiles = []

for i in range (1,31):
    report.run(reference_data=None, current_data=data.loc[data.lpep_pickup_datetime.between(f'2024-03-{i:02}', f'2024-03-{i+1:02}', inclusive="left")], column_mapping=column_mapping)
    result = report.as_dict()
    quantile = result['metrics'][0]['result']['current']['value']
    daily_quantiles.append(quantile)

# Find the maximum daily quantile
max_daily_quantile = max(daily_quantiles)
print(f"Maximum daily quantile: {max_daily_quantile}")

Maximum daily quantile: 14.2


In [None]:
### AS THE NEXT QUESTION IS ALREADY ABOUT GRAFANA, I WANTED TO ONLY USE EVIDENTLY UI HERE
 
from evidently.metric_preset import DataQualityPreset

from evidently.ui.workspace import Workspace
from evidently.ui.dashboards import DashboardPanelCounter, DashboardPanelPlot, CounterAgg, PanelValue, PlotType, ReportFilter
from evidently.renderers.html_widgets import WidgetSize

import datetime
import pandas as pd

from evidently import ColumnMapping
from evidently.report import Report


val_data = pd.read_parquet('data/green_tripdata_2022-01.parquet')

# create target
val_data["duration_min"] = val_data.lpep_dropoff_datetime - val_data.lpep_pickup_datetime
val_data.duration_min = val_data.duration_min.apply(lambda td : float(td.total_seconds())/60)

# filter out outliers
val_data = val_data[(val_data.duration_min >= 0) & (val_data.duration_min <= 60)]
val_data = val_data[(val_data.passenger_count > 0) & (val_data.passenger_count <= 8)]

# data labeling
target = "duration_min"
num_features = ["passenger_count", "trip_distance", "fare_amount", "total_amount"]
cat_features = ["PULocationID", "DOLocationID"]

ws = Workspace("workspace")

project = ws.create_project("NYC Taxi Data Quality Project")
project.description = "My project description"
project.save()

column_mapping = ColumnMapping(
    target=None,
    prediction=None,
    numerical_features=num_features,
    categorical_features=cat_features
)

regular_report = Report(
    metrics=[
        DataQualityPreset()
    ],
    timestamp=datetime.datetime(2022,1,28)
)

regular_report.run(reference_data=None,
                  current_data=val_data.loc[val_data.lpep_pickup_datetime.between('2022-01-28', '2022-01-29', inclusive="left")],
                  column_mapping=column_mapping)

ws.add_report(project.id, regular_report)

#configure the dashboard
project.dashboard.add_panel(
    DashboardPanelCounter(
        filter=ReportFilter(metadata_values={}, tag_values=[]),
        agg=CounterAgg.NONE,
        title="NYC taxi data dashboard"
    )
)

project.dashboard.add_panel(
    DashboardPanelPlot(
        filter=ReportFilter(metadata_values={}, tag_values=[]),
        title="Inference Count",
        values=[
            PanelValue(
                metric_id="DatasetSummaryMetric",
                field_path="current.number_of_rows",
                legend="count"
            ),
        ],
        plot_type=PlotType.BAR,
        size=WidgetSize.HALF,
    ),
)

project.dashboard.add_panel(
    DashboardPanelPlot(
        filter=ReportFilter(metadata_values={}, tag_values=[]),
        title="Number of Missing Values",
        values=[
            PanelValue(
                metric_id="DatasetSummaryMetric",
                field_path="current.number_of_missing_values",
                legend="count"
            ),
        ],
        plot_type=PlotType.LINE,
        size=WidgetSize.HALF,
    ),
)

project.save()

regular_report = Report(
    metrics=[
        DataQualityPreset()
    ],
    timestamp=datetime.datetime(2022,1,29)
)

regular_report.run(reference_data=None,
                  current_data=val_data.loc[val_data.lpep_pickup_datetime.between('2022-01-29', '2022-01-30', inclusive="left")],
                  column_mapping=column_mapping)

ws.add_report(project.id, regular_report)

# 1. Start UI with 'evidently ui'

## Q4. Dashboard


Finally, let’s add panels with new added metrics to the dashboard. After we customize the  dashboard let's save a dashboard config, so that we can access it later. Hint: click on “Save dashboard” to access JSON configuration of the dashboard. This configuration should be saved locally.

Where to place a dashboard config file?

* `project_folder` (05-monitoring)
* `project_folder/config`  (05-monitoring/config)
* `project_folder/dashboards`  (05-monitoring/dashboards)
* `project_folder/data`  (05-monitoring/data)

--> project_folder/dashboards (05-monitoring/dashboards)

#### FOR CAPSTONE PROJECT
- Problem with Evidently UI is that data is stored in data snapshots (JSON files)
- That is why PostgreSQL + Grafana is used in lecture
- What to do in project?
    - Schedule monitoring batch job and send regular emails using prefect?
    - Use only evidently code and not PostgreSQL & Grafana?