## Homework

The goal of this homework is to familiarize users with monitoring for ML batch services, using PostgreSQL database to store metrics and Grafana to visualize them.

## Q1. Prepare the dataset

Start with `baseline_model_nyc_taxi_data.ipynb`. Download the March 2024 Green Taxi data. We will use this data to simulate a production usage of a taxi trip duration prediction service.

What is the shape of the downloaded data? How many rows are there?

* 72044
* 78537 
* 57457
* 54396


In [1]:

import requests
import datetime
import pandas as pd
from tqdm import tqdm

In [7]:
files = [('green_tripdata_2024-03.parquet', './data')]

print("Download files:")
for file, path in files:
    url=f"https://d37ci6vzurychx.cloudfront.net/trip-data/{file}"
    resp=requests.get(url, stream=True)
    save_path=f"{path}/{file}"
    with open(save_path, "wb") as handle:
        for data in tqdm(resp.iter_content(),
                        desc=f"{file}",
                        postfix=f"save to {save_path}",
                        total=int(resp.headers["Content-Length"])):
            handle.write(data)

Download files:


green_tripdata_2024-03.parquet:   0%|          | 0/1372372 [00:00<?, ?it/s, save to ./data/green_tripdata_2024-03.parquet]

green_tripdata_2024-03.parquet: 100%|██████████| 1372372/1372372 [00:08<00:00, 162297.33it/s, save to ./data/green_tripdata_2024-03.parquet]


In [None]:
mar_data = pd.read_parquet('data/green_tripdata_2024-03.parquet')
mar_data.shape

(57457, 20)

## Q2. Metric

Let's expand the number of data quality metrics we’d like to monitor! Please add one metric of your choice and a quantile value for the `"fare_amount"` column (`quantile=0.5`).

Hint: explore evidently metric `ColumnQuantileMetric` (from `evidently.metrics import ColumnQuantileMetric`) 

What metric did you choose?

the mean of fare_amount

## Q3. Monitoring

Let’s start monitoring. Run expanded monitoring for a new batch of data (March 2024). 

What is the maximum value of metric `quantile = 0.5` on the `"fare_amount"` column during March 2024 (calculated daily)?

* 10
* 12.5
* 14.2
* 14.8

In [9]:
from evidently.metrics import ColumnQuantileMetric

In [3]:
from evidently import ColumnMapping
from evidently.report import Report
from evidently.metrics import ColumnDriftMetric, DatasetDriftMetric, DatasetMissingValuesMetric

from joblib import load, dump
from tqdm import tqdm

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_absolute_percentage_error

In [6]:
mar_data['date'] = mar_data['lpep_pickup_datetime'].dt.date

In [24]:
mar_data = mar_data[(mar_data['lpep_pickup_datetime'] >= '2024-03-01') & (mar_data['lpep_pickup_datetime'] < '2024-04-01')]


In [25]:
medians = []
for day, group in mar_data.groupby('date'):
    report = Report(metrics=[ColumnQuantileMetric(column_name='fare_amount', quantile=0.5)])
    report.run(reference_data=None, current_data=group)
    result = report.as_dict()
    median = result['metrics'][0]['result']['current']['value']
    medians.append((day, median))

In [26]:
medians

[(datetime.date(2024, 3, 1), np.float64(13.5)),
 (datetime.date(2024, 3, 2), np.float64(13.5)),
 (datetime.date(2024, 3, 3), np.float64(14.2)),
 (datetime.date(2024, 3, 4), np.float64(12.8)),
 (datetime.date(2024, 3, 5), np.float64(13.5)),
 (datetime.date(2024, 3, 6), np.float64(12.8)),
 (datetime.date(2024, 3, 7), np.float64(13.5)),
 (datetime.date(2024, 3, 8), np.float64(13.5)),
 (datetime.date(2024, 3, 9), np.float64(13.5)),
 (datetime.date(2024, 3, 10), np.float64(14.2)),
 (datetime.date(2024, 3, 11), np.float64(12.8)),
 (datetime.date(2024, 3, 12), np.float64(13.5)),
 (datetime.date(2024, 3, 13), np.float64(13.5)),
 (datetime.date(2024, 3, 14), np.float64(14.2)),
 (datetime.date(2024, 3, 15), np.float64(13.5)),
 (datetime.date(2024, 3, 16), np.float64(14.2)),
 (datetime.date(2024, 3, 17), np.float64(13.5)),
 (datetime.date(2024, 3, 18), np.float64(13.5)),
 (datetime.date(2024, 3, 19), np.float64(13.5)),
 (datetime.date(2024, 3, 20), np.float64(12.8)),
 (datetime.date(2024, 3, 21),

In [27]:

# Now max will work, since median is a float
max_day, max_median = max(medians, key=lambda x: x[1])
print(f"Maximum daily median fare_amount in March 2024: {max_median} on {max_day}")

Maximum daily median fare_amount in March 2024: 14.2 on 2024-03-03
