# Week6 Assignments (part 1)

**Please do all the assignments for this week using the "mlops_eng3" Conda environment.** (You can create it by following `week6_tutorial.ipynb` in the tutorial directory.)

You'll gain some hands-on experience with monitoring your model using Prometheus and Evidently in this week's assignments. This week's assignments consist of three parts. The first part (this notebook) is about configuring Prometheus alert rules for your inference service. In the [second part](./week6_assignments_part2.ipynb), you'll use Evidently to monitor your model's performance, including different phases needed in monitoring model performance, such as collecting the ground truth and using Evidently to calculate the model performance metrics. In the [final part](./week6_assignments_part3.ipynb), you'll build a KKP pipeline to unify the phases required to monitor model performance. 

**Guidelines for submitting assignments**:
- In the first part, you'll need to write some configurations in a YAML file (`manifests/prometheus-config-patch.yaml`), so please include the YAML file in your submission.  
- For every remaining assignment, a code skeleton is provided. Please put your solutions between the `### START CODE HERE` and `### END CODE HERE` code comments. Please **do not change any code other than those between the `### START CODE HERE` and `### END CODE HERE` comments**.
- Some assignments also require you to capture screenshots. Please put your screenshots in a PDF. 
- As for submission, please include the following files:
    - `prometheus-config-patch.yaml` (You should find it in the "manifests" directory)
    - `week6_assignments_part2.ipynb` and `week6_assignments_part3.ipynb` (You don't need to return `week6_assignments_part1.ipynb`)
    - `pipeline.yaml` (this file will be generated when you complete the third part of the assignments)
    - The PDF containing your screenshots

***Important!*** When submitting the files, please **do not** change the file names or put any of them in any sub-folder.


In [1]:
import lightgbm
import warnings
import pandas as pd
from pathlib import Path

from utils.config import FEATURE_STORE_DIR_NAME
from utils.utils import get_model_info, train, send_requests

warnings.filterwarnings("ignore")

assert lightgbm.__version__ == "3.3.5", "Incorrect version of lightgbm"

## Preparation before starting the assignments
Let's begin by training a model and deploying it to KServe. This model is trained using the house price data that we used in Week2 assignments. The task of the model is to predict the price of a house given the house's information such as building year and living area.

The raw training data can be found from "raw_data/reference/train" directory. The training data has been feature-engineered using the `etl` function from the second week and are split into a feature file (0_0_X.parquet) and a target file (0_0_y.csv) in the "feature_store_quarterly" directory. (The name of the feature store directory comes from the fact that we'll save features and targets in a quarterly basis in it.)

In [2]:
WORKING_DIR = Path.cwd()
# Prepare training data
train_x = pd.read_parquet(WORKING_DIR / FEATURE_STORE_DIR_NAME / "0_0_X.parquet")
train_y = pd.read_csv(WORKING_DIR / FEATURE_STORE_DIR_NAME / "0_0_y.csv")

# Model hyperparameters (hyperparameter optimization was performed)
params = {
    "colsample_bytree": 0.7,
    "learning_rate": 0.075,
    "max_depth": 50,
    "min_child_samples": 5,
    "min_split_gain": 20.0,
    "n_estimators": 1000,
    "num_leaves": 100,
    "reg_lambda": 50.0,
    "subsample": 0.1,
    "random_state": 42,
}
model_info = get_model_info()
if model_info is None:
    print("There is no model with a 'stage-Production' tag, start training a model")
    model_info = train(train_x, train_y, params)
model_version = model_info.model_version
model_s3_uri = model_info.model_s3_uri
print(f"The model version is: {model_version}, its S3 URI is: {model_s3_uri}")

There is no model with a 'stage-Production' tag, start training a model


Registered model 'Week6LgbmHousePrice' already exists. Creating a new version of this model...
2024/12/10 15:54:04 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: Week6LgbmHousePrice, version 2
Created version '2' of model 'Week6LgbmHousePrice'.


The model version is: 2, its S3 URI is: s3://mlflow/11/66d2b291052147e08ad5b2feb32dd4f2/artifacts/lgbm-house


After running the code cell above, you should see the model you just trained has a tag "stage" and the value is "Production" at [http://mlflow-server.local](http://mlflow-server.local). 

<img src="./images/mlflow-prod-model.png" width="800" />

You'll also see the S3 URI of your uploaded model printed . Let't then deploy the model to KServe. Before running the next cell, replace the `storageUri` in [manifests/house-price.yaml](./manifests/house-price.yaml) with your own S3 URI.

In [12]:
# Deploy an inference service named "house-price"
!kubectl apply -f manifests/house-price.yaml

inferenceservice.serving.kserve.io/house-price configured


Expected output:
```text
inferenceservice.serving.kserve.io/house-price created
```

In [16]:
# Check if the "house-price" inference service is ready
!kubectl -n kserve-inference get isvc house-price

NAME          URL                                               READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION           AGE
house-price   http://house-price.kserve-inference.example.com   True           100                              house-price-predictor-00003   3d2h


Expected output:
```text
NAME          URL                                               READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION                   AGE
house-price   http://house-price.kserve-inference.example.com   True           100                              house-price-predictor-default-00001   55s
```

In [17]:
# Then make sure there's a running pod for the "house-price" inference service
!kubectl -n kserve-inference get pods -l serving.kserve.io/inferenceservice=house-price

NAME                                                      READY   STATUS        RESTARTS   AGE
house-price-predictor-00002-deployment-54df65f6d7-fpclm   2/2     Terminating   0          3d2h
house-price-predictor-00003-deployment-64b978ff9c-lfmvm   2/2     Running       0          48s


Expected output:
```text
NAME                                                              READY   STATUS    RESTARTS   AGE
house-price-predictor-default-00001-deployment-748bc8bc67-r7p46   2/2     Running   0          68s
```

## Assignment 1: Monitoring 4xx responses of your inference service (2 points)
In this assignment, your task is to add a Prometheus alerting rule so that Prometheus will trigger an alert when your inference service gives too many client error responses (i.e., responses whose HTTP status code is 4xx). 

(If HTTP status codes are new to you, more information can be found [here](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes).)

You need to add an alerting rule to [manifests/prometheus-config-patch.yaml](./manifests/prometheus-config-patch.yaml) that **immediately** triggers an alert when your "house-price" inference service running in the "kserve-inference" namespace gives more than **ten** 4xx error responses in the **past one minute**. 

Please add your alerting rule between the comments "### START ALERTING RULE" and "### END ALERTING RULE". Please **do not change any text other than the one between the "### START ALERTING RULE" and "### END ALERTING RULE" comments**. Please **include the file "prometheus-config-patch.yaml" in your submission.**

Hints:
- An essential part of configuring an alerting rule is to decide which PromQL query to use. It may be easier if you first test your query in the Prometheus UI [http://prometheus-server.local](http://prometheus-server.local) before writing your alerting rule to the configuration file. You can use the following code cell to send some invalid requests to your inference service and then go to the Prometheus UI to test if your query can retrieve a reasonable value. 
- You may find the `revision_app_request_count` metric useful (introduced in the tutorial). You can then use the labels `namespace_name`, `isvc_name`, `response_code_class` to only include the responses you want to monitor. 
- You may also find the PromQL function [increase()](https://prometheus.io/docs/prometheus/latest/querying/functions/#increase) useful. 

In [18]:
# Let's pretend your downstream application is somehow broken and starts sending invalid requests to the inference service. 
# Inside each request, the input data is in a wrong format so the inference service will return responses with
# 422 (unprocessable entity) HTTP status code
send_requests(model_name="house-price", input=[None for _ in range(16)], count=20)

1 requests have been sent
2 requests have been sent
3 requests have been sent
4 requests have been sent
5 requests have been sent
6 requests have been sent
7 requests have been sent
8 requests have been sent
9 requests have been sent
10 requests have been sent
11 requests have been sent
12 requests have been sent
13 requests have been sent
14 requests have been sent
15 requests have been sent
16 requests have been sent
17 requests have been sent
18 requests have been sent
19 requests have been sent
20 requests have been sent


In [19]:
# Update the Prometheus configuration by patching your alerting rule to the ConfigMap consumed by the Prometheus pod
!kubectl -n monitoring patch configmap prometheus-server-conf --patch-file manifests/prometheus-config-patch.yaml

configmap/prometheus-server-conf patched (no change)


In [20]:
# Delete the old Prometheus pod so a new one that consumes the updated ConfigMap will be recreated automatically
!kubectl -n monitoring delete pod -l app=prometheus-server

pod "prometheus-deployment-7df47656d7-4qtws" deleted


In [21]:
# Check if the new Prometheus pod is ready
!kubectl -n monitoring get pod -l app=prometheus-server

NAME                                     READY   STATUS    RESTARTS   AGE
prometheus-deployment-7df47656d7-w5gnm   1/1     Running   0          4s


Expected output:
```text
NAME                                     READY   STATUS    RESTARTS   AGE
prometheus-deployment-7b898cb9d8-g8wd2   1/1     Running   0          6s
```
The "AGE" should be relatively small since this pod should be created after you deleted the old one. 

In [22]:
# Send invalid requests again and you should see an alert of too many 4xx responses has been triggered
send_requests(model_name="house-price", input=[None for _ in range(16)], count=40)

1 requests have been sent
2 requests have been sent
3 requests have been sent
4 requests have been sent
5 requests have been sent
6 requests have been sent
7 requests have been sent
8 requests have been sent
9 requests have been sent
10 requests have been sent
11 requests have been sent
12 requests have been sent
13 requests have been sent
14 requests have been sent
15 requests have been sent
16 requests have been sent
17 requests have been sent
18 requests have been sent
19 requests have been sent
20 requests have been sent
21 requests have been sent
22 requests have been sent
23 requests have been sent
24 requests have been sent
25 requests have been sent
26 requests have been sent
27 requests have been sent
28 requests have been sent
29 requests have been sent
30 requests have been sent
31 requests have been sent
32 requests have been sent
33 requests have been sent
34 requests have been sent
35 requests have been sent
36 requests have been sent
37 requests have been sent
38 request

Now you can go to [http://prometheus-server.local/alerts](http://prometheus-server.local/alerts) and see if the alert of too many 4xx responses is triggered.

### Screenshots for Assignment 1
Please screenshot the triggered alert and put it in your PDF file. Extend the alert field so your PromQL query is shown in the screenshot. Note the example is about another alert, it's just used to show you what should be included in the screenshot. 

<details>
    <summary>Example</summary>
    <img src="./images/alert-example.png" width=1000/>
</details>

You can go to the [second part](./week6_assignments_part2.ipynb) of the assignments. 