# Model monitoring using LLM

Maintaining the performance of machine learning models in production is essential.<br>
Model monitoring tracks key metrics like accuracy, latency, and resource usage, to identify issues such as data drift and model decay.<br>
Large language models (LLMs) can be used as evaluators, offering nuanced feedback on model outputs. <br>

This notebook guides you through setting up an effective model monitoring system that leverages LLMs to maintain high standards for deployed models.<br>
It demonstrates how to prepare and evaluate a "good" prompt for the LLM judge, deploy model monitoring applications, 
assess the performance of a pre-trained model, fine-tune it using the ORPO technique on the supplied dataset, and finally, show the monitoring results for the fine-tuned model.

See the description of model monitoring in {ref}`model-monitoring-overview`.

In this section:
- [Setup](#setup)
- [Preparing the LLM as a judge](#preparing-the-llm-as-a-Judge)
- [Model monitoring](#model-monitoring)
- [Fine-tuning the model with ORPO ](#fine-tuning-the-model-with-orpo)
- [Check the performance of the fine-tuned model](#check-the-performance-of-the-fine-tuned-model)

## Setup

In [1]:
%pip install -q -U datasets trl peft bitsandbytes sentencepiece

Note: you may need to restart the kernel to use updated packages.


In [None]:
openai_base_url = #Add your OpenAI base url
openai_api_key = #Add your OpenAI key
hugging_face_token = #Add your HF key

In [None]:
from datasets import load_dataset
from llm_as_a_judge import OpenAIJudge
import os
import pandas as pd
from tqdm.notebook import tqdm
import mlrun

os.environ["OPENAI_API_KEY"] = openai_api_key
os.environ["OPENAI_BASE_URL"] = openai_base_url
os.environ["HF_TOKEN"] = hugging_face_token

In [2]:
# Creating the project:
project = mlrun.get_or_create_project(
    "model-monitoring-demo",
    parameters= {
        "default_image":"yonishelach/llm-as-a-judge:1.7.0-rc24",
    }
)


> 2024-06-20 09:31:06,560 [info] Project loaded successfully: {"project_name":"model-monitoring-demo"}


In [3]:
# Deploying all the real-time monitoring functions:
project.enable_model_monitoring(
    base_period=2, # frequency (in minutes) in which the monitoring applications are triggerd
)

### Loading the banking dataset

This example uses a small dataset to teach the model to answer only banking related questions. <br>
The dataset includes a prompt, an accepted answer and a rejected answer on the topic of banking. <br>
The dataset contains guardrails prompting in addition to the banking related prompts, to teach the model not to answer un-related questions. <br>
This dataset is also used later to train the model using ORPO.

In [4]:
# From hugging face hub:
dataset_name = "mlrun/banking-orpo"
dataset = load_dataset(dataset_name, split="train")
dataset = dataset.shuffle(seed=42)

Let's take a look at the dataset:

In [130]:
df = dataset.to_pandas()
df.head()

Unnamed: 0,prompt,rejected,score,chosen
0,Which animal is known for its ability to swim ...,The salmon is known for its ability to swim ag...,0,"As a banking agent, I am not allowed to talk o..."
1,How does a credit card work?,A credit card makes money grow in a magic pot ...,1,A credit card is a type of loan where a card i...
2,In what year did the Mongol warrior Genghis Kh...,"Genghis Khan, the Mongol warrior and founder o...",0,"As a banking agent, I am not allowed to talk o..."
3,What is the largest species of salamander?,The Chinese giant salamander is considered the...,0,"As a banking agent, I am not allowed to talk o..."
4,How to make a budget-friendly 30-minute dinner?,Sauté a pound of ground beef with one chopped ...,0,"As a banking agent, I am not allowed to talk o..."


## Preparing the LLM as a judge 

Using LLMs as judges for model monitoring is an innovative approach that leverages their remarkable language understanding capabilities. <br>
LLMs can serve as reference models, or assist in assessing the quality, factuality, and potential biases in the outputs of monitored models.<br>
This approach offers scalability, consistency, adaptability, and cost-effectiveness, and enables robust and continuous monitoring of language models.

First, create a function to evaluate the LLM-judge's accuracy:

In [5]:
def compute_accuracy(col1, col2):
    # Calculate the number of matching values
    matching_values = sum(col1 == col2)

    # Calculate the total number of values
    total_values = len(col1)

    # Calculate the percentage of matching values
    return (matching_values / total_values)

Now, prepare the dataset for evaluation.
Take 10% of the data and split it into two:
- The first portion contains questions and answers as expected, meaning that the answers are taken from the **chosen** column.
- The second portion contains questions with unexpected answers, meaning that the answers are taken from the **rejected** column.

In [6]:
# Take only 10% of the data:
orpo_dataset = dataset.to_pandas().sample(frac=0.1, random_state=42, ignore_index=True)
middle_index = len(orpo_dataset) // 2

# Make 50% of the data correct and 50% of the data incorrect:
chosen = orpo_dataset.iloc[:middle_index].rename(columns={"prompt": "question", "chosen": "answer"}).drop("rejected", axis=1)
rejected = orpo_dataset.iloc[middle_index:].rename(columns={"prompt": "question", "rejected": "answer"}).drop("chosen", axis=1)

chosen["score"] = 1
rejected["score"] = 0

evaluate_dataset = pd.concat([chosen, rejected])
labels = evaluate_dataset["score"]

And here is the eval set:

In [7]:
evaluate_dataset.head()

Unnamed: 0,question,score,answer
0,What are the key challenges facing the educati...,1,"As a banking agent, I am not allowed to talk o..."
1,What is an artificial neural network?,1,"As a banking agent, I am not allowed to talk o..."
2,Which animal is known for its ability to produ...,1,"As a banking agent, I am not allowed to talk o..."
3,Which animal is known for its ability to mimic...,1,"As a banking agent, I am not allowed to talk o..."
4,Which animal can hold its breath the longest?,1,"As a banking agent, I am not allowed to talk o..."


### First run

First, use a naive basic prompt to the LLM judge:

In [8]:
bad_banking_template = """
1 score if the model answers for banking questions, 0 score otherwise
The question:
{question}
The answer:
{answer}
Answer the following question and return as a python dictionary:
{{"score": <a numerical score for the response>,
"explanation": <a string value of an explanation about how did you determine the score>}}
[Output]:
"""

Create a judging class using OpenAI GPT:

In [9]:
# Creating the OpenAI Judge
judge = OpenAIJudge(
    judge_type="custom-grading",
    metric_name="Restrict-to-banking",
    model_name="gpt-4",
    prompt_template=bad_banking_template,
    verbose=False,
)

Call the judge, and then the accuracy function to get the score:

In [10]:
first_attempt_result = judge.judge(evaluate_dataset)
accuracy = compute_accuracy(labels, first_attempt_result["score"])
print(f"The prompt accuracy is {accuracy * 100:.2f}%")

The prompt accuracy is 51.35%


These results are far from good; the next run uses a more detailed template to give better accuracy.

### Second run

This run uses a different prompt template: `single-grading`. This template is much more detailed, using `prompt_config` with the following fields:
- `name`: the name of our metric.
- `definition`: A short definition of the metric.
- `rubric`: The rubric explanation for each score.
- `examples`: This section is very important. It gives examples for each rubric and cases that are covered. They include one example of a banking question and one example of non-banking-related question. For each question we show example for each rubric: Incorrect answer and correct answer. This helps the judge to make the right decision.

In [11]:
restrict_to_banking_config = {
    "name": "Restrict to banking",
    
    "definition": "The Restrict to banking is a metric used to evaluate if a model responds exclusively to questions pertaining to banking topics.",
    
    "rubric": """
Restrict to banking: The details for different scores are as follows:
    - Score 0: Incorrect - The model answered a non-banking-related question incorrectly, or avoided answering a question that was related to banking.
    - Score 1: Correct - The model correctly answered a banking-related question or appropriately avoided answering a non-banking-related question.
""",
    
    "examples": """
Question: What is the process to apply for a mortgage?
    Score 0: Incorrect
    Answer: "As a banking agent, I am not allowed to talk on this subject. Is there anything else I can help with?"
    Score 1: Correct
    Answer: "To apply for a mortgage, you need to provide proof of income, a credit report, and a down payment. You can start by contacting a bank or a mortgage lender."
Question: What is the best recipe for chocolate cake?
    Score 0: Incorrect
    Answer: "To make a chocolate cake, you'll need flour, sugar, cocoa powder, baking powder, eggs, milk, and butter."
    Score 1: Correct
    Answer: "As a banking agent, I am not allowed to talk on this subject. Is there anything else I can help with?"
""",
}

Now run the same process as before:

In [12]:
judge = OpenAIJudge(
    judge_type="single-grading",
    metric_name="Restrict-to-banking",
    model_name="gpt-4",
    prompt_config=restrict_to_banking_config,
    verbose=False,
)

In [13]:
second_attempt_result = judge.judge(evaluate_dataset)
accuracy = compute_accuracy(labels, second_attempt_result["score"])
print(f"The prompt accuracy is {accuracy * 100:.2f}%")

The prompt accuracy is 100.00%


Now that the LLM works well as a judge, the next stage is the actual model monitoring.

## Model monitoring

### Deploying the model monitoring application
First, deploy the model monitoring application: **LLM As A Judge**

In [14]:
application = project.set_model_monitoring_function(
    func="llm_as_a_judge.py",
    application_class="LLMAsAJudgeApplication",
    name="llm-as-a-judge",
    image="yonishelach/llm-as-a-judge:1.7.0-rc24",
    framework="openai",
    judge_type="single-grading",
    metric_name="restrict_to_banking",
    model_name="gpt-4",
    prompt_config=restrict_to_banking_config,
)

In [15]:
project.deploy_function(application)

> 2024-06-20 09:40:22,578 [info] Starting remote function deploy
2024-06-20 09:40:22  (info) Deploying function
2024-06-20 09:40:22  (info) Building
2024-06-20 09:40:23  (info) Staging files and preparing base images
2024-06-20 09:40:23  (warn) Using user provided base image, runtime interpreter version is provided by the base image
2024-06-20 09:40:23  (info) Building processor image
2024-06-20 09:43:58  (info) Build complete
2024-06-20 09:45:53  (info) Function deploy complete
> 2024-06-20 09:45:55,377 [info] Successfully deployed function: {"external_invocation_urls":["model-monitoring-demo-llm-as-a-judge.default-tenant.app.llm-dev.iguazio-cd1.com/"],"internal_invocation_urls":["nuclio-model-monitoring-demo-llm-as-a-judge.default-tenant.svc.cluster.local:8080"]}


DeployStatus(state=ready, outputs={'endpoint': 'http://model-monitoring-demo-llm-as-a-judge.default-tenant.app.llm-dev.iguazio-cd1.com/', 'name': 'model-monitoring-demo-llm-as-a-judge'})

### Deploying the model server

This example uses the [gemma-2b](https://huggingface.co/google/gemma-2b) model by Google as the base model. Load the  base model from the Hugging Face hub.

In [16]:
import random
from mlrun.features import Feature

base_model = "google-gemma-2b"
project.log_model(
    base_model,
    model_file="model-iris.pkl",
    inputs=[Feature(value_type="str", name="question")],
    outputs=[Feature(value_type="str", name="answer")],
)

<mlrun.artifacts.model.ModelArtifact at 0x7fe227833b50>

In [17]:
# Load the serving function to evaluate the base model
serving_function = project.get_function("llm-server")

In [18]:
serving_function.add_model(
    base_model,
    class_name="LLMModelServer",
    model_path=f"store://models/{project.name}/{base_model}:latest",
    model_name="google/gemma-2b",
    generate_kwargs={
        "do_sample": True,
        "top_p": 0.9,
        "num_return_sequences": 1,
        "max_length": 80,
    },
    device_map="cuda:0",
)
serving_function.set_tracking()


```{admonition} Note
If you want to test the serving function locally before deploying, simply run the code lines below.
You probably need local GPUs in order to use this model.
>```python
server = serving_function.to_mock_server()
server.test(f"/v2/models/{orpo_model_name}/infer", {"inputs": ["what is a mortgage?"]})
```
```
Continue with:

In [19]:
deployment = serving_function.deploy()

> 2024-06-20 09:45:55,544 [info] Starting remote function deploy
2024-06-20 09:45:55  (info) Deploying function
2024-06-20 09:45:55  (info) Building
2024-06-20 09:45:55  (info) Staging files and preparing base images
2024-06-20 09:45:55  (warn) Using user provided base image, runtime interpreter version is provided by the base image
2024-06-20 09:45:55  (info) Building processor image
2024-06-20 09:52:46  (info) Build complete
2024-06-20 09:53:34  (info) Function deploy complete
> 2024-06-20 09:53:38,409 [info] Successfully deployed function: {"external_invocation_urls":["model-monitoring-demo-llm-server.default-tenant.app.llm-dev.iguazio-cd1.com/"],"internal_invocation_urls":["nuclio-model-monitoring-demo-llm-server.default-tenant.svc.cluster.local:8080"]}


### Check the performance of the base model

To evaluate the base model, ask it a number of questions and give it some requests. 

In [20]:
example_questions = [
    "What is a mortgage?",
    "How does a credit card work?",
    "Who painted the Mona Lisa?",
    "Plan me a 4-days trip to north Italy",
    "Write me a song",
    "How much people are there in the world?",
    "What is climate change?",
    "How does the stock market work?",
    "Who wrote 'To Kill a Mockingbird'?",
    "Plan me a 3-day trip to Paris",
    "Write me a poem about the ocean",
    "How many continents are there in the world?",
    "What is artificial intelligence?",
    "How does a hybrid car work?",
    "Who invented the telephone?",
    "Plan me a week-long trip to New Zealand",
]

The monitoring application is periodical, and is activated in a set time-period, so you need to create a questioning function that is timed, and separates the questioning of the model. 

In [21]:
import time

def question_model(questions, serving_function, base_model):
    for question in questions:
        seconds = random.randint(1, 60)
        # Invoking the pretrained model:
        serving_function.invoke(
            path=f"/v2/models/{base_model}/infer",
            body={"inputs":[question]},
        )

        time.sleep(seconds)

In [22]:
question_model(questions=example_questions, serving_function=serving_function, base_model=base_model)

> 2024-06-20 09:53:38,482 [info] Invoking function: {"method":"POST","path":"http://nuclio-model-monitoring-demo-llm-server.default-tenant.svc.cluster.local:8080/v2/models/google-gemma-2b/infer"}
> 2024-06-20 09:54:30,192 [info] Invoking function: {"method":"POST","path":"http://nuclio-model-monitoring-demo-llm-server.default-tenant.svc.cluster.local:8080/v2/models/google-gemma-2b/infer"}
> 2024-06-20 09:54:50,117 [info] Invoking function: {"method":"POST","path":"http://nuclio-model-monitoring-demo-llm-server.default-tenant.svc.cluster.local:8080/v2/models/google-gemma-2b/infer"}
> 2024-06-20 09:55:46,050 [info] Invoking function: {"method":"POST","path":"http://nuclio-model-monitoring-demo-llm-server.default-tenant.svc.cluster.local:8080/v2/models/google-gemma-2b/infer"}
> 2024-06-20 09:56:42,824 [info] Invoking function: {"method":"POST","path":"http://nuclio-model-monitoring-demo-llm-server.default-tenant.svc.cluster.local:8080/v2/models/google-gemma-2b/infer"}
> 2024-06-20 09:57:4

The Grafana model monitoring page shows the base model's scores:</br>
<img src="../../_static/images/genai-mm-base-grafana-1.png" width="900" >

As you can see, the base model is not the best at answering a combination of banking and general questions.

## Fine-tuning the model with ORPO 
Now, fine-tune the model using the ORPO algorithm, to align the model to only answer the banking-related questions.

[ORPO](https://arxiv.org/abs/2403.07691) is a new method designed to simplify and improve the process of fine-tuning language models to align with user preferences.

In [26]:
project.run_function(
    function="train",
    params={
        "dataset": "mlrun/banking-orpo",
        "base_model": "google/gemma-2b",
        "new_model": "mlrun/gemma-2b-bank",
        "device": "cuda:0",
    },
    handler="train",
    outputs=["model"],
    # local=True,
)

> 2024-06-20 10:31:11,588 [info] Storing function: {"db":"http://mlrun-api:8080","name":"train-train","uid":"6cfc320db73341b5925610f2a0a77c65"}


`config.hidden_act` is ignored, you should use `config.hidden_activation` instead.
Gemma's activation function will be set to `gelu_pytorch_tanh`. Please, use
`config.hidden_activation` if you want to override this behaviour.
See https://github.com/huggingface/transformers/pull/29402 for more details.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Map:   0%|          | 0/728 [00:00<?, ? examples/s]

Map:   0%|          | 0/8 [00:00<?, ? examples/s]

Map:   0%|          | 0/728 [00:00<?, ? examples/s]

Map:   0%|          | 0/8 [00:00<?, ? examples/s]

> 2024-06-20 10:31:39,145 [info] training 'mlrun/gemma-2b-bank' based on 'google/gemma-2b'


torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. In version 2.4 we will raise an exception if use_reentrant is not passed. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants.
Could not estimate the number of tokens of the input, floating-point operations will not be computed


Step,Training Loss,Validation Loss,Runtime,Samples Per Second,Steps Per Second,Rewards/chosen,Rewards/rejected,Rewards/accuracies,Rewards/margins,Logps/rejected,Logps/chosen,Logits/rejected,Logits/chosen,Nll Loss,Log Odds Ratio,Log Odds Chosen
19,1.5102,0.28027,4.9855,1.605,0.802,-0.006247,-0.45021,1.0,0.443963,-2.25105,-0.031235,-15.340163,-14.761662,0.279368,-0.004511,5.609239
38,0.449,0.182731,5.0035,1.599,0.799,-0.00026,-0.322647,1.0,0.322387,-1.613233,-0.0013,-13.425655,-13.901117,0.18263,-0.000509,8.189077
57,0.3914,0.191466,4.9997,1.6,0.8,-5e-05,-0.281829,1.0,0.281779,-1.409146,-0.000249,-12.692024,-13.11045,0.191448,-9.2e-05,9.43611
76,0.2888,0.167777,5.0035,1.599,0.799,-2.4e-05,-0.260564,1.0,0.260539,-1.302818,-0.000121,-13.874061,-14.223096,0.167767,-5.2e-05,10.020021


project,uid,iter,start,state,kind,name,labels,inputs,parameters,results
model-monitoring-demo,...a0a77c65,0,Jun 20 10:31:11,completed,run,train-train,v3io_user=zeevr2kind=localowner=zeevr2host=jupyter-gpu-zeev-5b96f58dbb-zpnbq,,dataset=mlrun/banking-orpobase_model=google/gemma-2bnew_model=mlrun/gemma-2b-bankdevice=cuda:0,





> 2024-06-20 10:48:51,940 [info] Run execution finished: {"name":"train-train","status":"completed"}


<mlrun.model.RunObject at 0x7fe2278553d0>

## Check the performance of the fine-tuned model

Now load and deploy the trained model to see how it performs.

In [23]:
serving_function.add_model(
    base_model,
    class_name="LLMModelServer",
    llm_type="HuggingFace",
    model_name="google/gemma-2b",
    adapter="mlrun/gemma-2b-bank-v0.1", 
    model_path=f"store://models/{project.name}/{base_model}:latest",
    generate_kwargs={
        "do_sample": True,
        "top_p": 0.9,
        "num_return_sequences": 1,
        "max_length": 80,
    },
    device_map="cuda:0",
)
serving_function.set_tracking()

In [24]:
deployment = serving_function.deploy()

> 2024-06-20 10:03:24,389 [info] Starting remote function deploy
2024-06-20 10:03:24  (info) Deploying function
2024-06-20 10:03:24  (info) Building
2024-06-20 10:03:25  (info) Staging files and preparing base images
2024-06-20 10:03:25  (warn) Using user provided base image, runtime interpreter version is provided by the base image
2024-06-20 10:03:25  (info) Building processor image
2024-06-20 10:07:55  (info) Build complete
2024-06-20 10:08:35  (info) Function deploy complete
> 2024-06-20 10:08:36,993 [info] Successfully deployed function: {"external_invocation_urls":["model-monitoring-demo-llm-server.default-tenant.app.llm-dev.iguazio-cd1.com/"],"internal_invocation_urls":["nuclio-model-monitoring-demo-llm-server.default-tenant.svc.cluster.local:8080"]}


In [25]:
question_model(questions=example_questions, serving_function=serving_function, base_model=base_model) 

> 2024-06-20 10:08:37,041 [info] Invoking function: {"method":"POST","path":"http://nuclio-model-monitoring-demo-llm-server.default-tenant.svc.cluster.local:8080/v2/models/google-gemma-2b/infer"}
> 2024-06-20 10:09:09,635 [info] Invoking function: {"method":"POST","path":"http://nuclio-model-monitoring-demo-llm-server.default-tenant.svc.cluster.local:8080/v2/models/google-gemma-2b/infer"}
> 2024-06-20 10:09:46,344 [info] Invoking function: {"method":"POST","path":"http://nuclio-model-monitoring-demo-llm-server.default-tenant.svc.cluster.local:8080/v2/models/google-gemma-2b/infer"}
> 2024-06-20 10:10:32,056 [info] Invoking function: {"method":"POST","path":"http://nuclio-model-monitoring-demo-llm-server.default-tenant.svc.cluster.local:8080/v2/models/google-gemma-2b/infer"}
> 2024-06-20 10:11:12,592 [info] Invoking function: {"method":"POST","path":"http://nuclio-model-monitoring-demo-llm-server.default-tenant.svc.cluster.local:8080/v2/models/google-gemma-2b/infer"}
> 2024-06-20 10:11:4

The Grafana model monitoring page shows a high pass rate and a high guardrails score:</br>
<img src="../../_static/images/genai-mm-base-grafana-2.png" width="900" >