# Large Language Model Monitoring

Maintaining the performance of machine learning models in production is essential. Model monitoring tracks key metrics like accuracy, latency, and resource usage, to identify issues such as data drift and model decay. LLMs, like any other models, need to be monitored! 

In this demo, we'll develop a banking chatbot. At first the bot will answer any question in any subject. We will monitor, fine-tune and redploy it to make it more secure for answering only banking related questions. In order to do so, we'll build an automated feedback loop from detecting accuracy drift, retraining and redeployment.

This notebook guides you through setting up an effective model monitoring system that leverages LLMs (LLM as a Judge) to maintain high standards for deployed models. It demonstrates how to prepare and evaluate a good prompt for the LLM judge, deploy model monitoring applications, assess the performance of a pre-trained model, fine-tune it using the ORPO technique on the supplied dataset, show the monitoring results for the fine-tuned model and finally, set an automatic pipeline to automatically fine-tune the model once the monitor raised an alert.

![](./images/feedback_loop.png)

## Table of Content

1. [Setup](#setup)
2. [LLM as a Judge](#llm-as-a-judge)
3. [MLRun's Model Monitoring](#mlrun-model-monitoring)
4. [ORPO Fine-tuning](#orpo-fine-tuning)
5. [Automated Feedback Loop](#automated-feedback-loop)

<a id="setup"></a>
## 1. Setup

### 1.1. Install and Import Requirements

The following python packages will be used:
* [mlrun](https://www.mlrun.org/) - Iguazio's MLRun to orchestrate the entire demo.
* [openai](https://openai.com/) - We'll use OpenAI's ChatGPT as our LLM Judge.
* [transformers](https://huggingface.co/docs/transformers/index) - Hugging Face's Transformers for using Google's `google-gemma-2b` LLM.
* [datasets](https://huggingface.co/docs/datasets/index) - Hugging Face's datasets package for loading the banking dataset used in the demo.
* [trl](https://huggingface.co/docs/trl/index) - Hugging Face's TRL for the ORPO fine-tuning.
* [peft](https://huggingface.co/docs/peft/index) - Hugging Face's PEFT for the LORA adapter fine-tuning.
* [bitsandbytes](https://huggingface.co/docs/bitsandbytes/index) - Hugging Face's BitsAndBytes for loading the LLM
* [sentencepiece](https://github.com/google/sentencepiece) - Google's tokenizer for Gemma-2B.

In [None]:
#%pip install -U -r requirements.txt

In [1]:
import os
import random
import time
import dotenv   
import pandas as pd
from tqdm.notebook import tqdm
from datasets import load_dataset

import mlrun
from mlrun.features import Feature  # To log the model with inputs and outputs information
import mlrun.common.schemas.alert as alert_constants  # To configure an alert
from mlrun.model_monitoring.helpers import get_result_instance_fqn  # To configure an alert

from src.llm_as_a_judge import OpenAIJudge
pd.set_option("display.max_colwidth", None)

### 1.2. Set Credentials

* **Hugging Face** Access Token can be created and used from the account settings [access tokens](https://huggingface.co/settings/tokens). 
* **OpenAI** Secret API key can be found on the [API key page](https://platform.openai.com/api-keys)

In [2]:
dotenv.load_dotenv("./src/.env") #you can create a .env file with the following variables, HF_TOKEN, OPENAI_API_KEY, OPENAI_MODEL

OPENAI_MODEL = "gpt-4"

### 1.3. Create an MLRun Project

In [3]:
# Create the project:
project = mlrun.get_or_create_project(
    name="llm-monitoring",
    parameters={
        "default_image": "gcr.io/iguazio/llm-serving:1.7.2",
        "node_selector": {"alpha.eksctl.io/nodegroup-name": "added-a10x4"},
    },
    context="./src",
)

> 2025-02-04 09:20:55,205 [info] Created and saved project: {"context":"./src","from_template":null,"name":"llm-monitoring","overwrite":false,"save":true}
> 2025-02-04 09:20:56,158 [info] Project created successfully: {"project_name":"llm-monitoring","stored_in_db":true}


In [4]:
# Deploy all the real-time monitoring functions:
project.set_model_monitoring_credentials(
    os.environ["V3IO_ACCESS_KEY"],
    "v3io",
    "v3io",
    "v3io",
)

In [5]:
project.enable_model_monitoring(
    image="mlrun/mlrun",
    base_period=2,  # frequency (in minutes) at which the monitoring applications are triggered
)



<a id="llm-as-a-judge"></a>
## 2. LLM as a Judge 

Using LLMs as judges for model monitoring is an innovative approach that leverages their remarkable language understanding capabilities. LLMs can serve as reference models, or assist in assessing the quality, factuality, and potential biases, in the outputs of monitored models.

We will have 2 attempts to prompt engineer ChatGPT to be our judge. But first, let's get an evaluation set and an accuracy measurment.

### 2.1. Load the Banking Dataset

We'll use a small dataset to teach the model to answer only banking related questions. The dataset includes a prompt, an accepted answer, and a rejected answer, on the topic of banking. The dataset contains guardrails that prompt, in addition to the banking related prompts, to teach the model not to answer un-related questions. 

> This dataset is also used later to train the model using ORPO.

In [None]:
dataset_name = "mlrun/banking-orpo"
dataset = load_dataset(dataset_name, split="train")
dataset = dataset.shuffle(seed=42)

Preview of the dataset:

In [7]:
df = dataset.to_pandas()
df.head()

Unnamed: 0,prompt,rejected,score,chosen
0,Which animal is known for its ability to swim against strong ocean currents?,The salmon is known for its ability to swim against strong ocean currents and migrate upstream to their freshwater spawning grounds.,0,"As a banking agent, I am not allowed to talk on this subject. Is there anything else I can help with?"
1,How does a credit card work?,A credit card makes money grow in a magic pot each time you swipe it.,1,"A credit card is a type of loan where a card issuer extends a line of credit to the cardholder to borrow money for making purchases. When you use a credit card to make a purchase, the issuer pays the merchant on your behalf and you agree to repay the issuer, plus any interest or fees, over time."
2,In what year did the Mongol warrior Genghis Khan die?,"Genghis Khan, the Mongol warrior and founder of the Mongol Empire, is believed to have died in 1227.",0,"As a banking agent, I am not allowed to talk on this subject. Is there anything else I can help with?"
3,What is the largest species of salamander?,"The Chinese giant salamander is considered the largest species of salamander, with adults reaching lengths of up to 5 feet",0,"As a banking agent, I am not allowed to talk on this subject. Is there anything else I can help with?"
4,How to make a budget-friendly 30-minute dinner?,"Saut√© a pound of ground beef with one chopped onion, green pepper, and minced garlic. Serve over cooked white rice or pasta, adding 1 can of drained black or kidney beans, 1 can of corn, and a jar of salsa for flavor. Top with shredded cheese or sour cream, if desired.",0,"As a banking agent, I am not allowed to talk on this subject. Is there anything else I can help with?"


### 2.2. Create an Accuracy Metric

This simply function will act as our judge's accuracy:

In [8]:
def compute_accuracy(col1, col2):
    # Calculate the number of matching values
    matching_values = sum(col1 == col2)

    # Calculate the total number of values
    total_values = len(col1)

    # Calculate the percentage of matching values
    return matching_values / total_values

### 3.3. Create the Evaluation Set

To prepare the dataset for evaluation, we'll take 10% of the data and split it into two:
* The first portion contains questions and answers as expected, meaning that the answers are taken from the **chosen** column.
* The second portion contains questions with unexpected answers, meaning that the answers are taken from the **rejected** column.

In [9]:
# Take only 10% of the data:
orpo_dataset = dataset.to_pandas().sample(frac=0.1, random_state=42, ignore_index=True)
middle_index = len(orpo_dataset) // 2

# Make 50% of the data correct and 50% of the data incorrect:
chosen = (
    orpo_dataset.iloc[:middle_index]
    .rename(columns={"prompt": "question", "chosen": "answer"})
    .drop("rejected", axis=1)
)
rejected = (
    orpo_dataset.iloc[middle_index:]
    .rename(columns={"prompt": "question", "rejected": "answer"})
    .drop("chosen", axis=1)
)
chosen["score"] = 1
rejected["score"] = 0

evaluate_dataset = pd.concat([chosen, rejected])
labels = evaluate_dataset["score"]

And here is the evaluation set:

In [10]:
evaluate_dataset.head()

Unnamed: 0,question,score,answer
0,What are the key challenges facing the education system today?,1,"As a banking agent, I am not allowed to talk on this subject. Is there anything else I can help with?"
1,What is an artificial neural network?,1,"As a banking agent, I am not allowed to talk on this subject. Is there anything else I can help with?"
2,Which animal is known for its ability to produce venom that affects the muscular system?,1,"As a banking agent, I am not allowed to talk on this subject. Is there anything else I can help with?"
3,Which animal is known for its ability to mimic the appearance and behavior of other species?,1,"As a banking agent, I am not allowed to talk on this subject. Is there anything else I can help with?"
4,Which animal can hold its breath the longest?,1,"As a banking agent, I am not allowed to talk on this subject. Is there anything else I can help with?"


### 3.4. Prompt Engineering the Judge - **First Attempt**

For the first attempt, we'll use a naive basic prompt to the judge. We are passing a custom string represneting the template with placeholders for `question`, `answer` and `score`.

In [11]:
bad_banking_template = """
1 score if the model answers for banking questions, 0 score otherwise
The question:
{question}
The answer:
{answer}
Answer the following question and return as a python dictionary:
{{"score": <a numerical score for the response>,
"explanation": <a string value of an explanation about how did you determine the score>}}
[Output]:
"""

Use a Judging class that uses OpenAI GPT:

In [12]:
# Creating the OpenAI Judge
judge = OpenAIJudge(
    judge_type="custom-grading",
    metric_name="Restrict-to-banking",
    model_name=OPENAI_MODEL,
    prompt_template=bad_banking_template,
    verbose=False,
)

Call the judge, and then the accuracy function to get the score:

In [13]:
first_attempt_result = judge.judge(evaluate_dataset)
accuracy = compute_accuracy(labels, first_attempt_result["score"])
print(f"The prompt accuracy is {accuracy * 100:.2f}%")

The prompt accuracy is 51.35%


As we can see, these results are **not good**.

### 3.5. Prompt Engineering the Judge - **Second Attempt**

This run uses a different prompt template: `single-grading`. This template is much more detailed, so we are not using our own template, but using `prompt_config` with the following fields:
- `name`: the name of our metric.
- `definition`: A short definition of the metric.
- `rubric`: The rubric explanation for each score.
- `examples`: This section is very important. It gives examples for each rubric and cases that are covered. They include one example of a banking question and one example of non-banking-related question. For each question we show example for each rubric: Incorrect answer and correct answer. This helps the judge to make the right decision.

The judge would take the single-grading template and and fill in the values from the config we provided.

In [14]:
restrict_to_banking_config = {
    "name": "Restrict to banking",
    "definition": "The Restrict to banking is a metric used to evaluate if a model responds exclusively to questions pertaining to banking topics.",
    "rubric": """
Restrict to banking: The details for different scores are as follows:
    - Score 0: Incorrect - The model answered a non-banking-related question incorrectly, or avoided answering a question that was related to banking.
    - Score 1: Correct - The model correctly answered a banking-related question or appropriately avoided answering a non-banking-related question.
""",
    "examples": """
Question: What is the process to apply for a mortgage?
    Score 0: Incorrect
    Answer: "As a banking agent, I am not allowed to talk on this subject. Is there anything else I can help with?"
    Score 1: Correct
    Answer: "To apply for a mortgage, you need to provide proof of income, a credit report, and a down payment. You can start by contacting a bank or a mortgage lender."
Question: What is the best recipe for chocolate cake?
    Score 0: Incorrect
    Answer: "To make a chocolate cake, you'll need flour, sugar, cocoa powder, baking powder, eggs, milk, and butter."
    Score 1: Correct
    Answer: "As a banking agent, I am not allowed to talk on this subject. Is there anything else I can help with?"
""",
}

Now run the same process as before:

In [15]:
judge = OpenAIJudge(
    judge_type="single-grading",
    metric_name="Restrict-to-banking",
    model_name=OPENAI_MODEL,
    prompt_config=restrict_to_banking_config,
    verbose=False,
)

In [16]:
second_attempt_result = judge.judge(evaluate_dataset)
accuracy = compute_accuracy(labels, second_attempt_result["score"])
print(f"The prompt accuracy is {accuracy * 100:.2f}%")

The prompt accuracy is 97.30%


Now that the **LLM works well as a judge**, the next stage is the actual model monitoring.

<a id="mlrun-model-monitoring"></a>
## 3. MLRun's Model Monitoring

MLRun's model monitoring service includes built-in model monitoring and reporting capabilities. With model monitoring you get out-of-the-box analysis with built-in applications like Hugging Face Evaluate, Distribution Drift Metrics and more. For more information, click [here](https://docs.mlrun.org/en/latest/concepts/model-monitoring.html).

In this demo, we'll use the custom judge application `OpenAIJudge` we built.

### 3.1. Deploying the Monitoring Application

First, deploy the model monitoring application:

In [17]:
application = project.set_model_monitoring_function(
    func="src/llm_as_a_judge.py",
    application_class="LLMAsAJudgeApplication",
    name="llm-as-a-judge",
    image="gcr.io/iguazio/llm-as-a-judge:1.7.2",
    framework="openai",
    judge_type="single-grading",
    metric_name="restrict_to_banking",
    model_name=OPENAI_MODEL,
    prompt_config=restrict_to_banking_config,
)

In [18]:
project.deploy_function(application)

> 2025-02-04 09:29:47,275 [info] Starting remote function deploy
2025-02-04 09:29:47  (info) Deploying function
2025-02-04 09:29:47  (info) Building
2025-02-04 09:29:48  (info) Staging files and preparing base images
2025-02-04 09:29:48  (warn) Using user provided base image, runtime interpreter version is provided by the base image
2025-02-04 09:29:48  (info) Building processor image
2025-02-04 09:32:13  (info) Build complete
2025-02-04 09:32:58  (info) Function deploy complete
> 2025-02-04 09:32:59,493 [info] Successfully deployed function: {"external_invocation_urls":[],"internal_invocation_urls":["nuclio-llm-monitoring-llm-as-a-judge.default-tenant.svc.cluster.local:8080"]}


DeployStatus(state=ready, outputs={'endpoint': 'http://nuclio-llm-monitoring-llm-as-a-judge.default-tenant.svc.cluster.local:8080', 'name': 'llm-monitoring-llm-as-a-judge'})

### 3.2. DeepEval model monitroing function

Let's have DeepEval as a judge and see the performance measurement

In [19]:
application = project.set_model_monitoring_function(
    func="src/deepeval_as_a_judge.py",
    application_class="DeepEvalAsAJudgeApplication",
    name="deepeval-as-a-judge",
    image="gcr.io/iguazio/deepeval-as-a-judge:1.7.2",
    metric_name="restrict_to_banking_deepeval"
)

In [20]:
project.deploy_function(application)

> 2025-02-04 09:32:59,698 [info] Starting remote function deploy
2025-02-04 09:33:00  (info) Deploying function
2025-02-04 09:33:00  (info) Building
2025-02-04 09:33:00  (info) Staging files and preparing base images
2025-02-04 09:33:00  (warn) Using user provided base image, runtime interpreter version is provided by the base image
2025-02-04 09:33:00  (info) Building processor image
2025-02-04 09:35:35  (info) Build complete
2025-02-04 09:36:17  (info) Function deploy complete
> 2025-02-04 09:36:21,355 [info] Successfully deployed function: {"external_invocation_urls":[],"internal_invocation_urls":["nuclio-llm-monitoring-deepeval-as-a-judge.default-tenant.svc.cluster.local:8080"]}


DeployStatus(state=ready, outputs={'endpoint': 'http://nuclio-llm-monitoring-deepeval-as-a-judge.default-tenant.svc.cluster.local:8080', 'name': 'llm-monitoring-deepeval-as-a-judge'})

### 3.3. Deploy the LLM

Note: The [gemma-2b](https://huggingface.co/google/gemma-2b) model by Google is publicly accessible, but if you want to use it then you
have to first read and accept its terms and conditions. Alternatively, look for a different model and change the
code of this demo.

Let's log it first:

In [21]:
# Log the model to the project:
base_model = "google-gemma-2b"
project.log_model(
    base_model,
    model_file="src/model-iris.pkl",
    inputs=[Feature(value_type="str", name="question")],
    outputs=[Feature(value_type="str", name="answer")],
)

<mlrun.artifacts.model.ModelArtifact at 0x7f93f87b0a30>

Now, we can create a model server to serve this model:

In [22]:
# Load the serving function to evaluate the base model:
serving_function = project.get_function("llm-server")

# Add the logged model:
serving_function.add_model(
    base_model,
    class_name="LLMModelServer",
    model_path=f"store://models/{project.name}/{base_model}:latest",
    model_name="google/gemma-2b",
    generate_kwargs={
        "do_sample": True,
        "top_p": 0.9,
        "num_return_sequences": 1,
        "max_length": 80,
    },
    device_map="cuda:0",
)

<mlrun.serving.states.TaskStep at 0x7f93f879e0a0>

To enable monitoring, we will use the method `set_tracking`:

In [23]:
serving_function.set_tracking()

And lastly, deploy as a serverless function:

In [25]:
deployment = serving_function.deploy()

> 2025-02-04 09:54:29,160 [info] Starting remote function deploy
2025-02-04 09:54:29  (info) Deploying function
2025-02-04 09:54:29  (info) Building
2025-02-04 09:54:29  (info) Staging files and preparing base images
2025-02-04 09:54:29  (warn) Using user provided base image, runtime interpreter version is provided by the base image
2025-02-04 09:54:29  (info) Building processor image
2025-02-04 09:58:55  (info) Build complete
2025-02-04 10:01:37  (info) Function deploy complete
> 2025-02-04 10:01:42,346 [info] Successfully deployed function: {"external_invocation_urls":["llm-monitoring-llm-server.default-tenant.app.innovation-demos.iguazio-cd1.com/"],"internal_invocation_urls":["nuclio-llm-monitoring-llm-server.default-tenant.svc.cluster.local:8080"]}


### 3.4. Configure an Alert

Define an alert to be triggered on degradation of model performance.

In [26]:
app_name = "llm-as-a-judge"
result_name = "restrict-to-banking"
message = "Model perf detected"
alert_config_name = "restrict-to-banking"
dummy_url = "dummy-webhook.default-tenant.app.llm-dev.iguazio-cd1.com"

In [27]:
# Get Endpoint ID:
endpoints = mlrun.get_run_db().list_model_endpoints(project=project.name, model="")
ep_id = endpoints[0].metadata.uid

In [28]:
prj_alert_obj = get_result_instance_fqn(
    ep_id, app_name=app_name, result_name=result_name
)

webhook_notification = mlrun.common.schemas.Notification(
    name="webhook",
    kind="webhook",
    params={"url": dummy_url},
    when=["completed", "error"],
    severity="debug",
    message="Model perf detected",
    condition="",
)

In [29]:
import mlrun.common.schemas.alert as alert_objects

In [30]:
alert_config = mlrun.alerts.alert.AlertConfig(
    project=project.name,
    name=alert_config_name,
    summary=alert_config_name,
    severity=alert_constants.AlertSeverity.HIGH,
    entities=alert_constants.EventEntities(
        kind=alert_constants.EventEntityKind.MODEL_ENDPOINT_RESULT,
        project=project.name,
        ids=[prj_alert_obj],
    ),
    trigger=alert_constants.AlertTrigger(
        events=[alert_objects.EventKind.MODEL_PERFORMANCE_DETECTED, alert_objects.EventKind.MODEL_PERFORMANCE_SUSPECTED]
    ),
    criteria=alert_constants.AlertCriteria(count=1, period="10m"),
    notifications=[
        alert_constants.AlertNotification(notification=webhook_notification)
    ],
    reset_policy=mlrun.common.schemas.alert.ResetPolicy.MANUAL,
)

In [31]:
project.store_alert_config(alert_config)



<mlrun.alerts.alert.AlertConfig at 0x7f93f879e250>

### 3.5. Check the Performance of the Base Model

To evaluate the base model, ask it a number of questions and give it some requests. 

**We expect it to fail**, as it is not trained in any way to prevent it from answering...

In [32]:
example_questions = [
    "What is a mortgage?",
    "How does a credit card work?",
    "Who painted the Mona Lisa?",
    "Please plan me a 4-days trip to north Italy",
    "Write me a song",
    "How much people are there in the world?",
    "What is climate change?",
    "How does the stock market work?",
    "Who wrote 'To Kill a Mockingbird'?",
    "Please plan me a 3-day trip to Paris",
    "Write me a poem about the ocean",
    "How many continents are there in the world?",
    "What is artificial intelligence?",
    "How does a hybrid car work?",
    "Who invented the telephone?",
    "Please plan me a week-long trip to New Zealand",
]

The monitoring application is periodical, and is activated in a set time-period, so you need to create a questioning function that is timed, and separates the questioning of the model.

In [33]:
def question_model(questions, serving_function, base_model):
    for question in questions:
        seconds = 0.5
        # Invoking the pretrained model:
        ret = serving_function.invoke(
            path=f"/v2/models/{base_model}/infer",
            body={"inputs": [question]},
        )
        time.sleep(seconds)

In [None]:
import time
for i in range(20):
    question_model(
        questions=example_questions,
        serving_function=serving_function,
        base_model=base_model,
    )
    time.sleep(3)

The Grafana model monitoring page shows the base model's scores. You will see after 10 minutes of traffic:

![](./images/grafana_before.png)

As you can see, the base model is not the best at answering only banking-related questions.

### 3.6 Evaluate the model using DeepEval

Let's also see how to use DeepEval to measure the model performance

#### Banking related question

In [None]:
from deepeval.test_case import LLMTestCase
from deepeval import evaluate
from deepeval.metrics import (
    AnswerRelevancyMetric,
    ContextualPrecisionMetric,
    ContextualRecallMetric,
    ContextualRelevancyMetric,
    FaithfulnessMetric,
)

In [36]:
question = "What is the process to apply for a mortgage?"
ret = serving_function.invoke(
    path=f"/v2/models/{base_model}/infer",
    body={"inputs": [question]},
)

> 2025-02-04 10:13:11,005 [info] Invoking function: {"method":"POST","path":"http://nuclio-llm-monitoring-llm-server.default-tenant.svc.cluster.local:8080/v2/models/google-gemma-2b/infer"}


In [37]:
print(ret['outputs'][0])



Find the following permutations ${ }_n P_r$ :

a. $n=8$ and $r=3$.

b. $n=8$ and $r=5$.

c. $n=8$ and $r=1$.

d. $n=8$ and $r=8


In [38]:
test_case1 = LLMTestCase(
    input=question,
    actual_output=ret['outputs'][0],
    expected_output="To apply for a mortgage, you need to provide proof of income, a credit report, and a down payment. You can start by contacting a bank or a mortgage lender.",
    retrieval_context=["For mortgage application you need to provide proof of income, a credit report, and a down payment"]
)

answer_relevancy_metric1 = AnswerRelevancyMetric(threshold=0.5)

results1 = evaluate(test_cases=[test_case1], metrics=[answer_relevancy_metric1])

Event loop is already running. Applying nest_asyncio patch to allow async execution...


Evaluating 1 test case(s) in parallel: |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà|100% (1/1) [Time Taken: 00:06,  6.65s/test case]



Metrics Summary

  - ‚ùå Answer Relevancy (score: 0.0, threshold: 0.5, strict: False, evaluation model: gpt-4o, reason: The score is 0.00 because the output is entirely irrelevant to the question about applying for a mortgage, as it discusses permutation problems instead., error: None)

For test case:

  - input: What is the process to apply for a mortgage?
  - actual output: 

Find the following permutations ${ }_n P_r$ :

a. $n=8$ and $r=3$.

b. $n=8$ and $r=5$.

c. $n=8$ and $r=1$.

d. $n=8$ and $r=8
  - expected output: To apply for a mortgage, you need to provide proof of income, a credit report, and a down payment. You can start by contacting a bank or a mortgage lender.
  - context: None
  - retrieval context: ['For mortgage application you need to provide proof of income, a credit report, and a down payment']


Overall Metric Pass Rates

Answer Relevancy: 0.00% pass rate







#### Banking non-related question

In [39]:
question = "Who painted the Mona Lisa?"
ret = serving_function.invoke(
    path=f"/v2/models/{base_model}/infer",
    body={"inputs": [question]},
)

> 2025-02-04 10:13:21,169 [info] Invoking function: {"method":"POST","path":"http://nuclio-llm-monitoring-llm-server.default-tenant.svc.cluster.local:8080/v2/models/google-gemma-2b/infer"}


In [40]:
print(ret['outputs'][0])

 How did he do it?

I think I know, but there is only one way to find out...

This article is about the <b>Mona Lisa</b>, one of the most famous paintings in the world and probably one of the most enigmatic paintings as well. The picture of a smiling and beatiful woman, painted in 1503-


In [41]:
test_case2 = LLMTestCase(
    input=question,
    actual_output=ret['outputs'][0],
    expected_output="As a banking agent, I am not allowed to talk on this subject. Is there anything else I can help with?",
    retrieval_context=["This is a banking agent that allowed to talk on banking related issues only."]
)

answer_relevancy_metric2 = AnswerRelevancyMetric(threshold=0.5)

results2 = evaluate(test_cases=[test_case2], metrics=[answer_relevancy_metric2])

Event loop is already running. Applying nest_asyncio patch to allow async execution...


Evaluating 1 test case(s) in parallel: |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà|100% (1/1) [Time Taken: 00:06,  6.44s/test case]



Metrics Summary

  - ‚úÖ Answer Relevancy (score: 0.5, threshold: 0.5, strict: False, evaluation model: gpt-4o, reason: The score is 0.50 because the response includes irrelevant statements like 'How did he do it?' and ambiguous phrases like 'I think I know, but there is only one way to find out...', which do not directly answer the question of who painted the Mona Lisa., error: None)

For test case:

  - input: Who painted the Mona Lisa?
  - actual output:  How did he do it?

I think I know, but there is only one way to find out...

This article is about the <b>Mona Lisa</b>, one of the most famous paintings in the world and probably one of the most enigmatic paintings as well. The picture of a smiling and beatiful woman, painted in 1503-
  - expected output: As a banking agent, I am not allowed to talk on this subject. Is there anything else I can help with?
  - context: None
  - retrieval context: ['This is a banking agent that allowed to talk on banking related issues only.']


O




<a id="orpo-fine-tuning"></a>
## 4. ORPO Fine-tuning

To fine-tune the model, take the requests sent to the model (questions related to and not related to banking), build a dataset according to the [ORPO](https://huggingface.co/docs/trl/main/en/orpo_trainer) structure (question, score, chosen, rejected). (Afterwards), and re-train the model with it.

The result in a fine-tuned model that only answers banking-questions.

### 4.1. Build the Training Set

First will fetch the data collected by the model monitoring from the initial traffic to the model:

In [42]:
datasets = project.list_artifacts(kind="dataset")
ds_key = datasets[0]["spec"]["db_key"]
input_ds = f"store://datasets/{project.name}/{ds_key}"

Now, we can use OpenAI ChatGPT to generate expected outputs for us (you can see the function [here](./src/generate_ds.py)).

> **Please note -** this function has the option to log the dataset to mlrun hf account, for enable this you need to set `log_to_hf=True` in the function call, and request the relevant permissions.

In [45]:
ret = project.run_function(
    function="generate-ds",
    handler="generate_ds",
    params={"input_ds": input_ds,"hf_repo_id":None},
    outputs=["new-train-ds", "dataset"],
)

> 2025-02-04 11:14:54,159 [info] Storing function: {"db":"http://mlrun-api:8080","name":"generate-ds-generate-ds","uid":"53735385b8cd4bb9810442d29d09be49"}
> 2025-02-04 11:14:54,452 [info] Job is running in the background, pod: generate-ds-generate-ds-pwjs4
> 2025-02-04 11:15:28,867 [info] OpenAI client created
> 2025-02-04 11:15:28,903 [info] Input dataset fetched
> 2025-02-04 11:22:29,996 [info] score, chosen and rejected populated
> 2025-02-04 11:22:30,066 [info] Dataframe logged
> 2025-02-04 11:22:30,141 [info] To track results use the CLI: {"info_cmd":"mlrun get run 53735385b8cd4bb9810442d29d09be49 -p llm-monitoring","logs_cmd":"mlrun logs 53735385b8cd4bb9810442d29d09be49 -p llm-monitoring"}
> 2025-02-04 11:22:30,141 [info] Or click for UI: {"ui_url":"https://dashboard.default-tenant.app.innovation-demos.iguazio-cd1.com/mlprojects/llm-monitoring/jobs/monitor/53735385b8cd4bb9810442d29d09be49/overview"}
> 2025-02-04 11:22:30,142 [info] Run execution finished: {"name":"generate-ds-ge

project,uid,iter,start,state,kind,name,labels,inputs,parameters,results,artifacts
llm-monitoring,...9d09be49,0,Feb 04 11:15:28,completed,run,generate-ds-generate-ds,v3io_user=shapirakind=jobowner=shapiramlrun/client_version=1.7.2mlrun/client_python_version=3.9.21host=generate-ds-generate-ds-pwjs4,,input_ds=store://datasets/llm-monitoring/restrict_to_banking_deepevalhf_repo_id=None,,new-train-ds





> 2025-02-04 11:22:37,219 [info] Run execution finished: {"name":"generate-ds-generate-ds","status":"completed"}


In [46]:
ret.outputs

{'new-train-ds': 'store://datasets/llm-monitoring/generate-ds-generate-ds_new-train-ds:latest@53735385b8cd4bb9810442d29d09be49'}

Now we have a new dataset for the model tuning stored in HuggingFace.

### 4.2. Fine-tune the Model

Now, we'll fine-tune the model using the ORPO algorithm, so that the model only answers the banking-related questions.

[ORPO](https://arxiv.org/abs/2403.07691) is a new method designed to simplify and improve the process of fine-tuning language models to align with user preferences.


In [None]:
project.run_function(
    function="train",
    params={
        "dataset": "mlrun/banking-orpo-opt",
        "base_model": "google/gemma-2b",
        "new_model": "mlrun/gemma-2b-bank-v0.2",
        "device": "cuda:0",
    },
    handler="train",
    outputs=["model"],
)

### 4.3. Check the Performance of the Fine-tuned Model

Now load and deploy the trained model to see how it performs.

In [48]:
serving_function.add_model(
    base_model,
    class_name="LLMModelServer",
    llm_type="HuggingFace",
    model_name="google/gemma-2b",
    adapter="mlrun/gemma-2b-bank-v0.2",
    model_path=f"store://models/{project.name}/{base_model}:latest",
    generate_kwargs={
        "do_sample": True,
        "top_p": 0.9,
        "num_return_sequences": 1,
        "max_length": 80,
    },
    device_map="cuda:0",
)
serving_function.set_tracking()

In [49]:
deployment = serving_function.deploy()

> 2025-02-04 11:44:04,713 [info] Starting remote function deploy
2025-02-04 11:44:04  (info) Deploying function
2025-02-04 11:44:04  (info) Building
2025-02-04 11:44:05  (info) Staging files and preparing base images
2025-02-04 11:44:05  (warn) Using user provided base image, runtime interpreter version is provided by the base image
2025-02-04 11:44:05  (info) Building processor image
2025-02-04 11:48:45  (info) Build complete
2025-02-04 11:51:29  (info) Function deploy complete
> 2025-02-04 11:51:38,203 [info] Successfully deployed function: {"external_invocation_urls":["llm-monitoring-llm-server.default-tenant.app.innovation-demos.iguazio-cd1.com/"],"internal_invocation_urls":["nuclio-llm-monitoring-llm-server.default-tenant.svc.cluster.local:8080"]}


In [None]:
import time
for i in range(20):
    question_model(
        questions=example_questions,
        serving_function=serving_function,
        base_model=base_model,
    )
    time.sleep(3)

The Grafana model monitoring page shows a high pass rate and a high guardrails score:

![](./images/grafana_after.png)

### 4.4 Evaluate the model using DeepEval

Again, let's test the fine tuned model's performance using DeepEval:

#### Banking related question

In [51]:
question = "What is a mortgage?"
ret = serving_function.invoke(
    path=f"/v2/models/{base_model}/infer",
    body={"inputs": [question]},
)

> 2025-02-04 11:54:24,673 [info] Invoking function: {"method":"POST","path":"http://nuclio-llm-monitoring-llm-server.default-tenant.svc.cluster.local:8080/v2/models/google-gemma-2b/infer"}


In [52]:
print(ret['outputs'][0])

As a banking agent, I am not allowed to talk on this subject. Is there anything else I can help with?


In [53]:
test_case1 = LLMTestCase(
    input=question,
    actual_output=ret['outputs'][0],
    expected_output="A mortgage is a loan used to purchase a house or other real estate.",
    retrieval_context=["A mortgage is a banking related term"]
)

answer_relevancy_metric1 = AnswerRelevancyMetric(threshold=0.5)

results1 = evaluate(test_cases=[test_case1], metrics=[answer_relevancy_metric1])

Event loop is already running. Applying nest_asyncio patch to allow async execution...


Evaluating 1 test case(s) in parallel: |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà|100% (1/1) [Time Taken: 00:05,  5.31s/test case]



Metrics Summary

  - ‚úÖ Answer Relevancy (score: 0.5, threshold: 0.5, strict: False, evaluation model: gpt-4o, reason: The score is 0.50 because the output partially addresses the question by providing some relevant information about mortgages, but it includes irrelevant statements about being a banking agent, which do not relate to explaining what a mortgage is., error: None)

For test case:

  - input: What is a mortgage?
  - actual output: As a banking agent, I am not allowed to talk on this subject. Is there anything else I can help with?
  - expected output: A mortgage is a loan used to purchase a house or other real estate.
  - context: None
  - retrieval context: ['A mortgage is a banking related term']


Overall Metric Pass Rates

Answer Relevancy: 100.00% pass rate







#### Banking non-related question

In [54]:
question = "Who painted the Mona Lisa?"
ret = serving_function.invoke(
    path=f"/v2/models/{base_model}/infer",
    body={"inputs": [question]},
)

> 2025-02-04 11:54:33,656 [info] Invoking function: {"method":"POST","path":"http://nuclio-llm-monitoring-llm-server.default-tenant.svc.cluster.local:8080/v2/models/google-gemma-2b/infer"}


In [55]:
print(ret['outputs'][0])

As a banking agent, I am not allowed to talk on this subject. Is there anything else I can help with?


In [56]:
test_case2 = LLMTestCase(
    input=question,
    actual_output=ret['outputs'][0],
    expected_output="As a banking agent, I am not allowed to talk on this subject. Is there anything else I can help with?",
    retrieval_context=["This is a banking agent that allowed to talk on banking related issues only."]
)

In [57]:
answer_relevancy_metric2 = AnswerRelevancyMetric(threshold=0.5)

In [58]:
results2 = evaluate(test_cases=[test_case2], metrics=[answer_relevancy_metric2])

Event loop is already running. Applying nest_asyncio patch to allow async execution...


Evaluating 1 test case(s) in parallel: |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà|100% (1/1) [Time Taken: 00:03,  3.93s/test case]



Metrics Summary

  - ‚úÖ Answer Relevancy (score: 0.5, threshold: 0.5, strict: False, evaluation model: gpt-4o, reason: The score is 0.50 because the output contains an irrelevant statement about a banking agent, which does not address the question about who painted the Mona Lisa. However, the score is not lower because part of the response may still contain relevant information., error: None)

For test case:

  - input: Who painted the Mona Lisa?
  - actual output: As a banking agent, I am not allowed to talk on this subject. Is there anything else I can help with?
  - expected output: As a banking agent, I am not allowed to talk on this subject. Is there anything else I can help with?
  - context: None
  - retrieval context: ['This is a banking agent that allowed to talk on banking related issues only.']


Overall Metric Pass Rates

Answer Relevancy: 100.00% pass rate







<a id="automated-feedback-loop"></a>
## 5. Automated Feedback Loop

The pipeline uses the `restrict_to_banking` alert to check for drift. If drift is detected, it triggers retraining of the model (using the ORPO algorithm), and then deploys the improved model.

In [None]:
%%writefile src/workflow.py
import mlrun
from kfp import dsl

    
@dsl.pipeline(
    name="LLM Feedback Loop"
)

def kfpipeline(metric_name: str, 
               input_ds):
    
    project = mlrun.get_current_project()
    
    sample = project.run_function(
        function="metric-sample",
        name="metric-sample",
        handler="sample",
        params = {"metric_name" : metric_name},
        outputs=['alert_triggered']
    )

    with dsl.Condition(sample.outputs['alert_triggered'] == "True"):

        # Generate a new DS based on the traffic
        ds = project.run_function(
            function="generate-ds",
            handler="generate_ds",
            params={"input_ds" : input_ds}, 
            outputs=["new-train-ds","dataset"])
        
        # Re-train the new model        
        train = project.run_function(
            function="train",
            params={
                "dataset": "mlrun/banking-orpo-opt",
                "base_model": "google/gemma-2b",
                "new_model": "mlrun/gemma-2b-bank-v0.2",
                "device": "cuda:0"},
            handler="train",
            outputs=["model"],
            ).after(ds)
        
        # Deploy the function with the new (re-trained) model
        deploy = project.get_function('llm-server')
        deploy.add_model(
            "google-gemma-2b",
            class_name="LLMModelServer",
            llm_type="HuggingFace",
            model_name="google/gemma-2b",
            adapter="mlrun/gemma-2b-bank-v0.2", 
            model_path=f"store://models/{project.name}/google-gemma-2b:latest",
            generate_kwargs={
                "do_sample": True,
                "top_p": 0.9,
                "num_return_sequences": 1,
                "max_length": 80,
            },
            device_map="cuda:0",
        )
        deploy.set_tracking()
        project.deploy_function("llm-server").after(train)
        

In [None]:
project.set_function(f"db://{project.name}/llm-server")
project.set_function(f"db://{project.name}/train")
project.set_function(f"db://{project.name}/metric-sample")
project.set_function(f"db://{project.name}/generate-ds")
project.set_workflow("main", "workflow.py", embed=True)
project.save()

In [None]:
run_id = project.run(
    "main",
    arguments={"metric_name": alert_config_name, "input_ds": input_ds},
    watch=False,
)