# LLM as a Judge

Based on llm as a judge paper https://arxiv.org/abs/2306.05685, We have implemented this function.

In this notebook we will go over the function's docs and outputs and see an end-to-end example of running it.

1. [Single grading metrics](#chapter1)
2. [Pairwise grading metrics](#chapter2)
3. [Pairwise_with_reference_grading_metrics](#chapter3)

<a id="chapter1"></a>
## 1. Single grading metrics


Single grading metrics will use a self-defined metrics and return a scroe based on the examples and rubric of the metrics.

### define the metrics

In [24]:
prompt_config = {
        "name": "accuracy",
        "definition": "The accuracy of the provided answer.",
        "rubric": """Accuracy: This rubric assesses the accuracy of the provided answer. The details for different scores are as follows:
            - Score 1: The answer is completely incorrect or irrelevant to the question. It demonstrates a fundamental
              misunderstanding of the topic or question.
            - Score 2: The answer contains significant inaccuracies, though it shows some understanding of the topic. Key
              elements of the question are addressed incorrectly.
            - Score 3: The answer is partially correct but has noticeable inaccuracies or omissions. It addresses the
              question but lacks depth or precision.
            - Score 4: The answer is mostly correct, with only minor inaccuracies or omissions. It provides a generally
              accurate response to the question.
            - Score 5: The answer is completely correct and thorough. It demonstrates a deep and accurate understanding of
              the topic, addressing all elements of the question effectively.""",
        "examples": """
            Question: What is the capital of France?
            Score 1: Completely Incorrect
            Answer: "The capital of France is Berlin."
            Explanation: This answer is entirely incorrect and irrelevant, as Berlin is the capital of Germany, not France.
            Score 2: Significantly Inaccurate
            Answer: "The capital of France is Lyon."
            Explanation: This answer demonstrates some understanding that the question is about a city in France, but it incorrectly identifies Lyon as the capital instead of Paris.
            Score 3: Partially Correct
            Answer: "I think the capital of France is either Paris or Marseille."
            Explanation: This answer shows partial knowledge but includes a significant inaccuracy by suggesting Marseille might be the capital. Paris is correct, but the inclusion of Marseille indicates a lack of certainty or complete understanding.
            Score 4: Mostly Correct
            Answer: "The capital of France is Paris, the largest city in the country."
            Explanation: This answer is mostly correct and identifies Paris as the capital. The addition of "the largest city in the country" is accurate but not directly relevant to the capital status, introducing a slight deviation from the question's focus.
            Score 5: Completely Correct and Thorough
            Answer: "The capital of France is Paris, which is not only the country's largest city but also its cultural and political center, hosting major institutions like the President's residence, the Elysée Palace."
            Explanation: This answer is completely correct, providing a thorough explanation that adds relevant context about Paris's role as the cultural and political center of France, directly addressing the question with depth and precision.
                     """,
    }

### HF model as Judge

In [29]:
import mlrun
project = mlrun.get_or_create_project(
    name="llm-judge",
    context = "./",
    user_project=True
)
llm_judge_fn = project.set_function(name="llm-judge", func="function.yaml")

> 2024-02-16 00:32:49,597 [info] Project loaded successfully: {'project_name': 'llm-judge'}


In [27]:
# config of the hugggingface model as judge
JUDGE_MODEL = "TheBloke/Mistral-7B-OpenOrca-GPTQ"
JUDGE_CONFIG = {
    "device_map": "auto",
    "revision": "gptq-8bit-128g-actorder_True",
    "trust_remote_code": False,
}
JUDGE_INFER_CONFIG = {
    "max_length": 1500,
}
TOKENIZER_JUDGE_CONFIG = {"use_fast": True}

In [22]:
single_grading_run = llm_judge_fn.run(
    handler="llm_judge",
    params={
        "input_path": "data/qa.csv",
        "metric_type": "LLMJudgeSingleGrading",
        "name": "accuracy_metrics",
        "model_judge": JUDGE_MODEL,
        "model_judge_config": JUDGE_CONFIG,
        "model_judge_infer_config": JUDGE_INFER_CONFIG,
        "prompt_template" : SINGLE_GRADE_PROMPT,
        "prompt_config" :prompt_config,
        "tokenizer_judge_config": TOKENIZER_JUDGE_CONFIG,
    },
    returns=[
        "single_result: dataset",
    ],
    local=True,
    artifact_path="./"
)

> 2024-02-15 23:47:18,766 [info] Storing function: {'name': 'llm-judge-llm-judge', 'uid': 'ebd659d1e9e048309ef8471561185615', 'db': None}
> 2024-02-15 23:47:19,002 [info] Preparing the judge model TheBloke/Mistral-7B-OpenOrca-GPTQ


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Using `disable_exllama` is deprecated and will be removed in version 4.37. Use `use_exllama` instead and specify the version with `exllama_config`.The value of `use_exllama` will be overwritten by `disable_exllama` passed in `GPTQConfig` or stored in your config file.


> 2024-02-15 23:48:19,224 [info] Computing the metrics over all data
> 2024-02-15 23:48:19,227 [info] Computing the metrics over one data point with What is the capital of China? and The capital of China is Kongfu
> 2024-02-15 23:48:19,228 [info] Filling the prompt template with the prompt config


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:32000 for open-end generation.
You are calling .generate() with the `input_ids` being on a device type different than your model's device. `input_ids` is on cpu, whereas the model is on cuda. You may experience unexpected behaviors or slower generation. Please make sure that you have put `input_ids` to the correct device by calling for example input_ids = input_ids.to('cuda') before running `.generate()`.


> 2024-02-15 23:48:57,315 [info] Extracting the score and explanation from 
Task:
Please act as an impartial judge and evaluate the quality of the response provided by an
AI assistant to the user question displayed below. You will be given the definition of accuracy, grading rubric, context information.
Your task is to determine a numerical score of accuracy for the response. You must use the grading rubric to determine your score. You must also give a explanation about how did you determine the score step-by-step. Please use chain of thinking.
Examples could be included beblow for your reference. Make sure you understand the grading rubric and use the examples before completing the task.
[Examples]:

            Question: What is the capital of France?
            Score 1: Completely Incorrect
            Answer: "The capital of France is Berlin."
            Explanation: This answer is entirely incorrect and irrelevant, as Berlin is the capital of Germany, not France.
            Sco

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:32000 for open-end generation.


> 2024-02-15 23:50:03,647 [info] Extracting the score and explanation from 
Task:
Please act as an impartial judge and evaluate the quality of the response provided by an
AI assistant to the user question displayed below. You will be given the definition of accuracy, grading rubric, context information.
Your task is to determine a numerical score of accuracy for the response. You must use the grading rubric to determine your score. You must also give a explanation about how did you determine the score step-by-step. Please use chain of thinking.
Examples could be included beblow for your reference. Make sure you understand the grading rubric and use the examples before completing the task.
[Examples]:

            Question: What is the capital of France?
            Score 1: Completely Incorrect
            Answer: "The capital of France is Berlin."
            Explanation: This answer is entirely incorrect and irrelevant, as Berlin is the capital of Germany, not France.
            Sco

project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
llm-judge-pengwei,...61185615,0,Feb 15 23:47:18,completed,llm-judge-llm-judge,v3io_user=pengweikind=localowner=pengweihost=jupyter-pengwei-gpu-7777658756-pmddr,,"input_path=data/qa.csvmetric_type=LLMJudgeSingleGradingname=accuracy_metricsmodel_judge=TheBloke/Mistral-7B-OpenOrca-GPTQmodel_judge_config={'device_map': 'auto', 'revision': 'gptq-8bit-128g-actorder_True', 'trust_remote_code': False}model_judge_infer_config={'max_length': 1500}prompt_template=\nTask:\nPlease act as an impartial judge and evaluate the quality of the response provided by an\nAI assistant to the user question displayed below. You will be given the definition of {name}, grading rubric, context information.\nYour task is to determine a numerical score of {name} for the response. You must use the grading rubric to determine your score. You must also give a explanation about how did you determine the score step-by-step. Please use chain of thinking.\nExamples could be included beblow for your reference. Make sure you understand the grading rubric and use the examples before completing the task.\n[Examples]:\n{examples}\n[User Question]:\n{question}\n[Response]:\n{answer}\n[Definition of {name}]:\n{definition}\n[Grading Rubric]:\n{rubric}\nYou must return the following fields in your output:\n- score: a numerical score of {name} for the response\n- explanation: a explanation about how did you determine the score step-by-step\n[Output]:\nprompt_config={'name': 'accuracy', 'definition': 'The accuracy of the provided answer.', 'rubric': 'Accuracy: This rubric assesses the accuracy of the provided answer. The details for different scores are as follows:\n - Score 1: The answer is completely incorrect or irrelevant to the question. It demonstrates a fundamental\n misunderstanding of the topic or question.\n - Score 2: The answer contains significant inaccuracies, though it shows some understanding of the topic. Key\n elements of the question are addressed incorrectly.\n - Score 3: The answer is partially correct but has noticeable inaccuracies or omissions. It addresses the\n question but lacks depth or precision.\n - Score 4: The answer is mostly correct, with only minor inaccuracies or omissions. It provides a generally\n accurate response to the question.\n - Score 5: The answer is completely correct and thorough. It demonstrates a deep and accurate understanding of\n the topic, addressing all elements of the question effectively.', 'examples': '\n Question: What is the capital of France?\n Score 1: Completely Incorrect\n Answer: ""The capital of France is Berlin.""\n Explanation: This answer is entirely incorrect and irrelevant, as Berlin is the capital of Germany, not France.\n Score 2: Significantly Inaccurate\n Answer: ""The capital of France is Lyon.""\n Explanation: This answer demonstrates some understanding that the question is about a city in France, but it incorrectly identifies Lyon as the capital instead of Paris.\n Score 3: Partially Correct\n Answer: ""I think the capital of France is either Paris or Marseille.""\n Explanation: This answer shows partial knowledge but includes a significant inaccuracy by suggesting Marseille might be the capital. Paris is correct, but the inclusion of Marseille indicates a lack of certainty or complete understanding.\n Score 4: Mostly Correct\n Answer: ""The capital of France is Paris, the largest city in the country.""\n Explanation: This answer is mostly correct and identifies Paris as the capital. The addition of ""the largest city in the country"" is accurate but not directly relevant to the capital status, introducing a slight deviation from the question\'s focus.\n Score 5: Completely Correct and Thorough\n Answer: ""The capital of France is Paris, which is not only the country\'s largest city but also its cultural and political center, hosting major institutions like the President\'s residence, the Elysée Palace.""\n Explanation: This answer is completely correct, providing a thorough explanation that adds relevant context about Paris\'s role as the cultural and political center of France, directly addressing the question with depth and precision.\n '}tokenizer_judge_config={'use_fast': True}",,result





> 2024-02-15 23:50:03,870 [info] Run execution finished: {'status': 'completed', 'name': 'llm-judge-llm-judge'}


### OPENAI model as Judge

In [32]:
import os
api_key = os.getenv("OPENAI_API_KEY")
base_url = os.getenv("OPENAI_API_BASE")
OPENAI_MODEL = "gpt-3.5-turbo"
OPENAI_JUDGE_CONFIG = {
    "api_key": api_key,
    "base_url": base_url,
}

In [33]:
openai_single_grading_run = llm_judge_fn.run(
    handler="llm_judge",
    params={
        "input_path": "data/qa.csv",
        "metric_type": "OPENAIJudgeSingleGrading",
        "name": "accuracy_metrics",
        "model_judge": OPENAI_MODEL,
        "model_judge_config": OPENAI_JUDGE_CONFIG,
        "prompt_config" :prompt_config,
    },
    returns=[
        "openai_single_result: dataset",
    ],
    local=True,
    artifact_path="./"
)

> 2024-02-16 00:35:53,612 [info] Storing function: {'name': 'llm-judge-llm-judge', 'uid': 'b902e2b291ac408dadf9ccfcc8206b57', 'db': None}
> 2024-02-16 00:35:53,831 [error] execution error, Traceback (most recent call last):
  File "/User/.conda/envs/llm_judge/lib/python3.9/site-packages/mlrun/runtimes/local.py", line 441, in exec_from_params
    val = mlrun.handler(
  File "/User/.conda/envs/llm_judge/lib/python3.9/site-packages/mlrun/package/__init__.py", line 140, in wrapper
    func_outputs = func(*args, **kwargs)
  File "/tmp/tmphymr1bdq.py", line 1112, in llm_judge
    metric = _get_metrics(**kwargs)
  File "/tmp/tmphymr1bdq.py", line 1097, in _get_metrics
    return MetricsType_dic[metric_type](**kwargs)
TypeError: __init__() missing 1 required positional argument: 'prompt_template'

> 2024-02-16 00:35:53,884 [error] exec error - __init__() missing 1 required positional argument: 'prompt_template'


__init__() missing 1 required positional argument: 'prompt_template'


project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
llm-judge-pengwei,...c8206b57,0,Feb 16 00:35:53,error,llm-judge-llm-judge,v3io_user=pengweikind=localowner=pengweihost=jupyter-pengwei-gpu-7777658756-pmddr,,"input_path=data/qa.csvmetric_type=OPENAIJudgeSingleGradingname=accuracy_metricsmodel_judge=gpt-3.5-turbomodel_judge_config={'api_key': None, 'base_url': None}prompt_config={'name': 'accuracy', 'definition': 'The accuracy of the provided answer.', 'rubric': 'Accuracy: This rubric assesses the accuracy of the provided answer. The details for different scores are as follows:\n - Score 1: The answer is completely incorrect or irrelevant to the question. It demonstrates a fundamental\n misunderstanding of the topic or question.\n - Score 2: The answer contains significant inaccuracies, though it shows some understanding of the topic. Key\n elements of the question are addressed incorrectly.\n - Score 3: The answer is partially correct but has noticeable inaccuracies or omissions. It addresses the\n question but lacks depth or precision.\n - Score 4: The answer is mostly correct, with only minor inaccuracies or omissions. It provides a generally\n accurate response to the question.\n - Score 5: The answer is completely correct and thorough. It demonstrates a deep and accurate understanding of\n the topic, addressing all elements of the question effectively.', 'examples': '\n Question: What is the capital of France?\n Score 1: Completely Incorrect\n Answer: ""The capital of France is Berlin.""\n Explanation: This answer is entirely incorrect and irrelevant, as Berlin is the capital of Germany, not France.\n Score 2: Significantly Inaccurate\n Answer: ""The capital of France is Lyon.""\n Explanation: This answer demonstrates some understanding that the question is about a city in France, but it incorrectly identifies Lyon as the capital instead of Paris.\n Score 3: Partially Correct\n Answer: ""I think the capital of France is either Paris or Marseille.""\n Explanation: This answer shows partial knowledge but includes a significant inaccuracy by suggesting Marseille might be the capital. Paris is correct, but the inclusion of Marseille indicates a lack of certainty or complete understanding.\n Score 4: Mostly Correct\n Answer: ""The capital of France is Paris, the largest city in the country.""\n Explanation: This answer is mostly correct and identifies Paris as the capital. The addition of ""the largest city in the country"" is accurate but not directly relevant to the capital status, introducing a slight deviation from the question\'s focus.\n Score 5: Completely Correct and Thorough\n Answer: ""The capital of France is Paris, which is not only the country\'s largest city but also its cultural and political center, hosting major institutions like the President\'s residence, the Elysée Palace.""\n Explanation: This answer is completely correct, providing a thorough explanation that adds relevant context about Paris\'s role as the cultural and political center of France, directly addressing the question with depth and precision.\n '}",,





> 2024-02-16 00:35:53,935 [info] Run execution finished: {'status': 'error', 'name': 'llm-judge-llm-judge'}


RunError: __init__() missing 1 required positional argument: 'prompt_template'