# LLM as a Judge

Based on llm as a judge paper https://arxiv.org/abs/2306.05685, We have implemented this function.

In this notebook we will go over the function's docs and outputs and see an end-to-end example of running it.

1. [Single grading metrics](#chapter1)
2. [Pairwise grading metrics](#chapter2)
3. [Pairwise_with_reference_grading_metrics](#chapter3)

<a id="chapter1"></a>
## 1. Single grading metrics


Single grading metrics will use a self-defined metrics and return a scroe based on the examples and rubric of the metrics.

### define the metrics

In [5]:
prompt_config = {
        "name": "accuracy",
        "definition": "The accuracy of the provided answer.",
        "rubric": """Accuracy: This rubric assesses the accuracy of the provided answer. The details for different scores are as follows:
            - Score 1: The answer is completely incorrect or irrelevant to the question. It demonstrates a fundamental
              misunderstanding of the topic or question.
            - Score 2: The answer contains significant inaccuracies, though it shows some understanding of the topic. Key
              elements of the question are addressed incorrectly.
            - Score 3: The answer is partially correct but has noticeable inaccuracies or omissions. It addresses the
              question but lacks depth or precision.
            - Score 4: The answer is mostly correct, with only minor inaccuracies or omissions. It provides a generally
              accurate response to the question.
            - Score 5: The answer is completely correct and thorough. It demonstrates a deep and accurate understanding of
              the topic, addressing all elements of the question effectively.""",
        "examples": """
            Question: What is the capital of France?
            Score 1: Completely Incorrect
            Answer: "The capital of France is Berlin."
            Explanation: This answer is entirely incorrect and irrelevant, as Berlin is the capital of Germany, not France.
            Score 2: Significantly Inaccurate
            Answer: "The capital of France is Lyon."
            Explanation: This answer demonstrates some understanding that the question is about a city in France, but it incorrectly identifies Lyon as the capital instead of Paris.
            Score 3: Partially Correct
            Answer: "I think the capital of France is either Paris or Marseille."
            Explanation: This answer shows partial knowledge but includes a significant inaccuracy by suggesting Marseille might be the capital. Paris is correct, but the inclusion of Marseille indicates a lack of certainty or complete understanding.
            Score 4: Mostly Correct
            Answer: "The capital of France is Paris, the largest city in the country."
            Explanation: This answer is mostly correct and identifies Paris as the capital. The addition of "the largest city in the country" is accurate but not directly relevant to the capital status, introducing a slight deviation from the question's focus.
            Score 5: Completely Correct and Thorough
            Answer: "The capital of France is Paris, which is not only the country's largest city but also its cultural and political center, hosting major institutions like the President's residence, the Elysée Palace."
            Explanation: This answer is completely correct, providing a thorough explanation that adds relevant context about Paris's role as the cultural and political center of France, directly addressing the question with depth and precision.
                     """,
    }

### HF model as Judge

In [6]:
import mlrun
project = mlrun.get_or_create_project(
    name="llm-judge",
    context = "./",
    user_project=True
)
llm_judge_fn = project.set_function(name="llm-judge", func="function.yaml")

> 2024-02-21 00:07:20,105 [info] Project loaded successfully: {'project_name': 'llm-judge'}


In [7]:
# config of the hugggingface model as judge
JUDGE_MODEL = "TheBloke/Mistral-7B-OpenOrca-GPTQ"
JUDGE_CONFIG = {
    "device_map": "auto",
    "revision": "gptq-8bit-128g-actorder_True",
    "trust_remote_code": False,
}
JUDGE_INFER_CONFIG = {
    "max_length": 1500,
}
TOKENIZER_JUDGE_CONFIG = {"use_fast": True}

In [8]:
single_grading_run = llm_judge_fn.run(
    handler="llm_judge",
    params={
        "input_path": "data/qa.csv",
        "metric_type": "LLMJudgeSingleGrading",
        "name": "accuracy_metrics",
        "model_judge": JUDGE_MODEL,
        "model_judge_config": JUDGE_CONFIG,
        "model_judge_infer_config": JUDGE_INFER_CONFIG,
        "prompt_config" :prompt_config,
        "tokenizer_judge_config": TOKENIZER_JUDGE_CONFIG,
    },
    returns=[
        "single_result: dataset",
    ],
    local=True,
    artifact_path="./"
)

> 2024-02-21 00:07:20,180 [info] Storing function: {'name': 'llm-judge-llm-judge', 'uid': 'caa85cbd0c884d23a97ff39d13aae9f4', 'db': None}
> 2024-02-21 00:07:28,093 [info] logging run results to: http://mlrun-api:8080
> 2024-02-21 00:07:28,522 [info] Preparing the judge model TheBloke/Mistral-7B-OpenOrca-GPTQ


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
CUDA extension not installed.
CUDA extension not installed.
Using `disable_exllama` is deprecated and will be removed in version 4.37. Use `use_exllama` instead and specify the version with `exllama_config`.The value of `use_exllama` will be overwritten by `disable_exllama` passed in `GPTQConfig` or stored in your config file.


> 2024-02-21 00:08:29,870 [info] Computing the metrics over all data
> 2024-02-21 00:08:29,877 [info] Computing the metrics over one data point with What is the capital of China? and The capital of China is Kongfu
> 2024-02-21 00:08:29,877 [info] Filling the prompt template with the prompt config


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:32000 for open-end generation.
You are calling .generate() with the `input_ids` being on a device type different than your model's device. `input_ids` is on cpu, whereas the model is on cuda. You may experience unexpected behaviors or slower generation. Please make sure that you have put `input_ids` to the correct device by calling for example input_ids = input_ids.to('cuda') before running `.generate()`.


> 2024-02-21 00:09:08,888 [info] Extracting the score and explanation from 
Task:
Please act as an impartial judge and evaluate the quality of the response provided by an
AI assistant to the user question displayed below. You will be given the definition of accuracy, grading rubric, context information.
Your task is to determine a numerical score of accuracy for the response. You must use the grading rubric to determine your score. You must also give a explanation about how did you determine the score step-by-step. Please use chain of thinking.
Examples could be included beblow for your reference. Make sure you understand the grading rubric and use the examples before completing the task.
[Examples]:

            Question: What is the capital of France?
            Score 1: Completely Incorrect
            Answer: "The capital of France is Berlin."
            Explanation: This answer is entirely incorrect and irrelevant, as Berlin is the capital of Germany, not France.
            Sco

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:32000 for open-end generation.


> 2024-02-21 00:10:15,169 [info] Extracting the score and explanation from 
Task:
Please act as an impartial judge and evaluate the quality of the response provided by an
AI assistant to the user question displayed below. You will be given the definition of accuracy, grading rubric, context information.
Your task is to determine a numerical score of accuracy for the response. You must use the grading rubric to determine your score. You must also give a explanation about how did you determine the score step-by-step. Please use chain of thinking.
Examples could be included beblow for your reference. Make sure you understand the grading rubric and use the examples before completing the task.
[Examples]:

            Question: What is the capital of France?
            Score 1: Completely Incorrect
            Answer: "The capital of France is Berlin."
            Explanation: This answer is entirely incorrect and irrelevant, as Berlin is the capital of Germany, not France.
            Sco

project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
llm-judge-pengwei,...13aae9f4,0,Feb 21 00:07:20,completed,llm-judge-llm-judge,v3io_user=pengweikind=localowner=pengweihost=jupyter-pengwei-gpu-7777658756-jcs45,,"input_path=data/qa.csvmetric_type=LLMJudgeSingleGradingname=accuracy_metricsmodel_judge=TheBloke/Mistral-7B-OpenOrca-GPTQmodel_judge_config={'device_map': 'auto', 'revision': 'gptq-8bit-128g-actorder_True', 'trust_remote_code': False}model_judge_infer_config={'max_length': 1500}prompt_config={'name': 'accuracy', 'definition': 'The accuracy of the provided answer.', 'rubric': 'Accuracy: This rubric assesses the accuracy of the provided answer. The details for different scores are as follows:\n - Score 1: The answer is completely incorrect or irrelevant to the question. It demonstrates a fundamental\n misunderstanding of the topic or question.\n - Score 2: The answer contains significant inaccuracies, though it shows some understanding of the topic. Key\n elements of the question are addressed incorrectly.\n - Score 3: The answer is partially correct but has noticeable inaccuracies or omissions. It addresses the\n question but lacks depth or precision.\n - Score 4: The answer is mostly correct, with only minor inaccuracies or omissions. It provides a generally\n accurate response to the question.\n - Score 5: The answer is completely correct and thorough. It demonstrates a deep and accurate understanding of\n the topic, addressing all elements of the question effectively.', 'examples': '\n Question: What is the capital of France?\n Score 1: Completely Incorrect\n Answer: ""The capital of France is Berlin.""\n Explanation: This answer is entirely incorrect and irrelevant, as Berlin is the capital of Germany, not France.\n Score 2: Significantly Inaccurate\n Answer: ""The capital of France is Lyon.""\n Explanation: This answer demonstrates some understanding that the question is about a city in France, but it incorrectly identifies Lyon as the capital instead of Paris.\n Score 3: Partially Correct\n Answer: ""I think the capital of France is either Paris or Marseille.""\n Explanation: This answer shows partial knowledge but includes a significant inaccuracy by suggesting Marseille might be the capital. Paris is correct, but the inclusion of Marseille indicates a lack of certainty or complete understanding.\n Score 4: Mostly Correct\n Answer: ""The capital of France is Paris, the largest city in the country.""\n Explanation: This answer is mostly correct and identifies Paris as the capital. The addition of ""the largest city in the country"" is accurate but not directly relevant to the capital status, introducing a slight deviation from the question\'s focus.\n Score 5: Completely Correct and Thorough\n Answer: ""The capital of France is Paris, which is not only the country\'s largest city but also its cultural and political center, hosting major institutions like the President\'s residence, the Elysée Palace.""\n Explanation: This answer is completely correct, providing a thorough explanation that adds relevant context about Paris\'s role as the cultural and political center of France, directly addressing the question with depth and precision.\n '}tokenizer_judge_config={'use_fast': True}",,single_result





> 2024-02-21 00:10:15,418 [info] Run execution finished: {'status': 'completed', 'name': 'llm-judge-llm-judge'}


### OPENAI model as Judge

In [9]:
OPENAI_MODEL = "gpt-3.5-turbo"

In [10]:
openai_single_grading_run = llm_judge_fn.run(
    handler="llm_judge",
    params={
        "input_path": "data/qa.csv",
        "metric_type": "OPENAIJudgeSingleGrading",
        "name": "accuracy_metrics",
        "model_judge": OPENAI_MODEL,
        "prompt_config" :prompt_config,
    },
    returns=[
        "openai_single_result: dataset",
    ],
    local=True,
    artifact_path="./"
)

> 2024-02-21 00:10:15,440 [info] Storing function: {'name': 'llm-judge-llm-judge', 'uid': '63442e4c9912481db5c898393089c973', 'db': None}
> 2024-02-21 00:10:16,446 [info] Prepare the openAI model as judge
> 2024-02-21 00:10:16,469 [info] Computing the metrics over all data
> 2024-02-21 00:10:16,470 [info] Compute the metrics over one data point using openAI's model
> 2024-02-21 00:10:16,470 [info] Filling the prompt template with the prompt config
> 2024-02-21 00:10:18,021 [info] Extracting the score and explanation from - score: 1
- explanation: The response provided is completely incorrect as the capital of China is not Kongfu. The answer demonstrates a fundamental misunderstanding of the topic, therefore warranting a score of 1 according to the grading rubric.
> 2024-02-21 00:10:18,025 [info] Compute the metrics over one data point using openAI's model
> 2024-02-21 00:10:18,025 [info] Filling the prompt template with the prompt config
> 2024-02-21 00:10:19,413 [info] Extracting the 

project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
llm-judge-pengwei,...3089c973,0,Feb 21 00:10:15,completed,llm-judge-llm-judge,v3io_user=pengweikind=localowner=pengweihost=jupyter-pengwei-gpu-7777658756-jcs45,,"input_path=data/qa.csvmetric_type=OPENAIJudgeSingleGradingname=accuracy_metricsmodel_judge=gpt-3.5-turboprompt_config={'name': 'accuracy', 'definition': 'The accuracy of the provided answer.', 'rubric': 'Accuracy: This rubric assesses the accuracy of the provided answer. The details for different scores are as follows:\n - Score 1: The answer is completely incorrect or irrelevant to the question. It demonstrates a fundamental\n misunderstanding of the topic or question.\n - Score 2: The answer contains significant inaccuracies, though it shows some understanding of the topic. Key\n elements of the question are addressed incorrectly.\n - Score 3: The answer is partially correct but has noticeable inaccuracies or omissions. It addresses the\n question but lacks depth or precision.\n - Score 4: The answer is mostly correct, with only minor inaccuracies or omissions. It provides a generally\n accurate response to the question.\n - Score 5: The answer is completely correct and thorough. It demonstrates a deep and accurate understanding of\n the topic, addressing all elements of the question effectively.', 'examples': '\n Question: What is the capital of France?\n Score 1: Completely Incorrect\n Answer: ""The capital of France is Berlin.""\n Explanation: This answer is entirely incorrect and irrelevant, as Berlin is the capital of Germany, not France.\n Score 2: Significantly Inaccurate\n Answer: ""The capital of France is Lyon.""\n Explanation: This answer demonstrates some understanding that the question is about a city in France, but it incorrectly identifies Lyon as the capital instead of Paris.\n Score 3: Partially Correct\n Answer: ""I think the capital of France is either Paris or Marseille.""\n Explanation: This answer shows partial knowledge but includes a significant inaccuracy by suggesting Marseille might be the capital. Paris is correct, but the inclusion of Marseille indicates a lack of certainty or complete understanding.\n Score 4: Mostly Correct\n Answer: ""The capital of France is Paris, the largest city in the country.""\n Explanation: This answer is mostly correct and identifies Paris as the capital. The addition of ""the largest city in the country"" is accurate but not directly relevant to the capital status, introducing a slight deviation from the question\'s focus.\n Score 5: Completely Correct and Thorough\n Answer: ""The capital of France is Paris, which is not only the country\'s largest city but also its cultural and political center, hosting major institutions like the President\'s residence, the Elysée Palace.""\n Explanation: This answer is completely correct, providing a thorough explanation that adds relevant context about Paris\'s role as the cultural and political center of France, directly addressing the question with depth and precision.\n '}",,openai_single_result





> 2024-02-21 00:10:19,584 [info] Run execution finished: {'status': 'completed', 'name': 'llm-judge-llm-judge'}


<a id="chapter2"></a>
## 2. Pairwise grading metrics


Pairwise grading metrics will use a smaller model as the benchmark model. It will ask the Judge to give two scores to the customized model and benchmark model to understand how well the model performs comparing with the benchmark model. 

### HF model as Judge

In [11]:
BENCHMARK_MODEL = "microsoft/phi-2"
BENCHMARK_CONFIG = {
    "max_length": 1500,
    "device_map": "auto",
    "revision": "main",
    "trust_remote_code": True,
    "torch_dtype": "auto",
}
TOKENIZER_BENCHMARK_CONFIG = {"trust_remote_code": True}
BENCHMARK_INFER_CONFIG = {"max_length": 1500}

In [12]:
pairwise_grading_run = llm_judge_fn.run(
    handler="llm_judge",
    params={
        "input_path": "data/qa.csv",
        "metric_type": "LLMJudgePairwiseGrading",
        "name": "accuracy_metrics",
        "model_judge": JUDGE_MODEL,
        "model_judge_config": JUDGE_CONFIG,
        "model_judge_infer_config": JUDGE_INFER_CONFIG,
        "model_bench_mark":BENCHMARK_MODEL,
        "model_bench_mark_config": BENCHMARK_CONFIG,
        "model_bench_mark_infer_config": BENCHMARK_INFER_CONFIG,
        "tokenizer_bench_mark_config": TOKENIZER_BENCHMARK_CONFIG,
        "prompt_config" :prompt_config,
        "tokenizer_judge_config": TOKENIZER_JUDGE_CONFIG,
    },
    returns=[
        "pairwise_result: dataset",
    ],
    local=True,
    artifact_path="./"
)

> 2024-02-21 00:10:19,603 [info] Storing function: {'name': 'llm-judge-llm-judge', 'uid': '8e8d2c31987542878218f68a40068efc', 'db': None}
> 2024-02-21 00:10:19,824 [info] Preparing the judge model TheBloke/Mistral-7B-OpenOrca-GPTQ


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Using `disable_exllama` is deprecated and will be removed in version 4.37. Use `use_exllama` instead and specify the version with `exllama_config`.The value of `use_exllama` will be overwritten by `disable_exllama` passed in `GPTQConfig` or stored in your config file.


> 2024-02-21 00:11:08,867 [info] Preparing the bench mark model microsoft/phi-2


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

> 2024-02-21 00:11:51,373 [info] Computing the metrics over What is the capital of China? and The capital of China is Kongfu
> 2024-02-21 00:11:51,374 [info] Computing the bench mark response for What is the capital of China?


You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use and modify the model generation configuration (see https://huggingface.co/docs/transformers/generation_strategies#default-text-generation-configuration )
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
You are calling .generate() with the `input_ids` being on a device type different than your model's device. `input_ids` is on cpu, whereas the model is on cuda. You may experience unexpected behaviors or slower generation. Please make sure that you have put `input_ids` to the correct device by calling for example input_ids = input_ids.to('cuda') before running `.generate()`.


> 2024-02-21 00:11:52,703 [info] Response of the bench mark model is What is the capital of China?
A) Beijing
B) Shanghai
C) Hong Kong
D) Tokyo

Answer: A) Beijing

> 2024-02-21 00:11:52,704 [info] Filling the prompt template with the prompt config


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:32000 for open-end generation.


> 2024-02-21 00:13:32,039 [info] Response of the judge model is 
Task:
Your task is to determine two numerical score of accuracy for the responses from two AI assistants. You must use the grading rubric to determine your scores. You must also give a explanation about how did you determine the scores step-by-step. Please using chain of thinking.
Examples could be included beblow for your reference. Make sure you understand the grading rubric and use the examples before completing the task.
[Examples]:

            Question: What is the capital of France?
            Score 1: Completely Incorrect
            Answer: "The capital of France is Berlin."
            Explanation: This answer is entirely incorrect and irrelevant, as Berlin is the capital of Germany, not France.
            Score 2: Significantly Inaccurate
            Answer: "The capital of France is Lyon."
            Explanation: This answer demonstrates some understanding that the question is about a city in France, but it

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


> 2024-02-21 00:13:33,056 [info] Response of the bench mark model is What is the capital of France?
A) London
B) Paris
C) Rome
D) Berlin
Answer: B) Paris

> 2024-02-21 00:13:33,057 [info] Filling the prompt template with the prompt config


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:32000 for open-end generation.


> 2024-02-21 00:15:51,290 [info] Response of the judge model is 
Task:
Your task is to determine two numerical score of accuracy for the responses from two AI assistants. You must use the grading rubric to determine your scores. You must also give a explanation about how did you determine the scores step-by-step. Please using chain of thinking.
Examples could be included beblow for your reference. Make sure you understand the grading rubric and use the examples before completing the task.
[Examples]:

            Question: What is the capital of France?
            Score 1: Completely Incorrect
            Answer: "The capital of France is Berlin."
            Explanation: This answer is entirely incorrect and irrelevant, as Berlin is the capital of Germany, not France.
            Score 2: Significantly Inaccurate
            Answer: "The capital of France is Lyon."
            Explanation: This answer demonstrates some understanding that the question is about a city in France, but it

project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
llm-judge-pengwei,...40068efc,0,Feb 21 00:10:19,completed,llm-judge-llm-judge,v3io_user=pengweikind=localowner=pengweihost=jupyter-pengwei-gpu-7777658756-jcs45,,"input_path=data/qa.csvmetric_type=LLMJudgePairwiseGradingname=accuracy_metricsmodel_judge=TheBloke/Mistral-7B-OpenOrca-GPTQmodel_judge_config={'device_map': 'auto', 'revision': 'gptq-8bit-128g-actorder_True', 'trust_remote_code': False}model_judge_infer_config={'max_length': 1500}model_bench_mark=microsoft/phi-2model_bench_mark_config={'max_length': 1500, 'device_map': 'auto', 'revision': 'main', 'trust_remote_code': True, 'torch_dtype': 'auto'}model_bench_mark_infer_config={'max_length': 1500}tokenizer_bench_mark_config={'trust_remote_code': True}prompt_config={'name': 'accuracy', 'definition': 'The accuracy of the provided answer.', 'rubric': 'Accuracy: This rubric assesses the accuracy of the provided answer. The details for different scores are as follows:\n - Score 1: The answer is completely incorrect or irrelevant to the question. It demonstrates a fundamental\n misunderstanding of the topic or question.\n - Score 2: The answer contains significant inaccuracies, though it shows some understanding of the topic. Key\n elements of the question are addressed incorrectly.\n - Score 3: The answer is partially correct but has noticeable inaccuracies or omissions. It addresses the\n question but lacks depth or precision.\n - Score 4: The answer is mostly correct, with only minor inaccuracies or omissions. It provides a generally\n accurate response to the question.\n - Score 5: The answer is completely correct and thorough. It demonstrates a deep and accurate understanding of\n the topic, addressing all elements of the question effectively.', 'examples': '\n Question: What is the capital of France?\n Score 1: Completely Incorrect\n Answer: ""The capital of France is Berlin.""\n Explanation: This answer is entirely incorrect and irrelevant, as Berlin is the capital of Germany, not France.\n Score 2: Significantly Inaccurate\n Answer: ""The capital of France is Lyon.""\n Explanation: This answer demonstrates some understanding that the question is about a city in France, but it incorrectly identifies Lyon as the capital instead of Paris.\n Score 3: Partially Correct\n Answer: ""I think the capital of France is either Paris or Marseille.""\n Explanation: This answer shows partial knowledge but includes a significant inaccuracy by suggesting Marseille might be the capital. Paris is correct, but the inclusion of Marseille indicates a lack of certainty or complete understanding.\n Score 4: Mostly Correct\n Answer: ""The capital of France is Paris, the largest city in the country.""\n Explanation: This answer is mostly correct and identifies Paris as the capital. The addition of ""the largest city in the country"" is accurate but not directly relevant to the capital status, introducing a slight deviation from the question\'s focus.\n Score 5: Completely Correct and Thorough\n Answer: ""The capital of France is Paris, which is not only the country\'s largest city but also its cultural and political center, hosting major institutions like the President\'s residence, the Elysée Palace.""\n Explanation: This answer is completely correct, providing a thorough explanation that adds relevant context about Paris\'s role as the cultural and political center of France, directly addressing the question with depth and precision.\n '}tokenizer_judge_config={'use_fast': True}",,pairwise_result





> 2024-02-21 00:15:51,502 [info] Run execution finished: {'status': 'completed', 'name': 'llm-judge-llm-judge'}


### OPENAI model as Judge

In [13]:
openai_pairwise_grading_run = llm_judge_fn.run(
    handler="llm_judge",
    params={
        "input_path": "data/qa.csv",
        "metric_type": "OPENAIJudgePairwiseGrading",
        "name": "accuracy_metrics",
        "model_judge": OPENAI_MODEL,
        "prompt_config" :prompt_config,
        "model_bench_mark":BENCHMARK_MODEL,
        "model_bench_mark_config": BENCHMARK_CONFIG,
        "model_bench_mark_infer_config": BENCHMARK_INFER_CONFIG,
        "tokenizer_bench_mark_config": TOKENIZER_BENCHMARK_CONFIG,
    },
    returns=[
        "openai_pairwise_result: dataset",
    ],
    local=True,
    artifact_path="./"
)

> 2024-02-21 00:15:51,514 [info] Storing function: {'name': 'llm-judge-llm-judge', 'uid': '294038f7ddfb4a0b93b913ef44a82bc1', 'db': None}
> 2024-02-21 00:15:51,757 [info] Prepare the openAI model as judge
> 2024-02-21 00:15:51,782 [info] Preparing the bench mark model microsoft/phi-2


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

> 2024-02-21 00:15:54,867 [info] Computing the metrics over What is the capital of China? and The capital of China is Kongfu
> 2024-02-21 00:15:54,868 [info] Computing the bench mark response for What is the capital of China?


You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use and modify the model generation configuration (see https://huggingface.co/docs/transformers/generation_strategies#default-text-generation-configuration )
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
You are calling .generate() with the `input_ids` being on a device type different than your model's device. `input_ids` is on cpu, whereas the model is on cuda. You may experience unexpected behaviors or slower generation. Please make sure that you have put `input_ids` to the correct device by calling for example input_ids = input_ids.to('cuda') before running `.generate()`.


> 2024-02-21 00:15:55,972 [info] Response of the bench mark model is What is the capital of China?
A) Beijing
B) Shanghai
C) Hong Kong
D) Tokyo

Answer: A) Beijing

> 2024-02-21 00:15:55,972 [info] Filling the prompt template with the prompt config
> 2024-02-21 00:15:58,090 [info] Extract the score and the explanation from the - Score of Assistant A: 1
- Explanation of Assistant A: The response "The capital of China is Kongfu" is completely incorrect and irrelevant to the question. It demonstrates a fundamental misunderstanding of the topic, as Kongfu is not the capital of China, leading to a score of 1.

- Score of Assistant B: 4
- Explanation of Assistant B: The response "A) Beijing" is mostly correct with only minor inaccuracies. It correctly identifies Beijing as the capital of China, which is the accurate answer. While other options are provided, the correct answer is included, leading to a score of 4.
> 2024-02-21 00:15:58,092 [info] Computing the metrics over What is the capital

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


> 2024-02-21 00:15:59,111 [info] Response of the bench mark model is What is the capital of France?
A) London
B) Paris
C) Rome
D) Berlin
Answer: B) Paris

> 2024-02-21 00:15:59,111 [info] Filling the prompt template with the prompt config
> 2024-02-21 00:16:02,294 [info] Extract the score and the explanation from the - Score of assistant A: 5
- Explanation of assistant A: The response from assistant A, "The capital of France is Paris," is completely correct and directly answers the question without any inaccuracies or omissions. This demonstrates a deep and accurate understanding of the topic and earns a score of 5 according to the grading rubric.

- Score of assistant B: 2
- Explanation of assistant B: The response from assistant B, "B) Paris," is somewhat accurate as it correctly identifies Paris as the capital of France. However, the answer is presented in a multiple-choice format which may imply uncertainty or lack of depth in understanding. Additionally, the other options mentione

project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
llm-judge-pengwei,...44a82bc1,0,Feb 21 00:15:51,completed,llm-judge-llm-judge,v3io_user=pengweikind=localowner=pengweihost=jupyter-pengwei-gpu-7777658756-jcs45,,"input_path=data/qa.csvmetric_type=OPENAIJudgePairwiseGradingname=accuracy_metricsmodel_judge=gpt-3.5-turboprompt_config={'name': 'accuracy', 'definition': 'The accuracy of the provided answer.', 'rubric': 'Accuracy: This rubric assesses the accuracy of the provided answer. The details for different scores are as follows:\n - Score 1: The answer is completely incorrect or irrelevant to the question. It demonstrates a fundamental\n misunderstanding of the topic or question.\n - Score 2: The answer contains significant inaccuracies, though it shows some understanding of the topic. Key\n elements of the question are addressed incorrectly.\n - Score 3: The answer is partially correct but has noticeable inaccuracies or omissions. It addresses the\n question but lacks depth or precision.\n - Score 4: The answer is mostly correct, with only minor inaccuracies or omissions. It provides a generally\n accurate response to the question.\n - Score 5: The answer is completely correct and thorough. It demonstrates a deep and accurate understanding of\n the topic, addressing all elements of the question effectively.', 'examples': '\n Question: What is the capital of France?\n Score 1: Completely Incorrect\n Answer: ""The capital of France is Berlin.""\n Explanation: This answer is entirely incorrect and irrelevant, as Berlin is the capital of Germany, not France.\n Score 2: Significantly Inaccurate\n Answer: ""The capital of France is Lyon.""\n Explanation: This answer demonstrates some understanding that the question is about a city in France, but it incorrectly identifies Lyon as the capital instead of Paris.\n Score 3: Partially Correct\n Answer: ""I think the capital of France is either Paris or Marseille.""\n Explanation: This answer shows partial knowledge but includes a significant inaccuracy by suggesting Marseille might be the capital. Paris is correct, but the inclusion of Marseille indicates a lack of certainty or complete understanding.\n Score 4: Mostly Correct\n Answer: ""The capital of France is Paris, the largest city in the country.""\n Explanation: This answer is mostly correct and identifies Paris as the capital. The addition of ""the largest city in the country"" is accurate but not directly relevant to the capital status, introducing a slight deviation from the question\'s focus.\n Score 5: Completely Correct and Thorough\n Answer: ""The capital of France is Paris, which is not only the country\'s largest city but also its cultural and political center, hosting major institutions like the President\'s residence, the Elysée Palace.""\n Explanation: This answer is completely correct, providing a thorough explanation that adds relevant context about Paris\'s role as the cultural and political center of France, directly addressing the question with depth and precision.\n '}model_bench_mark=microsoft/phi-2model_bench_mark_config={'max_length': 1500, 'device_map': 'auto', 'revision': 'main', 'trust_remote_code': True, 'torch_dtype': 'auto'}model_bench_mark_infer_config={'max_length': 1500}tokenizer_bench_mark_config={'trust_remote_code': True}",,openai_pairwise_result





> 2024-02-21 00:16:02,488 [info] Run execution finished: {'status': 'completed', 'name': 'llm-judge-llm-judge'}


<a id="chapter3"></a>
## 3. Pairwise grading metrics with reference


This type of metrics will use a benchmark model and the ground truth of the question to give the grading

### HF model as Judge

In [14]:
pairwise_grading_run = llm_judge_fn.run(
    handler="llm_judge",
    params={
        "input_path": "data/ref.csv",
        "metric_type": "LLMJudgeReferenceGrading",
        "name": "accuracy_metrics",
        "model_judge": JUDGE_MODEL,
        "model_judge_config": JUDGE_CONFIG,
        "model_judge_infer_config": JUDGE_INFER_CONFIG,
        "model_bench_mark":BENCHMARK_MODEL,
        "model_bench_mark_config": BENCHMARK_CONFIG,
        "model_bench_mark_infer_config": BENCHMARK_INFER_CONFIG,
        "tokenizer_bench_mark_config": TOKENIZER_BENCHMARK_CONFIG,
        "prompt_config" :prompt_config,
        "tokenizer_judge_config": TOKENIZER_JUDGE_CONFIG,
    },
    returns=[
        "reference_result: dataset",
    ],
    local=True,
    artifact_path="./"
)

> 2024-02-21 00:16:02,501 [info] Storing function: {'name': 'llm-judge-llm-judge', 'uid': 'c4aee4056c6d45e5ab2dfc7362ffdb31', 'db': None}
> 2024-02-21 00:16:02,717 [info] Preparing the judge model TheBloke/Mistral-7B-OpenOrca-GPTQ


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Using `disable_exllama` is deprecated and will be removed in version 4.37. Use `use_exllama` instead and specify the version with `exllama_config`.The value of `use_exllama` will be overwritten by `disable_exllama` passed in `GPTQConfig` or stored in your config file.


> 2024-02-21 00:17:03,088 [info] Preparing the bench mark model microsoft/phi-2


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

> 2024-02-21 00:17:45,346 [info] Computing the metrics over What is the capital of China? and The capital of China is Kongfu
> 2024-02-21 00:17:45,347 [info] Computing the bench mark response for What is the capital of China?


You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use and modify the model generation configuration (see https://huggingface.co/docs/transformers/generation_strategies#default-text-generation-configuration )
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
You are calling .generate() with the `input_ids` being on a device type different than your model's device. `input_ids` is on cpu, whereas the model is on cuda. You may experience unexpected behaviors or slower generation. Please make sure that you have put `input_ids` to the correct device by calling for example input_ids = input_ids.to('cuda') before running `.generate()`.


> 2024-02-21 00:17:46,529 [info] Response of the bench mark model is What is the capital of China?
A) Beijing
B) Shanghai
C) Hong Kong
D) Tokyo

Answer: A) Beijing

> 2024-02-21 00:17:46,530 [info] Filling the prompt template with the prompt config


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:32000 for open-end generation.


> 2024-02-21 00:19:26,011 [info] Response of the judge model is 
Task:
Your task is to determine two numerical score of accuracy for the responses from two AI assistants with the ground truth of the response. You must use the grading rubric to determine your scores. You must use the ground truth of the response. You need to give a explanation about how did you compare with the ground truth of the response to determine the scores step-by-step. Please using chain of thinking.
Examples could be included beblow for your reference. Make sure you understand the grading rubric and use the examples before completing the task.
[Examples]:

            Question: What is the capital of France?
            Score 1: Completely Incorrect
            Answer: "The capital of France is Berlin."
            Explanation: This answer is entirely incorrect and irrelevant, as Berlin is the capital of Germany, not France.
            Score 2: Significantly Inaccurate
            Answer: "The capital of Franc

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


> 2024-02-21 00:19:27,032 [info] Response of the bench mark model is What is the capital of France?
A) London
B) Paris
C) Rome
D) Berlin
Answer: B) Paris

> 2024-02-21 00:19:27,032 [info] Filling the prompt template with the prompt config


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:32000 for open-end generation.


> 2024-02-21 00:21:06,516 [info] Response of the judge model is 
Task:
Your task is to determine two numerical score of accuracy for the responses from two AI assistants with the ground truth of the response. You must use the grading rubric to determine your scores. You must use the ground truth of the response. You need to give a explanation about how did you compare with the ground truth of the response to determine the scores step-by-step. Please using chain of thinking.
Examples could be included beblow for your reference. Make sure you understand the grading rubric and use the examples before completing the task.
[Examples]:

            Question: What is the capital of France?
            Score 1: Completely Incorrect
            Answer: "The capital of France is Berlin."
            Explanation: This answer is entirely incorrect and irrelevant, as Berlin is the capital of Germany, not France.
            Score 2: Significantly Inaccurate
            Answer: "The capital of Franc

project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
llm-judge-pengwei,...62ffdb31,0,Feb 21 00:16:02,completed,llm-judge-llm-judge,v3io_user=pengweikind=localowner=pengweihost=jupyter-pengwei-gpu-7777658756-jcs45,,"input_path=data/ref.csvmetric_type=LLMJudgeReferenceGradingname=accuracy_metricsmodel_judge=TheBloke/Mistral-7B-OpenOrca-GPTQmodel_judge_config={'device_map': 'auto', 'revision': 'gptq-8bit-128g-actorder_True', 'trust_remote_code': False}model_judge_infer_config={'max_length': 1500}model_bench_mark=microsoft/phi-2model_bench_mark_config={'max_length': 1500, 'device_map': 'auto', 'revision': 'main', 'trust_remote_code': True, 'torch_dtype': 'auto'}model_bench_mark_infer_config={'max_length': 1500}tokenizer_bench_mark_config={'trust_remote_code': True}prompt_config={'name': 'accuracy', 'definition': 'The accuracy of the provided answer.', 'rubric': 'Accuracy: This rubric assesses the accuracy of the provided answer. The details for different scores are as follows:\n - Score 1: The answer is completely incorrect or irrelevant to the question. It demonstrates a fundamental\n misunderstanding of the topic or question.\n - Score 2: The answer contains significant inaccuracies, though it shows some understanding of the topic. Key\n elements of the question are addressed incorrectly.\n - Score 3: The answer is partially correct but has noticeable inaccuracies or omissions. It addresses the\n question but lacks depth or precision.\n - Score 4: The answer is mostly correct, with only minor inaccuracies or omissions. It provides a generally\n accurate response to the question.\n - Score 5: The answer is completely correct and thorough. It demonstrates a deep and accurate understanding of\n the topic, addressing all elements of the question effectively.', 'examples': '\n Question: What is the capital of France?\n Score 1: Completely Incorrect\n Answer: ""The capital of France is Berlin.""\n Explanation: This answer is entirely incorrect and irrelevant, as Berlin is the capital of Germany, not France.\n Score 2: Significantly Inaccurate\n Answer: ""The capital of France is Lyon.""\n Explanation: This answer demonstrates some understanding that the question is about a city in France, but it incorrectly identifies Lyon as the capital instead of Paris.\n Score 3: Partially Correct\n Answer: ""I think the capital of France is either Paris or Marseille.""\n Explanation: This answer shows partial knowledge but includes a significant inaccuracy by suggesting Marseille might be the capital. Paris is correct, but the inclusion of Marseille indicates a lack of certainty or complete understanding.\n Score 4: Mostly Correct\n Answer: ""The capital of France is Paris, the largest city in the country.""\n Explanation: This answer is mostly correct and identifies Paris as the capital. The addition of ""the largest city in the country"" is accurate but not directly relevant to the capital status, introducing a slight deviation from the question\'s focus.\n Score 5: Completely Correct and Thorough\n Answer: ""The capital of France is Paris, which is not only the country\'s largest city but also its cultural and political center, hosting major institutions like the President\'s residence, the Elysée Palace.""\n Explanation: This answer is completely correct, providing a thorough explanation that adds relevant context about Paris\'s role as the cultural and political center of France, directly addressing the question with depth and precision.\n '}tokenizer_judge_config={'use_fast': True}",,reference_result





> 2024-02-21 00:21:06,753 [info] Run execution finished: {'status': 'completed', 'name': 'llm-judge-llm-judge'}


### OPENAI model as Judge

In [15]:
openai_ref_grading_run = llm_judge_fn.run(
    handler="llm_judge",
    params={
        "input_path": "data/ref.csv",
        "metric_type": "OPENAIJudgeReferenceGrading",
        "name": "accuracy_metrics",
        "model_judge": OPENAI_MODEL,
        "prompt_config" :prompt_config,
        "model_bench_mark":BENCHMARK_MODEL,
        "model_bench_mark_config": BENCHMARK_CONFIG,
        "model_bench_mark_infer_config": BENCHMARK_INFER_CONFIG,
        "tokenizer_bench_mark_config": TOKENIZER_BENCHMARK_CONFIG,
    },
    returns=[
        "openai_reference_result: dataset",
    ],
    local=True,
    artifact_path="./"
)

> 2024-02-21 00:21:06,765 [info] Storing function: {'name': 'llm-judge-llm-judge', 'uid': '1f506415dfa3473484e270faeacdc498', 'db': None}
> 2024-02-21 00:21:06,977 [info] Prepare the openAI model as judge
> 2024-02-21 00:21:07,003 [info] Preparing the bench mark model microsoft/phi-2


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

> 2024-02-21 00:21:10,182 [info] Computing the metrics over What is the capital of China? and The capital of China is Kongfu
> 2024-02-21 00:21:10,182 [info] Computing the bench mark response for What is the capital of China?


You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use and modify the model generation configuration (see https://huggingface.co/docs/transformers/generation_strategies#default-text-generation-configuration )
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
You are calling .generate() with the `input_ids` being on a device type different than your model's device. `input_ids` is on cpu, whereas the model is on cuda. You may experience unexpected behaviors or slower generation. Please make sure that you have put `input_ids` to the correct device by calling for example input_ids = input_ids.to('cuda') before running `.generate()`.


> 2024-02-21 00:21:11,299 [info] Response of the bench mark model is What is the capital of China?
A) Beijing
B) Shanghai
C) Hong Kong
D) Tokyo

Answer: A) Beijing

> 2024-02-21 00:21:11,300 [info] Filling the prompt template with the prompt config
> 2024-02-21 00:21:13,910 [info] Extract the score and the explanation from the - score of assistant a: 1
- explanation of assistant a: The response from assistant A is completely incorrect as the capital of China is Beijing, not Kongfu. This answer demonstrates a fundamental misunderstanding of the question.
- score of assistant b: 5
- explanation of assistant b: The response from assistant B is completely correct as it identifies Beijing as the capital of China, which aligns perfectly with the ground truth. This answer shows a deep and accurate understanding of the topic.
> 2024-02-21 00:21:13,912 [info] Computing the metrics over What is the capital of France? and The capital of France is Seattle
> 2024-02-21 00:21:13,912 [info] Computing

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


> 2024-02-21 00:21:14,948 [info] Response of the bench mark model is What is the capital of France?
A) London
B) Paris
C) Rome
D) Berlin
Answer: B) Paris

> 2024-02-21 00:21:14,948 [info] Filling the prompt template with the prompt config
> 2024-02-21 00:21:17,446 [info] Extract the score and the explanation from the - score of assistant a: 1
- explanation of assistant a: The response from assistant A incorrectly states that the capital of France is Seattle, which is completely irrelevant and incorrect. This answer demonstrates a fundamental misunderstanding of the topic, resulting in a score of 1.

- score of assistant b: 4
- explanation of assistant b: The response from assistant B correctly identifies Paris as the capital of France, which aligns with the ground truth. The multiple-choice options provided help confirm that option B (Paris) is the correct answer. While there are other cities listed, the response correctly selects Paris, resulting in a score of 4 based on the grading r

project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
llm-judge-pengwei,...eacdc498,0,Feb 21 00:21:06,completed,llm-judge-llm-judge,v3io_user=pengweikind=localowner=pengweihost=jupyter-pengwei-gpu-7777658756-jcs45,,"input_path=data/ref.csvmetric_type=OPENAIJudgeReferenceGradingname=accuracy_metricsmodel_judge=gpt-3.5-turboprompt_config={'name': 'accuracy', 'definition': 'The accuracy of the provided answer.', 'rubric': 'Accuracy: This rubric assesses the accuracy of the provided answer. The details for different scores are as follows:\n - Score 1: The answer is completely incorrect or irrelevant to the question. It demonstrates a fundamental\n misunderstanding of the topic or question.\n - Score 2: The answer contains significant inaccuracies, though it shows some understanding of the topic. Key\n elements of the question are addressed incorrectly.\n - Score 3: The answer is partially correct but has noticeable inaccuracies or omissions. It addresses the\n question but lacks depth or precision.\n - Score 4: The answer is mostly correct, with only minor inaccuracies or omissions. It provides a generally\n accurate response to the question.\n - Score 5: The answer is completely correct and thorough. It demonstrates a deep and accurate understanding of\n the topic, addressing all elements of the question effectively.', 'examples': '\n Question: What is the capital of France?\n Score 1: Completely Incorrect\n Answer: ""The capital of France is Berlin.""\n Explanation: This answer is entirely incorrect and irrelevant, as Berlin is the capital of Germany, not France.\n Score 2: Significantly Inaccurate\n Answer: ""The capital of France is Lyon.""\n Explanation: This answer demonstrates some understanding that the question is about a city in France, but it incorrectly identifies Lyon as the capital instead of Paris.\n Score 3: Partially Correct\n Answer: ""I think the capital of France is either Paris or Marseille.""\n Explanation: This answer shows partial knowledge but includes a significant inaccuracy by suggesting Marseille might be the capital. Paris is correct, but the inclusion of Marseille indicates a lack of certainty or complete understanding.\n Score 4: Mostly Correct\n Answer: ""The capital of France is Paris, the largest city in the country.""\n Explanation: This answer is mostly correct and identifies Paris as the capital. The addition of ""the largest city in the country"" is accurate but not directly relevant to the capital status, introducing a slight deviation from the question\'s focus.\n Score 5: Completely Correct and Thorough\n Answer: ""The capital of France is Paris, which is not only the country\'s largest city but also its cultural and political center, hosting major institutions like the President\'s residence, the Elysée Palace.""\n Explanation: This answer is completely correct, providing a thorough explanation that adds relevant context about Paris\'s role as the cultural and political center of France, directly addressing the question with depth and precision.\n '}model_bench_mark=microsoft/phi-2model_bench_mark_config={'max_length': 1500, 'device_map': 'auto', 'revision': 'main', 'trust_remote_code': True, 'torch_dtype': 'auto'}model_bench_mark_infer_config={'max_length': 1500}tokenizer_bench_mark_config={'trust_remote_code': True}",,openai_reference_result





> 2024-02-21 00:21:17,650 [info] Run execution finished: {'status': 'completed', 'name': 'llm-judge-llm-judge'}
