# LLM-as-a-Judge

While human evaluation is the gold standard for assessing preferences, it's slow and costly. 

Using LLMs as judges can reduce the need for human intervention and allow for faster iterations.

### Advantages of LLM-as-a-Judge

- Can evaluate answer quality based on user-defined criteria
- Adaptable to various LLM use cases

### Common Evaluation Criteria

- Conciseness: Is the answer brief and to the point?
- Relevance: Does the answer relate to the question?
- Correctness: Is the answer accurate?
- Coherence: Is the answer consistent?
- Harmfulness: Does the answer contain harmful content?
- Maliciousness: Is the answer malicious or detrimental?
- Helpfulness: Is the answer useful?
- Controversiality: Is the answer likely to spark debate?
- Misogyny: Does the answer demean women?
- Criminality: Does the answer promote illegal activities?

### Single Answer Grading

This approach involves assigning scores to individual answers.

In [None]:
from libs.custom_llm_as_a_judge import Custom_LLM_Judge

llm_id = "anthropic.claude-3-5-sonnet-20240620-v1:0"
region = "us-west-2"
evaluator = Custom_LLM_Judge(llm_id, region)

In [None]:
from time import sleep

# question & response
question = "Why is the sky blue?"
response = "The sky is blue because of Rayleigh scattering."

# evaluation criteria
criteria_list = ['conciseness', 'relevance', 'coherence', 'helpfulness', 'controversiality', 'misogyny', 'criminality']

for criterion in criteria_list:
    basic_result = evaluator.evaluate("basic", question, response, criterion)
    print(f"{criterion}: {basic_result}")
    sleep(3) # preventing API throttling

### Reference-Guided Grading

This method provides a reference solution to guide the LLM's evaluation.

For assessing correctness, providing a pre-defined ground truth (label) or context can be effective.

_In the example below, we provide 'ground_truth' as a reference for evaluation._

In [None]:
# question & response & ground_truth
question = "Why is the sky blue?"
response = "The sky is blue because of Rayleigh scattering."
ground_truth = "The sky appears blue due to the scattering of sunlight by air molecules, a phenomenon known as Rayleigh scattering."

# evaluation criteria
criteria_list = ['correctness']

for criterion in criteria_list:
    basic_result = evaluator.evaluate("labeled", question, response, criterion, ground_truth=ground_truth)
    print(f"{criterion}: {basic_result}")

_In the example below, we provide 'context' as a reference for evaluation._

In [None]:
# question & response & context
question = "Why is the sky blue?"
response = "The sky is blue because of Rayleigh scattering."
context = "The color of the sky is determined by the way sunlight interacts with the Earth's atmosphere. This interaction is influenced by various factors including the composition of the atmosphere and the wavelengths of light."

# evaluation criteria
criteria_list = ['conciseness', 'relevance', 'correctness', 'coherence']

for criterion in criteria_list:
    basic_result = evaluator.evaluate("context-based", question, response, criterion, context=context)
    print(f"{criterion}: {basic_result}")

## RAG Evaluation based on LLM-as-a-Judge

In this section, we evaluate the quality of LLM's RAG responses based on LLM-as-a-Judge approach.

In [None]:
import json
from datasets import Dataset

input_file = "data/sample_processed_qa_dataset.jsonl"
def read_jsonl(file_path):
    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            yield json.loads(line.strip())

dataset = Dataset.from_list(list(read_jsonl(input_file)))

In [None]:
from time import sleep

llm_id = "anthropic.claude-3-5-sonnet-20240620-v1:0"
llm_id = "anthropic.claude-3-sonnet-20240229-v1:0"

region = "us-west-2"
evaluator = Custom_LLM_Judge(llm_id, region)

criteria_list = ['conciseness', 'relevance', 'correctness', 'coherence', 'helpfulness', 'controversiality', 'misogyny', 'criminality']
results = []

for i in range(min(5, len(dataset))):
    item = dataset[i]
    question = item['question']
    response = item['answer']
    ground_truth = item['ground_truth']
    #contexts = item['contexts']
    
    row_results = {'question': question, 'answer': response}
    print(f"Evaluating question {i+1}: {question}")
    for criterion in criteria_list:
        result = evaluator.evaluate("labeled", question, response, criterion, ground_truth=ground_truth)
        row_results[criterion] = result
        print(f"  {criterion}: {result}")

    results.append(row_results)
    sleep(3) # Preventing API throttling

In [None]:
import pandas as pd

results_df = pd.DataFrame(results)
print(results_df)

json_filename = 'data/sample_llm_judge_results.json'


results_list = results_df.to_dict('records')
with open(json_filename, 'w', encoding='utf-8') as f:
    json.dump(results_list, f, ensure_ascii=False, indent=4)

print(f"\nResults saved to {json_filename}")