## LLM Evaluation
In this notebook, we will evaluate the feedback suggestions using an LLM-as-a-Judge. The goal is to compare the LLM's evaluation with the expert evaluation (step 4) and to analyze the differences.


### Setup
We need to configure the LLM model and the evaluation metrics. The LLM model will be used to evaluate the feedback suggestions, while the metrics will define how we assess the quality of the feedback. Additionally, we will load the feedback suggestions (step 3) to be evaluated.

#### LLM-as-a-Judge Configuration
Make sure to set the following environment variables in your `.env` file:
- `LLM_EVALUATION_MODEL`: The name of the LLM model to use for evaluation.
- `AZURE_OPENAI_API_KEY`: The API key for Azure OpenAI.
- `AZURE_OPENAI_ENDPOINT`: The endpoint for Azure OpenAI.
- `OPENAI_API_VERSION`: The API version for OpenAI.

In [None]:
import os
from langchain_openai import AzureChatOpenAI
from dotenv import load_dotenv

load_dotenv(override=True)

model_name = os.getenv("LLM_EVALUATION_MODEL")
api_key = os.getenv("AZURE_OPENAI_API_KEY")
api_base = os.getenv("AZURE_OPENAI_ENDPOINT")
api_version = os.getenv("OPENAI_API_VERSION")


model = AzureChatOpenAI(
    azure_deployment=model_name.replace("azure_openai_", ""),
    api_key=api_key,
    azure_endpoint=api_base,
    api_version=api_version,
    temperature=0,
)

#### Define Metrics
The metrics define how we assess the quality of the feedback. You can use the predefined metrics from the `metrics` file or reuse the metrics from the expert evaluation. If you want to compare the LLM's evaluation with the expert evaluation, make sure to use the same metrics as in step 4.

##### Define New Metrics

In [None]:
from prompts.metrics import (
    completeness,
    correctness,
    actionability,
    tone,
)

metrics = [completeness, correctness, actionability, tone]
print(f"Loaded metrics: {[metric.title for metric in metrics]}")

##### Reuse Metrics from Expert Evaluation

In [None]:
import json
from model.evaluation_model import Metric

# Load Metrics from common evaluation config (created in 4_expert_evaluation.ipynb)
config_path = "data/4_expert_evaluation/output_depseudonymized/common_evaluation_config.json"

metrics = []
with open(config_path, "r") as config_file:
    common_evaluation_config = json.load(config_file)
    metrics_config = common_evaluation_config.get("metrics", [])
    metrics = [Metric(**metric) for metric in metrics_config]

print(f"Loaded metrics: {[metric.title for metric in metrics]}")

#### Load Feedback Suggestions
The feedback suggestions are stored in a CSV file (step 3). We will load the feedback suggestions and prepare them for evaluation.

In [None]:
import pandas as pd

data = pd.read_csv("data/3_feedback_suggestions/feedback_suggestions.csv")

print(f"Feedback Types: {data["feedback_type"].unique()}")
print(
    f"Exercises: {data["exercise_id"].nunique()}, Submissions: {data["submission_id"].nunique()}"
)

#### Generate Prompts
The prompts are generated based on the feedback suggestions and the metrics. The prompts will be used to evaluate the feedback suggestions using the LLM model.

In [None]:
from service.llm_as_a_judge_service import generate_evaluation_requests

# You can choose to evaluate only a specific feedback type by setting the filter (e.g. "Tutor")
requests = generate_evaluation_requests(data, metrics, feedback_type_filter=None)

print(f"Number of requests: {len(requests)}")

#### Sample Prompts for Testing
<mark>Optionally, you can sample a few prompts for testing purposes. This is useful to check if the prompts are generated correctly and to test the evaluation process without incurring high costs.</mark>

In [None]:
import random

requests = random.sample(requests, 10)

#### Evaluate Feedback with LLM
Evaluates the feedback suggestions using the LLM model. Saves the evaluations to a JSON file similar to the evaluation progress files from experts.

The evaluation takes approximately one minute per 100 requests.

<mark>**Note**: The evaluation using the LLM model incurs costs. Make sure to monitor your usage and costs. Try to use a small sample of prompts for testing before running the full evaluation. Try to run the full evaluation only once.</mark>

In [None]:
from service.llm_as_a_judge_service import process_feedback_evaluations

output_path = "data/5_llm_evaluation/"

process_feedback_evaluations(requests, output_path, model, metrics)