## LLM-as-a-Judge evaluation 

In this notebook, we test some approaches of evaluating responses Kai generates for different migration scenarios using an LLM as a judge.

The responses are generated by running Kai manually in vscode with different models. The diff files of the responses are stored as artifacts for simplicity.

#### Goal

Goal of this exercise is to:
- verify & finalize the evaluation metrics
- finalize the evaluation prompts

### Process

Run Tools  -->  Run Kai (manual, responses collected)  -->  Run Tools  -->  Evaluate

#### Running tools

In this notebook, we run following tools before and after applying fixes from Kai:
- `mvn compile`
- `mvn test` (only if tests are present for a given test case)
- `analyze` (kantra used for analysis)

#### Evaluate

For evaluating, three metrics are calculated:

1. Completeness (C): Measures whether the issue is completely resolved
2. Functional Parity (F): Measures whether existing functionality is maintained
3. Knock-on Effort (E): Measures how much effort is needed to address new issues caused by the fix

The total score is normalized for each metric and a final weighted score is produced between 0-10:

```
Final Score = 10 * (0.5 * C + 0.3 * F + 0.2 * E)
```

#### Pre-requisites

- Create a virtualenv using Jupyter for running cells in this notebook. Use [requirements.txt](../requirements.txt) to install dependencies needed. 
- Copy the .env.sample file to .env. Select the model you want to use for evaluation - Only Bedrock and ChatOpenAI are supported. Add your LLM key for the model you want to use. Once setup, run the following cell to load .env file.
- Make sure you have _java_ and _mvn_ installed.
- Download the latest _kantra_ binary.


In [1]:
%load_ext dotenv
%dotenv

Before we begin, we write some common code in the following cell which we will need later on. Run this cell before proceeding.

In [2]:
# some common functions we will need later on for the evaluation
import os
import sys
import yaml
import subprocess
from git import Repo
from pathlib import Path
from langchain_openai import ChatOpenAI
from langchain_aws import ChatBedrockConverse

test_data_path = Path("test-data").absolute()
apps_repo_path = (test_data_path / "apps").absolute()
artifacts_path = (test_data_path / "artifacts").absolute()
apps_repo_path.mkdir(parents=True, exist_ok=True)
artifacts_path.mkdir(parents=True, exist_ok=True)

def clone_repo(url: str, branch: str, path: Path):
    try:
        Repo.clone_from(url, depth=1, single_branch=True, branch=branch, to_path=path)
    except Exception as e:
        if "already exists" not in str(e):
            print("fatal error cloning repo")
            sys.exit(1)

def get_model():
    provider = os.getenv("model_provider")
    model_id = os.getenv("model_id")
    if not model_id or not provider:
        raise ValueError("model_id and/or model_provider are not set")
    match provider:
        case "chatbedrock":
            key_id = os.getenv("aws_access_key_id")
            access_key = os.getenv("aws_secret_access_key")
            region = os.getenv("region")
            if not region or not access_key or not key_id:
                raise ValueError("aws_region and/or aws_secret_access_key and/or aws_access_key_id is not set")
            return ChatBedrockConverse(
                model_id=model_id,
                aws_access_key_id=key_id,
                aws_secret_access_key=access_key,
                region_name=region,
            )
        case "chatopenai":
            api_key = os.getenv("OPEANAI_API_KEY") or os.getenv("api_key")
            if not api_key:
                raise ValueError("OPEANAI_API_KEY or api_key is not set")
            return ChatOpenAI(model=model_id, api_key=api_key)
        case _:
            raise ValueError(f"Invalid model provider: {provider}")

def parse_yaml(path: Path): 
    with open(path, "r") as f: return yaml.safe_load(f)

def clone_app(app_path: Path) -> Path:
    parsed = parse_yaml(app_path)
    repo_url = parsed["source_code"]["git"]["url"]
    branch = parsed["source_code"]["git"]["branch"]
    path = Path("test-data") / "apps" / parsed["name"]
    clone_repo(repo_url, branch, path)
    return path.absolute()

def get_test_selectors_from_tc(tc_path: str):
    parsed = parse_yaml(tc_path)
    return parsed["testSelectors"]

def run_command(cmd: list[str], stdout_path: Path, stderr_path: Path, cwd: str):
    pwd = os.getcwd()
    os.chdir(cwd)
    result = subprocess.run(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)
    with open(stdout_path, "w") as f:
        f.write(result.stdout)
    with open(stderr_path, "w") as f:
        f.write(result.stderr)
    os.chdir(pwd)


def run_mvn(tc_name: str, app_path: Path, sub_dir: Path = Path("")):
    output_dir = (artifacts_path / tc_name / sub_dir / "mvn").absolute()
    output_dir.mkdir(parents=True, exist_ok=True)
    stdout_path = os.path.join(output_dir, "mvn_compile.log")
    stderr_path = os.path.join(output_dir, "mvn_compile.err")
    run_command(["mvn", "compile"], stdout_path, stderr_path, app_path)

## Scanario 1:  Ehcache 2 to 3 upgrade

| App         | Complexity |
|-------------|------------|
| Petclinic   |    High    |

See description of the issue found [here](../apps/petclinic/test_cases/ehcache-2-to-3/tc.yaml).

See notes on expected fix [here](../apps/petclinic/test_cases/ehcache-2-to-3/notes.md).


In [3]:
tc_1_name = "ehcache-2-to-3"
tc_1_path = Path("../apps/petclinic/test_cases/ehcache-2-to-3/tc.yaml").absolute()
tc_1_selectors = get_test_selectors_from_tc(tc_1_path)
repo_path = clone_app(Path("../apps/petclinic/app.yaml").absolute())
run_mvn(tc_1_name, repo_path, "before")