## LLM-as-a-Judge

In [1]:
import os
import sys

sys.path.insert(0, os.path.abspath('..') + '/src')

import re
import ast
import json
import random
import pickle
import pandas as pd
from collections import OrderedDict
from collections import Counter

from annotation_tools import LabelAnalysis

label_analysis = LabelAnalysis()

We first run the script for generating LLM judgments.

**`Round 1`**
```
llm_judge.sh -m openai_gpt-4o-2024-05-13,anthropic_claude-3-5-sonnet-20240620 -d ../data/llm_judge -t ../data/prompts/llm_pairwise_judge_v1.txt

```

**`Round 2`**
```
llm_judge.sh -m openai_gpt-4o-2024-08-06,anthropic_claude-3-5-sonnet-20241022 -d ../data/llm_judge -t ../data/prompts/llm_pairwise_judge_v1.txt

```

To run this script, first, place the CSV file of the batch you want to process in the `../data/llm_judge` folder. Ensure there is only one CSV file in the folder (`TODO`: remove this requirement). The script will use that file as input and run `llm_annotation.py` on it six times for each model. The first three runs will use the default order of `response_a` and `response_b`, and the next three will use the reversed order of the responses.

The results of these runs will be saved in multiple CSV files (one file after each run for checkpoiting), with columns named `pred_[model_name]_ab|ba[run_number]`, where `run_number` is one of `[1, 2, 3]`. For example, `pred_openai_gpt-4o-2024-08-06_ab1` indicates the first of three runs using `openai_gpt-4o-2024-08-06` as the LLM judge, with responses in their default order.

The final CSV file for each batch, which includes all six columns per model, will serve as the input for the analysis in the next step. You can repeat this process for a new round, but make sure to remove or move the CSV files generated in the current round beforehand.

## Testing robustness

After running the `llm_judge.sh` script, navigate to the `../data/llm_judge` folder and ensure that all annotated batch files follow this naming pattern: `batch[batch_number]-llm.csv`. For example: `../data/llm_judge/batch1-llm.csv`. Once all batch file names have been updated to match this pattern, run the following script. (`TODO`: implement auto-renaming as part of running `llm_judge.sh`.)

### Round 1

In [2]:
!python ../src/llm_judge.py --llms openai_gpt-4o-2024-05-13 anthropic_claude-3-5-sonnet-20240620

Vote inconsistency in [ab] runs by models
-----------------------------------------
{'anthropic_claude-3-5-sonnet-20240620': 48, 'openai_gpt-4o-2024-05-13': 174}
total: 222
Vote inconsistency in [ba] runs by models
-----------------------------------------
{'anthropic_claude-3-5-sonnet-20240620': 69, 'openai_gpt-4o-2024-05-13': 199}
total: 268

Disagreements between [ab] and [ba] runs by model
-------------------------------------------------
* model: anthropic_claude-3-5-sonnet-20240620
OrderedDict([('efficiency', {'count': 197}),
             ('bias', {'count': 132}),
             ('harmfulness', {'count': 95}),
             ('reasoning', {'count': 74}),
             ('correctness', {'count': 73}),
             ('helpfulness', {'count': 52})])
total: 623
* model: openai_gpt-4o-2024-05-13
OrderedDict([('efficiency', {'count': 94}),
             ('correctness', {'count': 81}),
             ('reasoning', {'count': 79}),
             ('helpfulness', {'count': 72}),

### Round 2

In [3]:
!python ../src/llm_judge.py --llms openai_gpt-4o-2024-08-06 anthropic_claude-3-5-sonnet-20241022

Failed to parse vote for anthropic_claude-3-5-sonnet-20241022: expected string or bytes-like object. Using neutral vote.
Vote inconsistency in [ab] runs by models
-----------------------------------------
{'anthropic_claude-3-5-sonnet-20241022': 53, 'openai_gpt-4o-2024-08-06': 138}
total: 191
Vote inconsistency in [ba] runs by models
-----------------------------------------
{'anthropic_claude-3-5-sonnet-20241022': 69, 'openai_gpt-4o-2024-08-06': 144}
total: 213

Disagreements between [ab] and [ba] runs by model
-------------------------------------------------
* model: anthropic_claude-3-5-sonnet-20241022
OrderedDict([('efficiency', {'count': 181}),
             ('correctness', {'count': 76}),
             ('helpfulness', {'count': 63}),
             ('reasoning', {'count': 52}),
             ('harmfulness', {'count': 38}),
             ('bias', {'count': 7})])
total: 417
* model: openai_gpt-4o-2024-08-06
OrderedDict([('efficiency', {'count': 103}),
             ('reasoning', {'count'