## LLM-as-a-Judge

In [1]:
import os
import sys

sys.path.insert(0, os.path.abspath('..') + '/src')

import re
import ast
import json
import random
import pickle
import pandas as pd
from collections import OrderedDict
from collections import Counter

from annotation_tools import LabelAnalysis

label_analysis = LabelAnalysis()

We first run the script for generating LLM judgments.

```
./scripts/llm_judge.sh -m openai_gpt-4o-2024-05-13,anthropic_claude-3-5-sonnet-20240620 -d ../data/llm_judge -t ../data/prompts/llm_pairwise_judge_v1.txt

```

To run this script, we first need to put the CSV file of the batch we want to run this script on in the `../data/llm_judge` folder. This script will get that file as input and run the `llm_annotation.py` on it six times. The first three times are based on the default position of `response_a` and `response_b`, and the next three times are based on the reverse order of the responses. The results of these runs will be saved in a new CSV file under the column name `pred_[model_name]_ab|ba[run_number]` where `run_number` is one of `[1, 2, 3]`). The final CSV file for each batch that includes all six columns will be the input to the analysis in the next step.

### Testing robustness

In [2]:
!python ../src/llm_judge.py

Vote inconsistency in [ab] runs by models
-----------------------------------------
{'anthropic_claude-3-5-sonnet-20240620': 48, 'openai_gpt-4o-2024-05-13': 174}
total: 222
Vote inconsistency in [ba] runs by models
-----------------------------------------
{'anthropic_claude-3-5-sonnet-20240620': 69, 'openai_gpt-4o-2024-05-13': 199}
total: 268

Disagreements between [ab] and [ba] runs by model
-------------------------------------------------
* model: anthropic_claude-3-5-sonnet-20240620
OrderedDict([('efficiency', {'count': 197}),
             ('bias', {'count': 132}),
             ('harmfulness', {'count': 95}),
             ('reasoning', {'count': 74}),
             ('correctness', {'count': 73}),
             ('helpfulness', {'count': 52})])
total: 623
* model: openai_gpt-4o-2024-05-13
OrderedDict([('efficiency', {'count': 94}),
             ('correctness', {'count': 81}),
             ('reasoning', {'count': 79}),
             ('helpfulness', {'count': 72}),