Skip to content

Latest commit

 

History

History

llm_judge

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

SubEval

A general purpose subjective evaluation tool for LLMs using LLM-as-a-Judge. It has already been configured to support software design evaluation tasks for DevBench.

The evaluation is set to judge whether a response generated by a given model is better than that of a reference model under our predefined criteria. See evaluating_guidance for detailed metrics for different software design files.

More details on the terminology and instructions in SubEval context can be found here.

Installation

# Inside the subeval dir
pip install -e .

OpenAI API

You should have a json file to store your keys (for example, you can name it as keys.json), its content will be:

{
    "openai-keys": [
        "",
        ""
    ]
}

Before running the evaluation scripts, set the environment variable export KEYS=/path/to/your/keys.json. Also, as all the provided scripts are written to be run from the top level, you should set export PYTHONPATH=path/to/SubEval or export PYTHONPATH=$PWD if you are at the top level directory.

Quick Start

Example data

We provided the example data from DevBench in examples/DevBench_projects_example.xlsx(here) to run the Subjective Evaluation Tool.

The data includes responses from the following models (4 GPT models and 6 open-source models):

  • gpt-3.5-turbo-1106
  • gpt-4-0613
  • gpt-4-1106-preview
  • gpt-4-0125-preview
  • codellama-7b-instruct
  • codellama-13b-instruct
  • codellama-34b-instruct
  • deepseek-coder-1.3b-instruct
  • deepseek-coder-6.7b-instruct
  • deepseek-coder-33b-instruct

The current available GPT judges are here.

Feel free to add your own models and judges.

Run the script

To evaluate gpt-4-0125-preview's response using gpt-3.5-turbo-1106 as the reference model and gpt-4-1106-preview as the judge, run the following command:

python3 subeval/subjective/sub_eval.py --data examples/DevBench_projects_example.xlsx --model gpt-4-0125-preview --refm gpt-3.5-turbo-1106 --judge gpt-4-1106-preview --eval-nopt 2 --eval-proc 1 --mode dual

or this script:

chmod +x ./scipts/run_example.sh
./scripts/run_example.sh

To evaluate all models' responses using gpt-3.5-turbo-1106 as both the reference model and the judge, run the following command:

python3 subeval/subjective/sub_eval.py --data examples/DevBench_projects_example.xlsx --model gpt-3.5-turbo-1106 gpt-4-0613 gpt-4-1106-preview gpt-4-0125-preview codellama-7b-instruct codellama-13b-instruct codellama-34b-instruct deepseek-coder-1.3b-instruct deepseek-coder-6.7b-instruct deepseek-coder-33b-instruct --refm gpt-3.5-turbo-1106 --judge gpt-3.5-turbo-1106 --eval-nopt 2 --eval-proc 1 --mode dual

or this script:

chmod +x ./scipts/run_all_examples.sh
./scripts/run_all_examples.sh

Script arguments

Here's the brief overview of necessary arguments. See subeval/subjective/subeval.py(here) for more detail.

  • --data: The formatted excel file. Format see here
  • --model: The models whose responses will be evaluated.
  • --refm: The reference/baseline model to be compared to.
  • --judge: The judge model to evaluate responses.
  • --eval-nopt: The number of options (explained here)
  • --eval-proc: The number of processes for multi-process evaluation.
  • --mode: The mode for response order given to the judge. Choose between dual or random. However, we recommend to use , and used for DevBench as well, dual mode for consistent evaluations. TODO @lin Typo. Add ref.
  • --fill-contents: Optional to load actual contents from paths of a given column (explained here).

Output (original win rate)

The SubEval was developed based on another codebase of ours, which we adapted for the DevBench evaluation. As a result, this part now contains some legacy code. We further processed the original outputs to obtain the results presented in the paper. We apologize for any inconvenience this may cause.

The results of subjective evaluation will be stored in the directory output/{df_name}_infer_input_{seed}_record_{judge}_{nopt}. Among all the ouput files, log.txt integrates all evaluation results, and record_{judge}_{nopt}.tsv is the detailed output of evaluation for each pair of responses.

We included the example evaluation results here (log.txt generated by running the abovementioned two scripts).

The old version win rate is calculated only on consistent pairs of evaluations (see here for the definition of consistent and inconsistent in our context). We also provide a new version win rate calculation method specified as below, which considers inconsistent evaluation due to swapped response-order in the prompt as a "tie" for both models.

For more detailed explanation and interpretation, see subeval.md.

Processed win rate (used in the paper)

In our DevBench paper, we applied this new version win rate calculation method. It regards the inconsistent pairs of evalutations as a "tie" for both models.

To use the new version calculation, run the following script from the top-level directory:

python ./subeval/subjective/calculate_winrate_new.py

If you specified the directory to save the calculation results, there will be two csv files get saved:

  • win_rate_with_tie.csv
  • win_rate_without_tie.csv

For the two examples that we provided, run_example.sh gave all consistent evaluations, meaning that there are no difference between the old and new version win rate calculation; hence there is no need to calculate again using the new version.

However, run_all_examples.sh, resulted in some inconsistency. We then calculated the new version win rate with tie considered and without tie considered.

You can choose to further run the new version win rate calculation based on your need and experiment results. We highly recommend to run this new version calculation for eval-nopt 2 cases.

Now you are ready to go! Feel free to customize the Subjective Evaluation Tool to fit your need!