LiveCodeBench Evaluation Pipeline

Author: naholav

A comprehensive evaluation pipeline for LoRA fine-tuned code generation models using the LiveCodeBench benchmark.

Note: All models in this project were evaluated on AtCoder Easy problems (41 questions). We recommend using the same evaluation setup for consistent and comparable results.

Overview

This evaluation pipeline tests LoRA fine-tuned models on LiveCodeBench, a benchmark containing competitive programming problems from platforms like AtCoder, LeetCode, and Codeforces.

Directory Structure

test_github/
├── README.md                   # This file
├── requirements.txt            # Python dependencies
├── livecodebench_eval.py       # Main evaluation script
├── run_all_evaluations.py      # Batch evaluation runner
└── common/
    ├── __init__.py
    ├── model_loader.py         # Model loading utilities
    ├── code_executor.py        # Safe code execution
    └── code_postprocess.py     # Output post-processing

Requirements

Python 3.9+
CUDA-compatible GPU with at least 8GB VRAM
Linux (recommended) or macOS

Installation

Clone this repository and navigate to the evaluation directory:

cd test_github

Install dependencies:

pip install -r requirements.txt

(Optional) Install Flash Attention 2 for faster inference:

pip install flash-attn --no-build-isolation

Checkpoint Directory Structure

Your trained model checkpoints must follow this directory structure:

models/
├── deep_think/
│   └── checkpoints/
│       ├── checkpoint-step-300-epoch-2/
│       ├── checkpoint-step-400-epoch-2/
│       ├── checkpoint-step-500-epoch-2/
│       └── ...
├── deep_instruction/
│   └── checkpoints/
│       └── ...
├── diverse_think/
│   └── checkpoints/
│       └── ...
└── diverse_instruction/
    └── checkpoints/
        └── ...

Important: If your model types have different names, update the model_types tuple in the configuration section of the scripts.

Configuration

Modifying Model Types

If your training setup uses different model type names, edit the Config class in livecodebench_eval.py:

# Default configuration
model_types: tuple = (
    "deep_think",
    "deep_instruction",
    "diverse_think",
    "diverse_instruction"
)

# Example: Custom model types
model_types: tuple = (
    "my_model_type_1",
    "my_model_type_2",
)

Modifying Checkpoint Steps

If your checkpoints are saved at different steps:

# Default
checkpoint_steps: tuple = (300, 400, 500, 600, 700, 800)

# Custom steps
checkpoint_steps: tuple = (100, 200, 300, 400, 500)

Modifying System Prompts

The system prompts must match what was used during training. Edit in Config.__post_init__():

self.system_prompts = {
    "think": "Your think-style system prompt here",
    "instruction": "Your instruction-style system prompt here"
}

Usage

Single Model Evaluation

Evaluate a specific model type and checkpoint:

python livecodebench_eval.py \
    --model_type deep_think \
    --steps 600 \
    --difficulty easy \
    --checkpoint_dir ./models \
    --output_dir ./results

Evaluate All Models on Easy Difficulty

python livecodebench_eval.py \
    --difficulty easy \
    --include_base \
    --checkpoint_dir ./models

Platform-Specific Evaluation

Evaluate only on AtCoder problems:

python livecodebench_eval.py \
    --platform atcoder \
    --difficulty easy \
    --include_base \
    --checkpoint_dir ./models

Recommended: AtCoder Easy Evaluation

We recommend using AtCoder Easy problems (41 questions) for evaluation because:

Full LiveCodeBench evaluation takes many hours to complete
The 1.5B parameter model cannot solve hard problems even with fine-tuning
AtCoder Easy provides a good balance of speed and meaningful comparison

python livecodebench_eval.py \
    --platform atcoder \
    --difficulty easy \
    --include_base \
    --checkpoint_dir ./models

Batch Evaluation (All Combinations)

Run comprehensive evaluation across all models and difficulties:

python run_all_evaluations.py \
    --models_dir ./models \
    --output_dir ./results

Quick test mode (single model, single difficulty):

python run_all_evaluations.py --quick

Resume from previous run:

python run_all_evaluations.py --resume

Command Line Arguments

livecodebench_eval.py

Argument	Description	Default
`--model_type`	Model type to evaluate (or "all")	all
`--steps`	Checkpoint steps to evaluate	all
`--include_base`	Include base model in evaluation	False
`--difficulty`	Difficulty filter (easy/medium/hard/all)	all
`--platform`	Platform filter (atcoder/leetcode/codeforces/all)	all
`--date_start`	Start date in YYMM format	2408
`--date_end`	End date in YYMM format	2502
`--checkpoint_dir`	Directory containing model checkpoints	./models
`--output_dir`	Output directory for results	./results/livecodebench
`--version`	LiveCodeBench version	release_v5

run_all_evaluations.py

Argument	Description	Default
`--quick`	Quick test mode (1 model, 1 difficulty)	False
`--model_types`	Specific model types to evaluate	all
`--steps`	Specific steps to evaluate	all
`--difficulties`	Specific difficulties to evaluate	easy, medium, hard
`--resume`	Skip existing results	False
`--include_base`	Include base model	False
`--models_dir`	Checkpoint directory	./models
`--output_dir`	Output directory	./results/livecodebench

Output Files

After evaluation, results are saved in the following structure:

results/livecodebench/
├── summary.json                           # Overall summary
├── detailed/
│   └── {model_name}_{difficulty}.jsonl    # Detailed per-problem logs
├── generations/
│   └── {model_name}_{difficulty}.json     # Generated code
└── evaluations/
    └── {model_name}_{difficulty}_results.json  # Evaluation results

Results Format

The summary JSON contains:

{
    "model_name": "deep_think_checkpoint-step-600-epoch-3",
    "pass_at_1": 0.3171,
    "stats": {
        "total": 41,
        "passed": 13,
        "failed": 28,
        "error": 0,
        "no_tests": 0
    }
}

Troubleshooting

CUDA Out of Memory

If you encounter OOM errors, try:

Use a smaller batch size (already set to 1 by default)
Reduce max_new_tokens in the config
Use a GPU with more VRAM

Checkpoint Not Found

Ensure your checkpoint directory structure matches the expected format:

{checkpoint_dir}/{model_type}/checkpoints/checkpoint-step-{step}-epoch-{epoch}/

Different Model Types

If you used different model type names during training, update the model_types tuple in both livecodebench_eval.py and run_all_evaluations.py.

Flash Attention Error

If Flash Attention 2 fails to load, the scripts will still work but may be slower. To explicitly disable:

use_flash_attention_2=False

License

MIT License

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LiveCodeBench Evaluation Pipeline

Overview

Directory Structure

Requirements

Installation

Checkpoint Directory Structure

Configuration

Modifying Model Types

Modifying Checkpoint Steps

Modifying System Prompts

Usage

Single Model Evaluation

Evaluate All Models on Easy Difficulty

Platform-Specific Evaluation

Recommended: AtCoder Easy Evaluation

Batch Evaluation (All Combinations)

Command Line Arguments

livecodebench_eval.py

run_all_evaluations.py

Output Files

Results Format

Troubleshooting

CUDA Out of Memory

Checkpoint Not Found

Different Model Types

Flash Attention Error

License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
common		common
README.md		README.md
livecodebench_eval.py		livecodebench_eval.py
requirements.txt		requirements.txt
run_all_evaluations.py		run_all_evaluations.py

naholav/CodeGen

Folders and files

Latest commit

History

Repository files navigation

LiveCodeBench Evaluation Pipeline

Overview

Directory Structure

Requirements

Installation

Checkpoint Directory Structure

Configuration

Modifying Model Types

Modifying Checkpoint Steps

Modifying System Prompts

Usage

Single Model Evaluation

Evaluate All Models on Easy Difficulty

Platform-Specific Evaluation

Recommended: AtCoder Easy Evaluation

Batch Evaluation (All Combinations)

Command Line Arguments

livecodebench_eval.py

run_all_evaluations.py

Output Files

Results Format

Troubleshooting

CUDA Out of Memory

Checkpoint Not Found

Different Model Types

Flash Attention Error

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages