---
aliases:
- /eval/
categories:
- LLM Evaluation
date: '2025-05-01'
colab: <a href="https://colab.research.google.com/drive/12t6aLEcGJrDWCHcBv-JBpV2Z6tZkHke5?usp=sharing"><img src="images/colab.png" alt="Open In Colab"></a>
image: /images/eval/thumbnail.jpg
title: "LLM Evaluation Framework"
subtitle: "Replicate Huggingface Open LLM Leaderboard Locally"
---

<center>
    <img src="images/eval/thumbnail.jpg" alt="Image">
</center>


The discontinuation of Hugging Face’s Open LLM Leaderboard has left a gap in the community for standardized evaluation of large language models (LLMs). To address this, I developed the LLM Evaluation Framework, a comprehensive and modular tool designed to facilitate reproducible and extensible benchmarking of LLMs across various tasks and benchmarks.

## 🧩 Why This Framework Matters

The Open LLM Leaderboard was instrumental in providing a centralized platform for evaluating and comparing LLMs. Its retirement has underscored the need for tools that allow researchers and developers to conduct their own evaluations with transparency and consistency. The LLM Evaluation Framework aims to fill this void by offering:
- Modular Design: Inspired by microservice architecture, enabling easy integration and customization.
- Multiple Model Backends: Support for Hugging Face (hf) and vLLM backends, allowing flexibility in model loading and inference.
- Quantization Support: Evaluate quantized models (e.g., 4-bit, 8-bit with hf, AWQ with vLLM) to assess performance under resource constraints.
- Comprehensive Benchmarks: Includes support for standard benchmarks like MMLU, GSM8K, BBH, and more.
- Leaderboard Replication: Easily run evaluations mimicking the Open LLM Leaderboard setup with standardized few-shot settings.
- Flexible Configuration: Customize evaluations via CLI arguments or programmatic usage.
- Detailed Reporting: Generates JSON results and Markdown reports for easy analysis.
- Parallelism: Leverages vLLM for efficient inference, including tensor parallelism across multiple GPUs.

## 🚀 Getting Started

Installation
1.	Clone the Repository:


In [None]:
!git clone https://github.com/mattdepaolis/llm-evaluation.git
!cd llm-evaluation

2.	Set Up a Virtual Environment:

In [None]:
!python -m venv .venv
!source .venv/bin/activate  # On Windows use `.venv\Scripts\activate`

3.	Install Dependencies:

In [None]:
!pip install -e lm-evaluation-harness
!pip install torch numpy tqdm transformers accelerate bitsandbytes sentencepiece
!pip install vllm  # If you plan to use the vLLM backend

## 🧪 Example Usage

**Using the Command-Line Interface (CLI)**

Evaluate a model on the HellaSwag benchmark:

In [2]:
!python llm_eval_cli.py \
  --model hf \
  --model_name google/gemma-2b \
  --tasks hellaswag \
  --num_fewshot 0 \
  --device cuda  # Use 'cpu' if you don't have a GPU

This command will download the gemma-2b model (if not cached), run it on the HellaSwag benchmark with 0 few-shot examples, and save the results in the results/ and reports/ directories.


**Using as a Python Library**

Integrate the evaluation logic directly into your Python scripts:

In [12]:
from llm_eval import evaluate_model
import os

# Define evaluation parameters
eval_config = {
    "model_type": "hf",
    "model_name": "google/gemma-2b-it",
    "tasks": ["mmlu", "gsm8k"],
    "num_fewshot": 0,
    "device": "cuda",
    "quantize": True,
    "quantization_method": "4bit",
    "batch_size": "auto",
    "output_dir": "./custom_results"  # Optional: Specify output location
}

# Run the evaluation
try:
    results_summary, results_file_path = evaluate_model(**eval_config)

    print("Evaluation completed successfully!")
    print(f"Results summary: {results_summary}")
    print(f"Detailed JSON results saved to: {results_file_path}")

    # Construct the expected report path
    base_name = os.path.splitext(os.path.basename(results_file_path))[0]
    report_file_path = os.path.join(os.path.dirname(results_file_path).replace('results', 'reports'), f"{base_name}_report.md")

    if os.path.exists(report_file_path):
        print(f"Markdown report saved to: {report_file_path}")
    else:
        print("Markdown report not found at expected location.")

except Exception as e:
    print(f"An error occurred during evaluation: {e}")

## 📊 Reporting and Results

The framework generates:
- JSON Results: Detailed results for each task, including individual sample predictions (if applicable), metrics, and configuration details, saved in the results/ directory.
- **Markdown Reports**: A summary report aggregating scores across tasks, generated in the reports/ directory.

## 🔧 Extending the Framework

The modular design makes it easier to add new functionalities:
1.	Adding New Tasks/Benchmarks:
- Define the task configuration in llm_eval/tasks/task_registry.py or a similar configuration file.
- Ensure the task is compatible with the lm-evaluation-harness structure or adapt it.
2.	Supporting New Model Backends:
- Create a new model handler class in llm_eval/models/ inheriting from a base model class (if applicable).
- Implement the required methods for loading, inference, etc.
- Register the new backend type. ￼
3.	Customizing Reporting:
- Modify the report generation logic in llm_eval/reporting/ to change the format or content of the Markdown/JSON outputs.

## 🤝 Contributing

Contributions are welcome! Please follow standard practices:
1.	Fork the repository.
2.	Create a new branch for your feature or bug fix (git checkout -b feature/my-new-feature).
3.	Make your changes and commit them (git commit -am 'Add some feature').
4.	Push to the branch (git push origin feature/my-new-feature).
5.	Create a new Pull Request.