<a href="https://colab.research.google.com/github/mattdepaolis/llm-tutorials/blob/main/LLM_Evaluation_Framework.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 🚀 Getting Started

Installation
1.	Clone the Repository:


In [None]:
!git clone https://github.com/mattdepaolis/llm-evaluation.git
!cd llm-evaluation

2.	Set Up a Virtual Environment:

In [None]:
!python -m venv .venv
!source .venv/bin/activate  # On Windows use `.venv\Scripts\activate`

3.	Install Dependencies:

In [None]:
!pip install -e lm-evaluation-harness
!pip install torch numpy tqdm transformers accelerate bitsandbytes sentencepiece
!pip install vllm  # If you plan to use the vLLM backend

## 🧪 Example: Evaluating Your Model on the LEADERBOARD Benchmark

**Using the Command-Line Interface (CLI)**

Let’s illustrate how the LLM Evaluation Framework simplifies benchmarking by replicating the popular Hugging Face Open LLM Leaderboard setup—particularly useful given its recent discontinuation. Here’s a practical CLI example that runs the complete leaderboard evaluation:

In [None]:
!python llm_eval_cli.py \
  --model hf \
  --model_name meta-llama/Llama-2-13b-chat-hf \
  --leaderboard \
  --device cuda \
  --gpu_memory_utilization 0.9  # Adjust based on your GPU availability

With this simple command, the framework evaluates your model across several key benchmarks including BBH, GPQA, MMLU-Pro, MUSR, IFEval, and Math-lvl-5, automatically configuring the appropriate few-shot examples for each benchmark.


**Using as a Python Library**

Integrate the evaluation logic directly into your Python scripts:

In [None]:
from llm_eval import evaluate_model

# Run the evaluation
results, output_path = evaluate_model(
    model_type="hf",
    model_name="mistralai/Ministral-8B-Instruct-2410",
    tasks=["leaderboard"],
    num_samples=1,
    device="cuda",
    quantize=True,
    quantization_method="4bit",
    preserve_default_fewshot=True  # This ensures the correct few-shot settings for each benchmark task
)

# Print the paths to the results and report
print(f"Results saved to: {output_path}")

# The report path is derived from the output path
import os
from llm_eval.reporting.report_generator import get_reports_dir

# Get the base filename without extension
basename = os.path.basename(output_path)
basename = os.path.splitext(basename)[0]

# Construct the report path
reports_dir = get_reports_dir()
report_path = os.path.join(reports_dir, f"{basename}_report.md")

if os.path.exists(report_path):
    print(f"Report generated at: {report_path}")
else:
    print("Report was not generated. Check if there were any errors during evaluation.")

## 🔧 Extending the Framework

The modular design makes it easier to add new functionalities:

1. Adding New Tasks/Benchmarks:
- Define the task configuration in llm_eval/tasks/task_registry.py or a similar configuration file.
- Ensure the task is compatible with the lm-evaluation-harness structure or adapt it.
2. Supporting New Model Backends:
- Create a new model handler class in llm_eval/models/ inheriting from a base model class (if applicable).
- Implement the required methods for loading, inference, etc.
- Register the new backend type. ￼
3. Customizing Reporting:
- Modify the report generation logic in llm_eval/reporting/ to change the format or content of the Markdown/JSON outputs.

## 🤝 Contributing

Contributions are welcome! Please follow standard practices:

1. Fork the repository.
2. Create a new branch for your feature or bug fix (git checkout -b feature/my-new-feature).
3. Make your changes and commit them (git commit -am 'Add some feature').
4. Push to the branch (git push origin feature/my-new-feature).
5. Create a new Pull Request.