RoPE Long Context Evaluation Suite

A comprehensive evaluation framework for RoPE (Rotary Position Embedding) extension methods with support for state-of-the-art benchmarks and hyperparameter optimization.

Overview

This suite provides a unified framework for evaluating and comparing different RoPE scaling methods on long context tasks. It supports both synthetic benchmarks (NIAH, RULER) and real-world datasets (LongBench), with built-in hyperparameter sweeping and comprehensive result analysis.

Key Features

6 RoPE Extension Methods: Linear Interpolation, NTK-Aware, YaRN, LongRoPE, Dynamic NTK, Llama3 scaling
4 Benchmark Types: Traditional retrieval, NIAH (Needle-in-Haystack), RULER synthetic tasks, LongBench real-world tasks
Advanced Evaluation: Hyperparameter sweeping with grid/random search and parallel execution
Production Ready: Offline dataset support, NFS integration, GPU cluster deployment
CUDA OOM Handling: Automatic detection and graceful handling of out-of-memory errors with context length skipping
Offline Data Support: Load LongBench data from local JSONL files for air-gapped environments
Rich Analytics: Detailed metrics, visualizations, comparative analysis across methods

Quick Start

Prerequisites

Python ≥3.9
CUDA-compatible GPU with ≥8GB VRAM
PyTorch ≥2.0.0

Installation

git clone https://github.com/manncodes/rope_long_context_evaluation_suite.git
cd rope_long_context_evaluation_suite

# Option 1: Direct installation (recommended)
./install_direct.sh

# Option 2: UV-based setup (for advanced users)  
./setup_comprehensive_evaluation.sh

Configuration

Edit comprehensive_config.yaml for your setup:

model:
  name: "llama-3.2-1b"
  path: "unsloth/Llama-3.2-1B"  # HuggingFace model or local path
  device: "cuda"
  torch_dtype: "bfloat16"

datasets:
  longbench:
    path: "data/longbench"  # Update for your dataset location
    tasks: ["narrativeqa", "qasper", "multifieldqa_en"]

rope_methods:
  - name: "yarn"
    config:
      scaling_factor: 2.0
      attention_factor: 0.1
  - name: "ntk_aware" 
    config:
      scaling_factor: 2.0
      alpha: 8.0

Usage

Single Model Evaluation

# Quick validation
python scripts/validation/validate_imports.py

# Single evaluation with specific parameters
python run_evaluation.py --model "meta-llama/Llama-2-7b-hf" --benchmarks niah ruler --rope-method yarn

Comprehensive Parameter Sweep

For systematic evaluation across all RoPE methods, context lengths, and benchmarks:

# Quick test sweep (3 RoPE methods × 2 context lengths × 1 benchmark = 6 runs)
python run_comprehensive_sweep.py --config sweep_configs/quick_test_sweep.yaml

# Full comprehensive sweep (14 RoPE methods × 4 context lengths × 3 benchmarks = 168 runs)  
python run_comprehensive_sweep.py --config sweep_configs/full_sweep.yaml

# Custom filtered sweeps
python run_comprehensive_sweep.py --rope-methods none linear ntk_aware
python run_comprehensive_sweep.py --context-lengths 4000 8000
python run_comprehensive_sweep.py --benchmarks niah ruler
python run_comprehensive_sweep.py --max-runs 10  # Limit for testing

Sweep Results Analysis

The comprehensive sweep generates:

Individual run results: Detailed results for each parameter combination
Performance analysis: Statistical comparison across all dimensions
Best configurations: Top-performing RoPE method + context length combinations
Method comparison: Rankings and statistical significance tests
Context scaling: Performance trends across different sequence lengths

Results saved in: comprehensive_results/comprehensive_sweep_YYYYMMDD_HHMMSS/

Supported Methods

RoPE Extensions

Linear Interpolation: Simple position scaling
NTK-Aware: Frequency-dependent scaling with alpha parameter
YaRN: Adaptive interpolation with attention factor and beta parameters
LongRoPE: Evolutionary search with short/long factors
Dynamic NTK: Runtime adaptation based on sequence length
Llama3: Official Llama 3 scaling method

Benchmarks

Traditional Retrieval: Synthetic passkey retrieval tasks
NIAH: Needle In A Haystack with multi-needle and NoLiMa variants
RULER: Synthetic benchmark with retrieval, multi-hop, aggregation, QA
LongBench: Real-world long context tasks (12 core tasks)

Results & Analysis

The suite generates comprehensive results with multiple output formats:

Generated Outputs

JSON Results: Detailed metrics with full configuration traces
CSV Summaries: Tabular data for easy analysis and plotting
Visualizations: Performance heatmaps, method comparisons, context length scaling
Statistical Reports: Detailed performance statistics and confidence intervals

Sample Results Structure

comprehensive_results/
├── llama32_comprehensive_results_20250820.json    # Detailed JSON metrics
├── comprehensive_analysis.png                     # Overview visualization  
├── method_comparison_detailed.png                 # Side-by-side comparison
├── perplexity_heatmap.png                        # Context length analysis
└── detailed_statistics.txt                        # Statistical summary

Performance Insights

Based on extensive evaluations with TinyLlama 1.1B and Llama 3.2:

YaRN consistently outperforms other methods across context lengths
NTK-Aware provides good balance between performance and simplicity
Linear Interpolation works well for moderate context extensions
Context length scaling varies significantly by method and model size

Configuration

The framework uses YAML configuration for flexibility:

# Hardware optimization
hardware:
  num_gpus: 1
  gpu_memory_fraction: 0.9
  mixed_precision: true

# Evaluation settings
evaluation:
  batch_size: 1
  max_context_length: 32768
  gradient_checkpointing: true
  use_cache: false

# Dataset configuration
datasets:
  retrieval:
    context_lengths: [4000, 8000, 16000, 32000]
    num_samples: 50
  
  longbench:
    path: "/nfs/datasets/longbench"
    tasks: ["narrativeqa", "qasper", "multifieldqa_en"]

Offline Deployment

For GPU clusters with restricted internet access:

Set offline environment variables:

export HF_OFFLINE=1
export LONGBENCH_DATA_PATH=/nfs/datasets/longbench
export MODEL_CACHE_PATH=/nfs/models

Use pre-downloaded datasets and models with NFS paths in configuration
The framework automatically handles offline mode with local dataset loading

Development

Project Structure

rope_long_context_evaluation_suite/
├── src/rope_long_context_evaluation_suite/
│   ├── benchmarks/           # NIAH, RULER, LongBench implementations
│   ├── metrics/             # Perplexity, passkey, longppl metrics
│   ├── sweep/               # Hyperparameter optimization
│   └── core.py             # Main RoPEEvaluator class
├── scripts/
│   ├── analysis/           # Result analysis and plotting
│   ├── demos/             # Usage examples and demos
│   └── validation/        # Setup and import validation
├── examples/              # Configuration examples and tutorials
└── comprehensive_config.yaml    # Main configuration file

Architecture Overview

RoPEEvaluator: Main evaluation orchestrator handling model loading, RoPE application, and benchmark execution
Benchmark System: Modular benchmark implementations with consistent interfaces
Sweep Framework: Grid search and random search with parallel execution support
Configuration: YAML-based configuration with comprehensive validation

Performance & Optimization

Framework Optimizations

Flash Attention 2: Efficient attention computation for long sequences
Memory Management: Gradient checkpointing and automatic batch sizing
Mixed Precision: FP16/BF16 support for memory efficiency
Parallel Execution: Multi-GPU and multi-process sweep execution
Caching: Intelligent model and dataset caching

Resource Requirements

Model Size	GPU Memory	Recommended GPU	Max Context Length
1B models	~8GB	RTX 3080/4070	32K tokens
3B models	~16GB	RTX 4080/A100	32K tokens
7B models	~32GB	A100 80GB	16K tokens
13B+ models	~64GB	Multi-GPU	8K tokens

Performance Tips

Use bfloat16 for optimal memory/accuracy balance
Enable gradient checkpointing for longer contexts
Set use_cache: false when evaluating very long sequences
Consider model compilation with PyTorch 2.0+ for additional speedup

License

MIT License

Contributing

Contributions are welcome! Please see our contribution guidelines:

Fork & Create Branch: Fork the repo and create a feature branch
Add Tests: Include tests for new functionality
Follow Style: Use Black formatting and type hints
Documentation: Update docs for new features
Submit PR: Create a pull request with clear description

Citation

If you use this framework in your research, please cite:

@software{rope_evaluation_suite,
  title = {RoPE Long Context Evaluation Suite},
  author = {Mann Patel},
  year = {2024},
  url = {https://github.com/manncodes/rope_long_context_evaluation_suite},
  note = {A comprehensive framework for evaluating RoPE scaling methods}
}

Acknowledgments

Original RoPE paper and implementations
LongBench, NIAH, and RULER benchmark authors
PyTorch and Transformers library developers
Flash Attention authors for efficient attention implementation

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
comprehensive_benchmark_results		comprehensive_benchmark_results
config		config
examples		examples
llama32_comprehensive_results		llama32_comprehensive_results
quick_sweep_results		quick_sweep_results
sample_visualizations		sample_visualizations
scripts		scripts
src/rope_long_context_evaluation_suite		src/rope_long_context_evaluation_suite
sweep_configs		sweep_configs
test_configs		test_configs
test_env		test_env
test_results		test_results
tests		tests
third_party		third_party
tinyllama_comprehensive_results		tinyllama_comprehensive_results
.gitignore		.gitignore
.gitmodules		.gitmodules
Dockerfile		Dockerfile
INSTALL.md		INSTALL.md
INSTALL_SIMPLE.md		INSTALL_SIMPLE.md
LICENSE		LICENSE
OFFICIAL_BENCHMARKS.md		OFFICIAL_BENCHMARKS.md
README.md		README.md
USAGE_GUIDE.md		USAGE_GUIDE.md
comprehensive_config.yaml		comprehensive_config.yaml
comprehensive_config_fixed.yaml		comprehensive_config_fixed.yaml
comprehensive_sweep_visualizer.py		comprehensive_sweep_visualizer.py
environment.yml		environment.yml
install_direct.sh		install_direct.sh
pyproject.toml		pyproject.toml
requirements_benchmarks.txt		requirements_benchmarks.txt
run_comprehensive.py		run_comprehensive.py
run_comprehensive_evaluation.py		run_comprehensive_evaluation.py
run_comprehensive_sweep.py		run_comprehensive_sweep.py
run_evaluation.py		run_evaluation.py
setup_benchmarks.sh		setup_benchmarks.sh
setup_comprehensive_evaluation.sh		setup_comprehensive_evaluation.sh
setup_simple.sh		setup_simple.sh
test_all_benchmarks.py		test_all_benchmarks.py
test_comprehensive_config.py		test_comprehensive_config.py
test_gpt2.yaml		test_gpt2.yaml
test_llama31_1b.yaml		test_llama31_1b.yaml
test_llama32_1b.yaml		test_llama32_1b.yaml
test_llama32_quick.py		test_llama32_quick.py
test_longbench.py		test_longbench.py
test_longbench_v2.py		test_longbench_v2.py
test_mistral.yaml		test_mistral.yaml
test_niah.py		test_niah.py
test_ruler.py		test_ruler.py
test_setup.py		test_setup.py
test_small_rope.yaml		test_small_rope.yaml
test_tinyllama.yaml		test_tinyllama.yaml
test_validation_fix.py		test_validation_fix.py
tinyllama_full_sweep.yaml		tinyllama_full_sweep.yaml

Folders and files

Latest commit

History

Repository files navigation

RoPE Long Context Evaluation Suite

Overview

Key Features

Quick Start

Prerequisites

Installation

Configuration

Usage

Single Model Evaluation

Comprehensive Parameter Sweep

Sweep Results Analysis

Supported Methods

RoPE Extensions

Benchmarks

Results & Analysis

Generated Outputs

Sample Results Structure

Performance Insights

Configuration

Offline Deployment

Development

Project Structure

Architecture Overview

Performance & Optimization

Framework Optimizations

Resource Requirements

Performance Tips

License

Contributing

Citation

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages