A comprehensive framework for evaluating the safety of Large Language Models (LLMs) through systematic attack and refusal testing. RedEval provides a unified, secure, and extensible platform for assessing LLM robustness against adversarial prompts and harmful content. It is a part of the paper "RedBench: A Universal Dataset for Comprehensive Red Teaming of Large Language Models", which is a universal dataset for comprehensive red teaming of LLMs.
π¦ Dataset: The RedBench dataset is publicly available on HuggingFace at knoveleng/redbench. It includes comprehensive red teaming prompts across multiple safety categories and domains.
# Clone the repository
git clone https://github.com/knoveleng/redeval.git
cd redeval
# Install dependencies
pip install -r requirements.txt
# Copy and configure environment variables
cp .env.template .env
# Edit .env with your API keysEdit .env file with your credentials:
OPENAI_API_KEY=your_openai_api_key_here
HUGGINGFACE_TOKEN=your_huggingface_token_here# Set up environment
source ./sh/setup_env.sh
# Run complete pipeline
python -m redeval.cli run-pipeline --models "Qwen/Qwen2.5-7B-Instruct" "gpt-4o-mini"
# Or run individual phases
./sh/generate_attack.sh
./sh/run_attack.sh
./sh/eval_attack.sh- Overview
- Architecture
- Installation
- Configuration
- Usage
- Security Features
- API Reference
- Contributing
- License
RedEval evaluates LLM safety through two complementary approaches:
Tests LLM vulnerability to adversarial prompts using various jailbreaking techniques:
- Direct attacks: Straightforward harmful prompts
- Human jailbreaks: Human-crafted bypass techniques
- Zero-shot attacks: Automated adversarial prompt generation
Evaluates LLM's ability to appropriately refuse harmful requests across multiple safety datasets:
- CoCoNot: Context-aware content moderation
- SGXSTest: Safety guidelines examination
- XSTest: Cross-domain safety testing
- ORBench: Objective refusal benchmarking
Calculates comprehensive safety metrics combining both attack and refusal performance.
redeval/
βββ redeval/ # Core Python modules
β βββ config.py # Environment & configuration management
β βββ pipeline.py # Centralized pipeline orchestrator
β βββ cli.py # Unified command-line interface
β βββ exceptions.py # Custom exception handling
β βββ generate_attack.py # Attack prompt generation
β βββ run_attack.py # Attack execution
β βββ eval_attack.py # Attack evaluation
β βββ run_refuse.py # Refusal testing
β βββ eval_refuse.py # Refusal evaluation
β βββ score.py # Metric calculation
βββ sh/ # Shell script interfaces
β βββ setup_env.sh # Environment setup
β βββ generate_attack.sh # Attack generation script
β βββ run_attack.sh # Attack execution script
β βββ eval_attack.sh # Attack evaluation script
β βββ run_refuse.sh # Refusal testing script
β βββ eval_refuse.sh # Refusal evaluation script
β βββ score.sh # Scoring script
βββ recipes/ # Configuration files
βββ logs/ # Evaluation results
βββ .env # Environment variables
graph TD
A[Generate Attack Prompts] --> B[Run Attack Tests]
B --> C[Evaluate Attack Results]
D[Run Refuse Tests] --> E[Evaluate Refuse Results]
C --> F[Calculate Final Scores]
E --> F
F --> G[Generate Reports]
- Python 3.8 or higher
- OpenAI API key
- HuggingFace token
- Git
-
Clone Repository
git clone <repository-url> cd redeval
-
Install Dependencies
pip install -r requirements.txt
-
Configure Environment
cp .env.template .env # Edit .env with your API credentials -
Verify Installation
python -m redeval.cli --help
RedEval uses environment variables for secure and flexible configuration:
| Variable | Description | Example |
|---|---|---|
OPENAI_API_KEY |
OpenAI API key | sk-proj-... |
HUGGINGFACE_TOKEN |
HuggingFace token | hf_... |
| Variable | Description | Default |
|---|---|---|
REDEVAL_PROJECT_ROOT |
Project root directory | Current directory |
REDEVAL_LOG_DIR |
Logs directory | ./logs |
REDEVAL_RECIPES_DIR |
Configuration directory | ./recipes |
REDEVAL_LOG_LEVEL |
Logging level | INFO |
REDEVAL_NUM_SAMPLES |
Number of evaluation samples | 10 |
REDEVAL_SEED |
Random seed | 0 |
| Variable | Description | Default |
|---|---|---|
REDEVAL_OPEN_SOURCE_MODELS |
Open-source models list | "Qwen/Qwen2.5-7B-Instruct" |
REDEVAL_CLOSED_SOURCE_MODELS |
Closed-source models list | "gpt-4o-mini" |
Configuration files in recipes/ directory control evaluation parameters:
attack/base-open.yml- Open-source model attack configurationattack/base-close.yml- Closed-source model attack configurationattack/eval.yml- Attack evaluation configurationrefuse/base-open.yml- Open-source model refusal configurationrefuse/base-close.yml- Closed-source model refusal configurationrefuse/eval.yml- Refusal evaluation configuration
RedEval provides a unified CLI for all operations:
# Run full evaluation pipeline
python -m redeval.cli run-pipeline
# Run with specific models
python -m redeval.cli run-pipeline --models "Qwen/Qwen2.5-7B-Instruct" "gpt-4o-mini"
# Run specific phases only
python -m redeval.cli run-pipeline --phases generate_attack run_attack eval_attack# Generate attack prompts
python -m redeval.cli generate-attack --config ./recipes/attack/base-close.yml
# Run attack evaluation
python -m redeval.cli run-attack --config ./recipes/attack/base-open.yml --model "Qwen/Qwen2.5-7B-Instruct"
# Evaluate attack results
python -m redeval.cli eval-attack --config ./recipes/attack/eval.yml --log-dir ./logs/attack/HarmBench/direct/model_name
# Run refusal tests
python -m redeval.cli run-refuse --config ./recipes/refuse/base-open.yml --model "Qwen/Qwen2.5-7B-Instruct"
# Evaluate refusal results
python -m redeval.cli eval-refuse --config ./recipes/refuse/eval.yml --log-dir ./logs/refuse/CoCoNot/base/model_name
# Calculate scores
python -m redeval.cli score --log-dir ./logs/attack/HarmBench/direct/model_name --keyword "unsafe"For users preferring shell scripts:
# Set up environment (run first)
source ./sh/setup_env.sh
# Run evaluation phases
./sh/generate_attack.sh # Generate attack prompts
./sh/run_attack.sh # Execute attacks
./sh/eval_attack.sh # Evaluate attack results
./sh/run_refuse.sh # Run refusal tests
./sh/eval_refuse.sh # Evaluate refusal results
./sh/score.sh # Calculate final scoresFor programmatic access:
from redeval.pipeline import PipelineOrchestrator, PipelinePhase
from redeval.config import EnvironmentConfig
# Initialize configuration
config = EnvironmentConfig.from_env()
# Create pipeline orchestrator
orchestrator = PipelineOrchestrator(config)
# Run complete pipeline
orchestrator.run_complete_pipeline()
# Run specific phases
orchestrator.run_phase(PipelinePhase.GENERATE_ATTACK)
orchestrator.run_phase(PipelinePhase.RUN_ATTACK)- No hardcoded API keys in source code
- Environment variable-based configuration
.envfile support with validation- Secure token handling for HuggingFace authentication
- Environment variable validation on startup
- Configuration file validation
- Model name and parameter validation
- Custom exception classes for different error types
- Comprehensive error logging
- Graceful failure handling
- Principle of least privilege
- Secure defaults
- Comprehensive logging without sensitive data exposure
Run the complete evaluation pipeline or specific phases.
Options:
--models: List of models to evaluate--phases: Specific phases to run--config-dir: Configuration directory path--log-dir: Output directory for logs
Generate adversarial attack prompts.
Options:
--config: Configuration file path--num-samples: Number of samples to generate--seed: Random seed for reproducibility
Execute attack evaluation against target models.
Options:
--config: Configuration file path--model: Target model name--num-samples: Number of samples to process
Evaluate attack results using judge models.
Options:
--config: Configuration file path--log-dir: Directory containing attack logs
Run refusal capability testing.
Options:
--config: Configuration file path--model: Target model name--num-samples: Number of samples to test--split: Dataset split to use
Evaluate refusal test results.
Options:
--config: Configuration file path--log-dir: Directory containing refusal logs
Calculate safety scores from evaluation results.
Options:
--log-dir: Directory containing evaluation logs--keyword: Keyword to search for in results
Centralized pipeline management.
Methods:
run_complete_pipeline(): Execute full evaluation pipelinerun_phase(phase): Execute specific pipeline phaseget_phase_status(phase): Get status of pipeline phase
Environment and configuration management.
Methods:
from_env(): Load configuration from environment variablesvalidate(): Validate configuration completenesssetup_logging(): Configure logging system
-
Fork and Clone
git clone https://github.com/knoveleng/redeval.git cd redeval -
Create Development Environment
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate pip install -r requirements.txt
-
Set Up Pre-commit Hooks
pip install pre-commit pre-commit install
- Security First: Never commit API keys or sensitive data
- Environment Variables: Use environment variables for all configuration
- Error Handling: Add proper error handling with custom exceptions
- Documentation: Update documentation for new features
- Testing: Add tests for new functionality
- Backward Compatibility: Maintain compatibility with existing interfaces
- Follow PEP 8 for Python code
- Use type hints where appropriate
- Add comprehensive docstrings
- Maintain consistent logging patterns
This project is licensed under the MIT License - see the LICENSE file for details.
If you find this project useful, please consider citing it in your research:
@misc{dang2026redbenchuniversaldatasetcomprehensive,
title={RedBench: A Universal Dataset for Comprehensive Red Teaming of Large Language Models},
author={Quy-Anh Dang and Chris Ngo and Truong-Son Hy},
year={2026},
eprint={2601.03699},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2601.03699},
}