# üöÄ LLM-TestKit: Professional LLM Evaluation Framework

## Complete Functionality Showcase

This notebook demonstrates all the key functionalities of the `llm-testkit` package - a professional-grade LLM evaluation framework with beautiful HTML reports.

### üìã What you'll learn:
- Package installation and setup
- GPU compatibility checking
- Model evaluation workflows
- Task management and selection
- Report generation (HTML & Professional)
- Quick evaluation functions
- CLI tools overview

---


## üì¶ 1. Package Installation & Setup

First, let's install the package and check its basic information.


In [1]:
# Install the package (uncomment if not already installed)
# !pip install llm-testkit

# Import the package and check basic information
import llm_testkit

print(f"üì¶ LLM-TestKit Version: {llm_testkit.__version__}")
print(f"üë®‚Äçüíª Author: {llm_testkit.__author__}")
print(f"üìÑ License: {llm_testkit.__license__}")
print(f"üåê URL: {llm_testkit.__url__}")
print(f"üìß Contact: {llm_testkit.__email__}")




INFO 06-09 07:51:36 [__init__.py:243] Automatically detected platform cuda.
‚úÖ lm-eval loaded successfully
üì¶ LLM-TestKit Version: 1.1.1
üë®‚Äçüíª Author: Matthias De Paolis
üìÑ License: MIT
üåê URL: https://github.com/mattdepaolis/llm-testkit
üìß Contact: mattdepaolis@users.noreply.github.com


## üîç 2. GPU Compatibility & PyTorch Setup

The package includes intelligent GPU detection and PyTorch installation helpers.


In [2]:
# Check GPU compatibility
gpu_info = llm_testkit.check_gpu_compatibility()

print("üîç GPU Detection Results:")
print(f"GPUs detected: {len(gpu_info['gpus_detected'])}")

if gpu_info['gpus_detected']:
    for i, gpu in enumerate(gpu_info['gpus_detected']):
        print(f"  GPU {i+1}: {gpu['name']} (Compute {gpu['compute_cap']})")
else:
    print("  No NVIDIA GPUs detected or nvidia-smi not available")

print(f"\nüí° Recommendation: {gpu_info['recommendation']}")
print(f"üì¶ Installation command: {gpu_info['installation_command']}")

# Check if PyTorch is available with CUDA
try:
    import torch
    print(f"\nüéØ PyTorch version: {torch.__version__}")
    if torch.cuda.is_available():
        print(f"üöÄ CUDA available: {torch.cuda.get_device_name(0)}")
        print(f"üî• CUDA version: {torch.version.cuda}")
        print(f"üìä GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
    else:
        print("üíª CUDA not available - using CPU")
except ImportError:
    print("‚ùå PyTorch not installed")


üîç GPU Detection Results:
GPUs detected: 2
  GPU 1: NVIDIA GeForce RTX 5090 (Compute 12.0)
  GPU 2: NVIDIA GeForce RTX 4090 (Compute 8.9)

üí° Recommendation: GPU detected: NVIDIA GeForce RTX 5090. Using CUDA 12.8 for optimal performance.
üì¶ Installation command: pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128

üéØ PyTorch version: 2.7.0+cu128
üöÄ CUDA available: NVIDIA GeForce RTX 4090
üî• CUDA version: 12.8
üìä GPU Memory: 24.1 GB


## üìã 3. Available Tasks & Task Management

Let's explore what evaluation tasks are available.


In [3]:
# Display available tasks information
print("üìä Available Tasks Overview:")
print("The llm_testkit.list_available_tasks() function shows comprehensive task information:")

# Call the function to display the detailed task information
available_tasks = llm_testkit.list_available_tasks()

print("\nüéØ Task Categories Summary:")
print("The package supports several major task categories:")

# Show task category examples based on what we know from the output
task_categories = {
    "üìö Reasoning & Knowledge": ["MMLU", "ARC", "BBH", "GPQA", "TruthfulQA"],
    "üßÆ Mathematics": ["GSM8K", "MATH"],
    "üíª Code Generation": ["HumanEval"],
    "üß† Advanced Reasoning": ["MUSR", "IFEval"],
    "üèÜ Leaderboard Suite": ["LEADERBOARD (39 tasks total)"]
}

for category, examples in task_categories.items():
    print(f"\n  {category}:")
    for example in examples:
        print(f"    ‚Ä¢ {example}")

print(f"\nüèÜ Leaderboard Tasks Information:")
print(f"Total leaderboard tasks: {len(llm_testkit.LEADERBOARD_TASKS)}")
print("Sample leaderboard tasks:", list(llm_testkit.LEADERBOARD_TASKS)[:5])

print("\nüí° Key Points:")
print("  ‚Ä¢ Use task group names (e.g., 'MMLU', 'LEADERBOARD') for convenience")
print("  ‚Ä¢ Specify individual tasks for targeted evaluation")
print("  ‚Ä¢ Mix multiple tasks: --tasks mmlu,gsm8k,humaneval")
print("  ‚Ä¢ The LEADERBOARD suite includes the most comprehensive benchmarks")


üìä Available Tasks Overview:
The llm_testkit.list_available_tasks() function shows comprehensive task information:

AVAILABLE TASK GROUPS:

MMLU: Multiple choice QA covering 57 subjects across STEM, humanities, social sciences
   Contains 5 tasks. You can use 'MMLU' directly to run all tasks in this group.

MMLU-PRO: Advanced version of MMLU with more challenging questions
   Contains 2 tasks. You can use 'MMLU-PRO' directly to run all tasks in this group.

ARC: AI2 Reasoning Challenge (ARC) dataset with elementary/middle school science questions
   Contains 2 tasks. You can use 'ARC' directly to run all tasks in this group.

BBH: BIG-Bench Hard tasks - challenging subset of BIG-Bench benchmark
   Contains 2 tasks. You can use 'BBH' directly to run all tasks in this group.

GPQA: Graduate-level multiple-choice questions in biology, physics, and chemistry
   Contains 6 tasks. You can use 'GPQA' directly to run all tasks in this group.

IFEval: Instruction Following Evaluation benchmar

## ‚ö° 4. Quick Evaluation Functions

The package provides convenient functions for quick evaluations.


In [None]:
# Quick evaluation example (using a small model for demonstration)
print("üöÄ Running Quick Evaluation...")
print("Note: Using a small model and limited samples for demonstration purposes.")

try:
    # Quick evaluation with minimal parameters
    results = llm_testkit.quick_eval(
        model_name="microsoft/DialoGPT-small",  # Small model for demo
        tasks="arc_easy",  # Single task
        model_type="hf",   # HuggingFace backend
        limit=5,           # Very small sample for demo
        device="auto"      # Auto-detect device
    )
    
    print("‚úÖ Evaluation completed successfully!")
    print(f"üìä Results structure: {list(results.keys())}")
    
    if 'results' in results:
        for task, score in results['results'].items():
            print(f"  üìà {task}: {score:.3f}")
            
except Exception as e:
    print(f"‚ùå Evaluation failed: {e}")
    print("üí° This is expected in environments without GPU access or when the model is too large.")


## üîß 5. Advanced Evaluation Workflow

Here's the comprehensive evaluation function that was provided in your example.


In [4]:
# Advanced evaluation example (as provided by the user)
print("üéØ Advanced Evaluation Configuration")
print("This demonstrates the full-featured evaluate_model function.")

# This is the exact example provided by the user
print("\nüí° User's Example Configuration:")
print("""
results, output_path = llm_testkit.evaluate_model(
    model_type="hf",
    model_name="Qwen/Qwen2.5-7B-Instruct",
    tasks=["leaderboard"],
    preserve_default_fewshot=True,
    batch_size=1,  # Smallest batch size
    device="cuda:1",
    num_samples=1,  # Moderate sample size
    #quantize=True,
    #quantization_method="4bit",
    generate_report=True,
    report_format="professional",
    output_dir="./my_results"  # ‚Üê Add this to save in current directory
)
""")

print("üîß Configuration Parameters Explained:")
config_explanations = {
    "model_type": "Backend to use ('hf' for HuggingFace, 'vllm' for vLLM)",
    "model_name": "Model identifier from HuggingFace Hub",
    "tasks": "List of evaluation tasks to run",
    "preserve_default_fewshot": "Keep original few-shot examples",
    "batch_size": "Number of samples processed together",
    "device": "GPU device to use (cuda:0, cuda:1, etc.)",
    "num_samples": "Number of samples per task to evaluate",
    "generate_report": "Create HTML/PDF reports",
    "report_format": "Style of report ('basic', 'professional')",
    "output_dir": "Directory to save results and reports"
}

for param, explanation in config_explanations.items():
    print(f"  ‚Ä¢ {param}: {explanation}")

print("\n‚ö†Ô∏è  Note: This evaluation requires significant computational resources.")
print("    Uncomment and modify the code below when ready to run:")


üéØ Advanced Evaluation Configuration
This demonstrates the full-featured evaluate_model function.

üí° User's Example Configuration:

results, output_path = llm_testkit.evaluate_model(
    model_type="hf",
    model_name="Qwen/Qwen2.5-7B-Instruct",
    tasks=["leaderboard"],
    preserve_default_fewshot=True,
    batch_size=1,  # Smallest batch size
    device="cuda:1",
    num_samples=1,  # Moderate sample size
    #quantize=True,
    #quantization_method="4bit",
    generate_report=True,
    report_format="professional",
    output_dir="./my_results"  # ‚Üê Add this to save in current directory
)

üîß Configuration Parameters Explained:
  ‚Ä¢ model_type: Backend to use ('hf' for HuggingFace, 'vllm' for vLLM)
  ‚Ä¢ model_name: Model identifier from HuggingFace Hub
  ‚Ä¢ tasks: List of evaluation tasks to run
  ‚Ä¢ preserve_default_fewshot: Keep original few-shot examples
  ‚Ä¢ batch_size: Number of samples processed together
  ‚Ä¢ device: GPU device to use (cuda:0, cuda:1, etc.)
 

In [4]:
# Uncomment and run this cell when you have the computational resources:

results, output_path = llm_testkit.evaluate_model(
     model_type="hf",
     model_name="Qwen/Qwen2.5-7B-Instruct",
     tasks=["leaderboard"],
     preserve_default_fewshot=True,
     batch_size=1,  # Smallest batch size
     device="cuda:1",
     num_samples=1,  # Moderate sample size
     #quantize=True,
     #quantization_method="4bit",
     generate_report=True,
     report_format="professional",
     output_dir="./my_results"  # ‚Üê Add this to save in current directory
 )

print(f"‚úÖ Evaluation completed!")
print(f"üìä Results: {results}")
print(f"üìÑ Output saved to: {output_path}")

# print("üí° This cell contains the advanced evaluation code.")
# print("   Uncomment the lines above when you're ready to run a full evaluation.")


Evaluating model type: hf
Model: Qwen/Qwen2.5-7B-Instruct
Tasks: leaderboard
Device: cuda:1, Few-shot examples: 0
Batch size: 1
Using 1 samples per task
Using default few-shot settings for each task:
  - BBH tasks: 3-shot
  - GPQA tasks: 0-shot
  - MMLU-Pro tasks: 5-shot
  - MUSR tasks: 0-shot
  - IFEval tasks: 0-shot
  - Math-lvl-5 tasks: 4-shot
Starting evaluation on 1 tasks: leaderboard


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

[Task: leaderboard_musr_team_allocation] has_training_docs and has_validation_docs are False, using test_docs as fewshot_docs but this is not recommended.
[Task: leaderboard_musr_team_allocation] has_training_docs and has_validation_docs are False, using test_docs as fewshot_docs but this is not recommended.
[Task: leaderboard_musr_object_placements] has_training_docs and has_validation_docs are False, using test_docs as fewshot_docs but this is not recommended.
[Task: leaderboard_musr_object_placements] has_training_docs and has_validation_docs are False, using test_docs as fewshot_docs but this is not recommended.
[Task: leaderboard_musr_murder_mysteries] has_training_docs and has_validation_docs are False, using test_docs as fewshot_docs but this is not recommended.
[Task: leaderboard_musr_murder_mysteries] has_training_docs and has_validation_docs are False, using test_docs as fewshot_docs but this is not recommended.
[Task: leaderboard_ifeval] has_training_docs and has_validation_

Results saved to ./my_results/results_Qwen_Qwen2.5-7B-Instruct_leaderboard_20250609_075633.json
üìÑ JSON results saved to: ./my_results/results_Qwen_Qwen2.5-7B-Instruct_leaderboard_20250609_075633.json
üîç DEBUG: generate_report=True, json_output_path=./my_results/results_Qwen_Qwen2.5-7B-Instruct_leaderboard_20250609_075633.json
‚úÖ Model results saved for comparison: /workspace/llm-testkit/comparison_results/Qwen_Qwen2.5-7B-Instruct_20250609_075633.json
‚ú® Professional HTML report generated: /workspace/llm-testkit/reports/results_Qwen_Qwen2.5-7B-Instruct_leaderboard_20250609_075633_professional_report.html
‚ú® Professional HTML report generated: /workspace/llm-testkit/reports/results_Qwen_Qwen2.5-7B-Instruct_leaderboard_20250609_075633_professional_report.html
‚úÖ Professional report generated: /workspace/llm-testkit/reports/results_Qwen_Qwen2.5-7B-Instruct_leaderboard_20250609_075633_professional_report.md
‚úÖ Professional markdown report generated: /workspace/llm-testkit/reports/

## üìä 6. Report Generation & Visualization

One of the key features is generating beautiful HTML reports.


In [None]:
# Demonstrate report generation capabilities
print("üìÑ Report Generation Features")

# Show available report formats
print("\nüé® Available Report Formats:")
report_features = {
    "HTML Reports": [
        "Interactive web-based reports",
        "ZENO-style professional layouts", 
        "Card-based sample presentation",
        "Mobile-friendly responsive design"
    ],
    "Professional Reports": [
        "Business-ready presentations",
        "Executive summaries and insights",
        "Performance badges and recommendations",
        "Chart.js interactive visualizations"
    ],
    "Report Features": [
        "Enhanced question/choice display",
        "Smart answer highlighting (green/blue)",  
        "Confidence score visualization",
        "Activity badges for categorized tasks"
    ]
}

for category, features in report_features.items():
    print(f"\nüìã {category}:")
    for feature in features:
        print(f"  ‚ú® {feature}")

print("\nüöÄ Quick HTML Report Example:")
print("""
# Generate report with evaluation
results, report_path = llm_testkit.quick_html_report(
    model_name="microsoft/DialoGPT-small",
    tasks="arc_easy,hellaswag", 
    limit=100,
    output_dir="./my_reports"
)
""")

print("üí° Reports include:")
print("  ‚Ä¢ Interactive charts and graphs")
print("  ‚Ä¢ Detailed sample analysis") 
print("  ‚Ä¢ Performance breakdowns by task")
print("  ‚Ä¢ Professional styling and branding")


## üõ†Ô∏è 7. CLI Tools Overview

The package provides several CLI tools for different use cases.


In [None]:
import subprocess
import os

print("üñ•Ô∏è Available CLI Commands:")

cli_commands = {
    "llm-eval": "Main evaluation command",
    "llm-eval-gpu-setup": "GPU detection and PyTorch setup",
    "llm-eval-demo": "Demo command for report generation",  
    "llm-eval-html": "Convert JSON results to HTML reports",
    "llm-eval-showcase": "Showcase framework capabilities"
}

print("\nüìã Command Reference:")
for cmd, desc in cli_commands.items():
    print(f"  üîß {cmd}: {desc}")

print("\nüí° Example CLI Usage:")
cli_examples = [
    "# Basic evaluation",
    "llm-eval --model hf --model_name mistralai/Mistral-7B-v0.1 --tasks arc_easy --limit 100",
    "",
    "# Multiple tasks with professional reports",  
    "llm-eval --model hf --model_name microsoft/DialoGPT-small --tasks arc_easy,hellaswag --report_format professional",
    "",
    "# GPU-optimized evaluation",
    "llm-eval --model hf --model_name mistralai/Mistral-7B-v0.1 --tasks mmlu --device cuda:0 --batch_size 8",
    "",
    "# Generate HTML report from existing JSON",
    "llm-eval-html results.json -o report.html"
]

for example in cli_examples:
    print(f"  {example}")

# Check if commands are available  
print("\nüîç Checking CLI Command Availability:")
for cmd in cli_commands.keys():
    try:
        result = subprocess.run(["which", cmd], capture_output=True, text=True, timeout=5)
        if result.returncode == 0:
            print(f"  ‚úÖ {cmd}: Available")
        else:
            print(f"  ‚ùå {cmd}: Not found in PATH")
    except Exception as e:
        print(f"  ‚ö†Ô∏è  {cmd}: Check failed ({e})")


## üéØ 8. Use Case Examples

Different scenarios where llm-testkit excels.


In [None]:
print("üéØ LLM-TestKit Use Cases")

use_cases = {
    "üî¨ Research & Development": [
        "Model comparison across architectures",
        "Performance analysis with detailed breakdowns", 
        "Publication-ready evaluation materials",
        "Hypothesis testing with statistical analysis"
    ],
    "üíº Commercial Applications": [
        "Client demonstrations with professional reports",
        "Consulting deliverables and recommendations",
        "Proof of concepts for rapid prototyping",
        "Model selection for production deployment"
    ],
    "üéì Educational Use": [
        "Teaching materials with clear examples",
        "Student projects and coursework",
        "Research training with professional tools",
        "Comparative studies and benchmarking"
    ],
    "‚ö° Quick Evaluations": [
        "Rapid model screening and filtering",
        "A/B testing between model variants",
        "Performance monitoring in production",
        "Quality assurance for model updates"
    ]
}

for category, examples in use_cases.items():
    print(f"\n{category}:")
    for example in examples:
        print(f"  ‚Ä¢ {example}")

print("\nüåü Key Benefits:")
benefits = [
    "Professional-grade HTML reports with ZENO-style layouts",
    "Mobile-friendly responsive design for all devices", 
    "GPU optimization with CUDA 12.8 support",
    "35+ evaluation tasks across multiple domains",
    "Simple Python API and comprehensive CLI tools",
    "Business-ready presentation quality",
    "Batch processing for efficient evaluation",
    "Comprehensive error handling and logging"
]

for benefit in benefits:
    print(f"  ‚ú® {benefit}")

print("\nüí° Perfect for:")
perfect_for = [
    "Researchers publishing evaluation results",
    "Companies evaluating LLMs for production",
    "Consultants delivering client assessments", 
    "Educators teaching about LLM evaluation",
    "Developers comparing model performance"
]

for use in perfect_for:
    print(f"  üéØ {use}")


## üöÄ 9. Getting Started - Your Next Steps

Ready to start using llm-testkit? Here's your roadmap!


In [None]:
print("üöÄ Getting Started Checklist")

checklist = [
    ("‚úÖ", "Install llm-testkit package", "pip install llm-testkit"),
    ("üîß", "Setup PyTorch with CUDA 12.8", "llm_testkit.install_pytorch_for_gpu()"),
    ("üîç", "Check GPU compatibility", "llm_testkit.check_gpu_compatibility()"),
    ("üìã", "Explore available tasks", "llm_testkit.list_available_tasks()"),
    ("‚ö°", "Run quick evaluation", "llm_testkit.quick_eval(model_name, tasks)"),
    ("üìÑ", "Generate HTML report", "llm_testkit.quick_html_report(model_name, tasks)"),
    ("üéØ", "Advanced evaluation", "llm_testkit.evaluate_model(...)"),
    ("üñ•Ô∏è", "Try CLI commands", "llm-eval --help")
]

print("\nüìã Step-by-Step Guide:")
for i, (icon, step, command) in enumerate(checklist, 1):
    print(f"  {i}. {icon} {step}")
    print(f"     üíª {command}")
    print()

print("üéâ You're all set to start evaluating LLMs professionally!")

print("\nüìö Additional Resources:")
resources = [
    ("üåê", "Documentation", "https://github.com/mattdepaolis/llm-testkit"),
    ("üìä", "Example Reports", "Check the generated HTML files"),
    ("üí¨", "Support", "GitHub Issues for questions and feedback"),
    ("üì¶", "PyPI Package", "https://pypi.org/project/llm-testkit/"),
    ("üéØ", "Best Practices", "Use CUDA 12.8 for optimal performance"),
    ("‚ö°", "Quick Start", "Begin with small models and limited samples")
]

for icon, resource, info in resources:
    print(f"  {icon} {resource}: {info}")

print("\nüî• Pro Tips:")
pro_tips = [
    "Start with small models (DialoGPT-small) for testing",
    "Use limit parameter to control evaluation time",
    "Generate professional reports for presentations",
    "Leverage GPU optimization for faster evaluation",
    "Explore different task categories for comprehensive analysis"
]

for tip in pro_tips:
    print(f"  üí° {tip}")


---

## üéä Conclusion

This notebook has showcased the comprehensive capabilities of **llm-testkit**:

- ‚ú® **Professional Reports**: ZENO-style HTML reports with beautiful visualizations
- üöÄ **GPU Optimization**: Intelligent CUDA 12.8 setup for optimal performance
- üìä **Comprehensive Tasks**: 35+ evaluation tasks across multiple domains
- ‚ö° **Easy API**: Simple Python functions and CLI tools
- üíº **Business Ready**: Professional presentation quality for all use cases

Whether you're a researcher, developer, or business professional, llm-testkit provides the tools you need for professional LLM evaluation.

**Happy Evaluating! üéØ**
