Skip to content

major/modelgrader

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM Grading System

A proof-of-concept tool to grade LLM responses with and without additional context, specifically testing IBM WatsonX models on Red Hat Enterprise Linux administration questions.

Features

  • Automated Testing: Tests all available models in IBM WatsonX Dallas region
  • Context Comparison: Tests each model with and without additional context
  • Weighted Grading: Uses Google Gemini 2.5 Flash to grade responses with weighted categories:
    • Accuracy: 50%
    • Completeness: 20%
    • Clarity: 20%
    • Response Time: 10%
  • Percentile Ranking: Converts all scores to percentiles for easy comparison
  • Rich Console Output: Live progress display with colored results tables
  • Structured Logging: Plain-text structured logging with structlog
  • CSV Export: Detailed results exported to CSV with all scores and percentiles

Prerequisites

  • Python 3.12 or later
  • IBM WatsonX API key and project ID
  • Google Gemini API key
  • uv package manager

Installation

  1. Clone this repository
  2. Install dependencies using uv:
uv sync
  1. Copy .env.example to .env and fill in your API keys:
cp .env.example .env
  1. Edit .env with your credentials:
WATSONX_API_KEY=your_watsonx_api_key_here
WATSONX_PROJECT_ID=your_watsonx_project_id_here
GEMINI_API_KEY=your_gemini_api_key_here

Usage

Run the grading system:

uv run modelgrader

The tool will:

  1. Connect to IBM WatsonX and list all available models
  2. Load questions and contexts from the data/ directory
  3. Check for existing results in the CSV file (for resume capability)
  4. Test each model with each question (with and without context)
    • Skips already-tested combinations if resuming
    • Writes results to CSV immediately after each test
  5. Grade each response using Google Gemini
  6. Calculate percentile rankings for all results
  7. Display results in a formatted table
  8. Export final results with percentiles to llm_grading_results.csv

Resume Functionality

The tool automatically saves results as they come in and can resume from where it left off if interrupted:

  • Incremental Saves: Each test result is appended to the CSV immediately after completion
  • Automatic Resume: On restart, the tool checks the CSV for existing results and skips already-tested combinations
  • Progress Tracking: The console shows how many existing results were found and how many tests remain

Example Resume Output:

✓ Found 12 models
✓ Loaded 5 questions with contexts
↻ Found 45 existing results, 75 tests remaining

This means you can safely stop the tool at any time (Ctrl+C) and restart it later to continue testing.

Project Structure

modelgrader/
├── src/
│   └── modelgrader/
│       ├── __init__.py          # Main entry point
│       ├── config.py            # Configuration with pydantic-settings
│       ├── logging.py           # Structlog configuration
│       ├── models.py            # Data models
│       ├── watsonx_client.py    # IBM WatsonX integration
│       ├── gemini_grader.py     # Google Gemini grading
│       ├── console_output.py    # Rich console output
│       ├── test_runner.py       # Test orchestration
│       └── csv_writer.py        # CSV export
├── data/
│   ├── questions/               # RHEL question files
│   │   ├── question_1.txt
│   │   ├── question_2.txt
│   │   ├── question_3.txt
│   │   ├── question_4.txt
│   │   └── question_5.txt
│   └── contexts/                # Context files for each question
│       ├── context_1.txt
│       ├── context_2.txt
│       ├── context_3.txt
│       ├── context_4.txt
│       └── context_5.txt
├── tests/                       # Pytest tests
├── .env.example                 # Example environment variables
├── pyproject.toml               # Project configuration
└── README.md                    # This file

Questions Included

The tool includes 5 RHEL-related questions:

  1. SELinux Troubleshooting: How to troubleshoot SELinux denials
  2. Systemd Service Configuration: Configuring services to auto-start and restart on failure
  3. Firewall Configuration: Using firewalld to allow HTTP/HTTPS traffic
  4. Package Management: Installing specific package versions with dnf and version locking
  5. Sudo Configuration: Configuring passwordless sudo for specific commands

Each question has a corresponding context file with detailed RHEL documentation.

Grading System

Category Scores (0-100 each)

Gemini grades each response on four categories:

  • Accuracy (50% weight): Technical correctness, command accuracy
  • Completeness (20% weight): Thoroughness and detail
  • Clarity (20% weight): Organization and readability
  • Response Time (10% weight): Speed of response generation

Weighted Score

The final score is calculated as:

weighted_score = (accuracy * 0.5) + (completeness * 0.2) + (clarity * 0.2) + (response_time * 0.1)

Percentile Ranking

After all tests complete, weighted scores are converted to percentile rankings (0-100) to make it easier to compare relative performance across all models.

Output

Console Output

The tool displays:

  • Live progress bar during testing
  • Results table with:
    • Model name
    • Question number
    • Context provided (Y/N)
    • Weighted score
    • Percentile rank (color-coded)
    • Individual category scores
    • Response time
  • Summary statistics:
    • Best and worst performing models
    • Average scores
    • Context impact analysis

CSV Export

Results are saved to llm_grading_results.csv with columns:

  • Model Name
  • Question
  • Context Provided
  • Response
  • Accuracy Score
  • Completeness Score
  • Clarity Score
  • Response time
  • Weighted Score
  • Percentile Rank

Development

Running Tests

uv run pytest

Linting

uv run ruff check src/

Type Checking

uv run pyright src/

Configuration Options

All configuration can be set via environment variables or .env file:

  • WATSONX_API_KEY: IBM WatsonX API key (required)
  • WATSONX_PROJECT_ID: IBM WatsonX project ID (required)
  • WATSONX_URL: WatsonX API URL (default: https://us-south.ml.cloud.ibm.com)
  • GEMINI_API_KEY: Google Gemini API key (required)
  • GEMINI_MODEL: Gemini model to use (default: gemini-2.0-flash-exp)
  • OUTPUT_CSV_PATH: Output CSV file path (default: llm_grading_results.csv)
  • QUESTIONS_DIR: Questions directory (default: data/questions)
  • CONTEXTS_DIR: Contexts directory (default: data/contexts)
  • REQUEST_TIMEOUT: API request timeout in seconds (default: 120)
  • QUESTION_NUMBERS: Comma-separated question numbers to test (default: 1,2,3,4,5)
    • Example: QUESTION_NUMBERS=1 to test only question 1
    • Example: QUESTION_NUMBERS=1,3,5 to test questions 1, 3, and 5

Testing Subset of Questions

To reduce load or test incrementally, you can limit which questions to test by setting QUESTION_NUMBERS in your .env file:

# Test only question 1
QUESTION_NUMBERS=1

# Or test questions 1 and 2
QUESTION_NUMBERS=1,2

This is useful for:

  • Initial testing with a smaller dataset
  • Reducing API usage and costs
  • Testing specific questions of interest

License

This is a proof-of-concept project. Use at your own discretion.

Author

Major Hayden major@mhtx.net

About

A terribly written, very basic project to test LLMs and grade their replies

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages