LLM Grading System

A proof-of-concept tool to grade LLM responses with and without additional context, specifically testing IBM WatsonX models on Red Hat Enterprise Linux administration questions.

Features

Automated Testing: Tests all available models in IBM WatsonX Dallas region
Context Comparison: Tests each model with and without additional context
Weighted Grading: Uses Google Gemini 2.5 Flash to grade responses with weighted categories:
- Accuracy: 50%
- Completeness: 20%
- Clarity: 20%
- Response Time: 10%
Percentile Ranking: Converts all scores to percentiles for easy comparison
Rich Console Output: Live progress display with colored results tables
Structured Logging: Plain-text structured logging with structlog
CSV Export: Detailed results exported to CSV with all scores and percentiles

Prerequisites

Python 3.12 or later
IBM WatsonX API key and project ID
Google Gemini API key
uv package manager

Installation

Clone this repository
Install dependencies using uv:

uv sync

Copy .env.example to .env and fill in your API keys:

cp .env.example .env

Edit .env with your credentials:

WATSONX_API_KEY=your_watsonx_api_key_here
WATSONX_PROJECT_ID=your_watsonx_project_id_here
GEMINI_API_KEY=your_gemini_api_key_here

Usage

Run the grading system:

uv run modelgrader

The tool will:

Connect to IBM WatsonX and list all available models
Load questions and contexts from the data/ directory
Check for existing results in the CSV file (for resume capability)
Test each model with each question (with and without context)
- Skips already-tested combinations if resuming
- Writes results to CSV immediately after each test
Grade each response using Google Gemini
Calculate percentile rankings for all results
Display results in a formatted table
Export final results with percentiles to llm_grading_results.csv

Resume Functionality

The tool automatically saves results as they come in and can resume from where it left off if interrupted:

Incremental Saves: Each test result is appended to the CSV immediately after completion
Automatic Resume: On restart, the tool checks the CSV for existing results and skips already-tested combinations
Progress Tracking: The console shows how many existing results were found and how many tests remain

Example Resume Output:

✓ Found 12 models
✓ Loaded 5 questions with contexts
↻ Found 45 existing results, 75 tests remaining

This means you can safely stop the tool at any time (Ctrl+C) and restart it later to continue testing.

Project Structure

modelgrader/
├── src/
│   └── modelgrader/
│       ├── __init__.py          # Main entry point
│       ├── config.py            # Configuration with pydantic-settings
│       ├── logging.py           # Structlog configuration
│       ├── models.py            # Data models
│       ├── watsonx_client.py    # IBM WatsonX integration
│       ├── gemini_grader.py     # Google Gemini grading
│       ├── console_output.py    # Rich console output
│       ├── test_runner.py       # Test orchestration
│       └── csv_writer.py        # CSV export
├── data/
│   ├── questions/               # RHEL question files
│   │   ├── question_1.txt
│   │   ├── question_2.txt
│   │   ├── question_3.txt
│   │   ├── question_4.txt
│   │   └── question_5.txt
│   └── contexts/                # Context files for each question
│       ├── context_1.txt
│       ├── context_2.txt
│       ├── context_3.txt
│       ├── context_4.txt
│       └── context_5.txt
├── tests/                       # Pytest tests
├── .env.example                 # Example environment variables
├── pyproject.toml               # Project configuration
└── README.md                    # This file

Questions Included

The tool includes 5 RHEL-related questions:

SELinux Troubleshooting: How to troubleshoot SELinux denials
Systemd Service Configuration: Configuring services to auto-start and restart on failure
Firewall Configuration: Using firewalld to allow HTTP/HTTPS traffic
Package Management: Installing specific package versions with dnf and version locking
Sudo Configuration: Configuring passwordless sudo for specific commands

Each question has a corresponding context file with detailed RHEL documentation.

Grading System

Category Scores (0-100 each)

Gemini grades each response on four categories:

Accuracy (50% weight): Technical correctness, command accuracy
Completeness (20% weight): Thoroughness and detail
Clarity (20% weight): Organization and readability
Response Time (10% weight): Speed of response generation

Weighted Score

The final score is calculated as:

weighted_score = (accuracy * 0.5) + (completeness * 0.2) + (clarity * 0.2) + (response_time * 0.1)

Percentile Ranking

After all tests complete, weighted scores are converted to percentile rankings (0-100) to make it easier to compare relative performance across all models.

Output

Console Output

The tool displays:

Live progress bar during testing
Results table with:
- Model name
- Question number
- Context provided (Y/N)
- Weighted score
- Percentile rank (color-coded)
- Individual category scores
- Response time
Summary statistics:
- Best and worst performing models
- Average scores
- Context impact analysis

CSV Export

Results are saved to llm_grading_results.csv with columns:

Model Name
Question
Context Provided
Response
Accuracy Score
Completeness Score
Clarity Score
Response time
Weighted Score
Percentile Rank

Development

Running Tests

uv run pytest

Linting

uv run ruff check src/

Type Checking

uv run pyright src/

Configuration Options

All configuration can be set via environment variables or .env file:

WATSONX_API_KEY: IBM WatsonX API key (required)
WATSONX_PROJECT_ID: IBM WatsonX project ID (required)
WATSONX_URL: WatsonX API URL (default: https://us-south.ml.cloud.ibm.com)
GEMINI_API_KEY: Google Gemini API key (required)
GEMINI_MODEL: Gemini model to use (default: gemini-2.0-flash-exp)
OUTPUT_CSV_PATH: Output CSV file path (default: llm_grading_results.csv)
QUESTIONS_DIR: Questions directory (default: data/questions)
CONTEXTS_DIR: Contexts directory (default: data/contexts)
REQUEST_TIMEOUT: API request timeout in seconds (default: 120)
QUESTION_NUMBERS: Comma-separated question numbers to test (default: 1,2,3,4,5)
- Example: QUESTION_NUMBERS=1 to test only question 1
- Example: QUESTION_NUMBERS=1,3,5 to test questions 1, 3, and 5

Testing Subset of Questions

To reduce load or test incrementally, you can limit which questions to test by setting QUESTION_NUMBERS in your .env file:

# Test only question 1
QUESTION_NUMBERS=1

# Or test questions 1 and 2
QUESTION_NUMBERS=1,2

This is useful for:

Initial testing with a smaller dataset
Reducing API usage and costs
Testing specific questions of interest

License

This is a proof-of-concept project. Use at your own discretion.

Author

Major Hayden major@mhtx.net

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
src/modelgrader		src/modelgrader
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
pyproject.toml		pyproject.toml
renovate.json		renovate.json
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Grading System

Features

Prerequisites

Installation

Usage

Resume Functionality

Project Structure

Questions Included

Grading System

Category Scores (0-100 each)

Weighted Score

Percentile Ranking

Output

Console Output

CSV Export

Development

Running Tests

Linting

Type Checking

Configuration Options

Testing Subset of Questions

License

Author

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLM Grading System

Features

Prerequisites

Installation

Usage

Resume Functionality

Project Structure

Questions Included

Grading System

Category Scores (0-100 each)

Weighted Score

Percentile Ranking

Output

Console Output

CSV Export

Development

Running Tests

Linting

Type Checking

Configuration Options

Testing Subset of Questions

License

Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages