A proof-of-concept tool to grade LLM responses with and without additional context, specifically testing IBM WatsonX models on Red Hat Enterprise Linux administration questions.
- Automated Testing: Tests all available models in IBM WatsonX Dallas region
- Context Comparison: Tests each model with and without additional context
- Weighted Grading: Uses Google Gemini 2.5 Flash to grade responses with weighted categories:
- Accuracy: 50%
- Completeness: 20%
- Clarity: 20%
- Response Time: 10%
- Percentile Ranking: Converts all scores to percentiles for easy comparison
- Rich Console Output: Live progress display with colored results tables
- Structured Logging: Plain-text structured logging with structlog
- CSV Export: Detailed results exported to CSV with all scores and percentiles
- Python 3.12 or later
- IBM WatsonX API key and project ID
- Google Gemini API key
uvpackage manager
- Clone this repository
- Install dependencies using uv:
uv sync- Copy
.env.exampleto.envand fill in your API keys:
cp .env.example .env- Edit
.envwith your credentials:
WATSONX_API_KEY=your_watsonx_api_key_here
WATSONX_PROJECT_ID=your_watsonx_project_id_here
GEMINI_API_KEY=your_gemini_api_key_hereRun the grading system:
uv run modelgraderThe tool will:
- Connect to IBM WatsonX and list all available models
- Load questions and contexts from the
data/directory - Check for existing results in the CSV file (for resume capability)
- Test each model with each question (with and without context)
- Skips already-tested combinations if resuming
- Writes results to CSV immediately after each test
- Grade each response using Google Gemini
- Calculate percentile rankings for all results
- Display results in a formatted table
- Export final results with percentiles to
llm_grading_results.csv
The tool automatically saves results as they come in and can resume from where it left off if interrupted:
- Incremental Saves: Each test result is appended to the CSV immediately after completion
- Automatic Resume: On restart, the tool checks the CSV for existing results and skips already-tested combinations
- Progress Tracking: The console shows how many existing results were found and how many tests remain
Example Resume Output:
✓ Found 12 models
✓ Loaded 5 questions with contexts
↻ Found 45 existing results, 75 tests remaining
This means you can safely stop the tool at any time (Ctrl+C) and restart it later to continue testing.
modelgrader/
├── src/
│ └── modelgrader/
│ ├── __init__.py # Main entry point
│ ├── config.py # Configuration with pydantic-settings
│ ├── logging.py # Structlog configuration
│ ├── models.py # Data models
│ ├── watsonx_client.py # IBM WatsonX integration
│ ├── gemini_grader.py # Google Gemini grading
│ ├── console_output.py # Rich console output
│ ├── test_runner.py # Test orchestration
│ └── csv_writer.py # CSV export
├── data/
│ ├── questions/ # RHEL question files
│ │ ├── question_1.txt
│ │ ├── question_2.txt
│ │ ├── question_3.txt
│ │ ├── question_4.txt
│ │ └── question_5.txt
│ └── contexts/ # Context files for each question
│ ├── context_1.txt
│ ├── context_2.txt
│ ├── context_3.txt
│ ├── context_4.txt
│ └── context_5.txt
├── tests/ # Pytest tests
├── .env.example # Example environment variables
├── pyproject.toml # Project configuration
└── README.md # This file
The tool includes 5 RHEL-related questions:
- SELinux Troubleshooting: How to troubleshoot SELinux denials
- Systemd Service Configuration: Configuring services to auto-start and restart on failure
- Firewall Configuration: Using firewalld to allow HTTP/HTTPS traffic
- Package Management: Installing specific package versions with dnf and version locking
- Sudo Configuration: Configuring passwordless sudo for specific commands
Each question has a corresponding context file with detailed RHEL documentation.
Gemini grades each response on four categories:
- Accuracy (50% weight): Technical correctness, command accuracy
- Completeness (20% weight): Thoroughness and detail
- Clarity (20% weight): Organization and readability
- Response Time (10% weight): Speed of response generation
The final score is calculated as:
weighted_score = (accuracy * 0.5) + (completeness * 0.2) + (clarity * 0.2) + (response_time * 0.1)
After all tests complete, weighted scores are converted to percentile rankings (0-100) to make it easier to compare relative performance across all models.
The tool displays:
- Live progress bar during testing
- Results table with:
- Model name
- Question number
- Context provided (Y/N)
- Weighted score
- Percentile rank (color-coded)
- Individual category scores
- Response time
- Summary statistics:
- Best and worst performing models
- Average scores
- Context impact analysis
Results are saved to llm_grading_results.csv with columns:
- Model Name
- Question
- Context Provided
- Response
- Accuracy Score
- Completeness Score
- Clarity Score
- Response time
- Weighted Score
- Percentile Rank
uv run pytestuv run ruff check src/uv run pyright src/All configuration can be set via environment variables or .env file:
WATSONX_API_KEY: IBM WatsonX API key (required)WATSONX_PROJECT_ID: IBM WatsonX project ID (required)WATSONX_URL: WatsonX API URL (default: https://us-south.ml.cloud.ibm.com)GEMINI_API_KEY: Google Gemini API key (required)GEMINI_MODEL: Gemini model to use (default: gemini-2.0-flash-exp)OUTPUT_CSV_PATH: Output CSV file path (default: llm_grading_results.csv)QUESTIONS_DIR: Questions directory (default: data/questions)CONTEXTS_DIR: Contexts directory (default: data/contexts)REQUEST_TIMEOUT: API request timeout in seconds (default: 120)QUESTION_NUMBERS: Comma-separated question numbers to test (default: 1,2,3,4,5)- Example:
QUESTION_NUMBERS=1to test only question 1 - Example:
QUESTION_NUMBERS=1,3,5to test questions 1, 3, and 5
- Example:
To reduce load or test incrementally, you can limit which questions to test by setting QUESTION_NUMBERS in your .env file:
# Test only question 1
QUESTION_NUMBERS=1
# Or test questions 1 and 2
QUESTION_NUMBERS=1,2This is useful for:
- Initial testing with a smaller dataset
- Reducing API usage and costs
- Testing specific questions of interest
This is a proof-of-concept project. Use at your own discretion.
Major Hayden major@mhtx.net