Compare multiple Bedrock-hosted LLMs using production-like prompts. The framework measures latency, token usage, JSON validity, and cost, aggregates results, and visualizes comparisons in a Streamlit dashboard.
The manual run guide contains detailed step-by-step instructions for:
- Setting up your environment
- Configuring AWS credentials
- Running evaluations
- Launching the dashboard
- Troubleshooting common issues
π Click here to read the Manual Run Guide π
This README provides a quick overview, but the manual guide has comprehensive instructions for first-time setup.
- Claude 3.7 Sonnet (
us.anthropic.claude-3-7-sonnet-20250219-v1:0) - Llama 3.2 11B Instruct (
us.meta.llama3-2-11b-instruct-v1:0)
- Multi-model evaluation on the same prompts
- Per-request metrics: input/output tokens, latency, validity, cost
- Aggregations: p50/p95 latency, averages, validity%, cost/request
- Interactive Streamlit dashboard with visualizations
- Config-driven models and pricing (YAML)
- CSV export functionality
AICostOptimizer/
ββ src/
β ββ __init__.py
β ββ model_registry.py
β ββ prompt_loader.py
β ββ tokenizers.py
β ββ evaluator.py
β ββ metrics_logger.py
β ββ report_generator.py
β ββ dashboard.py
β ββ utils/
β ββ bedrock_client.py
β ββ timing.py
β ββ json_utils.py
ββ configs/
β ββ models.yaml
β ββ settings.yaml
ββ data/
β ββ test_prompts.csv
β ββ runs/
β β ββ raw_metrics.csv (generated)
β β ββ model_comparison.csv (generated)
β ββ cache/
ββ scripts/
β ββ run_evaluation.py
β ββ extract_prompts_from_json.py
ββ .env.example
ββ requirements.txt
ββ README.md
π‘ For detailed step-by-step instructions, see MANUAL_RUN_GUIDE.md
This is a quick reference. For comprehensive setup instructions, please refer to the Manual Run Guide.
git clone <repository-url>
cd Optimizationπ Next: Read MANUAL_RUN_GUIDE.md for detailed setup instructions!
Windows PowerShell:
# Create virtual environment
python -m venv .venv
# Activate virtual environment
.venv\Scripts\Activate.ps1Windows CMD:
python -m venv .venv
.venv\Scripts\activate.batLinux/Mac:
python3 -m venv .venv
source .venv/bin/activatepip install -r requirements.txtThis will install all required packages including:
boto3(AWS SDK)pandas(Data manipulation)streamlit(Dashboard)plotly(Visualizations)tiktoken(Token counting)tqdm(Progress bars)- And other dependencies...
us.anthropic.claude-3-7-sonnet-20250219-v1:0(Claude 3.7 Sonnet)us.meta.llama3-2-11b-instruct-v1:0(Llama 3.2 11B Instruct)
Make sure these models are enabled in your AWS Bedrock account before running evaluations.
You have three options for AWS credentials:
Option A: Using .env file (Recommended)
-
Copy the example environment file:
# Windows PowerShell Copy-Item .env.example .env # Linux/Mac cp .env.example .env
-
Edit
.envfile and add your AWS credentials:AWS_ACCESS_KEY_ID=your_access_key_here AWS_SECRET_ACCESS_KEY=your_secret_key_here AWS_REGION=us-east-2
β οΈ Important: Never commit the.envfile to version control!
Option B: Using AWS Profile
If you have AWS CLI configured, you can set the profile in config.py:
AWS_PROFILE = "your-profile-name"Or set it as an environment variable:
export AWS_PROFILE=your-profile-nameOption C: Using AWS Credentials File
Configure AWS credentials using AWS CLI:
aws configureOr manually edit ~/.aws/credentials:
[default]
aws_access_key_id = your_access_key_here
aws_secret_access_key = your_secret_key_here
region = us-east-2Note: If you don't set credentials, the code will try to use:
- Environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
- AWS Profile (if configured)
- Default AWS credentials (~/.aws/credentials or IAM role)
If you have Bedrock CloudTrail JSON logs and want to extract prompts:
-
Place your JSON log file in the
data/directory (e.g.,data/my_logs.json) -
Run the extraction script:
python scripts/extract_prompts_from_json.py --input data/my_logs.json --output data/test_prompts.csv
Or use the default file if it exists:
python scripts/extract_prompts_from_json.py
-
The script will:
- Extract all prompts from JSON log file
- Combine multi-message conversations into complete prompts
- Detect JSON-expected prompts automatically
- Save to
data/test_prompts.csv
-
Verify the output:
# Check the CSV file # Windows PowerShell Get-Content data/test_prompts.csv -Head 5 # Linux/Mac head -5 data/test_prompts.csv
CSV Format:
prompt_id,prompt,expected_json,category
1,"Your complete prompt text here...",True,converse
2,"Another prompt...",False,general
- Edit
configs/models.yamland update with your Bedrock model IDs:
region_name: us-east-2 # Change if your models are in a different region
models:
- name: Claude 3.7 Sonnet
provider: anthropic
bedrock_model_id: us.anthropic.claude-3-7-sonnet-20250219-v1:0
tokenizer: anthropic
pricing:
input_per_1k_tokens_usd: 0.008
output_per_1k_tokens_usd: 0.024
generation_params:
max_tokens: 1500
temperature: 0.7
top_p: 0.95
- name: Llama 3.2 11B Instruct
provider: meta
bedrock_model_id: us.meta.llama3-2-11b-instruct-v1:0
tokenizer: llama
pricing:
input_per_1k_tokens_usd: 0.0006
output_per_1k_tokens_usd: 0.0008
generation_params:
max_tokens: 1500
temperature: 0.7
top_p: 0.95π‘ Tips:
- Find available Bedrock models in AWS Console β Amazon Bedrock β Foundation models
- Update pricing from AWS pricing pages (prices may vary by region)
- Use model IDs that match your AWS account's available models
- (Optional) Edit
configs/settings.yamlfor advanced settings:
region_name: us-east-1
output_dir: data/runs
use_sqlite: false
cache_enabled: true
concurrency: 4
logging_level: INFOTest Run (First 3 Prompts - Recommended for First Time):
python scripts/run_evaluation.py --models all --limit 3Full Evaluation (All Prompts):
python scripts/run_evaluation.py --models all --prompts data/test_prompts.csv --out data/runsEvaluate Specific Models:
python scripts/run_evaluation.py --models "Claude 3.7 Sonnet,Llama 3.2 11B Instruct"Other Useful Options:
# Custom output directory
python scripts/run_evaluation.py --models all --out data/my_results
# Custom run ID
python scripts/run_evaluation.py --models all --run-id my_test_run
# Skip report generation (faster for testing)
python scripts/run_evaluation.py --models all --limit 5 --skip-reportWhat Happens During Evaluation:
- Loads prompts from CSV
- For each prompt and each model:
- Sends request to Bedrock API
- Measures latency (start to finish)
- Counts input/output tokens
- Calculates cost based on pricing
- Validates JSON if expected
- Records all metrics
- Saves raw metrics to
data/runs/raw_metrics.csv - Generates aggregated report to
data/runs/model_comparison.csv
Progress Indicators:
- You'll see a progress bar showing evaluation status
- Success (β ) or Error (β) indicators per evaluation
- Final summary with metrics
π For detailed dashboard setup and troubleshooting, see MANUAL_RUN_GUIDE.md - Step 12
-
Launch the Streamlit dashboard:
streamlit run src/dashboard.py
-
The dashboard will automatically open in your browser at:
http://localhost:8501 -
If it doesn't open automatically, copy the URL from the terminal output.
Note: If you encounter connection issues, check the Troubleshooting section in the manual guide.
Dashboard Features:
- Summary Cards: Total evaluations, success rate, total cost, models compared
- Model Comparison Table: Aggregated metrics per model
- Best Performer Highlights: Best latency, cost, and JSON validity
- Visualizations:
- Latency Tab: Box plots showing latency distribution
- Tokens Tab: Bar charts for average input/output tokens
- Cost Tab: Cost distribution and average cost per request
- JSON Validity Tab: Validity percentage by model
- Filters:
- Select specific models to compare
- Filter by prompt IDs
- Filter by status (success/error)
- Export: Download filtered results as CSV
Using the Dashboard:
- Use sidebar to select data file paths (if different from defaults)
- Apply filters to focus on specific models or prompts
- Explore visualizations in different tabs
- Export results using download buttons
Key Metrics to Compare:
-
Latency (p50/p95/p99):
- Lower is better
- p95 shows worst-case performance
- Important for user-facing applications
-
Cost per Request:
- Lower is better for cost optimization
- Consider both input and output token costs
- Multiply by expected request volume
-
JSON Validity:
- Higher percentage is better
- Critical if you require structured outputs
-
Token Usage:
- Compare input vs output tokens
- Models with lower token usage may be more cost-effective
-
Success Rate:
- Check for errors in the raw metrics
- Investigate any model-specific failures
Files Generated:
data/runs/raw_metrics.csv: Per-request detailed metricsdata/runs/model_comparison.csv: Aggregated comparison table
π For comprehensive troubleshooting guide, see MANUAL_RUN_GUIDE.md - Troubleshooting Section
Solution: Install dependencies:
pip install -r requirements.txtSee MANUAL_RUN_GUIDE.md - Step 4 for detailed instructions.
Solution: Configure AWS credentials. See MANUAL_RUN_GUIDE.md - Step 5 for detailed setup instructions.
Solution: Check that your model IDs in configs/models.yaml match available models in your AWS account. See MANUAL_RUN_GUIDE.md - Step 8.
Solution: Ensure your AWS credentials have permissions for Amazon Bedrock. See MANUAL_RUN_GUIDE.md - Step 6.
Solution:
- Run evaluation first (Step 7), then check file paths in dashboard sidebar
- See MANUAL_RUN_GUIDE.md - Step 12 for dashboard launch instructions
- Check the troubleshooting section in the manual guide for connection issues
Solution:
- Use
--limitto test with fewer prompts first - Check your internet connection
- Verify Bedrock model availability in your region
- See MANUAL_RUN_GUIDE.md - Step 9 for test evaluation instructions
After running evaluation, you'll see:
β
Found 2 model(s): ['Claude 3.7 Sonnet', 'Llama 3.2 11B Instruct']
β
Loaded 16 prompt(s)
π Run ID: run_20250101_120000
π Starting evaluation...
Models: 2
Prompts: 16
Total evaluations: 32
Evaluating: 100%|ββββββββββββ| 32/32 [02:15<00:00, 2.81s/eval]
β
Evaluation complete! Collected 32 metric records
πΎ Saving metrics...
Saved to: data/runs/raw_metrics.csv
π Generating aggregated report...
Saved to: data/runs/model_comparison.csv
π Summary:
model_name count avg_input_tokens p95_latency_ms avg_cost_usd_per_request
Claude 3.7 Sonnet 16 1250.3 2150.5 0.008425
Llama 3.2 11B Instruct 16 1248.7 2890.2 0.005234
Prompts CSV (data/test_prompts.csv):
prompt_id,prompt,expected_json,category
1,"Your prompt text here...",true,converse
2,"Another prompt...",false,general
Raw Metrics CSV (data/runs/raw_metrics.csv):
- timestamp, run_id, model_name, model_id, prompt_id
- input_tokens, output_tokens, latency_ms
- json_valid, error, status
- cost_usd_input, cost_usd_output, cost_usd_total
Comparison CSV (data/runs/model_comparison.csv):
- model_name, count, success_count, error_count
- avg_input_tokens, avg_output_tokens
- p50_latency_ms, p95_latency_ms, p99_latency_ms
- json_valid_pct
- avg_cost_usd_per_request, total_cost_usd
- Never commit
.envfile or AWS credentials to version control - Use AWS IAM roles with minimal required permissions
- Review AWS CloudTrail logs regularly
- Keep dependencies updated for security patches
- Fork the repository
- Create a feature branch
- Make your changes
- Test thoroughly
- Submit a pull request
[Add your license here]
- Start with
--limit 3to test before full evaluation - Use different run IDs to compare different configurations
- Export results regularly for tracking over time
- Monitor AWS costs in CloudWatch
If you're setting up the project for the first time:
- Read the MANUAL_RUN_GUIDE.md - It has step-by-step instructions for everything
- Check the Troubleshooting section in the manual guide for common issues
- Follow the checklist at the end of the manual guide to ensure you've completed all steps
The manual guide covers:
- β Virtual environment setup
- β Dependency installation
- β AWS credentials configuration
- β Running evaluations
- β Launching the dashboard
- β Troubleshooting
Happy Evaluating! π