# API Provider Validation Notebook

This notebook tests ONE model from EACH API provider to validate your experiment setup before submitting batch jobs.

**API Types Tested:**
1. **OpenAI** - `gpt-4o-mini` (cheapest)
2. **Anthropic** - `claude-3-haiku` (cheapest)
3. **Google** - `gemini-2-0-flash-lite` (cheapest)
4. **XAI** - `grok-3-mini` (cheapest)
5. **OpenRouter** - `amazon-nova-micro` (cheapest)
6. **Princeton Cluster** - `Llama-3.2-3B-Instruct` (local HuggingFace)

**Run this notebook before batch submissions to catch API issues early!**

In [1]:
# Setup - Add project root to path
import sys
import os
from pathlib import Path

# Find project root (parent of tests/)
notebook_dir = Path.cwd()
if notebook_dir.name == 'tests':
    project_root = notebook_dir.parent
else:
    project_root = notebook_dir

sys.path.insert(0, str(project_root))
os.chdir(project_root)

print(f"Project root: {project_root}")
print(f"Working directory: {os.getcwd()}")

Project root: /scratch/gpfs/DANQIC/jz4391/bargain
Working directory: /scratch/gpfs/DANQIC/jz4391/bargain


In [2]:
!source ~/.bashrc

In [None]:
# Imports
import asyncio
import json
import time
from datetime import datetime
from typing import Dict, Any, Optional, Tuple
import traceback

# Check for required environment variables
print("=" * 60)
print("API KEY STATUS")
print("=" * 60)

api_keys = {
    "OPENAI_API_KEY": os.getenv("OPENAI_API_KEY"),
    "ANTHROPIC_API_KEY": os.getenv("ANTHROPIC_API_KEY"),
    "GOOGLE_API_KEY": os.getenv("GOOGLE_API_KEY"),
    "XAI_API_KEY": os.getenv("XAI_API_KEY"),
    "OPENROUTER_API_KEY": os.getenv("OPENROUTER_API_KEY"),
}

for key_name, key_value in api_keys.items():
    status = "SET" if key_value else "MISSING"
    icon = "\u2705" if key_value else "\u274c"
    print(f"{icon} {key_name}: {status}")

print("\nNote: Princeton Cluster models use local HuggingFace - no API key needed")

API KEY STATUS
✅ OPENAI_API_KEY: SET
✅ ANTHROPIC_API_KEY: SET
✅ GOOGLE_API_KEY: SET
✅ XAI_API_KEY: SET
✅ OPENROUTER_API_KEY: SET

Note: Princeton Cluster models use local HuggingFace - no API key needed


In [5]:
# Import project modules
from strong_models_experiment.configs import STRONG_MODELS_CONFIG
from strong_models_experiment.agents import StrongModelAgentFactory
from negotiation.llm_agents import NegotiationContext

# Display available models by API type
print("=" * 60)
print("MODELS BY API TYPE")
print("=" * 60)

models_by_api = {}
for model_name, config in STRONG_MODELS_CONFIG.items():
    api_type = config.get("api_type", "unknown")
    if api_type not in models_by_api:
        models_by_api[api_type] = []
    # Skip deprecated models
    if not config.get("deprecated", False):
        models_by_api[api_type].append(model_name)

for api_type, models in sorted(models_by_api.items()):
    print(f"\n{api_type.upper()} ({len(models)} models):")
    for m in models[:5]:  # Show first 5
        print(f"  - {m}")
    if len(models) > 5:
        print(f"  ... and {len(models) - 5} more")

MODELS BY API TYPE

ANTHROPIC (9 models):
  - claude-4-5-haiku
  - claude-4-sonnet
  - claude-sonnet-4-5
  - claude-4-1-opus
  - claude-opus-4-5-thinking-32k
  ... and 4 more

GOOGLE (7 models):
  - gemini-1-5-pro
  - gemini-2-5-pro
  - gemini-2-0-flash
  - gemini-2-0-flash-lite
  - gemini-3-pro
  ... and 2 more

OPENAI (16 models):
  - gpt-4o
  - gpt-4o-latest
  - gpt-4o-mini
  - o1
  - gpt-5-low-effort
  ... and 11 more

OPENROUTER (11 models):
  - llama-3-1-405b
  - glm-4.7
  - qwen3-max
  - deepseek-r1-0528
  - grok-4
  ... and 6 more

PRINCETON_CLUSTER (18 models):
  - gemma-3-27b-it
  - QwQ-32B
  - llama-3.3-70b-instruct
  - Qwen2.5-72B-Instruct
  - gemma-2-27b-it
  ... and 13 more

XAI (4 models):
  - grok-4-0709
  - grok-3
  - grok-3-mini
  - grok-4-1-thinking


---
## Test Configuration

Select one cheap/fast model from each provider for testing.

In [6]:
# Test models - one from each provider (cheapest/fastest options)
TEST_MODELS = {
    "openai": "gpt-4o-mini",
    "anthropic": "claude-3-haiku",
    "google": "gemini-2-0-flash-lite",
    "xai": "grok-3-mini",
    "openrouter": "amazon-nova-micro",
    "princeton_cluster": "Llama-3.2-3B-Instruct",
}

# Simple test prompt
TEST_PROMPT = "Say 'Hello, I am working!' and nothing else."

print("Test models selected:")
for api_type, model in TEST_MODELS.items():
    print(f"  {api_type}: {model}")

Test models selected:
  openai: gpt-4o-mini
  anthropic: claude-3-haiku
  google: gemini-2-0-flash-lite
  xai: grok-3-mini
  openrouter: amazon-nova-micro
  princeton_cluster: Llama-3.2-3B-Instruct


---
## Helper Functions

In [7]:
# Results tracking
test_results = {}

def create_minimal_context() -> NegotiationContext:
    """
    Create a minimal NegotiationContext for testing.
    The generate_response method requires this context object.
    """
    return NegotiationContext(
        current_round=1,
        max_rounds=1,
        items=[{"name": "test_item", "value": 1}],
        agents=["test_agent"],
        agent_id="test_agent",
        preferences={"test_item": 1},
        conversation_history=[],
        turn_type="discussion"
    )


async def test_single_model(model_name: str, api_type: str) -> Tuple[bool, str, float]:
    """
    Test a single model with a simple prompt.
    
    Returns:
        Tuple of (success: bool, message: str, latency: float)
    """
    factory = StrongModelAgentFactory()
    start_time = time.time()
    
    try:
        # Create a single agent
        config = {"max_tokens_default": 100}
        agents = await factory.create_agents([model_name], config)
        
        if not agents:
            return False, "Failed to create agent (check API key)", 0.0
        
        agent = agents[0]
        
        # Create minimal context for testing
        context = create_minimal_context()
        
        # Send test prompt using correct method signature
        response = await agent.generate_response(
            context=context,
            prompt=TEST_PROMPT
        )
        
        latency = time.time() - start_time
        
        if response and response.content and len(response.content) > 0:
            # Truncate response for display
            display_response = response.content[:100] + "..." if len(response.content) > 100 else response.content
            return True, f"Response: {display_response}", latency
        else:
            return False, "Empty response received", latency
            
    except Exception as e:
        latency = time.time() - start_time
        error_msg = str(e)
        # Truncate long error messages
        if len(error_msg) > 200:
            error_msg = error_msg[:200] + "..."
        return False, f"Error: {error_msg}", latency


def print_result(api_type: str, model: str, success: bool, message: str, latency: float):
    """Pretty print test result."""
    icon = "\u2705" if success else "\u274c"
    status = "PASS" if success else "FAIL"
    print(f"\n{icon} [{api_type.upper()}] {model}")
    print(f"   Status: {status}")
    print(f"   Latency: {latency:.2f}s")
    print(f"   {message}")

---
## 1. Test OpenAI API

In [6]:
api_type = "openai"
model = TEST_MODELS[api_type]

print(f"Testing {api_type.upper()} with model: {model}")
print("-" * 50)

if not os.getenv("OPENAI_API_KEY"):
    print("\u274c OPENAI_API_KEY not set - skipping")
    test_results[api_type] = {"success": False, "message": "API key missing", "latency": 0}
else:
    success, message, latency = await test_single_model(model, api_type)
    test_results[api_type] = {"success": success, "message": message, "latency": latency, "model": model}
    print_result(api_type, model, success, message, latency)

Testing OPENAI with model: gpt-4o-mini
--------------------------------------------------

✅ [OPENAI] gpt-4o-mini
   Status: PASS
   Latency: 1.79s
   Response: Hello, I am working!


---
## 2. Test Anthropic API

In [7]:
api_type = "anthropic"
model = TEST_MODELS[api_type]

print(f"Testing {api_type.upper()} with model: {model}")
print("-" * 50)

if not os.getenv("ANTHROPIC_API_KEY"):
    print("\u274c ANTHROPIC_API_KEY not set - skipping")
    test_results[api_type] = {"success": False, "message": "API key missing", "latency": 0}
else:
    success, message, latency = await test_single_model(model, api_type)
    test_results[api_type] = {"success": success, "message": message, "latency": latency, "model": model}
    print_result(api_type, model, success, message, latency)

Testing ANTHROPIC with model: claude-3-haiku
--------------------------------------------------

✅ [ANTHROPIC] claude-3-haiku
   Status: PASS
   Latency: 1.09s
   Response: Hello, I am working!


---
## 3. Test Google API

In [8]:
api_type = "google"
model = TEST_MODELS[api_type]

print(f"Testing {api_type.upper()} with model: {model}")
print("-" * 50)

if not os.getenv("GOOGLE_API_KEY"):
    print("\u274c GOOGLE_API_KEY not set - skipping")
    test_results[api_type] = {"success": False, "message": "API key missing", "latency": 0}
else:
    success, message, latency = await test_single_model(model, api_type)
    test_results[api_type] = {"success": success, "message": message, "latency": latency, "model": model}
    print_result(api_type, model, success, message, latency)

Testing GOOGLE with model: gemini-2-0-flash-lite
--------------------------------------------------

✅ [GOOGLE] gemini-2-0-flash-lite
   Status: PASS
   Latency: 1.63s
   Response: Hello, I am working!



---
## 4. Test XAI (Grok) API

In [10]:
api_type = "xai"
model = TEST_MODELS[api_type]

print(f"Testing {api_type.upper()} with model: {model}")
print("-" * 50)

if not os.getenv("XAI_API_KEY"):
    print("\u274c XAI_API_KEY not set - skipping")
    test_results[api_type] = {"success": False, "message": "API key missing", "latency": 0}
else:
    success, message, latency = await test_single_model(model, api_type)
    test_results[api_type] = {"success": success, "message": message, "latency": latency, "model": model}
    print_result(api_type, model, success, message, latency)

Testing XAI with model: grok-3-mini
--------------------------------------------------
❌ XAI_API_KEY not set - skipping


---
## 5. Test OpenRouter API

In [10]:
api_type = "openrouter"
model = TEST_MODELS[api_type]

print(f"Testing {api_type.upper()} with model: {model}")
print("-" * 50)

if not os.getenv("OPENROUTER_API_KEY"):
    print("\u274c OPENROUTER_API_KEY not set - skipping")
    test_results[api_type] = {"success": False, "message": "API key missing", "latency": 0}
else:
    success, message, latency = await test_single_model(model, api_type)
    test_results[api_type] = {"success": success, "message": message, "latency": latency, "model": model}
    print_result(api_type, model, success, message, latency)

Testing OPENROUTER with model: amazon-nova-micro
--------------------------------------------------
[client] Wrote request_1768990049-056579.json
[client] Got response for 1768990049-056579: success

✅ [OPENROUTER] amazon-nova-micro
   Status: PASS
   Latency: 1.09s
   Response: Hello, I am working! Let's make sure this negotiation is as efficient and mutually beneficial as pos...


---
## 6. Test Princeton Cluster (Local HuggingFace)

**Note:** This requires:
1. Running on a cluster node with GPU access (e.g., della-gpu)
2. Model weights available at `/scratch/gpfs/DANQIC/models/`
3. `transformers` and `torch` packages installed

If not on cluster, this test will be skipped.

In [11]:
api_type = "princeton_cluster"
model = TEST_MODELS[api_type]

print(f"Testing {api_type.upper()} with model: {model}")
print("-" * 50)

# Check if we're on the cluster by looking for model directory
model_config = STRONG_MODELS_CONFIG.get(model, {})
local_path = model_config.get("local_path", "")
models_base = "/scratch/gpfs/DANQIC/models"

if not os.path.exists(models_base):
    print("\u26a0\ufe0f Not on Princeton cluster - skipping local model test")
    print(f"   (Models directory not found: {models_base})")
    print("   Run this notebook on della-gpu or similar to test local models")
    test_results[api_type] = {"success": None, "message": "Not on cluster", "latency": 0}
elif local_path and not os.path.exists(local_path):
    print(f"\u274c Model path not found: {local_path}")
    print("   Available models in /scratch/gpfs/DANQIC/models/:")
    try:
        for m in sorted(os.listdir(models_base))[:10]:
            print(f"     - {m}")
    except:
        pass
    test_results[api_type] = {"success": False, "message": f"Model path missing: {local_path}", "latency": 0}
else:
    # Check for GPU availability
    try:
        import torch
        gpu_available = torch.cuda.is_available()
        if gpu_available:
            gpu_name = torch.cuda.get_device_name(0)
            print(f"   GPU detected: {gpu_name}")
        else:
            print("   \u26a0\ufe0f No GPU detected - model loading will be slow")
    except ImportError:
        print("   \u26a0\ufe0f torch not available")
    
    print(f"   Model path: {local_path}")
    print("   Loading model (this may take a while)...")
    
    success, message, latency = await test_single_model(model, api_type)
    test_results[api_type] = {"success": success, "message": message, "latency": latency, "model": model}
    print_result(api_type, model, success, message, latency)

Testing PRINCETON_CLUSTER with model: Llama-3.2-3B-Instruct
--------------------------------------------------
   GPU detected: NVIDIA A100-PCIE-40GB
   Model path: /scratch/gpfs/DANQIC/models/Llama-3.2-3B-Instruct
   Loading model (this may take a while)...


Loading checkpoint shards: 100%|██████████| 2/2 [00:07<00:00,  3.70s/it]



✅ [PRINCETON_CLUSTER] Llama-3.2-3B-Instruct
   Status: PASS
   Latency: 37.70s
   Response: Hello, I am working!


---
## Summary

In [12]:
print("=" * 60)
print("TEST RESULTS SUMMARY")
print("=" * 60)
print(f"\nTimestamp: {datetime.now().isoformat()}")
print()

passed = 0
failed = 0
skipped = 0

for api_type, result in test_results.items():
    success = result.get("success")
    model = result.get("model", TEST_MODELS.get(api_type, "unknown"))
    latency = result.get("latency", 0)
    message = result.get("message", "")
    
    if success is True:
        icon = "\u2705"
        status = "PASS"
        passed += 1
    elif success is False:
        icon = "\u274c"
        status = "FAIL"
        failed += 1
    else:
        icon = "\u26a0\ufe0f"
        status = "SKIP"
        skipped += 1
    
    print(f"{icon} {api_type.upper():20} | {status:4} | {latency:6.2f}s | {model}")

print()
print("-" * 60)
total = passed + failed + skipped
print(f"Total: {total} | Passed: {passed} | Failed: {failed} | Skipped: {skipped}")

if failed > 0:
    print("\n\u26a0\ufe0f Some tests FAILED - fix issues before running batch experiments!")
elif passed == total - skipped and passed > 0:
    print("\n\u2705 All tested APIs are working! Safe to submit batch jobs.")
else:
    print("\n\u26a0\ufe0f Some tests were skipped - verify those APIs separately if needed.")

TEST RESULTS SUMMARY

Timestamp: 2026-01-21T05:08:23.748997

✅ OPENAI               | PASS |   1.79s | gpt-4o-mini
✅ ANTHROPIC            | PASS |   1.09s | claude-3-haiku
✅ GOOGLE               | PASS |   1.63s | gemini-2-0-flash-lite
❌ XAI                  | FAIL |   0.00s | grok-3-mini
✅ OPENROUTER           | PASS |   1.09s | amazon-nova-micro
✅ PRINCETON_CLUSTER    | PASS |  37.70s | Llama-3.2-3B-Instruct

------------------------------------------------------------
Total: 6 | Passed: 5 | Failed: 1 | Skipped: 0

⚠️ Some tests FAILED - fix issues before running batch experiments!


---
## (Optional) Run a Mini Negotiation Experiment

This runs a full 2-round negotiation with 2 items to test the complete pipeline.

In [13]:
# Set to True to run a mini experiment
RUN_MINI_EXPERIMENT = False

# Choose which API to test (use one that passed above)
MINI_EXPERIMENT_API = "anthropic"  # Change as needed
MINI_EXPERIMENT_MODEL = TEST_MODELS.get(MINI_EXPERIMENT_API, "claude-3-haiku")

In [14]:
if RUN_MINI_EXPERIMENT:
    from strong_models_experiment import StrongModelsExperiment
    import tempfile
    
    print(f"Running mini experiment with {MINI_EXPERIMENT_MODEL}...")
    print("Configuration: 2 agents, 2 items, 2 rounds")
    print("-" * 50)
    
    # Use temporary output directory
    with tempfile.TemporaryDirectory() as tmp_dir:
        experiment = StrongModelsExperiment(output_dir=tmp_dir)
        
        # Run a minimal experiment
        config = {
            "m_items": 2,
            "t_rounds": 2,
            "competition_level": 0.5,
            "random_seed": 42,
        }
        
        try:
            start = time.time()
            results = await experiment.run_single_experiment(
                models=[MINI_EXPERIMENT_MODEL, MINI_EXPERIMENT_MODEL],
                experiment_config=config
            )
            elapsed = time.time() - start
            
            print(f"\n\u2705 Mini experiment completed in {elapsed:.1f}s")
            print(f"   Consensus reached: {results.consensus_reached}")
            print(f"   Final round: {results.final_round}")
            print(f"   Final utilities: {results.final_utilities}")
            
        except Exception as e:
            print(f"\n\u274c Mini experiment failed: {e}")
            traceback.print_exc()
else:
    print("Mini experiment skipped. Set RUN_MINI_EXPERIMENT = True to run.")

Mini experiment skipped. Set RUN_MINI_EXPERIMENT = True to run.


---
## Export Results

In [15]:
# Save results to JSON for CI/automation
output_path = project_root / "tests" / "provider_test_results.json"

export_data = {
    "timestamp": datetime.now().isoformat(),
    "summary": {
        "passed": passed,
        "failed": failed,
        "skipped": skipped,
    },
    "results": test_results,
}

with open(output_path, "w") as f:
    json.dump(export_data, f, indent=2, default=str)

print(f"Results saved to: {output_path}")

Results saved to: /scratch/gpfs/DANQIC/jz4391/bargain/tests/provider_test_results.json


---
## Troubleshooting

### Common Issues:

1. **API Key Missing**: Set the environment variable before starting Jupyter:
   ```bash
   export OPENAI_API_KEY="sk-..."
   export ANTHROPIC_API_KEY="sk-ant-..."
   # etc.
   ```

2. **Princeton Cluster Model Not Found**: 
   - Verify the model exists: `ls /scratch/gpfs/DANQIC/models/`
   - Check `STRONG_MODELS_CONFIG` for the correct `local_path`

3. **Rate Limiting**: If you get rate limit errors, wait a minute and retry.

4. **Timeout**: Some models (especially local ones) may take longer. The default timeout is 30s.

5. **Import Errors**: Ensure you're in the project virtual environment:
   ```bash
   source .venv/bin/activate
   ```