# Test Baseline Inference Engine

This notebook tests the baseline LLM inference on sample data.

**Tests:**
1. Load baseline model
2. Run inference on sample data
3. Display predictions vs ground truth
4. Compute basic metrics

**Expected Output:**
- Model loads successfully
- Predictions generated for all samples
- Metrics computed (accuracy, per-class stats)


In [None]:
import sys
from pathlib import Path

# Add project root to path
project_root = Path.cwd().parent
sys.path.insert(0, str(project_root))

from src.baselines.llama_single import LlamaSingleBaseline
from src.evaluation.metrics import compute_all_metrics, print_metrics_report
from src.utils.preprocess import load_from_jsonl
import yaml
import torch

print("✅ Imports successful")


## 1. Load Configuration


In [None]:
# Load baseline config
config_path = project_root / 'configs' / 'baseline.yaml'
with open(config_path, 'r') as f:
    config = yaml.safe_load(f)

print("Baseline Configuration:")
print(f"  Model: {config['model_name']}")
print(f"  Dtype: {config['dtype']}")
print(f"  4-bit: {config['load_in_4bit']}")
print(f"  Max tokens: {config['max_new_tokens']}")
print(f"  Temperature: {config['temperature']}")
print(f"  Prompt template: {config['prompt_template']}")


## 2. Check GPU


In [None]:
print(f"CUDA available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"Device: {torch.cuda.get_device_name(0)}")
    print(f"Total GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
    print(f"Currently allocated: {torch.cuda.memory_allocated(0) / 1e9:.2f} GB")
else:
    print("⚠️  No GPU available")


## 3. Initialize Baseline Model

This will take ~30-60 seconds to load the model.


In [None]:
baseline = LlamaSingleBaseline(config)
baseline.load_model()

if torch.cuda.is_available():
    print(f"\nGPU Memory after loading:")
    print(f"  Allocated: {torch.cuda.memory_allocated(0) / 1e9:.2f} GB")
    print(f"  Reserved: {torch.cuda.memory_reserved(0) / 1e9:.2f} GB")


## 4. Load Sample Data


In [None]:
# Load processed training data
data_path = project_root / 'data' / 'processed' / 'train.jsonl'

all_samples = load_from_jsonl(data_path)
print(f"Total training samples: {len(all_samples)}")

# Take 10 samples for testing
n_samples = 10
samples = all_samples[:n_samples]
print(f"\nTesting on {len(samples)} samples")


## 5. Display Sample Examples


In [None]:
print("Sample 1:")
print(f"  Text: {samples[0]['text'][:150]}...")
print(f"  Trigger: {samples[0]['trigger_text']}")
print(f"  True label: {samples[0]['status_label']}")
print(f"  Source: {samples[0]['source']}")

print("\nSample 2:")
print(f"  Text: {samples[1]['text'][:150]}...")
print(f"  Trigger: {samples[1]['trigger_text']}")
print(f"  True label: {samples[1]['status_label']}")
print(f"  Source: {samples[1]['source']}")


## 6. Run Baseline Inference

This will take ~2-3 seconds per sample.


In [None]:
results = baseline.predict_batch(samples, show_progress=True)
print(f"\n✅ Inference completed on {len(results)} samples")


## 7. Display Predictions


In [None]:
correct = 0

for i, result in enumerate(results, 1):
    true_label = result['status_label']
    pred_label = result['pred_label']
    pred_letter = result['pred_letter']
    match = "✅" if true_label == pred_label else "❌"
    
    if true_label == pred_label:
        correct += 1
    
    print(f"\nSample {i}:")
    print(f"  Text: {result['text'][:100]}...")
    print(f"  Trigger: {result['trigger_text']}")
    print(f"  True: {true_label}")
    print(f"  Pred: {pred_label} ({pred_letter})")
    print(f"  {match}")

print(f"\n{'='*80}")
print(f"Quick Accuracy: {correct}/{len(results)} = {correct/len(results):.1%}")


## 8. Compute Detailed Metrics


In [None]:
y_true = [r['status_label'] for r in results]
y_pred = [r['pred_label'] for r in results]

labels = ['none', 'current', 'past', 'Not Applicable']
metrics = compute_all_metrics(y_true, y_pred, labels=labels)

print_metrics_report(metrics)


## 9. Validation Checklist

✅ **Check these:**
1. Model loads without errors
2. GPU memory usage is reasonable (< 10GB for bf16)
3. Inference completes for all samples
4. Predictions are valid labels (none/current/past/Not Applicable)
5. Some predictions match ground truth (>0% accuracy)
6. Confusion matrix makes sense (no all-zero rows/columns)

**Expected Performance (baseline, untrained):**
- Accuracy: Variable, typically 20-50% on small samples
- FPR: May be high (model predicts 'current' too often)
- This is expected - we'll improve it with the agentic approach!

---

**If all checks pass, proceed to Phase 3!**
