# Test Baseline Inference Engine

This notebook tests the baseline LLM inference on sample data.

**Tests:**
1. Load baseline model
2. Run inference on sample data
3. Display predictions vs ground truth
4. Compute basic metrics

**Expected Output:**
- Model loads successfully
- Predictions generated for all samples
- Metrics computed (accuracy, per-class stats)


In [1]:
import sys
from pathlib import Path

# Add project root to path
project_root = Path.cwd().parent
sys.path.insert(0, str(project_root))

from src.baselines.llama_single import LlamaSingleBaseline
from src.evaluation.metrics import compute_all_metrics, print_metrics_report
from src.utils.preprocess import load_from_jsonl
import yaml
import torch

print("✅ Imports successful")


  from .autonotebook import tqdm as notebook_tqdm


✅ Imports successful


## 1. Load Configuration


In [2]:
# Load baseline config
config_path = project_root / 'configs' / 'baseline.yaml'
with open(config_path, 'r') as f:
    config = yaml.safe_load(f)

print("Baseline Configuration:")
print(f"  Model: {config['model_name']}")
print(f"  Dtype: {config['dtype']}")
print(f"  4-bit: {config['load_in_4bit']}")
print(f"  Max tokens: {config['max_new_tokens']}")
print(f"  Temperature: {config['temperature']}")
print(f"  Prompt template: {config['prompt_template']}")


Baseline Configuration:
  Model: meta-llama/Llama-3.1-8B-Instruct
  Dtype: bf16
  4-bit: False
  Max tokens: 8
  Temperature: 0.1
  Prompt template: status_v1


## 2. Check GPU


In [3]:
print(f"CUDA available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"Device: {torch.cuda.get_device_name(0)}")
    print(f"Total GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
    print(f"Currently allocated: {torch.cuda.memory_allocated(0) / 1e9:.2f} GB")
else:
    print("⚠️  No GPU available")


CUDA available: True
Device: NVIDIA GeForce RTX 4090
Total GPU memory: 25.3 GB
Currently allocated: 0.00 GB


## 3. Initialize Baseline Model

This will take ~30-60 seconds to load the model.


In [4]:
baseline = LlamaSingleBaseline(config)
baseline.load_model()

if torch.cuda.is_available():
    print(f"\nGPU Memory after loading:")
    print(f"  Allocated: {torch.cuda.memory_allocated(0) / 1e9:.2f} GB")
    print(f"  Reserved: {torch.cuda.memory_reserved(0) / 1e9:.2f} GB")


Loading tokenizer from meta-llama/Llama-3.1-8B-Instruct...
Loading model...


Loading checkpoint shards: 100%|██████████| 4/4 [00:02<00:00,  1.67it/s]

✅ Model loaded successfully

GPU Memory after loading:
  Allocated: 7.16 GB
  Reserved: 7.22 GB





## 4. Load Sample Data


In [5]:
# Load processed training data
data_path = project_root / 'data' / 'processed' / 'train.jsonl'

all_samples = load_from_jsonl(data_path)
print(f"Total training samples: {len(all_samples)}")

# Take 10 samples for testing
n_samples = 10
samples = all_samples[:n_samples]
print(f"\nTesting on {len(samples)} samples")


Total training samples: 3004

Testing on 10 samples


## 5. Display Sample Examples


In [6]:
print("Sample 1:")
print(f"  Text: {samples[0]['text'][:150]}...")
print(f"  Trigger: {samples[0]['trigger_text']}")
print(f"  True label: {samples[0]['status_label']}")
print(f"  Source: {samples[0]['source']}")

print("\nSample 2:")
print(f"  Text: {samples[1]['text'][:150]}...")
print(f"  Trigger: {samples[1]['trigger_text']}")
print(f"  True label: {samples[1]['status_label']}")
print(f"  Source: {samples[1]['source']}")


Sample 1:
  Text: Social History: Tob (-), EtOH - a glass of wine 1-2x/month, IVDU (-), lives with her husband and 9yr old daughter, does not work outside of the home....
  Trigger: IVDU
  True label: none
  Source: mimic

Sample 2:
  Text: SOCIAL HISTORY: Former real estate [**Doctor Last Name 360**], current unemployed. Lives alone. Smokes 1-1.5 packs per day x20 years. Currently admits...
  Trigger: IV drug use
  True label: none
  Source: mimic


## 6. Run Baseline Inference

This will take ~2-3 seconds per sample.


In [7]:
results = baseline.predict_batch(samples, show_progress=True)
print(f"\n✅ Inference completed on {len(results)} samples")


  0%|          | 0/10 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
 10%|█         | 1/10 [00:00<00:05,  1.53it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
 20%|██        | 2/10 [00:00<00:02,  2.72it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
 40%|████      | 4/10 [00:00<00:01,  5.52it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
 60%|██████    | 6/10 [00:01<00:00,  7.16it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
 80%|████████  | 8/10 [00:01<00:00,  8.29it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
10


✅ Inference completed on 10 samples





## 7. Display Predictions


In [8]:
correct = 0

for i, result in enumerate(results, 1):
    true_label = result['status_label']
    pred_label = result['pred_label']
    pred_letter = result['pred_letter']
    match = "✅" if true_label == pred_label else "❌"
    
    if true_label == pred_label:
        correct += 1
    
    print(f"\nSample {i}:")
    print(f"  Text: {result['text'][:100]}...")
    print(f"  Trigger: {result['trigger_text']}")
    print(f"  True: {true_label}")
    print(f"  Pred: {pred_label} ({pred_letter})")
    print(f"  {match}")

print(f"\n{'='*80}")
print(f"Quick Accuracy: {correct}/{len(results)} = {correct/len(results):.1%}")



Sample 1:
  Text: Social History: Tob (-), EtOH - a glass of wine 1-2x/month, IVDU (-), lives with her husband and 9yr...
  Trigger: IVDU
  True: none
  Pred: current (b)
  ❌

Sample 2:
  Text: SOCIAL HISTORY: Former real estate [**Doctor Last Name 360**], current unemployed. Lives alone. Smok...
  Trigger: IV drug use
  True: none
  Pred: Not Applicable (d)
  ❌

Sample 3:
  Text: SOCIAL HISTORY: Former real estate [**Doctor Last Name 360**], current unemployed. Lives alone. Smok...
  Trigger: recreational drug use
  True: none
  Pred: current (b)
  ❌

Sample 4:
  Text: Social History: No tobacco history. Denies excessive ETOH. Married with children. Works at the [**Co...
  Trigger: recreatinal drugs
  True: none
  Pred: current (b)
  ❌

Sample 5:
  Text: Social History: IVDA and illicit drug use (heroin, oxycontin, and cocaine) up until day of surgery. ...
  Trigger: IVDA and illicit drug use
  True: current
  Pred: current (b)
  ✅

Sample 6:
  Text: Social History: Pt lives at home 

## 8. Compute Detailed Metrics


In [9]:
y_true = [r['status_label'] for r in results]
y_pred = [r['pred_label'] for r in results]

labels = ['none', 'current', 'past', 'Not Applicable']
metrics = compute_all_metrics(y_true, y_pred, labels=labels)

print_metrics_report(metrics)


EVALUATION METRICS

Overall Metrics:
  Accuracy: 0.2000
  FPR (False Positive Rate for 'none'): 0.0000
  Total Samples: 10

Label Distribution (Ground Truth):
  current             :     2 ( 20.0%)
  none                :     8 ( 80.0%)

Label Distribution (Predicted):
  Not Applicable      :     4 ( 40.0%)
  current             :     5 ( 50.0%)
  none                :     1 ( 10.0%)

Per-Class Metrics:
Label                  Prec    Rec     F1   Supp
--------------------------------------------------
none                  1.000  0.125  0.222      8
current               0.200  0.500  0.286      2
past                  0.000  0.000  0.000      0
Not Applicable        0.000  0.000  0.000      0

Confusion Matrix:
True \ Pred               none   current      pastNot Applic
------------------------------------------------------------
none                         1         4         0         3
current                      0         1         0         1
past                         0    

## 9. Validation Checklist

✅ **Check these:**
1. Model loads without errors
2. GPU memory usage is reasonable (< 10GB for bf16)
3. Inference completes for all samples
4. Predictions are valid labels (none/current/past/Not Applicable)
5. Some predictions match ground truth (>0% accuracy)
6. Confusion matrix makes sense (no all-zero rows/columns)

**Expected Performance (baseline, untrained):**
- Accuracy: Variable, typically 20-50% on small samples
- FPR: May be high (50-100% on small samples)
  - FPR measures: predicted drug use (current/past) when truth is no use (none/Not Applicable)
  - Computed ONLY on negative ground truth samples
  - Lower is better (target: <15% for safety)
- This is expected - we'll improve it with the agentic approach!

**Understanding FPR:**
```
Negative class (no drug use): none + Not Applicable
Positive class (drug use): current + past

FPR = FP / (FP + TN)
FP = predicted positive when truth is negative
TN = predicted negative when truth is negative
```

---

**If all checks pass, proceed to Phase 3!**
