# Workflow Test with Langfuse Tracing

This notebook tests:
1. Multi-agent extraction workflow on 10 CUAD samples
2. Langfuse observability/tracing integration
3. Model diagnostics tracking

**Prerequisites:**
- Set `ANTHROPIC_API_KEY` in `.env`
- Set Langfuse keys: `LANGFUSE_SECRET_KEY`, `LANGFUSE_PUBLIC_KEY`, `LANGFUSE_HOST`

In [None]:
import sys
import os
sys.path.insert(0, "..")

# Load environment variables
from dotenv import load_dotenv
load_dotenv("../.env")

# Verify API key is set
assert os.getenv("ANTHROPIC_API_KEY"), "ANTHROPIC_API_KEY not set in .env"
print("Environment loaded")

Environment loaded


## 1. Setup Langfuse Tracing

In [None]:
from langfuse import Langfuse

# Initialize Langfuse client
langfuse = Langfuse()

# Verify connection
print(f"Langfuse host: {os.getenv('LANGFUSE_HOST', 'https://cloud.langfuse.com')}")
print("Langfuse connected!")

Langfuse host: https://cloud.langfuse.com
Langfuse connected!


## 2. Load CUAD Test Data (10 samples)

In [None]:
from src.data import CUADDataLoader

# Load dataset
loader = CUADDataLoader()
loader.load()

# Get 10 diverse samples (mix of tiers and positive/negative)
samples = []
seen_categories = set()

for sample in loader:
    # Get one sample per category until we have 10
    if sample.category not in seen_categories:
        samples.append(sample)
        seen_categories.add(sample.category)
    if len(samples) >= 5:
        break

print(f"Selected {len(samples)} samples from different categories:")
for s in samples:
    print(f"  - {s.category} ({s.tier}): {'has clause' if s.has_clause else 'no clause'}")

Selected 5 samples from different categories:
  - Document Name (common): has clause
  - Parties (common): has clause
  - Agreement Date (common): has clause
  - Effective Date (common): has clause
  - Expiration Date (common): has clause


## 3. Setup Agents and Orchestrator

In [None]:
from src.agents import RiskLiabilityAgent, TemporalRenewalAgent, IPCommercialAgent
from src.agents.orchestrator import Orchestrator
from src.models import ModelDiagnostics

# Create diagnostics tracker
diagnostics = ModelDiagnostics()

# Create specialist agents
specialists = {
    "risk_liability": RiskLiabilityAgent(diagnostics=diagnostics),
    "temporal_renewal": TemporalRenewalAgent(diagnostics=diagnostics),
    "ip_commercial": IPCommercialAgent(diagnostics=diagnostics),
}

# Create orchestrator
orchestrator = Orchestrator(specialists=specialists)

print("Agents initialized:")
for name, agent in specialists.items():
    print(f"  - {name}: {len(agent.config.categories)} categories")

Agents initialized:
  - risk_liability: 13 categories
  - temporal_renewal: 11 categories
  - ip_commercial: 17 categories


## 4. Run Extraction on 10 Samples

In [41]:
from langfuse import observe

results = []

@observe(name="workflow_test")
async def run_extraction(sample):
    """Run extraction with Langfuse tracing."""
    result = await orchestrator.extract(
        contract_text=sample.contract_text,
        category=sample.category,
        question=sample.question,
    )
    return result

print("Running extractions...\n")

for i, sample in enumerate(samples):
    print(f"[{i+1}/5] Processing: {sample.category}...")
    try:
        result = await run_extraction(sample)
        results.append({
            "sample": sample,
            "result": result,
            "success": True,
        })
        extracted = len(result.extracted_clauses)
        print(f"       -> Extracted {extracted} clause(s), confidence: {result.confidence:.2f}")
    except Exception as e:
        results.append({
            "sample": sample,
            "result": None,
            "success": False,
            "error": str(e),
        })
        print(f"       -> ERROR: {e}")

print("\nDone!")

Running extractions...

[1/5] Processing: Document Name...
       -> Extracted 2 clause(s), confidence: 1.00
[2/5] Processing: Parties...
       -> Extracted 3 clause(s), confidence: 1.00
[3/5] Processing: Agreement Date...
       -> Extracted 1 clause(s), confidence: 1.00
[4/5] Processing: Effective Date...
       -> Extracted 3 clause(s), confidence: 0.95
[5/5] Processing: Expiration Date...
       -> Extracted 3 clause(s), confidence: 0.95

Done!


## 5. View Results

In [42]:
print("Extraction Results Summary")
print("=" * 70)

for i, r in enumerate(results):
    sample = r["sample"]
    result = r["result"]
    
    print(f"\n[{i+1}] {sample.category} ({sample.tier})")
    print(f"    Contract: {sample.contract_title[:50]}...")
    print(f"    Ground truth: {'Yes' if sample.has_clause else 'No'} ({sample.num_spans} spans)")
    
    if r["success"]:
        print(f"    Extracted: {len(result.extracted_clauses)} clause(s)")
        print(f"    Confidence: {result.confidence:.2f}")
        if result.extracted_clauses:
            preview = result.extracted_clauses[0][:100]
            print(f"    First clause: {preview}...")
    else:
        print(f"    ERROR: {r['error']}")

Extraction Results Summary

[1] Document Name (common)
    Contract: LIMEENERGYCO_09_09_1999-EX-10-DISTRIBUTOR AGREEMEN...
    Ground truth: Yes (1 spans)
    Extracted: 2 clause(s)
    Confidence: 1.00
    First clause: DISTRIBUTOR AGREEMENT...

[2] Parties (common)
    Contract: LIMEENERGYCO_09_09_1999-EX-10-DISTRIBUTOR AGREEMEN...
    Ground truth: Yes (5 spans)
    Extracted: 3 clause(s)
    Confidence: 1.00
    First clause: THIS DISTRIBUTOR AGREEMENT (the "Agreement") is made by and between Electric City Corp., a Delaware ...

[3] Agreement Date (common)
    Contract: LIMEENERGYCO_09_09_1999-EX-10-DISTRIBUTOR AGREEMEN...
    Ground truth: Yes (1 spans)
    Extracted: 1 clause(s)
    Confidence: 1.00
    First clause: THIS DISTRIBUTOR AGREEMENT (the "Agreement") is made by and between Electric City Corp., a Delaware ...

[4] Effective Date (common)
    Contract: LIMEENERGYCO_09_09_1999-EX-10-DISTRIBUTOR AGREEMEN...
    Ground truth: Yes (2 spans)
    Extracted: 3 clause(s)
    Con

## 6. Model Diagnostics

In [43]:
# Get diagnostics summary
summary = diagnostics.summary()

print("Model Diagnostics")
print("=" * 50)
print(f"Total API calls:    {summary['total_calls']}")
print(f"Successful calls:   {summary['successful_calls']}")
print(f"Failed calls:       {summary['failed_calls']}")
print(f"")
print(f"Total input tokens: {summary['total_input_tokens']:,}")
print(f"Total output tokens: {summary['total_output_tokens']:,}")
print(f"Estimated cost:     ${summary['total_cost_usd']:.4f}")
print(f"")
print(f"Avg latency:        {summary['avg_latency_ms']:.0f}ms")

Model Diagnostics
Total API calls:    7
Successful calls:   7
Failed calls:       0

Total input tokens: 83,130
Total output tokens: 3,456
Estimated cost:     $0.3012

Avg latency:        16552ms


## 7. Flush Langfuse Traces

In [44]:
# Flush traces to Langfuse
langfuse.flush()

print("Traces flushed to Langfuse!")
print(f"View traces at: {os.getenv('LANGFUSE_HOST', 'https://cloud.langfuse.com')}")

Traces flushed to Langfuse!
View traces at: https://cloud.langfuse.com


## 8. Quick Accuracy Check

In [38]:
# Simple accuracy metrics
tp = fp = fn = tn = 0

for r in results:
    if not r["success"]:
        continue
        
    sample = r["sample"]
    result = r["result"]
    
    has_ground_truth = sample.has_clause
    has_extraction = len(result.extracted_clauses) > 0
    
    if has_ground_truth and has_extraction:
        tp += 1
    elif has_ground_truth and not has_extraction:
        fn += 1
    elif not has_ground_truth and has_extraction:
        fp += 1
    else:
        tn += 1

print("Quick Accuracy Check (10 samples)")
print("=" * 40)
print(f"True Positives:  {tp}")
print(f"False Positives: {fp}")
print(f"False Negatives: {fn}")
print(f"True Negatives:  {tn}")
print(f"")

precision = tp / (tp + fp) if (tp + fp) > 0 else 0
recall = tp / (tp + fn) if (tp + fn) > 0 else 0
f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0

print(f"Precision: {precision:.2f}")
print(f"Recall:    {recall:.2f}")
print(f"F1:        {f1:.2f}")
print(f"")
print("Note: This is a rough check. Full evaluation uses text-level F1.")

Quick Accuracy Check (10 samples)
True Positives:  5
False Positives: 0
False Negatives: 0
True Negatives:  0

Precision: 1.00
Recall:    1.00
F1:        1.00

Note: This is a rough check. Full evaluation uses text-level F1.


## Summary

This notebook demonstrated:
1. **Langfuse Integration**: Traces are captured and sent to Langfuse
2. **Multi-Agent Workflow**: Orchestrator routes to appropriate specialists
3. **Model Diagnostics**: Token usage, costs, and latency tracked
4. **Basic Accuracy**: Quick check of extraction quality

**Next steps:**
- Run on full test set with `scripts/run_baseline.py`
- Compare against zero-shot baseline
- Analyze Langfuse traces for optimization opportunities