# 🛡️ Online Validator: The Debate & Defer Framework

Welcome! This notebook demonstrates a production-grade **Cognitive Architecture** using **LangGraph** and **Langfuse**. It is designed for high-stakes, VIP e-commerce environments where a standard LLM wrapper is too risky.

### 🎯 What this simulates
This notebook simulates a VIP customer of a **premier luxury brand** aggressively demanding a refund for a high-end backpack due to a subtle color variation under sunlight. Instead of an LLM blindly processing a $600+ refund (high financial risk) or rudely rejecting the VIP (high brand risk), the graph triggers an **Online Debate Validator**. 

### 🧠 Core Concepts Implemented

**1. The "LACE" Agentic Debate** *Inspired by Instacart's [Turbocharging Customer Support Chatbot Development](https://tech.instacart.com/turbocharging-customer-support-chatbot-development-with-llm-based-automated-evaluation-6a269aae56b2)*. 
An LLM should never grade its own homework. We run an online, synchronous debate before the user ever sees a response:
* **Draft Node**: Proposes a resolution.
* **Attacker Node (The Auditor)**: Actively tries to find policy violations or tone issues in the draft.
* **Defender Node**: Defends the draft logically against the critique.

**2. Epistemic Humility & The Escape Hatch**
*Inspired by Ramp's [How to Build Agents Users Can Trust](https://engineering.ramp.com/post/how-to-build-agents-users-can-trust).*
To build trust, AI must know when to say "I don't know." 
* **Judge Node**: Evaluates the debate. If the policy is unclear or the downside risk is too high, the Judge flags the state as `AMBIGUOUS`.
* **Human-in-the-Loop**: The graph physically pauses execution (`interrupt_before`), safely escalating the ambiguous edge case to a human manager.

> 🔍 **Observability:** Every node execution, token count, and latency metric is automatically traced and logged to **Langfuse** for continuous opportunity cost and unit economics evaluation.

In [None]:
# Automatically reload imported modules like src.graph
# so that any changes to graph.py apply without restarting the kernel!
%load_ext autoreload
%autoreload 2

In [None]:
# Install required dependencies quietly for the notebook kernel
%pip install -q langchain langchain-google-genai langgraph langfuse python-dotenv grandalf


In [None]:
import os
from dotenv import load_dotenv
from langfuse.langchain import CallbackHandler
from src.graph import graph, load_prompt

# Load environment variables (API Keys)
load_dotenv()

# Initialize Langfuse observability handler
langfuse_handler = CallbackHandler()

# Load the realistic customer input signal
customer_input = load_prompt("input_client.txt")

In [None]:
from IPython.display import display, Image

# Render the Debate & Defer graph dynamically
display(Image(graph.get_graph().draw_mermaid_png()))

## Running the Debate Graph

This script proves the Instacart/Ramp concepts natively in LangGraph, fully observable via Langfuse.

The test query simulates a VIP customer demanding a luxury item refund, which should trigger the debate and evaluation process.

## State Management & Observability

> **Pro Tip for Luxury Retail Deployment**: Notice the `thread_id` configuration below. In a true enterprise environment, this thread ID ties the execution memory and Langfuse tracing directly back to a specific customer support ticket database entry.

In [None]:
# Define the configuration for the tracing and human-in-the-loop thread
config = {
    "configurable": {"thread_id": "vip_ticket_luxury_001"}, 
    "callbacks": [langfuse_handler]
}

print("STAGING: LOADING CUSTOMER SIGNAL...")
print(f"\n{'='*40}")
print(" CUSTOMER INPUT")
print(f"{'='*40}")
print(customer_input)

print("\nStarting Debate Graph Logic...\n")

# Stream the execution to observe the debate unfold step-by-step
from IPython.display import display, Markdown

trace_displayed = False
for event in graph.stream({"query": customer_input}, config=config):
    # Dynamically capture and display the Langfuse trace URL
    if not trace_displayed:
        try:
            trace_id = getattr(langfuse_handler, "last_trace_id", None)
            if trace_id:
                trace_url = langfuse_handler.client.get_trace_url(trace_id=trace_id)
                display(Markdown(f"**🔍 [View Trace in Langfuse]({trace_url})**"))
                trace_displayed = True
        except Exception:
            pass
    for node_name, node_state in event.items():
        if not isinstance(node_state, dict):
            # LangGraph yields __interrupt__ events as tuples when paused
            print(f"\n{'='*40}")
            print(f" MANAGER ESCALATION: Paused at {node_name}")
            print(f"{'='*40}")
            continue
            
        print(f"\n{'='*40}")
        print(f" NODE: {node_name.upper()}")
        print(f"{'='*40}")
        if node_name == 'judge':
            # Show the full judge reasoning, not just the final verdict
            print(f"Judge Synthesis:\n{node_state.get('debate_synthesis', '')}")
            print(f"\nEscape Hatch Triggered: {node_state.get('escape_hatch_triggered', False)}")
            print(f"Verdict: {node_state.get('verdict')}")
        elif node_name == 'human_escalation':
            print(node_state.get('draft'))
        else:
            # Print the relevant output dynamically based on the node
            key = 'draft' if node_name == 'draft' else 'critique' if node_name == 'attacker' else 'defense'
            print(node_state.get(key, ''))

## The Takeaway

> "A standard LLM wrapper is cheap to build but carries a massive opportunity cost if it hallucinates a refund policy to a high-LTV customer. By implementing an online Instacart-style debate validated by Langfuse, we increase the compute cost per ticket slightly, but we drastically reduce the risk of brand damage. Furthermore, the Ramp-style 'AMBIGUOUS' routing ensures we only spend human OpEx on the edge cases that truly require human judgment."