## Observability with Opik - LLM-Native Monitoring

**Alternative Approach**: Instead of traditional observability tools (Prometheus, Grafana, Jaeger), we'll use **Opik** - an open-source LLM observability platform designed specifically for AI applications.

### Why Opik for LLM Applications?

**Traditional Tools (Prometheus/Grafana/Jaeger):**
- Generic application monitoring
- Manual instrumentation required
- Separate tools for metrics, logs, traces
- Limited LLM-specific insights

**Opik:**
- Built specifically for LLM applications
- Automatic LangChain/LangGraph integration
- Unified dashboard for traces, costs, evaluations
- Prompt versioning and experimentation
- Response quality monitoring
- Token usage and cost tracking per request

### What We'll Cover:

1. **Setup Opik** - Installation and configuration
2. **Automatic Tracing** - LangChain/LangGraph integration
3. **Custom Annotations** - Add business context
4. **Cost Tracking** - Monitor spending per user/contract type
5. **Evaluation** - Score response quality
6. **Dashboard** - View insights in Opik UI

In [1]:
# Install Opik
!pip install -q opik

print("Opik installed successfully!")

Opik installed successfully!


ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
langchain-openai 0.2.1 requires openai<2.0.0,>=1.40.0, but you have openai 2.8.1 which is incompatible.

[notice] A new release of pip is available: 24.0 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


## Part 1: Install and Setup Opik

Opik provides LLM-native observability with automatic tracing for LangChain applications.

In [2]:
import sys
from pathlib import Path

# Add src directory to path
sys.path.insert(0, str(Path.cwd().parent / "src"))

# Import Opik
import opik
from opik import track
from opik.integrations.langchain import OpikTracer

print("Opik imported")
print(f"   Version: {opik.__version__}")

Opik imported
   Version: 1.9.33


### Configure Opik

Opik can run in two modes:
1. **Cloud Mode** - Free hosted service at app.comet.com/opik âœ… **RECOMMENDED**
2. **Local Mode** - Self-hosted with Docker (limited SDK tracing support)

**Important Note for Local Deployment:**
The self-hosted Opik instance (via Docker) has limited automatic tracing capabilities. While your LLM calls will work correctly, you may see 404 errors for trace endpoints. For full automatic tracing features demonstrated in this notebook, we recommend using **Opik Cloud** (free tier available).

**Alternative:** For production-grade observability with full local control, use the traditional stack in Notebook 02 (Prometheus + Grafana + Jaeger + OpenTelemetry).

In [3]:
# Configure Opik for local deployment
import opik

# Option 1: Use Opik Cloud (RECOMMENDED for full features)
# Sign up at: https://www.comet.com/signup?from=opik
# opik.configure(use_local=False)
# Then set your API key when prompted or via environment:
# os.environ["OPIK_API_KEY"] = "your_api_key_here"
# os.environ["OPIK_WORKSPACE"] = "your_workspace"

# Option 2: Use local Opik instance (self-hosted)
opik.configure(use_local=True)

print("Opik configured successfully!")
print(f"   Mode: Local (Self-Hosted)")
print(f"   UI: http://localhost:5173")
print(f"   Data stored in: ~/opik directory")
print(f"\nConfiguration saved to: ~/.opik.config")
print(f"\nðŸ’¡ To use Opik Cloud instead:")
print(f"   Run: opik.configure(use_local=False)")
print(f"   Sign up: https://www.comet.com/signup?from=opik")

OPIK: Configuration completed successfully. Traces will be logged to 'Default Project' project. To change the destination project, see: https://www.comet.com/docs/opik/tracing/log_traces#configuring-the-project-name


Opik configured successfully!
   Mode: Local (Self-Hosted)
   UI: http://localhost:5173
   Data stored in: ~/opik directory

Configuration saved to: ~/.opik.config

ðŸ’¡ To use Opik Cloud instead:
   Run: opik.configure(use_local=False)
   Sign up: https://www.comet.com/signup?from=opik


## Part 2: Automatic LangChain Tracing

Opik automatically traces LangChain/LangGraph operations with zero manual instrumentation.

In [4]:
# Import LangChain components
from dotenv import load_dotenv
load_dotenv(Path.cwd().parent / ".env", override=True)

from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from pydantic import BaseModel, Field
from typing import List

print("LangChain components imported")

LangChain components imported


### Initialize Opik Tracer for LangChain

The OpikTracer callback handler automatically captures all LangChain operations.

In [5]:
# Create Opik callback handler for LangChain
# Note: We'll add the graph parameter later when we create the LangGraph workflow
opik_tracer = OpikTracer(
    project_name="contract-analysis",
    tags=["notebook", "demo"]
)

print("Opik tracer initialized for LangChain")
print("   Project: contract-analysis")

print("   Tags: notebook, demo")
print("   OpikTracer(graph=app.get_graph(xray=True))")
print("\nðŸ’¡ For LangGraph workflows, pass graph parameter:")

Opik tracer initialized for LangChain
   Project: contract-analysis
   Tags: notebook, demo
   OpikTracer(graph=app.get_graph(xray=True))

ðŸ’¡ For LangGraph workflows, pass graph parameter:


### Simple Classification Example with Automatic Tracing

Run a contract classification with automatic trace capture.

In [6]:
import os

class ContractClassification(BaseModel):
    contract_type: str = Field(description="Type: NDA, SaaS, Employment, Partnership, Unknown")
    complexity: str = Field(description="Complexity: Simple, Moderate, Complex")
    confidence_score: float = Field(description="Confidence 0-1")
    reasoning: str = Field(description="Brief explanation")

# Sample contract
sample_contract = """
NON-DISCLOSURE AGREEMENT

This Non-Disclosure Agreement is entered into as of January 1, 2024,
by and between TechCorp Inc. and John Doe.

1. CONFIDENTIAL INFORMATION
The Receiving Party agrees to maintain confidentiality of all information
disclosed by the Disclosing Party.

2. OBLIGATIONS
- Maintain confidentiality
- Not disclose to third parties
- Use only for authorized purposes

3. TERM
This Agreement remains in effect for 2 years.
"""

# Create LLM with structured output
llm = ChatOpenAI(
    model=os.getenv("OPENAI_MODEL", "gpt-4o-mini"),
    temperature=0
)
structured_llm = llm.with_structured_output(ContractClassification)

# Create prompt
prompt = ChatPromptTemplate.from_messages([
    ("system", "You are an expert contract analyst. Classify contracts by type and complexity."),
    ("user", "Classify this contract:\n\n{contract_text}")
])

# Create chain
chain = prompt | structured_llm

# Run with Opik tracing (automatically captures everything!)
print("Classifying contract with Opik tracing...\n")

result = chain.invoke(
    {"contract_text": sample_contract},
    config={"callbacks": [opik_tracer]}  # This enables automatic tracing!
)

print(f"Classification Complete:")
print(f"   Type: {result.contract_type}")
print(f"   Complexity: {result.complexity}")
print(f"   Confidence: {result.confidence_score:.2%}")
print(f"   Reasoning: {result.reasoning}")
print(f"\nTrace automatically sent to Opik!")
print(f"   View at: http://localhost:5173")

OPIK: Started logging traces to the "contract-analysis" project at http://localhost:5173/api/v1/session/redirect/projects/?trace_id=019ace3f-7abd-79fe-bc87-84fb58b7e287&path=aHR0cDovL2xvY2FsaG9zdDo1MTczL2FwaS8=.


Classifying contract with Opik tracing...

Classification Complete:
   Type: NDA
   Complexity: Simple
   Confidence: 90.00%
   Reasoning: The contract is a straightforward Non-Disclosure Agreement with clear obligations and a defined term, making it simple in nature.

Trace automatically sent to Opik!
   View at: http://localhost:5173


## Part 3: Custom Function Tracking

Use the `@track` decorator to monitor custom functions and add business context.

In [7]:
from opik import track
import time

@track(
    name="extract_pdf_text",
    project_name="contract-analysis",
    tags=["pdf", "extraction"]
)
def extract_pdf_simulation(file_path: str, user_id: str):
    """Simulate PDF extraction with Opik tracking."""
    time.sleep(0.3)
    return {
        "text": sample_contract,
        "pages": 2,
        "chars": len(sample_contract)
    }


@track(
    name="validate_contract",
    project_name="contract-analysis",
    tags=["security", "validation"]
)
def validate_contract_simulation(contract_text: str, user_id: str):
    """Simulate security validation with tracking."""
    time.sleep(0.2)
    return {
        "valid": True,
        "pii_detected": False,
        "issues": []
    }


# Run tracked functions
print("Running tracked operations...\n")

pdf_result = extract_pdf_simulation("nda_standard.pdf", "user-001")
print(f"PDF extracted: {pdf_result['pages']} pages, {pdf_result['chars']} chars")

validation_result = validate_contract_simulation(sample_contract, "user-001")
print(f"Validation: {validation_result['valid']}")

print(f"\nAll operations tracked in Opik!")

Running tracked operations...

PDF extracted: 2 pages, 442 chars
Validation: True

All operations tracked in Opik!


## Part 4: Integrate with LangGraph Agent

Let's add Opik observability to our complete contract analysis workflow.

In [8]:
# Import agent state
from agent.state import ContractAnalysisState, create_initial_state
from langgraph.graph import StateGraph, END
import uuid
from datetime import datetime

print("Agent components imported")

# Define classification node with Opik tracking
@track(
    name="classify_contract_node",
    project_name="contract-analysis",
    tags=["langgraph", "classification"]
)
def classify_contract_with_opik(state: ContractAnalysisState) -> ContractAnalysisState:
    """Classification node with Opik observability."""
    
    try:
        # Create LLM chain
        llm = ChatOpenAI(model=os.getenv("OPENAI_MODEL", "gpt-4o-mini"), temperature=0)
        structured_llm = llm.with_structured_output(ContractClassification)
        
        prompt = ChatPromptTemplate.from_messages([
            ("system", "You are an expert contract analyst. Classify contracts."),
            ("user", "Classify:\n\n{contract_text}")
        ])
        
        chain = prompt | structured_llm
        
        # Run with Opik tracing
        result = chain.invoke(
            {"contract_text": state['contract_text'][:4000]},
            config={"callbacks": [opik_tracer]}
        )
        
        # Update state
        state['contract_type'] = result.contract_type
        state['complexity'] = result.complexity
        state['confidence_score'] = result.confidence_score
        
        print(f"  Classified: {result.contract_type} ({result.confidence_score:.2%})")
        
    except Exception as e:
        state['errors'].append(f"Classification error: {str(e)}")
        state['contract_type'] = "Unknown"
        print(f"  Error: {str(e)}")
    
    return state


# Define analysis node with Opik tracking
@track(
    name="analyze_contract_node",
    project_name="contract-analysis",
    tags=["langgraph", "analysis"]
)
def analyze_contract_with_opik(state: ContractAnalysisState) -> ContractAnalysisState:
    """Analysis node with Opik observability."""
    
    # Simulate analysis
    state['key_terms'] = [
        {"term": "Confidentiality", "importance": "High"},
        {"term": "Term Length", "importance": "Medium"}
    ]
    state['risks'] = [
        {"risk": "Broad definition of confidential info", "severity": "Medium"}
    ]
    
    print(f"  Analysis: {len(state['key_terms'])} terms, {len(state['risks'])} risks")
    
    return state


print("Tracked LangGraph nodes defined")

Agent components imported
Tracked LangGraph nodes defined


### Create and Run Observed LangGraph

In [9]:
# Create workflow
workflow = StateGraph(ContractAnalysisState)

# Add nodes
workflow.add_node("classify", classify_contract_with_opik)
workflow.add_node("analyze", analyze_contract_with_opik)

# Define edges
workflow.set_entry_point("classify")
workflow.add_edge("classify", "analyze")
workflow.add_edge("analyze", END)

# Compile
contract_agent_opik = workflow.compile()

print("LangGraph workflow created")

# Create OpikTracer with graph visualization (IMPORTANT for LangGraph)
opik_tracer_langgraph = OpikTracer(
    graph=contract_agent_opik.get_graph(xray=True),  # Enable graph visualization
    project_name="contract-analysis",
    tags=["langgraph", "workflow"]
)

print("OpikTracer configured with LangGraph graph visualization")

# Run the agent with Opik observability
print("\nRunning contract analysis with Opik observability...")
print("=" * 80)

# Create initial state
initial_state = create_initial_state(
    contract_text=sample_contract,
    file_path="nda_standard.pdf",
    user_id="demo_user_opik"
)

# Execute with OpikTracer callback
final_state = contract_agent_opik.invoke(
    initial_state,
    config={"callbacks": [opik_tracer_langgraph]}  # Pass tracer as callback
)

print("\n" + "=" * 80)
print("ANALYSIS COMPLETE")
print("=" * 80)
print(f"\nContract Type: {final_state['contract_type']}")
print(f"Complexity: {final_state['complexity']}")
print(f"Confidence: {final_state['confidence_score']:.2%}")

print(f"Key Terms: {len(final_state['key_terms'])}")
print(f"Trace ID: {opik_tracer_langgraph.created_traces()[0].id if opik_tracer_langgraph.created_traces() else 'N/A'}")

print(f"Risks: {len(final_state['risks'])}")
print(f"\nFull trace with graph visualization: http://localhost:5173")

LangGraph workflow created
OpikTracer configured with LangGraph graph visualization

Running contract analysis with Opik observability...
  Classified: NDA (90.00%)
  Analysis: 2 terms, 1 risks

ANALYSIS COMPLETE

Contract Type: NDA
Complexity: Simple
Confidence: 90.00%
Key Terms: 2
Trace ID: 019ace42-3d3f-75d8-88f0-cde1e28be666
Risks: 1

Full trace with graph visualization: http://localhost:5173


## Part 5: Cost Tracking and Token Usage

Opik automatically tracks token usage and costs for OpenAI calls.

In [10]:
# Run multiple analyses to accumulate cost data
print("Running 10 analyses to track costs...\n")

contract_samples = [
    "NDA between TechCorp and vendor...",
    "SaaS subscription agreement for cloud services...",
    "Employment offer letter for software engineer...",
    "Partnership agreement for joint venture...",
    "Service Level Agreement for support services..."
]

for i, sample in enumerate(contract_samples * 2, 1):  # 10 total
    state = create_initial_state(
        contract_text=sample,
        file_path=f"contract_{i}.pdf",
        user_id=f"user-{i % 3 + 1}"  # 3 different users
    )
    
    result = classify_contract_with_opik(state)
    print(f"  {i}. {result['contract_type']:15} (User: {state['user_id']})")

print(f"\nAll analyses tracked!")
print(f"View cost breakdown by:")
print(f"   â€¢ User ID")
print(f"   â€¢ Contract type")
print(f"   â€¢ Time period")
print(f"   â€¢ Model used")
print(f"\nCheck Opik dashboard: http://localhost:5173")

Running 10 analyses to track costs...

  Classified: NDA (90.00%)
  1. NDA             (User: user-2)
  Classified: SaaS (90.00%)
  2. SaaS            (User: user-3)
  Classified: Employment (90.00%)
  3. Employment      (User: user-1)
  Classified: Partnership (85.00%)
  4. Partnership     (User: user-2)
  Classified: Unknown (70.00%)
  5. Unknown         (User: user-3)
  Classified: NDA (90.00%)
  6. NDA             (User: user-1)
  Classified: SaaS (90.00%)
  7. SaaS            (User: user-2)
  Classified: Employment (90.00%)
  8. Employment      (User: user-3)
  Classified: Partnership (85.00%)
  9. Partnership     (User: user-1)
  Classified: Unknown (70.00%)
  10. Unknown         (User: user-2)

All analyses tracked!
View cost breakdown by:
   â€¢ User ID
   â€¢ Contract type
   â€¢ Time period
   â€¢ Model used

Check Opik dashboard: http://localhost:5173


## Part 6: Response Evaluation

Opik allows you to score and evaluate LLM responses for quality monitoring.

In [11]:
# Define evaluation function
@track(
    name="evaluate_classification",
    project_name="contract-analysis"
)
def evaluate_classification_quality(contract_text: str, expected_type: str):
    """Evaluate classification accuracy."""
    
    # Run classification
    llm = ChatOpenAI(model=os.getenv("OPENAI_MODEL", "gpt-4o-mini"), temperature=0)
    structured_llm = llm.with_structured_output(ContractClassification)
    
    prompt = ChatPromptTemplate.from_messages([
        ("system", "Classify the contract type."),
        ("user", "{contract_text}")
    ])
    
    chain = prompt | structured_llm
    result = chain.invoke(
        {"contract_text": contract_text},
        config={"callbacks": [opik_tracer]}
    )
    
    # Score the result
    is_correct = result.contract_type == expected_type
    confidence_score = result.confidence_score
    
    return {
        "expected": expected_type,
        "predicted": result.contract_type,
        "correct": is_correct,
        "confidence": confidence_score
    }


# Test evaluations
print("Evaluating classification quality...\n")

test_cases = [
    ("This NDA protects confidential information...", "NDA"),
    ("SaaS subscription for monthly software access...", "SaaS"),
    ("Employment contract for full-time position...", "Employment"),
]

results = []
for contract, expected in test_cases:
    result = evaluate_classification_quality(contract, expected)
    status = "âœ“" if result['correct'] else "âœ—"
    print(f"{status} Expected: {result['expected']:12} | Got: {result['predicted']:12} | Confidence: {result['confidence']:.2%}")
    results.append(result)

accuracy = sum(r['correct'] for r in results) / len(results)
avg_confidence = sum(r['confidence'] for r in results) / len(results)

print(f"\nOverall Metrics:")
print(f"   Accuracy: {accuracy:.2%}")
print(f"   Avg Confidence: {avg_confidence:.2%}")
print(f"\nView detailed evaluation in Opik!")

Evaluating classification quality...

âœ“ Expected: NDA          | Got: NDA          | Confidence: 90.00%
âœ“ Expected: SaaS         | Got: SaaS         | Confidence: 90.00%
âœ“ Expected: Employment   | Got: Employment   | Confidence: 90.00%

Overall Metrics:
   Accuracy: 100.00%
   Avg Confidence: 90.00%

View detailed evaluation in Opik!


## Part 7: Experiment Tracking

Track different prompt versions and compare performance.

In [12]:
# Test different prompt strategies
prompts_to_test = {
    "v1_basic": "Classify this contract: {contract_text}",
    
    "v2_detailed": """Analyze this contract and classify it.
Consider: legal language, parties involved, obligations, and termination clauses.
Contract: {contract_text}""",
    
    "v3_structured": """You are an expert legal analyst with 20 years experience.
Carefully read the contract below and classify its type with reasoning.

Contract Types:
- NDA: Non-Disclosure Agreement
- SaaS: Software as a Service
- Employment: Offer letters, employment contracts
- Partnership: Joint ventures, partnerships

Contract:
{contract_text}"""
}

print("Testing different prompt strategies...\n")

for version, prompt_template in prompts_to_test.items():
    # Track each experiment
    @track(
        name=f"prompt_experiment_{version}",
        project_name="contract-analysis",
        tags=["experiment", version]
    )
    def test_prompt(template):
        llm = ChatOpenAI(model=os.getenv("OPENAI_MODEL", "gpt-4o-mini"), temperature=0)
        structured_llm = llm.with_structured_output(ContractClassification)
        
        prompt = ChatPromptTemplate.from_messages([
            ("system", "You are a contract analyst."),
            ("user", template)
        ])
        
        chain = prompt | structured_llm
        result = chain.invoke(
            {"contract_text": sample_contract},
            config={"callbacks": [opik_tracer]}
        )
        
        return result
    
    result = test_prompt(prompt_template)
    print(f"  {version:15} â†’ {result.contract_type:12} (Confidence: {result.confidence_score:.2%})")

print(f"\nExperiment results tracked!")
print(f"Compare prompts in Opik to find the best performer")

Testing different prompt strategies...

  v1_basic        â†’ NDA          (Confidence: 90.00%)
  v2_detailed     â†’ NDA          (Confidence: 90.00%)
  v3_structured   â†’ NDA          (Confidence: 95.00%)

Experiment results tracked!
Compare prompts in Opik to find the best performer


## Part 8: Starting Local Opik Instance

To view all the traces and metrics we've been collecting:

In [None]:
print("""Start Local Opik Instance:

1. Clone Opik repository (if not already done):
   
   git clone https://github.com/comet-ml/opik.git
   cd opik

2. Start Opik using Docker Compose:
   
   # Full Opik platform (recommended)
   ./opik.sh
   
   # Or directly with docker compose:
   cd deployment/docker-compose
   docker compose --profile opik up --detach

3. Configure Python SDK for local deployment:
   
   opik configure --use_local
   
   # Or in Python:
   import opik
   opik.configure(use_local=True)

4. Access Opik UI:
   
   http://localhost:5173

5. Data Storage:
   
   All data stored in: ~/opik directory
   Config file: ~/.opik.config

print("Instructions displayed above")

6. Stop Opik:


   ./opik.sh --stop   â€¢ Advanced filters (user, tags, etc.)

   â€¢ Quality Evaluations

7. Alternative: Use Opik Cloud (Recommended for full features)   â€¢ Prompt Version comparison

      â€¢ Cost Dashboard with breakdowns

   - Sign up: https://www.comet.com/signup?from=opik   â€¢ Token Usage tracking per request

   - Configure: opik.configure(use_local=False)   â€¢ Trace Timeline with LangGraph visualization

   - Enter API key when promptedFeatures in Opik UI:
""")

## Comparison: Opik vs Traditional Stack

Understanding when to use each approach.

In [None]:
comparison = """
TRADITIONAL STACK (Prometheus/Grafana/Jaeger)
Pros:
  - Industry standard, well-established
  - Great for infrastructure monitoring
  - Flexible, customizable dashboards
  - Good for multi-service architectures

Cons:
  - Requires manual instrumentation for LLM calls
  - No built-in cost tracking
  - No prompt versioning
  - Separate tools for different concerns
  - Complex setup (3+ services)

OPIK
Pros:
  - Zero-config LangChain/LangGraph integration
  - Automatic token and cost tracking
  - Built-in prompt experimentation
  - Unified dashboard for everything
  - LLM-specific metrics (hallucination, response quality)
  - Single service deployment

Cons:
  - Newer, less mature ecosystem
  - Focused on LLM apps (not general purpose)
  - Limited customization vs Grafana

WHEN TO USE EACH:

Use Traditional Stack When:
  - You need infrastructure-level monitoring
  - Multi-language, multi-service architecture
  - Existing Prometheus/Grafana expertise
  - Custom metric aggregation requirements

Use Opik When:
  - Primary focus is LLM application monitoring
  - Need cost tracking per user/request
  - Want prompt A/B testing
  - Rapid prototyping and iteration
  - Simpler setup and maintenance
"""

print(comparison)

## Key Takeaways

### What We Learned:

1. **Automatic Tracing**
   - Zero-config LangChain integration via callbacks
   - Automatic token counting and cost calculation
   - Full request/response capture

2. **Custom Tracking**
   - `@track` decorator for any function
   - Rich metadata logging
   - Business context capture

3. **Cost Monitoring**
   - Per-user, per-model, per-operation costs
   - Token usage trends
   - Budget alerts (in Opik UI)

4. **Quality Evaluation**
   - Score responses for accuracy
   - Track confidence over time
   - Identify low-quality outputs

5. **Experimentation**
   - Compare prompt versions
   - A/B test different approaches
   - Find optimal configurations

---
