# 📓 The GenAI Revolution Cookbook

**Title:** How to Use Langfuse Tracing for Prompts, Evals, and Cost Control

**Description:** Instrument calls with Langfuse tracing to see traces/spans, version prompts safely, A/B on datasets, and tie evaluations to p95 latency, token cost, and acceptance-rate improvements.

---

*This jupyter notebook contains executable code examples. Run the cells below to try out the code yourself!*



## Why Observability Matters for GenAI Applications

Building reliable, efficient GenAI systems requires more than just connecting to an LLM API. As you scale from prototype to production, you need visibility into latency, token usage, errors, and user feedback. Without observability, you're debugging blind—unable to pinpoint bottlenecks, optimize costs, or measure quality.

Langfuse is an open-source observability platform designed specifically for LLM applications. It captures traces, spans, and generations across your entire pipeline, giving you the data to improve performance, reduce costs, and iterate faster. This guide shows you how to instrument a Python application with Langfuse, from basic tracing to advanced scoring and feedback loops.

## Setting Up Langfuse

Install the Langfuse SDK and set your credentials as environment variables:

In [None]:
pip install langfuse openai

In [None]:
export LANGFUSE_PUBLIC_KEY="your-public-key"
export LANGFUSE_SECRET_KEY="your-secret-key"
export LANGFUSE_HOST="https://cloud.langfuse.com"  # or self-hosted URL

Initialize the client in your application:

In [None]:
import os
from langfuse import Langfuse

langfuse = Langfuse(
    public_key=os.getenv("LANGFUSE_PUBLIC_KEY"),
    secret_key=os.getenv("LANGFUSE_SECRET_KEY"),
    host=os.getenv("LANGFUSE_HOST"),
)

This setup connects your app to Langfuse, enabling trace collection and analysis in the dashboard.

## Creating Traces and Spans

A **trace** represents a single user request end-to-end. A **span** is a timed operation within that trace—like retrieval, generation, or tool execution.

Start a trace with metadata for filtering:

In [None]:
def start_trace(user_id: str, route: str, model: str):
    trace = langfuse.trace(
        name="chat_request",
        user_id=user_id,
        metadata={"route": route, "env": os.getenv("APP_ENV", "dev"), "model": model},
    )
    return trace

Wrap operations in spans to measure latency:

In [None]:
import time

def with_span(trace, name: str):
    span = trace.span(name=name)
    start = time.time()
    try:
        yield span
    finally:
        span.end()

Use spans to isolate slow components—retrieval, reranking, or generation—and optimize where it counts.

## Tracing LLM Generations

Capture input, output, token usage, and latency for every LLM call:

In [None]:
from openai import OpenAI

client = OpenAI()

def call_llm_with_tracing(trace, messages, model="gpt-4o-mini"):
    gen = trace.generation(
        name="chat.completions",
        model=model,
        metadata={"provider": "openai"},
        input=messages,
    )
    t0 = time.time()
    resp = client.chat.completions.create(model=model, messages=messages)
    t1 = time.time()

    content = resp.choices[0].message.content
    usage = {"input": resp.usage.prompt_tokens, "output": resp.usage.completion_tokens}

    gen.end(
        output=content,
        usage=usage,
        metadata={"elapsed_ms": int((t1 - t0) * 1000)},
    )
    return content, usage

This logs every generation with full context, making it easy to spot cost spikes or slow models in the Langfuse dashboard.

## Logging Scores and Feedback

Scores quantify quality, cost, and user satisfaction. Log them directly to traces:

In [None]:
def log_feedback_and_scores(trace, accepted: bool, helpfulness: float, grounded: bool, notes: str = ""):
    trace.score(name="accepted", value=1.0 if accepted else 0.0)
    trace.score(name="helpfulness", value=helpfulness)
    trace.score(name="groundedness", value=1.0 if grounded else 0.0, comment=notes)

Combine heuristic scores (e.g., groundedness checks) with user feedback (thumbs up/down) to track quality over time. Filter traces by score in the UI to debug low-quality outputs.

## Analyzing Traces in Production

End-to-end traces reveal bottlenecks across retrieval, generation, and tool calls. Each span shows latency and errors. Capturing token usage per generation exposes cost spikes by model and prompt. With this, you'll reduce p95 latency and cut tokens where it matters, not by hunch. For a comprehensive guide on building RAG systems, see our [5 essential steps to building agentic RAG systems with LangChain and ChromaDB](/blog/44830763/5-essential-steps-to-building-agentic-rag-systems-with-langchain-and-chromadb).

Use the Langfuse dashboard to:

- Filter traces by user, route, or model
- Aggregate latency and token usage by span
- Correlate scores with trace metadata to identify patterns

This data-driven approach replaces guesswork with actionable insights.

## Complete Example

Here's a full implementation combining tracing, spans, and scoring:

In [None]:
import os
import time
from langfuse import Langfuse
from openai import OpenAI

langfuse = Langfuse(
    public_key=os.getenv("LANGFUSE_PUBLIC_KEY"),
    secret_key=os.getenv("LANGFUSE_SECRET_KEY"),
    host=os.getenv("LANGFUSE_HOST"),
)

client = OpenAI()

def start_trace(user_id: str, route: str, model: str):
    trace = langfuse.trace(
        name="chat_request",
        user_id=user_id,
        metadata={"route": route, "env": os.getenv("APP_ENV", "dev"), "model": model},
    )
    return trace

def with_span(trace, name: str):
    span = trace.span(name=name)
    start = time.time()
    try:
        yield span
    finally:
        span.end()

def call_llm_with_tracing(trace, messages, model="gpt-4o-mini"):
    gen = trace.generation(
        name="chat.completions",
        model=model,
        metadata={"provider": "openai"},
        input=messages,
    )
    t0 = time.time()
    resp = client.chat.completions.create(model=model, messages=messages)
    t1 = time.time()

    content = resp.choices[0].message.content
    usage = {"input": resp.usage.prompt_tokens, "output": resp.usage.completion_tokens}

    gen.end(
        output=content,
        usage=usage,
        metadata={"elapsed_ms": int((t1 - t0) * 1000)},
    )
    return content, usage

def openai_messages(user_input: str, system_prompt: str):
    return [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_input},
    ]

def log_feedback_and_scores(trace, accepted: bool, helpfulness: float, grounded: bool, notes: str = ""):
    trace.score(name="accepted", value=1.0 if accepted else 0.0)
    trace.score(name="helpfulness", value=helpfulness)
    trace.score(name="groundedness", value=1.0 if grounded else 0.0, comment=notes)

# Example usage
trace = start_trace(user_id="user_123", route="/chat", model="gpt-4o-mini")
messages = openai_messages("What is RAG?", "You are a helpful assistant.")
content, usage = call_llm_with_tracing(trace, messages)
log_feedback_and_scores(trace, accepted=True, helpfulness=0.9, grounded=True)

Run this code, then open the Langfuse dashboard to explore traces, spans, and scores.

## Next Steps

Start by instrumenting one critical path—your main chat endpoint or RAG pipeline. Add spans for each operation, log token usage, and capture user feedback. As you collect data, filter by latency or cost to find optimization opportunities.

Langfuse integrates with LangChain, LlamaIndex, and other frameworks, so you can extend tracing across your entire stack. For production deployments, consider self-hosting Langfuse to retain full control over your observability data.