# 📓 The GenAI Revolution Cookbook

**Title:** Structured Data Extraction with LLMs: How to Build a Pipeline

**Description:** Build a reliable structured data extraction pipeline using LLMs, LangChain, and OpenAI functions—JSON schemas, deterministic outputs, zero hallucinations, for production.

---

*This jupyter notebook contains executable code examples. Run the cells below to try out the code yourself!*



## What You Will Build

You'll build an end-to-end extractor that converts unstructured text (feedback, reports, emails, webpages) into validated JSON objects. It uses OpenAI function calling, LangChain orchestration, and Pydantic validation to guarantee schema-conformant results, even at scale. If you want to avoid subtle bugs when processing text, be sure to review our guide on [tokenization pitfalls and invisible characters that break prompts and RAG](/article/tokenization-pitfalls-invisible-characters-that-break-prompts-and-rag-2).

**Core build objective:** Extract events (name, date, outcome) from any text into a validated Pydantic schema, with no hallucinations or free-form responses.

**What you'll learn:**
- Schema-first extraction with Pydantic models
- OpenAI function/tool binding for structured outputs
- LangChain orchestration and output parsing
- Test harness for extraction quality
- Chunking strategy for long documents
- Production readiness: retries, logging, deduplication

**Prerequisites:**
- Python 3.9+
- OpenAI API key
- Basic familiarity with LangChain and Pydantic

---

## Install Dependencies

Run this cell to install all required packages with pinned versions for reproducibility:

In [None]:
!pip install langchain==0.1.0 langchain-openai==0.0.2 langchain-community==0.0.10 pydantic==1.10.15 python-dotenv beautifulsoup4 requests

---

## Configure API Key

Securely load your OpenAI API key. This cell works in both local environments (via `.env` file) and Colab (via userdata):

In [None]:
# Purpose: Load OpenAI API key from environment or Colab userdata for secure configuration.

import os
from dotenv import load_dotenv

try:
    from google.colab import userdata
except ImportError:
    userdata = None  # Not running in Colab

# Load environment variables from .env file if present
load_dotenv()

# Fetch API key from Colab userdata if not set in environment
if "OPENAI_API_KEY" not in os.environ and userdata is not None:
    os.environ["OPENAI_API_KEY"] = userdata.get("OPENAI_API_KEY")

# If still not set, prompt the user securely
if "OPENAI_API_KEY" not in os.environ:
    from getpass import getpass
    os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API Key: ")

assert os.getenv("OPENAI_API_KEY"), "OPENAI_API_KEY must be set to proceed"

---

## Verify Model Connectivity

Confirm your credentials and model availability with a simple deterministic test:

In [None]:
# Purpose: Confirm OpenAI credentials, model availability, and deterministic response.

from langchain_openai import ChatOpenAI

# Initialize the model with temperature=0 for more stable output
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
resp = llm.invoke("Respond with the word OK only.")
print(resp.content)  # Should print: "OK"

---

## Define the Extraction Schema

Create a Pydantic model to represent the structure of an event. This schema enforces explicit field types and optional values:

In [None]:
# Purpose: Represent an event with explicit fields for extraction.

from typing import List, Optional
from langchain_core.pydantic_v1 import BaseModel, Field

class Event(BaseModel):
    """
    Represents a single event extracted from text.

    Args:
        name (str): The exact event name or label mentioned in the text.
        date (Optional[str]): The explicit date string if present (e.g., 'June 15, 2024').
        outcome (Optional[str]): The explicit outcome/result if stated (e.g., 'canceled', 'successful').
    """
    name: str = Field(..., description="The exact event name or label mentioned in the text.")
    date: Optional[str] = Field(None, description="The explicit date string if present (e.g., 'June 15, 2024').")
    outcome: Optional[str] = Field(None, description="The explicit outcome/result if stated (e.g., 'canceled', 'successful').")

Wrap the list of events in a container schema for validation:

In [None]:
# Purpose: Wrap a list of events for structured extraction and validation.

class Extracted(BaseModel):
    """
    Container for all events extracted from a text.

    Args:
        events (List[Event]): All explicit events found in the text.
    """
    events: List[Event] = Field(default_factory=list, description="All explicit events found in the text.")

---

## Build the Extraction Chain

Define a strict system prompt to prevent hallucinations and enforce extraction discipline:

In [None]:
# Purpose: Instruct the model to extract only explicit information, no inference.

SYSTEM_PROMPT = """You are an information extraction engine.
Extract only what is explicitly stated in the input text.
Do not infer or assume missing information.
If no explicit events exist, return an empty list.
"""

Create a prompt template that combines the system instruction with user input:

In [None]:
# Purpose: Structure the prompt for system and user input.

from langchain_core.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate.from_messages(
    [("system", SYSTEM_PROMPT), ("human", "{text}")]
)

Initialize the model and bind it to the Pydantic schema using LangChain's structured output feature:

In [None]:
# Purpose: Use LangChain's structured output to return validated Pydantic models.

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
structured_llm = llm.with_structured_output(Extracted)

# Compose the full chain: prompt → model → validated output
chain = prompt | structured_llm

---

## Run and Validate

Test the chain with explicit events to confirm correct extraction:

In [None]:
# Purpose: Validate extraction of multiple events with explicit dates.

text = "I attended a music festival on June 15th and a tech conference on July 20th."
result = chain.invoke({"text": text})
print(result)
# Expected: Extracted(events=[Event(name='music festival', date='June 15th', outcome=None), Event(name='tech conference', date='July 20th', outcome=None)])

Verify that the output matches the schema:

In [None]:
# Purpose: Ensure output matches schema for both chain types.

assert isinstance(result.events, list)
assert all(hasattr(e, "name") for e in result.events)
print("✓ Schema validation passed")

Test with irrelevant input to confirm the chain returns an empty list:

In [None]:
# Purpose: Confirm that irrelevant text yields an empty list.

irrelevant = "This is irrelevant text."
result_empty = chain.invoke({"text": irrelevant})
assert result_empty.events == []
print("✓ Empty result for irrelevant input")

Test with partial data to ensure missing fields remain `None`:

In [None]:
# Purpose: Ensure missing fields are set to None, not guessed.

partial = "We hosted Analytics Day and later ran the Fall Hackathon."
result_partial = chain.invoke({"text": partial})
print(result_partial)
# Expected: Events with names but no dates or outcomes

Validate stability across repeated runs (note: full determinism requires seed and version pinning):

In [None]:
# Purpose: Ensure repeated runs yield consistent outputs.

sample = "Board meeting on March 3rd; outcome: postponed."
r1 = chain.invoke({"text": sample})
r2 = chain.invoke({"text": sample})
assert r1.events[0].name == r2.events[0].name, "Output should be stable with temperature=0"
print("✓ Stability check passed")

---

## Process Real-World Text Data

Load a live webpage to extract events from unstructured content:

In [None]:
# Purpose: Fetch and process real-world unstructured text.

from langchain_community.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://en.wikipedia.org/wiki/Apollo_program")
docs = loader.load()
page = docs[0].page_content
print(f"Loaded {len(page)} characters")

For long documents, chunk the text to respect model context limits. Use overlap to preserve context across boundaries:

In [None]:
# Purpose: Split large texts and aggregate results safely.

from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size=3000, chunk_overlap=300)
chunks = splitter.split_text(page)

all_events = []
for ch in chunks:
    # Extend the list with events from each chunk
    result = chain.invoke({"text": ch})
    all_events.extend(result.events)

print(f"Extracted {len(all_events)} events from {len(chunks)} chunks")

If you're concerned about losing critical information in long prompts or context windows, our article on [placing critical info in long prompts and mitigating recall loss](/article/lost-in-the-middle-placing-critical-info-in-long-prompts) offers practical advice. For a deeper dive into why LLMs may "forget" or hallucinate as context grows, see our breakdown of [context rot and how to manage LLM memory effectively](/article/context-rot-why-llms-forget-as-their-memory-grows-3).

Display a sample of extracted events:

In [None]:
# Purpose: Inspect a few results for quality and structure.

for e in all_events[:5]:
    print(e)

Deduplicate events based on key fields to remove redundant entries:

In [None]:
# Purpose: Remove duplicate events based on key fields.

def dedupe_events(events):
    """
    Deduplicate a list of Event objects by (name, date, outcome).

    Args:
        events (List[Event]): List of Event Pydantic models.

    Returns:
        List[Event]: Deduplicated list of events.
    """
    seen = set()
    unique = []
    for e in events:
        # Use a tuple of key fields as the deduplication key
        key = (e.name, e.date, e.outcome)
        if key not in seen:
            seen.add(key)
            unique.append(e)
    return unique

all_events = dedupe_events(all_events)
print(f"Unique events: {len(all_events)}")

---

## Production Readiness

Add retry logic with exponential backoff to handle rate limits and transient errors:

In [None]:
# Purpose: Add robustness to API calls in production.

import time
import random

def robust_invoke(chain, text, retries=3):
    """
    Invoke a chain with retries and exponential backoff.

    Args:
        chain: The LangChain chain to invoke.
        text (str): The input text for extraction.
        retries (int): Number of retry attempts.

    Returns:
        The result of chain.invoke.

    Raises:
        Exception: If all retries fail.
    """
    for i in range(retries):
        try:
            return chain.invoke({"text": text})
        except Exception as e:
            if i == retries - 1:
                raise
            # Exponential backoff with jitter
            wait = (2 ** i) + random.random()
            print(f"Retry {i+1}/{retries} after {wait:.2f}s due to: {e}")
            time.sleep(wait)

Log token usage and latency to monitor cost and performance:

In [None]:
# Purpose: Capture token usage and latency for cost and performance tracking.

import time

def invoke_with_metrics(chain, text):
    """
    Invoke the chain and log token usage and latency.

    Args:
        chain: The LangChain chain to invoke.
        text (str): The input text for extraction.

    Returns:
        The result of chain.invoke.
    """
    start = time.time()
    result = chain.invoke({"text": text})
    elapsed = time.time() - start
    # Note: Token usage requires callback or response inspection; placeholder here
    print(f"Latency: {elapsed*1000:.0f}ms")
    return result

For production logging, consider redacting sensitive data from prompts and outputs to protect PII.

---

## Appendix: Alternative Approach with Function Binding

If you need fine-grained control over function calling, you can manually bind the function and parse the output. This approach is useful for debugging or custom parsing logic:

In [None]:
# Purpose: Bind OpenAI function and parse the 'events' key for fine control.

from langchain.output_parsers.openai_functions import JsonKeyOutputFunctionsParser

try:
    from langchain_community.utils.openai_functions import convert_pydantic_to_openai_function
except ImportError:
    from langchain.output_parsers.openai_functions import convert_pydantic_to_openai_function  # type: ignore

extract_fn = convert_pydantic_to_openai_function(Extracted)
functions = [extract_fn]

# Force the model to call the extraction function for determinism
model_with_fn = llm.bind(functions=functions, function_call={"name": extract_fn["name"]})

# Parse only the 'events' key from the function call arguments
events_parser = JsonKeyOutputFunctionsParser(key_name="events")

chain_functions = prompt | model_with_fn | events_parser

# Example usage:
events = chain_functions.invoke({"text": "I attended a music festival on June 15th."})
print(events)  # Output: [{'name': 'music festival', 'date': 'June 15th', 'outcome': None}]

Inspect raw function call output for debugging:

In [None]:
# Purpose: Debug by viewing both raw and parsed outputs.

from langchain.output_parsers.openai_functions import JsonOutputFunctionsParser

raw_parser = JsonOutputFunctionsParser()  # Returns full function args as dict
raw_chain = prompt | model_with_fn | raw_parser

sample = "Board meeting on March 3rd; outcome: postponed."
print("Raw:", raw_chain.invoke({"text": sample}))
print("Parsed:", chain_functions.invoke({"text": sample}))

---

## Single-File Script

Here's a complete, runnable script that consolidates the entire pipeline:

In [None]:
# extractor.py
# Purpose: End-to-end pipeline for structured event extraction from text.

import os
from typing import List, Optional
from dotenv import load_dotenv

from langchain_core.pydantic_v1 import BaseModel, Field
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

# Load API key from .env or Colab userdata
try:
    from google.colab import userdata
except ImportError:
    userdata = None

load_dotenv()
if "OPENAI_API_KEY" not in os.environ and userdata is not None:
    os.environ["OPENAI_API_KEY"] = userdata.get("OPENAI_API_KEY")

if "OPENAI_API_KEY" not in os.environ:
    from getpass import getpass
    os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API Key: ")

assert os.getenv("OPENAI_API_KEY"), "Set OPENAI_API_KEY in your .env file or Colab userdata"

SYSTEM_PROMPT = """You are an information extraction engine.
Extract only what is explicitly stated in the input text.
Do not infer or assume missing information.
If no explicit events exist, return an empty list.
"""

class Event(BaseModel):
    """
    Represents a single event extracted from text.

    Args:
        name (str): The exact event name or label mentioned in the text.
        date (Optional[str]): The explicit date string if present.
        outcome (Optional[str]): The explicit outcome/result if stated.
    """
    name: str = Field(..., description="The exact event name or label mentioned in the text.")
    date: Optional[str] = Field(None, description="The explicit date string if present (e.g., 'June 15, 2024').")
    outcome: Optional[str] = Field(None, description="The explicit outcome/result if stated (e.g., 'canceled', 'successful').")

class Extracted(BaseModel):
    """
    Container for all events extracted from a text.

    Args:
        events (List[Event]): All explicit events found in the text.
    """
    events: List[Event] = Field(default_factory=list, description="All explicit events found in the text.")

def build_chain(model_name: str = "gpt-4o-mini"):
    """
    Build a structured output chain using LangChain's with_structured_output.

    Args:
        model_name (str): The OpenAI model to use.

    Returns:
        A LangChain chain that returns validated Pydantic models.
    """
    llm = ChatOpenAI(model=model_name, temperature=0)
    prompt = ChatPromptTemplate.from_messages([("system", SYSTEM_PROMPT), ("human", "{text}")])
    structured_llm = llm.with_structured_output(Extracted)
    return prompt | structured_llm

if __name__ == "__main__":
    chain = build_chain()

    # Test with sample input containing two events
    sample = "I attended a music festival on June 15th and a tech conference on July 20th."
    print("Structured:", chain.invoke({"text": sample}))

    # Test with irrelevant input (should return empty)
    irrelevant = "This is irrelevant text."
    print("Empty structured:", chain.invoke({"text": irrelevant}))

---

## Conclusion

You've built a production-ready JSON extraction pipeline that converts unstructured text into validated, schema-conformant events. The system uses OpenAI function calling, LangChain orchestration, and Pydantic validation to eliminate hallucinations and enforce structure.

**Key takeaways:**
- Schema-first design with Pydantic ensures type safety and validation
- Structured output chains simplify extraction and reduce parsing errors
- Chunking with overlap preserves context in long documents
- Retry logic and deduplication add production robustness

**Next steps:**
- Add a FastAPI route to serve the extractor as an API
- Implement an evaluation harness with golden examples and assertions
- Normalize extracted dates to ISO format with Pydantic validators
- Explore vendor fallback by swapping to an open-weight model

You now have a working, testable system ready to extract structured data from any text source at scale.