# 📓 The GenAI Revolution Cookbook

**Title:** Structured Data Extraction with LLMs: How to Build a Pipeline

**Description:** Build a reliable structured data extraction pipeline using LLMs, LangChain, and OpenAI functions—JSON schemas, deterministic outputs, zero hallucinations, for production.

---

*This jupyter notebook contains executable code examples. Run the cells below to try out the code yourself!*



High-quality structured data unlocks downstream analytics, automation, and search. This guide shows you exactly how to turn messy text into validated JSON using LLMs, OpenAI function calling, LangChain, and Pydantic—no training data, no brittle regex. You'll build and run a Colab-ready, deterministic extraction pipeline that converts raw text into validated JSON. If you're dealing with unpredictable input, understanding [tokenization pitfalls and invisible characters](/article/tokenization-pitfalls-invisible-characters-that-break-prompts-and-rag-2) can help you avoid subtle extraction bugs.

## Why This Approach Works

**Function Calling Enforces Structure**  
OpenAI function calling forces the model to return JSON matching your schema. No free-form text, no hallucinated fields. The model outputs only what you define.

**Pydantic Validates At The Boundary**  
Pydantic models enforce types, required fields, and constraints at runtime. Invalid payloads fail fast with clear error messages, ensuring downstream systems receive clean data.

**LangChain Orchestrates Composable Pipelines**  
LangChain chains combine prompts, models, and parsers into testable, reusable pipelines. You can swap models, extend schemas, or add retry logic without rewriting core extraction logic.

**Deterministic Output With Temperature Zero**  
Setting temperature to zero eliminates randomness. The same input produces the same output every time, making extraction predictable and testable.

## How It Works

1. **Define Schema**: Use Pydantic models to specify the structure of extracted data (e.g., event name, date, outcome).
2. **Convert To Function Spec**: Transform the Pydantic model into an OpenAI function definition.
3. **Bind Function To Model**: Attach the function spec to the LLM so it knows to return structured JSON.
4. **Create Prompt**: Write a strict system prompt instructing the model to extract only explicit information.
5. **Build Chain**: Compose prompt, model, and output parser into a LangChain pipeline.
6. **Invoke And Validate**: Run the chain on input text and validate the output with Pydantic.

## Setup & Installation

Run this cell at the top of your Colab notebook to install all required dependencies:

In [22]:
!pip install -U "langchain>=0.2" "langchain-openai>=0.1" "langchain-community>=0.2" "langchain-text-splitters>=0.0.1" pydantic python-dotenv beautifulsoup4 html2text



Set your OpenAI API key as an environment variable before running the notebook:

In [26]:
import os

required_keys = ["OPENAI_API_KEY"]
missing = [k for k in required_keys if not os.getenv(k)]
if missing:
    raise EnvironmentError(
        f"Missing required environment variables: {', '.join(missing)}\n"
        "Please set them before running the notebook. Example:\n"
        "  export OPENAI_API_KEY='your-key-here'"
    )

print("All required API keys found.")

All required API keys found.


## Step-by-Step Implementation

### Step 1: Initialize The LLM

Load environment variables and initialize the OpenAI model with temperature zero for deterministic output:

In [28]:
from langchain_openai import ChatOpenAI

# Use gpt-4o-mini for cost-effective, fast extraction with function calling support
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
print("LLM ready:", llm)

LLM ready: client=<openai.resources.chat.completions.completions.Completions object at 0x79d56d546b10> async_client=<openai.resources.chat.completions.completions.AsyncCompletions object at 0x79d56d547a10> root_client=<openai.OpenAI object at 0x79d4ccdaed50> root_async_client=<openai.AsyncOpenAI object at 0x79d4da138500> model_name='gpt-4o-mini' temperature=0.0 model_kwargs={} openai_api_key=SecretStr('**********') stream_usage=True


### Step 2: Define Pydantic Models

Create Pydantic models to define the structure of extracted events. Field descriptions guide the LLM:

In [29]:
from typing import List, Optional
from pydantic import BaseModel, Field

class Event(BaseModel):
    """
    Represents a single event extracted from text.
    """
    name: str = Field(..., description="The explicit event name or title extracted verbatim from the text.")
    date: Optional[str] = Field(None, description="The explicit date as written in the text (ISO if present, else raw).")
    outcome: Optional[str] = Field(None, description="The explicit outcome/result stated in the text, if any.")

class Extracted(BaseModel):
    """
    Wrapper model for a list of extracted events.
    """
    events: List[Event] = Field(default_factory=list, description="All events explicitly mentioned in the text.")

### Step 3: Convert Schema To OpenAI Function Spec

Transform the Pydantic model into an OpenAI function definition so the model knows to return structured JSON:

In [32]:
from langchain_core.utils.function_calling import convert_to_openai_function

extract_fn = convert_to_openai_function(Extracted)

functions = [extract_fn]

### Step 4: Create A Strict System Prompt

Write a system prompt that instructs the model to extract only explicit information and avoid hallucination:

In [33]:
from langchain_core.prompts import ChatPromptTemplate

SYSTEM_PROMPT = """You are a precise information extractor.
- Extract only information explicitly present in the text.
- Do not infer, guess, or add missing details.
- If a field is not explicitly present, set it to null.
- If no events are present, return an empty list.
- Preserve original wording where reasonable."""

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", SYSTEM_PROMPT),
        ("human", "{text}")
    ]
)

### Step 5: Bind Function To Model And Set Up Parsers

Bind the function spec to the LLM and configure output parsers to extract structured data:

In [35]:
from langchain_core.output_parsers.openai_functions import (
    JsonKeyOutputFunctionsParser,
    JsonOutputFunctionsParser,
)

# Bind the function spec to the LLM
model_with_fn = llm.bind(functions=functions)

# Parser that returns only the "events" key from the function arguments
events_only_parser = JsonKeyOutputFunctionsParser(key_name="events")

# Parser that returns the entire function arguments payload
full_args_parser = JsonOutputFunctionsParser()

### Step 6: Build LangChain Pipelines

Compose the prompt, model, and parsers into reusable chains:

In [36]:
# Chain that returns only the list of events
events_chain = prompt | model_with_fn | events_only_parser

# Chain that returns the full structured payload for validation
full_chain = prompt | model_with_fn | full_args_parser

## Run And Validate

### Basic Extraction

Test the pipeline on a simple input string:

In [37]:
text = "I attended a music festival on June 15th and a tech conference on July 20th."
events = events_chain.invoke({"text": text})
print(events)

[{'name': 'music festival', 'date': 'June 15th', 'outcome': None}, {'name': 'tech conference', 'date': 'July 20th', 'outcome': None}]


### Validate With Pydantic

Use the full chain to extract the payload and validate it with Pydantic:

In [38]:
payload = full_chain.invoke({"text": text})
validated = Extracted.model_validate(payload)
print(validated)
for ev in validated.events:
    print(ev.name, ev.date, ev.outcome)

events=[Event(name='music festival', date='June 15th', outcome=None), Event(name='tech conference', date='July 20th', outcome=None)]
music festival June 15th None
tech conference July 20th None


### Edge Case: No Events

Verify the extractor returns an empty list when no events are present:

In [39]:
text_irrelevant = "This is irrelevant text with no events."
empty = events_chain.invoke({"text": text_irrelevant})
print(empty)
assert empty == [], "Extractor should not hallucinate events."

[]


### Edge Case: Missing Fields

Test extraction when some fields are missing:

In [40]:
text_partial = "Our team hosted Launch Day and later Demo Night."
partial = events_chain.invoke({"text": text_partial})
print(partial)

[{'name': 'Launch Day', 'date': None, 'outcome': None}, {'name': 'Demo Night', 'date': None, 'outcome': None}]


### Determinism Check

Run the same input multiple times and confirm outputs are identical:

In [41]:
same1 = events_chain.invoke({"text": text})
same2 = events_chain.invoke({"text": text})
assert same1 == same2, "Outputs should be identical with temperature=0."

### Real-World Data Extraction

Load a Wikipedia page and extract events from real content:

In [42]:
from langchain_community.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://en.wikipedia.org/wiki/Apollo_program")
docs = loader.load()
page_text = docs[0].page_content[:10000]

real_events = events_chain.invoke({"text": page_text})
print(f"Extracted {len(real_events)} events")
for e in real_events[:5]:
    print(e)



Extracted 4 events
{'name': 'Apollo 11 mission', 'date': 'July 20, 1969', 'outcome': 'astronauts Neil Armstrong and Buzz Aldrin landed their Apollo Lunar Module on the Moon and walked on the lunar surface, while Michael Collins remained in lunar orbit in the command and service module, and all three landed safely on Earth in the Pacific Ocean on July 24.'}
{'name': 'Apollo 1 cabin fire', 'date': '1967', 'outcome': 'killed the entire crew during a prelaunch test.'}
{'name': 'Apollo 13 landing', 'date': None, 'outcome': "had to be aborted after an oxygen tank exploded en route to the Moon, crippling the CSM. The crew barely managed a safe return to Earth by using the Lunar Module as a 'lifeboat' on the return journey."}
{'name': 'Apollo 17 mission', 'date': 'December 1972', 'outcome': None}


### Chunking Long Documents

For longer documents, split text into overlapping chunks, extract per chunk, then merge and deduplicate events. If you notice models missing details or hallucinating as context grows, our deep dive on [context rot and LLM memory limits](/article/context-rot-why-llms-forget-as-their-memory-grows-3) explains why this happens and how to mitigate it.

In [43]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size=4000, chunk_overlap=400)
chunks = splitter.split_text(docs[0].page_content)

all_events = []
for ch in chunks:
    all_events.extend(events_chain.invoke({"text": ch}))

# Deduplicate by (name, date) tuple
unique = {(e["name"], e.get("date")): e for e in all_events}
chr = list(unique.values())
print(f"Merged events: {len(merged_events)}")

Merged events: 150


## Constraints And Performance

**Token Limits**: gpt-4o-mini supports up to 128k tokens input. For documents over 4k characters, chunk the text to avoid context window issues.

**Cost**: gpt-4o-mini costs approximately $0.15 per 1M input tokens and $0.60 per 1M output tokens. A typical extraction call uses 500–2000 tokens.

**Latency**: Expect 1–3 seconds per extraction call depending on input size and API load.

For high-volume jobs, control prompt size, reduce chunk overlap, and prefer cheaper models like gpt-4o-mini for extraction. If you're evaluating which model best fits your pipeline's speed, cost, and reliability needs, see our practical guide on [how to choose an LLM for your application](/article/how-to-choose-an-ai-model-for-your-app-speed-cost-reliability).

## Conclusion

You've built a deterministic, validated extraction pipeline that converts raw text into structured JSON using OpenAI function calling, Pydantic, and LangChain. The system enforces schema compliance, eliminates hallucination, and produces repeatable results.

**Key Design Choices**:
- **Function calling** forces the model to return only schema-compliant JSON.
- **Pydantic validation** catches invalid payloads at runtime.
- **LangChain orchestration** makes the pipeline composable, testable, and extensible.
- **Temperature zero** ensures deterministic output.

**Next Steps**:
- Add retry logic with exponential backoff for production reliability.
- Extend the schema with new fields like location or confidence scores.
- Deploy the pipeline as a REST API using FastAPI.
- Parallelize extraction across multiple documents with asyncio and rate limiting.
- Add observability with structured logging and input/output hashing to track performance and avoid logging PII.