# LangExtract: A Comprehensive Guide for Contributors

This notebook walks you through **everything** you need to know about [langextract](https://github.com/SalML/LangExtract) — from basic usage to its internal architecture — so you can confidently contribute to the project.

## What is LangExtract?

LangExtract is a Python library for **structured information extraction** from text using Large Language Models (LLMs). Given some text, a description of what to extract, and a few examples, it:

1. **Chunks** the text into manageable pieces
2. **Prompts** an LLM with your examples (few-shot learning)
3. **Parses** the LLM's structured output (JSON/YAML)
4. **Aligns** extracted spans back to the original text (exact character positions)
5. **Visualizes** results as interactive, color-coded HTML

Think of it as: *"I want to find all the characters, emotions, and relationships in this Shakespeare passage"* — and LangExtract handles the entire pipeline.

---

## Architecture Overview

Here's how data flows through the system:

```
User Input (text / URL / Documents)
    │
    ▼
┌─────────────────────────────────────────────┐
│  lx.extract()  — the main entry point       │
│                                              │
│  1. Model Factory  ─── creates LLM provider │
│     ├─ Gemini, OpenAI, or Ollama             │
│     └─ Applies schema constraints            │
│                                              │
│  2. Tokenizer  ─── splits text into tokens   │
│                                              │
│  3. Chunking  ─── splits into sized chunks   │
│     (controlled by max_char_buffer)           │
│                                              │
│  4. Prompt Template  ─── builds few-shot     │
│     prompts from your examples               │
│                                              │
│  5. LLM Inference  ─── calls the model       │
│     (parallel batches via max_workers)        │
│                                              │
│  6. Resolver  ─── parses JSON/YAML output    │
│     └─ Creates Extraction objects             │
│                                              │
│  7. Alignment  ─── maps extractions back     │
│     to source text (char_interval)            │
│                                              │
│  8. Multi-pass Merging (if passes > 1)       │
│     └─ Merges non-overlapping extractions    │
└─────────────────────────────────────────────┘
    │
    ▼
AnnotatedDocument (text + extractions with positions)
    │
    ├──▶ lx.io.save_annotated_documents()  →  JSONL file
    │
    └──▶ lx.visualize()  →  Interactive HTML
```

---

## Key Modules (for contributors)

| Module | What it does | Key classes/functions |
|--------|-------------|----------------------|
| `core/data.py` | Core data structures | `Extraction`, `Document`, `AnnotatedDocument`, `ExampleData`, `CharInterval`, `AlignmentStatus` |
| `annotation.py` | Orchestrates the full extraction pipeline | `Annotator.annotate_text()`, `Annotator.annotate_documents()` |
| `factory.py` | Creates LLM providers from model IDs | `create_model()`, `ModelConfig` |
| `providers/` | LLM provider implementations | `GeminiLanguageModel`, `OpenAILanguageModel`, `OllamaLanguageModel` |
| `prompting.py` | Builds few-shot prompts | `PromptTemplateStructured`, `QAPromptGenerator` |
| `chunking.py` | Splits text into sized chunks | `TextChunk` |
| `resolver.py` | Parses LLM output and aligns to text | `Resolver`, `WordAligner` |
| `core/schema.py` | Structured output schemas | `BaseSchema`, `FormatModeSchema` |
| `core/tokenizer.py` | Text tokenization | `Tokenizer`, `TokenInterval`, `TokenizedText` |
| `io.py` | Save/load annotated documents | `save_annotated_documents()`, `load_annotated_documents_jsonl()` |
| `visualization.py` | Interactive HTML visualization | `visualize()` |
| `plugins.py` | Dynamic provider registration | Plugin entry points |

---

## Part 1: Setup

First, import langextract and load your API key from a `.env` file.

LangExtract resolves API keys from environment variables:
- `GEMINI_API_KEY` for Gemini models
- `OPENAI_API_KEY` for OpenAI models
- `LANGEXTRACT_API_KEY` as a universal fallback

In [None]:
import langextract as lx
import textwrap

from dotenv import load_dotenv

load_dotenv()

---

## Part 2: Core Data Structures

Before using `lx.extract()`, you need to understand the three fundamental data structures.

### 2a. `Extraction` — a single extracted entity

An `Extraction` represents one thing you found in the text:

```python
@dataclass
class Extraction:
    extraction_class: str           # Category (e.g. "character", "emotion")
    extraction_text: str            # The exact text span (e.g. "ROMEO")
    attributes: dict | None         # Extra metadata (e.g. {"mood": "happy"})
    char_interval: CharInterval     # Start/end positions in the source text
    alignment_status: AlignmentStatus  # How well it matched (EXACT, FUZZY, etc.)
```

**`AlignmentStatus` values:**
- `MATCH_EXACT` — perfect token-level match
- `MATCH_FUZZY` — fuzzy overlap match (configurable threshold, default 0.75)
- `MATCH_GREATER` — matched text is longer than the extraction
- `MATCH_LESSER` — partial exact match

### 2b. `ExampleData` — a few-shot training example

You teach the LLM *what* to extract by providing examples:

```python
@dataclass
class ExampleData:
    text: str                        # The example text
    extractions: list[Extraction]    # What should be extracted from it
```

### 2c. `AnnotatedDocument` — the output

After extraction, you get back an `AnnotatedDocument`:

```python
@dataclass
class AnnotatedDocument:
    text: str                        # The original input text
    extractions: list[Extraction]    # All extractions found
    document_id: str                 # Auto-generated unique ID
```

Each extraction in the result has `char_interval` set, so you know *exactly* where in the text it was found.

---

## Part 3: Defining Your Extraction Task

You need two things:
1. **A prompt** — tells the LLM what kinds of entities to extract
2. **Examples** — shows the LLM the expected output format (few-shot learning)

### The Prompt

The prompt describes your extraction task in natural language. Tips:
- Be specific about what categories to extract
- Tell it to use exact text (not paraphrase)
- Mention that attributes should add meaningful context

In [None]:
prompt = textwrap.dedent("""\
    Extract characters, emotions, and relationships in order of appearance.
    Use exact text for extractions. Do not paraphrase or overlap entities.
    Provide meaningful attributes for each entity to add context.""")

print(prompt)

### The Examples (Few-Shot Learning)

Examples are **critical** — they teach the LLM:
- What `extraction_class` values to use (e.g. "character", "emotion", "relationship")
- What text spans to extract (`extraction_text` must be an exact substring)
- What attributes to include and how to format them

LangExtract uses these examples to:
1. Build the few-shot prompt sent to the LLM
2. Generate structured output schemas (if `use_schema_constraints=True`)
3. Validate prompt alignment (catches mistakes early)

**Important:** `extraction_text` must be an **exact substring** of the example `text`. LangExtract validates this!

In [None]:
examples = [
    lx.data.ExampleData(
        text="ROMEO. But soft! What light through yonder window breaks? It is the east, and Juliet is the sun.",
        extractions=[
            # Category: "character" — we extract the character's name
            # extraction_text MUST appear exactly in the example text
            lx.data.Extraction(
                extraction_class="character",
                extraction_text="ROMEO",
                attributes={"emotional_state": "wonder"}
            ),
            # Category: "emotion" — we extract the phrase that conveys emotion
            lx.data.Extraction(
                extraction_class="emotion",
                extraction_text="But soft!",
                attributes={"feeling": "gentle awe"}
            ),
            # Category: "relationship" — we extract the phrase describing a relationship
            lx.data.Extraction(
                extraction_class="relationship",
                extraction_text="Juliet is the sun",
                attributes={"type": "metaphor"}
            ),
        ]
    )
]

print(f"Example text: {examples[0].text!r}")
print(f"Number of extractions in example: {len(examples[0].extractions)}")
for ext in examples[0].extractions:
    print(f"  [{ext.extraction_class}] \"{ext.extraction_text}\" -> {ext.attributes}")

---

## Part 4: Running Extraction with `lx.extract()`

This is the main entry point. Here's what each parameter does:

| Parameter | Type | Default | What it does |
|-----------|------|---------|-------------|
| `text_or_documents` | `str`, URL, or `Iterable[Document]` | *required* | The text to extract from |
| `prompt_description` | `str` | `None` | Natural language description of your task |
| `examples` | `list[ExampleData]` | `None` | Few-shot examples (**required**) |
| `model_id` | `str` | `"gemini-2.5-flash"` | Which LLM to use |
| `max_char_buffer` | `int` | `1000` | Max chars per LLM call (controls chunking) |
| `temperature` | `float` | `None` | LLM temperature (lower = more deterministic) |
| `use_schema_constraints` | `bool` | `True` | Generate structured output schema from examples |
| `batch_length` | `int` | `10` | Chunks per batch |
| `max_workers` | `int` | `10` | Parallel workers for inference |
| `extraction_passes` | `int` | `1` | Run multiple passes for better recall |
| `format_type` | `FormatType` | `None` | Force JSON or YAML output format |
| `debug` | `bool` | `False` | Enable debug logging |

**Returns:**
- `AnnotatedDocument` if input is a single string/URL
- `list[AnnotatedDocument]` if input is multiple Documents

In [None]:
input_text = "Lady Juliet gazed longingly at the stars, her heart aching for Romeo"

result = lx.extract(
    text_or_documents=input_text,
    prompt_description=prompt,
    examples=examples,
    model_id="gemini-2.5-flash",  # also supports: gpt-4o, llama3, etc.
)

### Inspecting the Result

The result is an `AnnotatedDocument`. Let's look at what was extracted.

In [None]:
print(f"Input text: {result.text!r}")
print(f"Document ID: {result.document_id}")
print(f"Number of extractions: {len(result.extractions)}")
print()

for i, ext in enumerate(result.extractions):
    print(f"Extraction {i+1}:")
    print(f"  Class:      {ext.extraction_class}")
    print(f"  Text:       \"{ext.extraction_text}\"")
    print(f"  Attributes: {ext.attributes}")
    print(f"  Position:   chars {ext.char_interval.start_pos}-{ext.char_interval.end_pos}")
    print(f"  Alignment:  {ext.alignment_status}")
    # Verify the char_interval points to the right text
    if ext.char_interval.start_pos is not None:
        actual = result.text[ext.char_interval.start_pos:ext.char_interval.end_pos]
        print(f"  Verified:   \"{actual}\"")
    print()

---

## Part 5: Saving Results with `lx.io`

LangExtract saves results as **JSON Lines** (`.jsonl`) — one JSON object per line, one document per line.

This format is:
- Easy to stream (read one line at a time)
- Easy to append to
- Compatible with most data tools

In [None]:
# Save to a JSONL file
lx.io.save_annotated_documents(
    [result],                           # list of AnnotatedDocument(s)
    output_name="extraction_results.jsonl",
    output_dir="."
)

In [None]:
# You can load them back later
loaded_docs = lx.io.load_annotated_documents_jsonl("extraction_results.jsonl")
print(f"Loaded {len(loaded_docs)} document(s)")
print(f"First doc has {len(loaded_docs[0].extractions)} extractions")

---

## Part 6: Visualization with `lx.visualize()`

`lx.visualize()` creates an **interactive HTML widget** that:
- Color-codes each extraction class (character = blue, emotion = green, etc.)
- Highlights the extracted span in the text
- Shows attributes on hover
- Has play/pause controls to step through extractions one by one

It automatically assigns colors from a 10-color palette:
```
Light Blue, Light Green, Light Yellow, Light Red, Light Orange,
Light Purple, Light Teal, Light Pink, Light Grey, Pale Cyan
```

**Parameters:**
- `data_source`: an `AnnotatedDocument` or path to a `.jsonl` file
- `animation_speed`: seconds between extractions during auto-play (default: 1.0)
- `show_legend`: show color legend (default: True)
- `gif_optimized`: larger fonts for screen recording (default: True)

In [None]:
# Visualize directly from the JSONL file
html_content = lx.visualize("extraction_results.jsonl")

# In Jupyter, this renders inline automatically.
# To also save as a standalone HTML file:
with open("visualization.html", "w") as f:
    if hasattr(html_content, 'data'):
        f.write(html_content.data)
    else:
        f.write(html_content)

html_content

---

## Part 7: Supported Models

LangExtract supports three provider families. The `model_id` string is auto-routed to the correct provider.

| Provider | Model IDs | API Key Env Var | Structured Output |
|----------|-----------|-----------------|-------------------|
| **Gemini** | `gemini-2.5-flash`, `gemini-2.5-pro`, etc. | `GEMINI_API_KEY` | Yes (JSON schema) |
| **OpenAI** | `gpt-4o`, `gpt-4o-mini`, `gpt-5`, etc. | `OPENAI_API_KEY` | No |
| **Ollama** (local) | `llama3`, `mistral`, `gemma`, `phi`, `qwen`, `deepseek`, etc. | N/A (runs locally) | No |

You can also register custom providers via the **plugin system** (entry points).

---

## Part 8: A Different Example — Custom Categories

Let's try a completely different extraction task to show how flexible LangExtract is. This time we'll extract **people, places, and activities** from casual text.

In [None]:
custom_prompt = textwrap.dedent("""\
    Extract people, places, and activities mentioned in the text.
    Use exact text for extractions. Do not paraphrase.
    Provide meaningful attributes for each entity to add context.""")

custom_examples = [
    lx.data.ExampleData(
        text="Sarah went to the library to study for her final exams.",
        extractions=[
            lx.data.Extraction(
                extraction_class="person",
                extraction_text="Sarah",
                attributes={"role": "student"}
            ),
            lx.data.Extraction(
                extraction_class="place",
                extraction_text="the library",
                attributes={"type": "academic facility"}
            ),
            lx.data.Extraction(
                extraction_class="activity",
                extraction_text="study for her final exams",
                attributes={"purpose": "exam preparation"}
            ),
        ]
    )
]

custom_input = "Anirudh loves to go to UIUC and he enjoys spending time with his friends and having good food"

custom_result = lx.extract(
    text_or_documents=custom_input,
    prompt_description=custom_prompt,
    examples=custom_examples,
    model_id="gemini-2.5-flash",
)

In [None]:
print(f"Input: {custom_result.text!r}")
print()
for ext in custom_result.extractions:
    print(f"  [{ext.extraction_class}] \"{ext.extraction_text}\" -> {ext.attributes}")

In [None]:
# Visualize this result directly from the AnnotatedDocument (no need to save first)
lx.visualize(custom_result)

---

## Part 9: Advanced Features

### 9a. Multi-pass Extraction (better recall)

Sometimes the LLM misses entities on the first pass. Setting `extraction_passes > 1` runs extraction multiple times and merges the results. Overlapping extractions use a first-pass-wins strategy.

```python
result = lx.extract(
    text_or_documents=long_text,
    prompt_description=prompt,
    examples=examples,
    model_id="gemini-2.5-flash",
    extraction_passes=3,  # run 3 passes, merge results
)
```

### 9b. Schema Constraints (more reliable output)

When `use_schema_constraints=True` (the default), LangExtract generates a JSON schema from your examples and passes it to the LLM. This forces the model to return well-structured output. Currently best supported by Gemini.

### 9c. Chunking Control

`max_char_buffer` controls how much text is sent per LLM call. For long documents:
- Smaller buffer = more API calls, but each is easier for the LLM
- Larger buffer = fewer API calls, but may lose detail

### 9d. Parallel Processing

`batch_length` and `max_workers` control parallelism:
- `batch_length`: how many chunks per batch
- `max_workers`: max parallel API calls
- Effective parallelism = min(batch_length, max_workers)

### 9e. Processing URLs

You can pass a URL directly as input — LangExtract will download and extract from it:

```python
result = lx.extract(
    text_or_documents="https://example.com/article.txt",
    prompt_description=prompt,
    examples=examples,
    model_id="gemini-2.5-flash",
)
```

### 9f. Batch Processing Multiple Documents

To process many documents at once, wrap them in `Document` objects:

```python
docs = [
    lx.data.Document(text="First document text..."),
    lx.data.Document(text="Second document text..."),
]

results = lx.extract(
    text_or_documents=docs,
    prompt_description=prompt,
    examples=examples,
    model_id="gemini-2.5-flash",
)  # returns list[AnnotatedDocument]
```

---

## Part 10: Internals Deep Dive (for Contributors)

If you want to contribute to langextract, here's how the key internal components work:

### The Extraction Pipeline (step by step)

**Step 1: Model Creation** (`factory.py`)
- `factory.create_model(config)` matches `model_id` to a provider using regex patterns
- Each provider (Gemini, OpenAI, Ollama) implements `BaseLanguageModel` with an `infer()` method
- Schema constraints are applied via `apply_schema()` if supported

**Step 2: Tokenization** (`core/tokenizer.py`)
- Text is tokenized using a regex-based tokenizer
- `TokenizedText` is lazy-computed and cached on the `Document` object
- Each token has a `TokenInterval` tracking its position

**Step 3: Chunking** (`chunking.py`)
- Text is split into `TextChunk` objects, each ≤ `max_char_buffer` characters
- Chunks include overlap to avoid splitting entities at boundaries
- Each chunk carries its `TokenInterval` for later alignment

**Step 4: Prompt Building** (`prompting.py`)
- `PromptTemplateStructured` combines your prompt description + examples into a few-shot prompt
- `QAPromptGenerator` formats each example as a Q&A pair
- The chunk text is inserted as the "question" the LLM should answer

**Step 5: Inference** (`annotation.py` + `providers/`)
- The `Annotator` sends chunks to the LLM in parallel batches
- Each provider's `infer()` method handles the actual API call
- Progress bars show processing status

**Step 6: Resolution** (`resolver.py`)
- `Resolver.resolve()` parses the LLM's JSON/YAML output
- Extracts code fences (` ```json...``` `) if present
- Creates `Extraction` objects from the parsed data

**Step 7: Alignment** (`resolver.py`)
- `WordAligner.align()` maps each extraction back to the source text
- Tries exact token matching first → `MATCH_EXACT`
- Falls back to fuzzy matching if needed → `MATCH_FUZZY` (threshold 0.75)
- Sets `char_interval` (start/end character positions) on each extraction
- Sets `alignment_status` so you know how confident the match is

### Error Handling

Key exception classes in `exceptions.py`:
- `LangExtractError` — base class for all errors
- `InvalidDatasetError` — empty or invalid input dataset
- `InferenceConfigError` — model creation or config failure
- `PromptAlignmentError` — example text doesn't contain extraction_text
- `PromptBuilderError` — prompt template construction failure
- `ParseError` — template parsing failure
- `TokenUtilError` — tokenization error

---

## Part 11: How to Contribute

Now that you understand the architecture, here are some areas where contributions are valuable:

1. **New Providers** — Add support for more LLMs (Anthropic, Cohere, etc.) by implementing `BaseLanguageModel`
2. **Schema Support** — Extend structured output schemas to more providers (currently best in Gemini)
3. **Resolver Improvements** — Better parsing and alignment strategies
4. **Visualization** — New visualization modes, better interactivity
5. **Chunking Strategies** — Smarter text splitting (e.g., sentence-aware, paragraph-aware)
6. **Testing** — More unit tests, edge case coverage
7. **Documentation** — Examples, tutorials, API docs

### Adding a New Provider (quick guide)

1. Create a new file in `providers/` (e.g., `anthropic.py`)
2. Implement a class that extends `BaseLanguageModel`
3. Implement at minimum: `infer(prompt: str) -> str`
4. Register it in the router with regex patterns for matching model IDs
5. Or use the plugin system via entry points for external packages

---

## Quick Reference

```python
import langextract as lx

# 1. Define what to extract
prompt = "Extract characters and emotions. Use exact text."

# 2. Provide examples
examples = [
    lx.data.ExampleData(
        text="...",
        extractions=[lx.data.Extraction(extraction_class="...", extraction_text="...", attributes={...})]
    )
]

# 3. Extract
result = lx.extract(text_or_documents="...", prompt_description=prompt, examples=examples, model_id="gemini-2.5-flash")

# 4. Save
lx.io.save_annotated_documents([result], output_name="output.jsonl", output_dir=".")

# 5. Visualize
lx.visualize("output.jsonl")
```