# [Outlines](https://github.com/dottxt-ai/outlines) for JSON-constrained LLM output

This notebook demonstrates using Outlines to guarantee structured JSON from LLM queries.

In [None]:
# Install outlines if needed
# !pip install outlines

In [16]:
import outlines
from pydantic import BaseModel, Field
from typing import List
from transformers import AutoTokenizer, AutoModelForCausalLM
import json

In [17]:
# Define Pydantic schema for person extraction
class Person(BaseModel):
    name: str = Field(description="Person's name")
    time: str = Field(description="Time period when mentioned, or 'not specified'")
    place: str = Field(description="Geographic location, or 'not specified'") 
    role: str = Field(description="Person's role or occupation, or 'not specified'")

class PeopleExtraction(BaseModel):
    people: List[Person] = Field(description="List of people extracted from text")

In [7]:
# Load your model with Outlines
model_path = "/gpfs1/llm/llama-3.2-hf/Meta-Llama-3.2-3B-Instruct"

# Initialize the Outlines model
model = outlines.from_transformers(
    AutoModelForCausalLM.from_pretrained(model_path, device_map="cuda"),
    AutoTokenizer.from_pretrained(model_path)
)
print("Outlines model loaded successfully!")

AttributeError: module 'outlines' has no attribute 'from_transformers'

In [8]:
# Sample text from your existing data
sample_text = """Whether Ong would have seen cyberspace as fundamentally oral or literary, he would surely have recognized it as transformative: not just a revitalization of older forms, not just an amplification, but something wholly new. He might have sensed a coming discontinuity akin to the emergence of literacy itself. Few understood better than Ong just how profound a discontinuity that had been.
When he began his studies, "oral literature" was a common phrase. It is an oxymoron laced with anachronism; the words imply an all-too-unconscious approach to the past by way of the present. Oral literature was generally treated as a variant of writing; this, Ong said, was "rather like thinking of horses as automobiles without wheels."
"Language in fact bears the same relationship to the concept of mind that legislation bears to the concept of parliament," says Jonathan Miller: "it is a competence forever bodying itself in a series of concrete performances." Much the same might be said of writing—it is concrete performance—but when the word is instantiated in paper or stone, it takes on a separate existence as artifice. It is a product of tools, and it is a tool. And like many technologies that followed, it thereby inspired immediate detractors.
One unlikely Luddite was also one of the first long-term beneficiaries. Plato (channeling the nonwriter Socrates) warned that this technology meant impoverishment."""

In [9]:
# Method 1: Direct structured generation with Pydantic
def extract_people_with_outlines(text: str) -> str:
    """Extract people information with guaranteed JSON format using Outlines"""
    
    prompt = f"""Extract all people mentioned in the following text. For each person, provide their name, time period, location, and role. Use 'not specified' for missing information.

Text: {text}

Extract people as JSON:"""
    
    # Outlines guarantees the output will match the PeopleExtraction schema
    result = model(
        prompt,
        PeopleExtraction,
        max_new_tokens=400
    )
    
    return result

print("Function defined successfully!")

Function defined successfully!


In [10]:
# Test the extraction
print("Testing Outlines-based JSON extraction...")
result = extract_people_with_outlines(sample_text)

print("\nRaw JSON result:")
print(result)

print("\nParsed and formatted result:")
parsed_result = json.loads(result)
print(json.dumps(parsed_result, indent=2))

Testing Outlines-based JSON extraction...


NameError: name 'model' is not defined

In [11]:
# Validate with Pydantic
print("Validating with Pydantic...")
try:
    validated_result = PeopleExtraction.model_validate_json(result)
    print("✅ JSON is valid according to schema!")
    
    print("\nExtracted people:")
    for i, person in enumerate(validated_result.people, 1):
        print(f"{i}. {person.name} - {person.role} ({person.time}, {person.place})")
        
except Exception as e:
    print(f"❌ Validation failed: {e}")

Validating with Pydantic...
❌ Validation failed: name 'result' is not defined


## Alternative Outlines approaches

In [12]:
# Method 2: Using simple types for quick classification
from typing import Literal

# Simple sentiment analysis
sentiment_result = model(
    "Analyze the sentiment: 'This book completely changed my understanding of information theory!'",
    Literal["Positive", "Negative", "Neutral"]
)
print(f"Sentiment: {sentiment_result}")

# Extract specific data types
year_result = model(
    "In what year was Plato born? Answer with just the number.",
    int
)
print(f"Year: {year_result}")

NameError: name 'model' is not defined

In [13]:
# Method 3: Using Template for reusable prompts
person_template = outlines.Template.from_string("""
Extract information about people from this {{ content_type }}:

{{ text }}

Return a JSON object with a list of people, where each person has:
- name: person's name
- time: time period or 'not specified'
- place: location or 'not specified'
- role: their role or 'not specified'
""")

# Generate prompt using template
templated_prompt = person_template(
    content_type="academic text",
    text=sample_text[:500] + "..."  # Shorter text for demo
)

print("Generated prompt:")
print(templated_prompt)

# Use with structured output
template_result = model(
    templated_prompt,
    PeopleExtraction,
    max_new_tokens=300
)

print("\nTemplate result:")
template_parsed = json.loads(template_result)
print(json.dumps(template_parsed, indent=2))

AttributeError: module 'outlines' has no attribute 'Template'

In [14]:
# Method 4: Batch processing multiple texts
texts_to_analyze = [
    "Aristotle taught Alexander the Great in ancient Macedonia.",
    "Marie Curie won Nobel Prizes in both Physics and Chemistry in early 20th century France.",
    "Shakespeare wrote Hamlet during the Elizabethan era in England."
]

print("Batch processing multiple texts...")

batch_results = []
for i, text in enumerate(texts_to_analyze):
    prompt = f"Extract people from this text: {text}"
    result = model(prompt, PeopleExtraction, max_new_tokens=200)
    batch_results.append(result)
    
    print(f"\nText {i+1}: {text[:50]}...")
    parsed = json.loads(result)
    for person in parsed['people']:
        print(f"  - {person['name']} ({person['role']})")

Batch processing multiple texts...


NameError: name 'model' is not defined

## Performance Comparison

In [15]:
import time

# Time the structured generation
start_time = time.time()
timed_result = extract_people_with_outlines(sample_text[:300])  # Shorter for speed
end_time = time.time()

print(f"Outlines structured generation took: {end_time - start_time:.2f} seconds")
print(f"Generated {len(timed_result)} characters of valid JSON")
print(f"Number of people extracted: {len(json.loads(timed_result)['people'])}")

# Verify it's always valid JSON
try:
    json.loads(timed_result)
    print("✅ Output is guaranteed valid JSON!")
except:
    print("❌ JSON parsing failed")

NameError: name 'model' is not defined