# Canonicalization

In real-world AI agent systems, conversation history and context often accumulate in messy, unstructured formats. Users express the same concepts in different ways, messages contain inconsistent formatting, entities appear with multiple variations, and temporal references use diverse phrasings. This variability makes it difficult for agents to understand and reference past interactions accurately. When an agent needs to retrieve information about "order #12345" but the conversation history contains references to "order 12345", "Order #12345", and "my order", the inconsistency hinders effective context utilization.

Canonicalization addresses this challenge by transforming unstructured or messy conversation history into normalized, schema-aligned formats that follow consistent patterns. Rather than leaving context in its raw form with all its variations and inconsistencies, we convert it into a standardized representation where entities follow uniform formats, temporal references use consistent structures, and key information is extracted into well-defined fields. This normalization makes context significantly easier for agents to parse, understand, and reference accurately.

This notebook demonstrates how to implement canonicalization from basic normalization techniques to production-ready systems. We will explore entity standardization, temporal canonicalization, schema-based conversation normalization, and complete canonicalization pipelines that transform raw conversation history into clean, structured formats optimized for agent comprehension.

In [1]:
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, AIMessage, SystemMessage
from pydantic import BaseModel, Field
from typing import List, Optional, Dict, Any
from datetime import datetime, timedelta
import re
import os

### Initialize the language model for canonicalization tasks

In [2]:
# Using gpt-4o-mini for cost-effective text transformation
llm = ChatOpenAI(model="gpt-4o-mini", api_key=os.getenv("OPENAI_API_KEY", "").strip(), temperature=0)

## Basic entity normalization
The simplest form of canonicalization involves normalizing entities to consistent formats. In conversations, entities like order numbers, phone numbers, email addresses, and product codes often appear in multiple variations. A user might refer to "order #12345", "Order 12345", or "my order (12345)" - all referring to the same entity. These variations create noise in the context and make it harder for the agent to recognize when the same entity is being discussed.

We will implement basic entity normalization using regular expressions and string manipulation to convert entities into standard formats. This rule-based approach handles common patterns deterministically, ensuring that all variations of an entity are converted to a single canonical form. While simple, this technique provides immediate benefits for agent comprehension by reducing variability in how key identifiers appear in context.

In [3]:
def normalize_order_numbers(text: str) -> str:
    """
    Normalize order number references to a consistent format.
    Converts variations like "order #12345", "Order 12345", "order: 12345" to the canonical form "ORDER-12345".

    Args:
        text: Input text containing order number references

    Returns:
        Text with normalized order numbers
    """
    # Define regex pattern to match various order number formats
    # \s* matches zero or more whitespace characters
    # (?:number|#|:)? is a non-capturing group that optionally matches "number", "#", or ":"
    # (\d{4,6}) captures the order number digits (4 to 6 digits)
    pattern = r'order\s*(?:number|#|:)?\s*(\d{4,6})'

    # Use re.sub to replace matched patterns with canonical format
    # r'ORDER-\1' references the first captured group (\d{4,6}) from the pattern
    # flags=re.IGNORECASE ensures both "order" and "Order" are matched
    normalized = re.sub(pattern, r'ORDER-\1', text, flags=re.IGNORECASE)

    return normalized

# Test order number normalization with various formats
test_messages = [
    "I need help with my order #12345",
    "Check Order 67890 status please",
    "What about order: 11111?",
    "Order number 99999 is missing"
]

print("Order Number Normalization:")
print("="*80)
for msg in test_messages:
    normalized = normalize_order_numbers(msg)
    print(f"  '{msg}'")
    print(f"  → '{normalized}'")
    print()

Order Number Normalization:
  'I need help with my order #12345'
  → 'I need help with my ORDER-12345'

  'Check Order 67890 status please'
  → 'Check ORDER-67890 status please'

  'What about order: 11111?'
  → 'What about ORDER-11111?'

  'Order number 99999 is missing'
  → 'ORDER-99999 is missing'



The order number normalization function uses a regular expression pattern with a capturing group to identify order references in multiple formats.
- The pattern `order\s*(?:number|#|:)?\s*(\d{4,6})` matches the word "order" (case-insensitive due to the IGNORECASE flag), followed by optional whitespace, an optional separator ("number", "#", or ":"), more optional whitespace, and finally captures 4-6 digits representing the order ID.
- The non-capturing group `(?:...)` is used for the separator to match it without creating a capture group, keeping only the numeric portion as the first captured group.
- The replacement string `ORDER-\1` uses backreference `\1` to insert the captured digits, producing a uniform "ORDER-XXXXX" format. This deterministic transformation ensures that "order #12345", "Order 12345", and "order: 12345" all convert to "ORDER-12345", eliminating format variations that would otherwise make it difficult for agents to recognize identical order references.

Phone numbers present another common normalization challenge. Users enter phone numbers in diverse formats depending on regional conventions, personal preferences, or input method constraints. The same phone number might appear as "555-123-4567", "(555) 123-4567", "555.123.4567" or even "5551234567". This formatting inconsistency creates the same recognition problem as order numbers - the agent cannot easily determine when two differently-formatted phone numbers refer to the same contact. By establishing a single canonical format for all phone numbers in the conversation history, we eliminate this ambiguity and enable reliable phone number matching and retrieval.

In [4]:
def normalize_phone_numbers(text: str) -> str:
    """
    Normalize phone number references to a consistent format.
    Converts various formats to (XXX) XXX-XXXX.

    Args:
        text: Input text containing phone numbers

    Returns:
        Text with normalized phone numbers
    """
    # Define regex pattern to match US phone numbers in various formats
    # \(? matches an optional opening parenthesis
    # \b ensures we're at a word boundary (start of phone number)
    # (\d{3}) captures the three-digit area code as group 1
    # \)? matches an optional closing parenthesis  
    # [\s.-]? matches an optional separator (space, dot, or dash)
    # (\d{3}) captures the three-digit prefix as group 2
    # [\s.-]? matches another optional separator
    # (\d{4}) captures the four-digit line number as group 3
    # \b ensures we're at a word boundary (end of phone number)
    pattern = r'\(?\b(\d{3})\)?[\s.-]?(\d{3})[\s.-]?(\d{4})\b'

    # Replace with canonical format using captured groups
    # \1, \2, \3 reference the three captured groups (area code, prefix, line number)
    # Final format: (XXX) XXX-XXXX
    normalized = re.sub(pattern, r'(\1) \2-\3', text)

    return normalized

# Test phone number normalization with various formats
test_phone_formats = [
    "Call me at 555-123-4567",
    "My number is (555) 987-6543",
    "Reach out to 555.111.2222",
    "Contact: 5554445555"
]

print("Phone Number Normalization:")
print("="*80)
for msg in test_phone_formats:
    normalized = normalize_phone_numbers(msg)
    print(f"  '{msg}'")
    print(f"  → '{normalized}'")
    print()

Phone Number Normalization:
  'Call me at 555-123-4567'
  → 'Call me at (555) 123-4567'

  'My number is (555) 987-6543'
  → 'My number is (555) 987-6543'

  'Reach out to 555.111.2222'
  → 'Reach out to (555) 111-2222'

  'Contact: 5554445555'
  → 'Contact: (555) 444-5555'



The phone number normalization function uses a more complex regex pattern with three separate capturing groups to handle the structure of US phone numbers.
- The pattern `\(?\b(\d{3})\)?[\s.-]?(\d{3})[\s.-]?(\d{4})\b` begins with an optional opening parenthesis `\(?`, followed by a word boundary `\b` to ensure we match complete phone numbers.
    - The first capturing group `(\d{3})` matches the three-digit area code, followed by an optional closing parenthesis `\)?` to accommodate formats like "(555)" or just "555".
    - The character class `[\s.-]?` matches an optional separator which can be a space, dot, or dash.
    - The second capturing group `(\d{3})` captures the three-digit prefix (also called the exchange code), followed by another optional separator.
    - Finally, the third capturing group `(\d{4})` matches the four-digit line number, and `\b` ensures we're at a word boundary.
- The replacement string `(\1) \2-\3` uses backreferences to reconstruct the number in the canonical format: area code in parentheses, space, prefix, dash, and line number. This transformation converts "555-123-4567", "(555) 987-6543", "555.111.2222", and "5554445555" all into the uniform "(555) 123-4567" style format, eliminating all formatting variations.

Email addresses, while typically following a standard format, can still introduce inconsistencies through case variations. The same email address might appear as "john.doe@example.com", "John.Doe@EXAMPLE.COM", or "JOHN.DOE@example.com". According to email standards, the domain portion of email addresses is case-insensitive, and most email providers treat the local part (before the @) as case-insensitive as well. However, when storing email addresses in conversation history, these case variations make it difficult to determine if two references point to the same email. Normalizing all email addresses to lowercase ensures consistent representation and enables reliable matching.

In [5]:
def normalize_email_addresses(text: str) -> str:
    """
    Normalize email addresses to lowercase.
    Ensures consistent casing for email references.

    Args:
        text: Input text containing email addresses

    Returns:
        Text with normalized email addresses
    """
    # Define regex pattern to match email addresses
    # [A-Za-z0-9._%+-]+ matches the local part (before @) - letters, digits, and special chars
    # @ matches the literal @ symbol
    # [A-Za-z0-9.-]+ matches the domain name - letters, digits, dots, and hyphens
    # \. matches the literal dot before the TLD  
    # [A-Z|a-z]{2,} matches the top-level domain (2+ letters)
    # \b ensures word boundaries on both sides
    pattern = r'\b([A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,})\b'

    # Define a helper function to convert matched email to lowercase
    # This is applied to each match found by re.sub
    def lowercase_email(match):
        return match.group(0).lower()

    # Apply lowercase transformation to all matched email addresses
    normalized = re.sub(pattern, lowercase_email, text)

    return normalized

# Test email normalization with various casing
test_email_formats = [
    "Send confirmation to John.Doe@EXAMPLE.COM",
    "Contact SUPPORT@Company.NET for help",
    "My email is User.Name@Domain.ORG",
    "Reach out at MixedCase@test.Com"
]

print("Email Address Normalization:")
print("="*80)
for msg in test_email_formats:
    normalized = normalize_email_addresses(msg)
    print(f"  '{msg}'")
    print(f"  → '{normalized}'")
    print()

Email Address Normalization:
  'Send confirmation to John.Doe@EXAMPLE.COM'
  → 'Send confirmation to john.doe@example.com'

  'Contact SUPPORT@Company.NET for help'
  → 'Contact support@company.net for help'

  'My email is User.Name@Domain.ORG'
  → 'My email is user.name@domain.org'

  'Reach out at MixedCase@test.Com'
  → 'Reach out at mixedcase@test.com'



The email normalization function uses a regex pattern that matches the standard email address structure with a callback function for transformation.
- The pattern `\b([A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,})\b` consists of three main parts: the local part `[A-Za-z0-9._%+-]+` before the @ symbol that matches letters, digits, and common email special characters (dots, underscores, percent signs, plus signs, and hyphens); the @ symbol itself; and the domain part `[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}` which matches the domain name followed by a dot and a top-level domain of at least two letters.
    - Word boundaries `\b` ensure we match complete email addresses.
- Unlike the previous normalizations that used direct string replacement, this implementation uses a callback function `lowercase_email` that receives each match object and returns the lowercased version.
    - The `re.sub` function applies this callback to every matched email address in the text. This approach converts "John.Doe@EXAMPLE.COM", "SUPPORT@Company.NET", and "MixedCase@test.Com" all to their lowercase equivalents, ensuring that "john.doe@example.com" is recognized as the same address regardless of how the user originally typed it.
 
Now that we have individual normalization functions for different entity types, we can combine them into a single composite function that applies all normalizations in sequence. This composite approach ensures that every message processed through the canonicalization pipeline receives consistent entity formatting across all entity types. The function chains the individual normalizers, passing the output of one as input to the next, producing text where all entities follow their canonical formats simultaneously.

In [6]:
def normalize_entities(text: str) -> str:
    """
    Apply all entity normalization rules to text.
    Chains order number, phone number, and email normalization.

    Args:
        text: Input text with various entity formats

    Returns:
        Text with all entities normalized
    """
    # Apply each normalization function in sequence
    # The output of each function becomes the input to the next
    text = normalize_order_numbers(text)
    text = normalize_phone_numbers(text)
    text = normalize_email_addresses(text)

    return text

# Test composite normalization with a message containing multiple entity types
test_composite = [
    "Hi, my order #12345 hasn't arrived. Call me at 555-123-4567 or email John.Doe@EXAMPLE.COM",
    "Order 67890 needs to be sent to (555) 987-6543, confirmed via SUPPORT@company.net",
    "Please update order: 11111 and contact me at 555.111.2222 or user@DOMAIN.ORG"
]

print("Composite Entity Normalization:")
print("="*80)
for msg in test_composite:
    normalized = normalize_entities(msg)
    print(f"Original:")
    print(f"  {msg}")
    print(f"Normalized:")
    print(f"  {normalized}")
    print()

Composite Entity Normalization:
Original:
  Hi, my order #12345 hasn't arrived. Call me at 555-123-4567 or email John.Doe@EXAMPLE.COM
Normalized:
  Hi, my ORDER-12345 hasn't arrived. Call me at (555) 123-4567 or email john.doe@example.com

Original:
  Order 67890 needs to be sent to (555) 987-6543, confirmed via SUPPORT@company.net
Normalized:
  ORDER-67890 needs to be sent to (555) 987-6543, confirmed via support@company.net

Original:
  Please update order: 11111 and contact me at 555.111.2222 or user@DOMAIN.ORG
Normalized:
  Please update ORDER-11111 and contact me at (555) 111-2222 or user@domain.org



The composite `normalize_entities` function creates a processing pipeline by sequentially applying each specialized normalization function.
- The text flows through the pipeline: first normalize_order_numbers transforms any order references to "ORDER-XXXXX" format, then normalize_phone_numbers standardizes phone numbers to "(XXX) XXX-XXXX" format, and finally normalize_email_addresses converts all email addresses to lowercase.
- Each function receives the output from the previous function, ensuring that all transformations are applied cumulatively.
- The example demonstrates this multi-entity normalization: a message containing "order #12345", "555-123-4567", and "John.Doe@EXAMPLE.COM" gets transformed to "ORDER-12345", "(555) 123-4567", and "john.doe@example.com" in a single pass.
- This composable design makes it easy to add additional entity types (like product codes, tracking numbers, or account IDs) by simply adding new normalization functions to the chain. The sequential application ensures consistency and allows downstream components to rely on predictable entity formatting.

## Temporal reference canonicalization
Conversations frequently contain temporal references that use relative terms like "yesterday", "last week", "two days ago", or "next Monday". While humans easily understand these relative references, they become ambiguous in stored conversation history. A message saying "I ordered it yesterday" means something different depending on when it was said. When the agent retrieves this message days or weeks later, "yesterday" no longer refers to the correct date, causing temporal confusion.

Temporal canonicalization converts these relative time references into absolute dates or timestamps. By anchoring temporal expressions to specific dates at the time they are created, we ensure that the meaning remains consistent regardless of when the conversation is retrieved. This is particularly important for long-running agents that maintain conversation history over extended periods, where relative references would otherwise become meaningless or misleading.

In [7]:
def canonicalize_temporal_references(
    text: str, 
    reference_date: datetime = None
) -> str:
    """
    Convert relative temporal references to absolute dates.
    Replaces terms like "yesterday", "last week" with specific dates.
    
    Args:
        text: Input text with relative temporal references
        reference_date: The "current" date for resolving relative references
                       Defaults to now if not provided
        
    Returns:
        Text with absolute date references
    """
    # Use current time as reference if not provided
    if reference_date is None:
        reference_date = datetime.now()
    
    # Define temporal patterns and their transformations
    # Each pattern maps to a function that calculates the absolute date
    
    replacements = []
    
    # "yesterday" -> specific date
    if "yesterday" in text.lower():
        yesterday = reference_date - timedelta(days=1)
        date_str = yesterday.strftime("%B %d, %Y")
        replacements.append((r'\byesterday\b', f"on {date_str}", re.IGNORECASE))
    
    # "today" -> specific date
    if "today" in text.lower():
        today_str = reference_date.strftime("%B %d, %Y")
        replacements.append((r'\btoday\b', f"on {today_str}", re.IGNORECASE))
    
    # "tomorrow" -> specific date
    if "tomorrow" in text.lower():
        tomorrow = reference_date + timedelta(days=1)
        date_str = tomorrow.strftime("%B %d, %Y")
        replacements.append((r'\btomorrow\b', f"on {date_str}", re.IGNORECASE))
    
    # "last week" -> specific date range
    if "last week" in text.lower():
        last_week = reference_date - timedelta(weeks=1)
        date_str = last_week.strftime("%B %d, %Y")
        replacements.append((r'\blast week\b', f"around {date_str}", re.IGNORECASE))
    
    # "next week" -> specific date range
    if "next week" in text.lower():
        next_week = reference_date + timedelta(weeks=1)
        date_str = next_week.strftime("%B %d, %Y")
        replacements.append((r'\bnext week\b', f"around {date_str}", re.IGNORECASE))
    
    # "X days ago" -> specific date
    # Pattern matches "2 days ago", "5 days ago", etc.
    days_ago_pattern = r'(\d+)\s+days?\s+ago'
    matches = re.finditer(days_ago_pattern, text, re.IGNORECASE)
    for match in matches:
        num_days = int(match.group(1))
        past_date = reference_date - timedelta(days=num_days)
        date_str = past_date.strftime("%B %d, %Y")
        replacements.append((match.group(0), f"on {date_str}", None))
    
    # Apply all replacements to the text
    result = text
    for pattern, replacement, flags in replacements:
        if flags:
            result = re.sub(pattern, replacement, result, flags=flags)
        else:
            # For exact string matches from days_ago_pattern
            result = result.replace(pattern, replacement)
    
    return result

# Example: Canonicalize temporal references in conversation
# Simulate a conversation from a specific date
conversation_date = datetime(2024, 3, 15, 10, 30)

temporal_messages = [
    "I placed my order yesterday and expected it today",
    "The issue started last week when I tried to update my profile",
    "Can you schedule the delivery for tomorrow?",
    "I contacted support 3 days ago but haven't heard back",
    "The payment was processed 7 days ago",
    "Let's schedule a follow-up for next week"
]

print(f"Conversation Date: {conversation_date.strftime('%B %d, %Y')}")
print("\nOriginal Messages (with relative temporal references):")
print("="*80)
for msg in temporal_messages:
    print(f"  {msg}")

print("\nCanonical Messages (with absolute dates):")
print("="*80)
for msg in temporal_messages:
    canonical = canonicalize_temporal_references(msg, conversation_date)
    print(f"  {canonical}")

# Demonstrate why this matters: retrieve the message later
print("\n" + "="*80)
print("\nWhy Canonicalization Matters:")
print(f"If we retrieve 'I placed my order yesterday' on {(conversation_date + timedelta(days=10)).strftime('%B %d, %Y')}:")
print(f"  - Without canonicalization: 'yesterday' is ambiguous/wrong")
print(f"  - With canonicalization: 'on {(conversation_date - timedelta(days=1)).strftime('%B %d, %Y')}' is precise")

Conversation Date: March 15, 2024

Original Messages (with relative temporal references):
  I placed my order yesterday and expected it today
  The issue started last week when I tried to update my profile
  Can you schedule the delivery for tomorrow?
  I contacted support 3 days ago but haven't heard back
  The payment was processed 7 days ago
  Let's schedule a follow-up for next week

Canonical Messages (with absolute dates):
  I placed my order on March 14, 2024 and expected it on March 15, 2024
  The issue started around March 08, 2024 when I tried to update my profile
  Can you schedule the delivery for on March 16, 2024?
  I contacted support on March 12, 2024 but haven't heard back
  The payment was processed on March 08, 2024
  Let's schedule a follow-up for around March 22, 2024


Why Canonicalization Matters:
If we retrieve 'I placed my order yesterday' on March 25, 2024:
  - Without canonicalization: 'yesterday' is ambiguous/wrong
  - With canonicalization: 'on March 14, 20

Temporal canonicalization uses a reference datetime to resolve relative temporal expressions into absolute dates.
- The function maintains a list of replacement patterns, detecting common relative terms like "yesterday", "today", "tomorrow", "last week" and "next week" in the text. For each detected pattern, it calculates the corresponding absolute date using timedelta operations on the reference date.
- The "X days ago" pattern uses regex with a capturing group to extract the numeric value, then calculates the past date by subtracting that many days from the reference.
- All matched patterns are replaced with formatted absolute dates using strftime.
- The example demonstrates how a conversation from March 15, 2024 gets canonicalized: "yesterday" becomes "on March 14, 2024", "3 days ago" becomes "on March 12, 2024", and "next week" becomes "around March 22, 2024".
- This transformation ensures that when the conversation is retrieved weeks or months later, the temporal references maintain their original meaning. Without canonicalization, "yesterday" would be meaningless or misleading; with canonicalization, the specific date preserves the exact temporal context.

## Schema-based conversation canonicalization
Beyond normalizing individual entities and temporal references, production systems benefit from transforming entire conversation turns into structured, schema-aligned formats. Raw conversation messages contain unstructured natural language with information scattered throughout. A single message might reference multiple entities, express several intents, and contain both questions and statements. This unstructured format makes it difficult for agents to quickly locate specific information or understand the conversation structure.

Schema-based canonicalization uses language models with structured output capabilities to extract information from raw messages and organize it into predefined schemas. Rather than keeping messages as free-form text, we convert them into structured objects with explicit fields for entities, intents, temporal context and other relevant information. This structured representation makes conversation history much easier to parse, search, and reference programmatically, while also ensuring consistency in how information is represented across different message types.

In [15]:
class CanonicalMessage(BaseModel):
    """
    Structured schema for canonicalized conversation messages.
    Transforms free-form conversation into consistent, parseable format.
    """
    # The original message timestamp for temporal ordering
    timestamp: str = Field(
        description="ISO-format timestamp when message was created"
    )
    
    # Who sent the message (human/ai/system)
    speaker: str = Field(
        description="Message sender: 'user', 'assistant', or 'system'"
    )
    
    # Normalized core content
    canonical_content: str = Field(
        description="Message content with entities and temporal refs normalized"
    )
    
    # Extracted and normalized entities
    entities: Dict[str, List[str]] = Field(
        description="Extracted entities organized by type (orders, emails, phones, etc.)"
    )
    
    # Primary intent or purpose of the message
    intent: str = Field(
        description="Primary intent: question, request, statement, response, etc."
    )
    
    # Topics or themes discussed
    topics: List[str] = Field(
        description="Main topics or subjects discussed in the message"
    )

def canonicalize_message(
    message: Any,
    llm: ChatOpenAI,
    message_timestamp: datetime = None
) -> CanonicalMessage:
    """
    Transform a raw conversation message into canonical schema format.
    Uses LLM to extract structure and normalize content.
    
    Args:
        message: Raw conversation message (HumanMessage, AIMessage, etc.)
        llm: Language model for structured extraction
        message_timestamp: When the message was created (for temporal canonicalization)
        
    Returns:
        CanonicalMessage with normalized, structured content
    """
    # Default to current time if no timestamp provided
    if message_timestamp is None:
        message_timestamp = datetime.now()
    
    # Determine speaker type from message class
    if isinstance(message, HumanMessage):
        speaker = "user"
    elif isinstance(message, AIMessage):
        speaker = "assistant"
    elif isinstance(message, SystemMessage):
        speaker = "system"
    else:
        speaker = "unknown"
    
    # Apply basic normalization first
    normalized_text = normalize_entities(message.content)
    normalized_text = canonicalize_temporal_references(
        normalized_text, 
        message_timestamp
    )
    
    # Create prompt for structured extraction
    extraction_prompt = f"""Analyze the following message and extract structured information.

Message (already normalized): {normalized_text}
Speaker: {speaker}
Timestamp: {message_timestamp.isoformat()}

Extract:
- canonical_content: The normalized message content
- entities: Dictionary of entities by type (orders, emails, phones, products, etc.)
  Example: {{"orders": ["ORDER-12345"], "emails": ["user@example.com"]}}
- intent: Primary intent (question, request, complaint, acknowledgment, etc.)
- topics: Main topics discussed (delivery, payment, support, etc.)
"""
    
    # Use structured output to ensure schema compliance
    llm_with_structure = llm.with_structured_output(
        CanonicalMessage, 
        method="function_calling" # This is often more resilient to Pydantic defaults
    )    
    # Generate canonical message structure
    canonical = llm_with_structure.invoke([HumanMessage(content=extraction_prompt)])
    
    return canonical

# Example: Canonicalize a sequence of conversation messages
raw_messages = [
    (HumanMessage(content="I need help with order #12345 from yesterday"), 
     datetime(2024, 3, 15, 9, 30)),
    
    (AIMessage(content="I'd be happy to help with order 12345. Let me look that up."),
     datetime(2024, 3, 15, 9, 31)),
    
    (HumanMessage(content="It was supposed to arrive today but I haven't received it. Please contact me at 555-123-4567"),
     datetime(2024, 3, 15, 9, 32)),
    
    (AIMessage(content="I see that ORDER-12345 is delayed. I'll expedite it and call you at (555) 123-4567 tomorrow with an update."),
     datetime(2024, 3, 15, 9, 35)),
]

print("Canonicalizing Conversation Messages")
print("="*80)

canonical_conversation = []

for message, timestamp in raw_messages:
    canonical = canonicalize_message(message, llm, timestamp)
    canonical_conversation.append(canonical)
    
    print(f"\nOriginal ({canonical.speaker} at {timestamp.strftime('%H:%M')}):")
    print(f"  {message.content}")
    
    print(f"\nCanonical Structure:")
    print(f"  Timestamp: {canonical.timestamp}")
    print(f"  Speaker: {canonical.speaker}")
    print(f"  Content: {canonical.canonical_content}")
    print(f"  Intent: {canonical.intent}")
    print(f"  Entities: {canonical.entities}")
    print(f"  Topics: {canonical.topics}")
    print("  " + "-"*76)

Canonicalizing Conversation Messages

Original (user at 09:30):
  I need help with order #12345 from yesterday

Canonical Structure:
  Timestamp: 2024-03-15T09:30:00
  Speaker: user
  Content: I need help with ORDER-12345 from on March 14, 2024
  Intent: request
  Entities: {'orders': ['ORDER-12345']}
  Topics: ['support']
  ----------------------------------------------------------------------------

Original (assistant at 09:31):
  I'd be happy to help with order 12345. Let me look that up.

Canonical Structure:
  Timestamp: 2024-03-15T09:31:00
  Speaker: assistant
  Content: I'd be happy to help with ORDER-12345. Let me look that up.
  Intent: request
  Entities: {'orders': ['ORDER-12345']}
  Topics: ['support']
  ----------------------------------------------------------------------------

Original (user at 09:32):
  It was supposed to arrive today but I haven't received it. Please contact me at 555-123-4567

Canonical Structure:
  Timestamp: 2024-03-15T09:32:00
  Speaker: user
  C

Schema-based canonicalization combines rule-based normalization with LLM-powered structured extraction.
- The `CanonicalMessage` Pydantic model defines six fields that capture different aspects of a message: timestamp for temporal ordering, speaker identification, normalized content, extracted entities organized by type, primary intent and relevant topics.
- The `canonicalize_message` function first applies entity and temporal normalization to the raw message content, then constructs a prompt instructing the LLM to extract structured information from the normalized text.
- Using LangChain's `with_structured_output` method enforces that the LLM's response conforms to the `CanonicalMessage` schema. The resulting structure provides a machine-readable representation of the conversation turn.
- In the example, we see how a raw message "I need help with order #12345 from yesterday" gets transformed into a canonical structure with normalized content ("I need help with ORDER-12345 from on March 14, 2024"), extracted entities ({"orders": ["ORDER-12345"]}), identified intent ("request"), and relevant topics (["order support", "delivery"]). This structured format makes it trivial to programmatically answer questions like "Which orders were discussed?", "What was the user's intent?", or "When was this message sent?", significantly improving the agent's ability to reference and utilize conversation history.