# Hierarchical Compression

Context information for AI agents varies in its value and relevance based on recency, importance and the current task. Recent conversation turns typically need to be preserved verbatim for coherent interaction, while older information can tolerate more aggressive summarization. Similarly, critical facts like customer identifiers or account status require full preservation, while tangential details or metadata can be highly compressed. A single compression level applied uniformly across all context elements fails to capture these nuances, either wasting tokens on overly detailed old information or losing important details through excessive compression.

Hierarchical compression addresses this by organizing context into multiple tiers with different compression ratios applied at each level. Recent and critical information occupies the highest tier with verbatim or light compression, moderately old or important information exists in a middle tier with balanced summarization, and older or less critical information resides in a heavily compressed tier with aggressive summarization or even removal. This creates a graduated context structure where information naturally flows through compression tiers over time or as importance changes, optimizing the token budget allocation across different context needs.

This notebook demonstrates how to implement hierarchical compression from basic multi-tier systems to production-ready managers. We will explore time-based hierarchies where age determines compression level, importance-based hierarchies that preserve critical information regardless of age, progressive summarization with multiple compression stages, layered storage with different granularity levels, and complete hierarchical managers that automatically assign and transition information across tiers. These techniques are essential for building agents that maintain rich context awareness while respecting strict token budgets through intelligent prioritization.

In [1]:
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, AIMessage, SystemMessage
from pydantic import BaseModel, Field
from typing import List, Optional, Dict, Any, Literal
from datetime import datetime, timedelta
from enum import Enum
import json
import os

### Initialize the language model that will perform compression operations

In [2]:
# Using gpt-4o-mini for cost efficiency, temperature=0 for deterministic outputs
llm = ChatOpenAI(model="gpt-4o-mini", api_key=os.getenv("OPENAI_API_KEY", "").strip(), temperature=0)

## Understanding compression granularity levels
Before implementing hierarchical systems, we need to understand the different levels of compression granularity and their appropriate use cases. Verbatim preservation maintains complete original content, light summarization condenses while preserving most details, moderate summarization captures key facts with selective detail, and aggressive summarization reduces content to minimal essential information. Each level trades off fidelity for token efficiency at different ratios.

We will demonstrate the four compression levels applied to the same conversation content, showing the token costs and information preservation characteristics of each. This establishes the foundation for hierarchical systems that intelligently assign information to appropriate compression tiers based on context needs.

Let's start by preparing a sample conversation that we will use to demonstrate different compression levels. This conversation represents a typical customer service exchange with specific details like order numbers, dates, and commitments that we need to handle carefully during compression. By using the same conversation across all compression levels, we can directly compare how much information is preserved at each tier and understand the tradeoffs between token efficiency and detail retention.

In [3]:
# Sample conversation content to compress at different levels
sample_conversation = """
User: I'm having trouble with my order #A12345. It was supposed to arrive yesterday 
but the tracking shows it's still in transit. I need it urgently for my daughter's 
birthday party this weekend.

Agent: I apologize for the delay with order #A12345. Let me check the tracking 
information right away. I can see the package is currently at the regional distribution 
center in Memphis, TN. There was an unexpected delay due to severe weather in the area. 
I can offer you expedited shipping at no charge to ensure it arrives by Friday.

User: Friday should work, but can you guarantee it? The party is on Saturday afternoon.

Agent: Yes, I can guarantee Friday delivery by 5 PM. I'm upgrading your shipping to 
priority overnight at no cost. You'll receive a new tracking number within the hour, 
and I'm adding a $25 credit to your account for the inconvenience.
"""

Now, we will define four compression functions, each representing a different tier in our hierarchy. 
- The verbatim function simply returns the original content unchanged, serving as our baseline.
- The light compression function uses the LLM to condense the text while keeping most details intact, removing only redundant phrasing and targeting about 70-80% of the original length.
- The moderate compression function extracts key facts and the main narrative while omitting conversational nuances, aiming for 40-50% of original length.
- The aggressive compression function reduces content to bare essential facts and outcomes, stripping away all context and explanation to achieve 20-30% of original length.

Each function uses carefully crafted prompts to guide the LLM toward the appropriate compression level, ensuring consistent behavior across different types of content.

In [4]:
def compress_verbatim(content: str) -> str:
    """
    Verbatim level: No compression, complete preservation.
    Use for: Recent messages, critical information.
    """
    return content

def compress_light(content: str, llm: ChatOpenAI) -> str:
    """
    Light compression: Minor condensing, preserve most details.
    Use for: Recent but not immediate content.
    Target: 70-80% of original length.
    """
    prompt = f"""Lightly compress this conversation while preserving most details:

{content}

Light compression guidelines:
- Keep all important facts, numbers, and commitments
- Preserve the conversation flow and context
- Remove only redundant phrasing and minor details
- Target 70-80% of original length

Compressed version:"""
    
    response = llm.invoke([HumanMessage(content=prompt)])
    return response.content

def compress_moderate(content: str, llm: ChatOpenAI) -> str:
    """
    Moderate compression: Capture key facts with selective details.
    Use for: Older content, moderate importance.
    Target: 40-50% of original length.
    """
    prompt = f"""Create a moderate compression of this conversation:

{content}

Moderate compression guidelines:
- Preserve key facts: order numbers, commitments, outcomes
- Summarize the main issue and resolution
- Omit conversational details and minor context
- Target 40-50% of original length

Compressed version:"""
    
    response = llm.invoke([HumanMessage(content=prompt)])
    return response.content

def compress_aggressive(content: str, llm: ChatOpenAI) -> str:
    """
    Aggressive compression: Minimal essential information only.
    Use for: Old content, low importance, background context.
    Target: 20-30% of original length.
    """
    prompt = f"""Create an aggressive compression of this conversation:

{content}

Aggressive compression guidelines:
- Extract only the essential facts
- Focus on outcomes and commitments
- Omit explanations and conversational context
- Target 20-30% of original length

Compressed version:"""
    
    response = llm.invoke([HumanMessage(content=prompt)])
    return response.content

With our compression functions ready, let's now apply them to the sample conversation and see how they perform. We will run the same conversation through all four compression levels and compare the results side by side. For each level, we will calculate the approximate token count and show the percentage of original content preserved. This demonstration will reveal the practical tradeoffs at each tier - verbatim preserves everything but uses maximum tokens, light compression reduces verbosity while keeping all facts, moderate compression focuses on key points while dropping conversational flow, and aggressive compression extracts only the bare essentials. Pay attention to what information survives at each level and what gets discarded, as this understanding is crucial for designing effective hierarchical compression strategies.

In [5]:
# Demonstrate all compression levels
print("Compression Granularity Levels:")
print("="*80)

# Original
verbatim = compress_verbatim(sample_conversation)
verbatim_tokens = int(len(verbatim.split()) * 1.3)
print("\n1. VERBATIM (No Compression):")
print("-" * 80)
print(verbatim)
print(f"\nTokens: ~{verbatim_tokens}")

# Light compression
light = compress_light(sample_conversation, llm)
light_tokens = int(len(light.split()) * 1.3)
print("\n2. LIGHT COMPRESSION (70-80% preserved):")
print("-" * 80)
print(light)
print(f"\nTokens: ~{light_tokens} ({(light_tokens/verbatim_tokens*100):.0f}% of original)")

# Moderate compression
moderate = compress_moderate(sample_conversation, llm)
moderate_tokens = int(len(moderate.split()) * 1.3)
print("\n3. MODERATE COMPRESSION (40-50% preserved):")
print("-" * 80)
print(moderate)
print(f"\nTokens: ~{moderate_tokens} ({(moderate_tokens/verbatim_tokens*100):.0f}% of original)")

# Aggressive compression
aggressive = compress_aggressive(sample_conversation, llm)
aggressive_tokens = int(len(aggressive.split()) * 1.3)
print("\n4. AGGRESSIVE COMPRESSION (20-30% preserved):")
print("-" * 80)
print(aggressive)
print(f"\nTokens: ~{aggressive_tokens} ({(aggressive_tokens/verbatim_tokens*100):.0f}% of original)")

print("\n" + "="*80)
print("\nCompression Comparison:")
print(f"  Verbatim: {verbatim_tokens} tokens (baseline)")
print(f"  Light: {light_tokens} tokens (saves {verbatim_tokens - light_tokens})")
print(f"  Moderate: {moderate_tokens} tokens (saves {verbatim_tokens - moderate_tokens})")
print(f"  Aggressive: {aggressive_tokens} tokens (saves {verbatim_tokens - aggressive_tokens})")

Compression Granularity Levels:

1. VERBATIM (No Compression):
--------------------------------------------------------------------------------

User: I'm having trouble with my order #A12345. It was supposed to arrive yesterday 
but the tracking shows it's still in transit. I need it urgently for my daughter's 
birthday party this weekend.

Agent: I apologize for the delay with order #A12345. Let me check the tracking 
information right away. I can see the package is currently at the regional distribution 
center in Memphis, TN. There was an unexpected delay due to severe weather in the area. 
I can offer you expedited shipping at no charge to ensure it arrives by Friday.

User: Friday should work, but can you guarantee it? The party is on Saturday afternoon.

Agent: Yes, I can guarantee Friday delivery by 5 PM. I'm upgrading your shipping to 
priority overnight at no cost. You'll receive a new tracking number within the hour, 
and I'm adding a $25 credit to your account for the incon

The compression level demonstrations show the tradeoffs between information preservation and token efficiency across four granularity tiers.
- Verbatim preservation maintains complete content at full token cost.
- Light compression uses LLM-based condensing to remove redundant phrasing and verbose expressions while keeping all substantive details, achieving approximately 20-30% compression.
- Moderate compression extracts key facts and main narrative flow while omitting conversational details and explanatory context, achieving 40-60% compression.
- Aggressive compression reduces content to bare essential facts and outcomes, discarding all context and explanation to achieve 60-80% compression.

The example demonstrates these levels on a customer service conversation: verbatim preserves the complete dialogue, light compression condenses phrasing while keeping all facts, moderate compression captures the delivery issue and resolution without full dialogue, and aggressive compression reduces to just the essential outcome (order upgraded, credit issued). Understanding these levels is crucial for hierarchical systems that must intelligently assign different compression ratios to different context elements based on their characteristics and needs.

## Time-based hierarchical strategy
The most intuitive hierarchical approach organizes context by age, applying lighter compression to recent information and progressively more aggressive compression to older content. This time-based hierarchy naturally reflects how conversation relevance typically decays - the most recent few turns are essential for coherent interaction, turns from earlier in the session provide valuable context, and very old turns offer only background awareness. By defining age thresholds and associated compression levels, we create an automatic aging system where context naturally transitions through compression tiers.

We will implement a time-based hierarchical compression system with three tiers: recent (verbatim), medium-age (moderate compression), and old (aggressive compression). The system automatically categorizes messages by age and applies appropriate compression, creating an efficient context structure that prioritizes recent information.

Before we build our time-based compression system, we need to define the data structures that will represent compression tiers and messages with their metadata. We will create an enum to represent the four compression tiers we just demonstrated, making it easy to reference them in code. Then we will define a Pydantic model for tiered messages that tracks not just the message content but also the compression tier it's currently at, the original uncompressed content, a timestamp for age calculations and the message type. These models provide type safety and ensure we maintain all the information needed to make intelligent compression decisions as messages age through the system.

In [6]:
class CompressionTier(str, Enum):
    """
    Enum defining the four compression tiers for hierarchical strategy.
    Using str enum allows easy serialization and comparison.
    """
    VERBATIM = "verbatim"      # No compression (100% preserved)
    LIGHT = "light"            # Light compression (70-80% preserved)
    MODERATE = "moderate"      # Moderate compression (40-50% preserved)
    AGGRESSIVE = "aggressive"  # Aggressive compression (20-30% preserved)


class TieredMessage(BaseModel):
    """
    Represents a message with compression tier metadata.
    
    This model tracks both the current state and the history of a message
    as it flows through compression tiers.
    """
    # Current (possibly compressed) content
    content: str = Field(
        description="Message content at current compression level"
    )
    
    # Original uncompressed content for recompression without artifacts
    original_content: Optional[str] = Field(
        default=None,
        description="Original uncompressed content"
    )
    
    # Current compression tier
    tier: CompressionTier = Field(
        description="Current compression tier"
    )
    
    # Timestamp for age-based compression decisions
    timestamp: datetime = Field(
        description="When this message was created"
    )
    
    # Message type (human, ai, system)
    message_type: str = Field(
        description="Type of message (human, ai, system)"
    )

Now we will build the `TimeBasedHierarchicalCompressor` class that automatically assigns compression tiers based on message age. The class maintains age thresholds that define the boundaries between compression tiers - messages younger than the recent threshold stay verbatim, messages between recent and medium thresholds get light compression, messages between medium and old thresholds get moderate compression, and anything older than the old threshold gets aggressive compression. The key insight is that messages naturally flow through compression tiers as they age, without any manual intervention. We will provide methods to add new messages starting at verbatim tier, and a recompression method that evaluates all messages based on their current age and adjusts their compression level as needed. Importantly, recompression always works from the original content to avoid compounding compression artifacts that would degrade quality if we repeatedly compressed already-compressed text.

In [7]:
class TimeBasedHierarchicalCompressor:
    """
    Hierarchical compressor that assigns compression tiers based on message age.
    
    As messages age, they automatically flow through compression tiers:
    - Recent messages: VERBATIM (full detail for coherent interaction)
    - Medium-age messages: LIGHT (condensed but contextual)
    - Older messages: MODERATE (key facts only)
    - Very old messages: AGGRESSIVE (minimal essentials)
    """
    
    def __init__(
        self,
        llm: ChatOpenAI,
        recent_threshold_minutes: int = 15,
        medium_threshold_minutes: int = 60,
        old_threshold_minutes: int = 180
    ):
        """
        Initialize the time-based compressor with age thresholds.
        
        Args:
            llm: Language model for performing compression
            recent_threshold_minutes: Age below which messages stay verbatim
            medium_threshold_minutes: Age below which messages get light compression
            old_threshold_minutes: Age below which messages get moderate compression
                (Messages older than this get aggressive compression)
        """
        self.llm = llm
        
        # Convert threshold minutes to timedelta objects for easy comparison
        self.recent_threshold = timedelta(minutes=recent_threshold_minutes)
        self.medium_threshold = timedelta(minutes=medium_threshold_minutes)
        self.old_threshold = timedelta(minutes=old_threshold_minutes)
        
        # Storage for all messages with their metadata
        self.messages: List[TieredMessage] = []
    
    def _determine_tier(self, message_age: timedelta) -> CompressionTier:
        """
        Determine the appropriate compression tier based on message age.
        
        This implements the age-based hierarchy:
        - age < recent_threshold: VERBATIM
        - age < medium_threshold: LIGHT
        - age < old_threshold: MODERATE
        - age >= old_threshold: AGGRESSIVE
        
        Args:
            message_age: Time elapsed since message creation
            
        Returns:
            Appropriate compression tier for this age
        """
        if message_age < self.recent_threshold:
            return CompressionTier.VERBATIM
        elif message_age < self.medium_threshold:
            return CompressionTier.LIGHT
        elif message_age < self.old_threshold:
            return CompressionTier.MODERATE
        else:
            return CompressionTier.AGGRESSIVE
    
    def _compress_to_tier(
        self,
        content: str,
        target_tier: CompressionTier
    ) -> str:
        """
        Compress content to the specified tier level.
        
        Delegates to the appropriate compression function based on target tier.
        
        Args:
            content: Original or source content to compress
            target_tier: Desired compression level
            
        Returns:
            Content compressed to the target tier
        """
        if target_tier == CompressionTier.VERBATIM:
            return content
        elif target_tier == CompressionTier.LIGHT:
            return compress_light(content, self.llm)
        elif target_tier == CompressionTier.MODERATE:
            return compress_moderate(content, self.llm)
        else:  # AGGRESSIVE
            return compress_aggressive(content, self.llm)
    
    def add_message(
        self,
        content: str,
        message_type: str = "human",
        timestamp: Optional[datetime] = None
    ) -> None:
        """
        Add a new message to the hierarchical store.
        
        New messages always start at VERBATIM tier. They will be compressed through recompression as they age.
        
        Args:
            content: Message content
            message_type: Type of message (human, ai, system)
            timestamp: Message timestamp (defaults to current time)
        """
        # Use current time if no timestamp provided
        ts = timestamp or datetime.now()
        
        # Create a new tiered message starting at verbatim level
        message = TieredMessage(
            content=content,
            original_content=content,  # Store original for future recompression
            tier=CompressionTier.VERBATIM,
            timestamp=ts,
            message_type=message_type
        )
        
        # Add to message list
        self.messages.append(message)
    
    def recompress_by_age(self, current_time: Optional[datetime] = None) -> None:
        """
        Recompress all messages based on their current age.
        This method calculates each message's current age and assigns it to the appropriate tier. Messages naturally flow through tiers as they age.
        
        IMPORTANT: Always recompresses from original_content to avoid compounding compression artifacts.
        
        Args:
            current_time: Reference time for age calculation (defaults to now)
        """
        # Use current time if not provided
        current = current_time or datetime.now()
        
        # Iterate through all messages
        for message in self.messages:
            # Calculate how old this message is
            age = current - message.timestamp
            
            # Determine what tier this message should be at given its age
            target_tier = self._determine_tier(age)
            
            # Only recompress if the tier has changed
            if target_tier != message.tier:
                # Use original content as source to avoid compression artifacts
                source_content = message.original_content or message.content
                
                # Compress to the target tier
                message.content = self._compress_to_tier(source_content, target_tier)
                
                # Update the tier metadata
                message.tier = target_tier
    
    def get_context(self) -> List[Dict[str, Any]]:
        """
        Get the current context with tier information for all messages.
        
        Returns:
            List of message dictionaries with content, type, tier, and age
        """
        return [
            {
                'content': msg.content,
                'type': msg.message_type,
                'tier': msg.tier.value,
                'age_minutes': int((datetime.now() - msg.timestamp).total_seconds() / 60)
            }
            for msg in self.messages
        ]
    
    def get_stats(self) -> Dict[str, Any]:
        """
        Get compression statistics showing token savings and tier distribution.
        
        Returns:
            Dictionary with statistics including token counts and savings
        """
        from collections import Counter
        
        # Count how many messages are at each tier
        tier_counts = Counter(msg.tier.value for msg in self.messages)
        
        # Calculate current total tokens (with compression)
        total_tokens = sum(int(len(msg.content.split()) * 1.3) for msg in self.messages)
        
        # Calculate original total tokens (without compression)
        original_tokens = sum(
            int(len((msg.original_content or msg.content).split()) * 1.3)
            for msg in self.messages
        )
        
        return {
            'total_messages': len(self.messages),
            'tier_distribution': dict(tier_counts),
            'current_tokens': total_tokens,
            'original_tokens': original_tokens,
            'tokens_saved': original_tokens - total_tokens,
            'compression_ratio': f"{(1 - total_tokens/max(original_tokens, 1))*100:.1f}%"
        }

Let's see the time-based hierarchical compression in action by simulating a conversation that spans 45 minutes. We will create messages at different points in time - some from 45 minutes ago that should be aggressively compressed by now, some from 20 minutes ago that should be moderately compressed, and recent ones from 2-3 minutes ago that should stay verbatim. We will use specific timestamps for each message to simulate the passage of time, then call the recompression method to automatically assign appropriate tiers based on age. This demonstration will show how the system automatically manages compression without manual intervention, creating a natural hierarchy where recent context stays detailed and old context becomes progressively more condensed.

In [8]:
# Example: Time-based hierarchical compression
print("Time-Based Hierarchical Compression:")
print("="*80)

# Create compressor with custom time thresholds (shorter for demonstration)
compressor = TimeBasedHierarchicalCompressor(
    llm=llm,
    recent_threshold_minutes=5,  # Last 5 min: verbatim
    medium_threshold_minutes=15,  # 5-15 min: light compression
    old_threshold_minutes=30  # 15-30 min: moderate, 30+: aggressive
)

# Simulate conversation over time - Use a base time 45 minutes in the past as our starting point
base_time = datetime.now() - timedelta(minutes=45)

# Old messages (45 minutes ago) - these should be aggressively compressed
compressor.add_message(
    "Hi, I have a question about my account settings",
    message_type="human",
    timestamp=base_time
)
compressor.add_message(
    "I'd be happy to help with your account settings. What would you like to know?",
    message_type="ai",
    timestamp=base_time + timedelta(minutes=1)
)

# Medium-age messages (20 minutes ago) - these should be moderately compressed
compressor.add_message(
    "I need to update my payment method",
    message_type="human",
    timestamp=base_time + timedelta(minutes=25)
)
compressor.add_message(
    "Sure, I can help you update your payment method. Would you like to add a new card or update an existing one?",
    message_type="ai",
    timestamp=base_time + timedelta(minutes=26)
)

# Recent messages (3 minutes ago) - these should stay verbatim
compressor.add_message(
    "I want to add a new card ending in 4567",
    message_type="human",
    timestamp=base_time + timedelta(minutes=42)
)
compressor.add_message(
    "Perfect, I've added the card ending in 4567 to your account. Would you like to set it as your default payment method?",
    message_type="ai",
    timestamp=base_time + timedelta(minutes=43)
)

# Apply time-based compression
# Simulate "now" as 45 minutes after base_time
print("\nApplying time-based compression...")
compressor.recompress_by_age(current_time=base_time + timedelta(minutes=45))

# Get the compressed context
context = compressor.get_context()
stats = compressor.get_stats()

# Display the hierarchical context
print("\nHierarchical Context (organized by age):")
print("-" * 80)

for i, msg in enumerate(context, 1):
    print(f"\n{i}. [{msg['tier'].upper()}] (Age: {msg['age_minutes']} min)")
    print(f"   Type: {msg['type']}")
    print(f"   Content: {msg['content']}")

# Display compression statistics
print("\n" + "="*80)
print("\nCompression Statistics:")
for key, value in stats.items():
    print(f"  {key.replace('_', ' ').title()}: {value}")

Time-Based Hierarchical Compression:

Applying time-based compression...

Hierarchical Context (organized by age):
--------------------------------------------------------------------------------

1. [AGGRESSIVE] (Age: 45 min)
   Type: human
   Content: Question about account settings.

2. [AGGRESSIVE] (Age: 44 min)
   Type: ai
   Content: How can I assist with your account settings?

3. [MODERATE] (Age: 20 min)
   Type: human
   Content: User needs to update their payment method.

4. [MODERATE] (Age: 19 min)
   Type: ai
   Content: I can assist you with updating your payment method. Do you want to add a new card or update an existing one?

5. [VERBATIM] (Age: 3 min)
   Type: human
   Content: I want to add a new card ending in 4567

6. [VERBATIM] (Age: 2 min)
   Type: ai
   Content: Perfect, I've added the card ending in 4567 to your account. Would you like to set it as your default payment method?


Compression Statistics:
  Total Messages: 6
  Tier Distribution: {'aggressive': 2, 'm

Time-based tier assignment rules:
- 0-5 min: VERBATIM (full preservation)
- 5-15 min: LIGHT (minor condensing)
- 15-30 min: MODERATE (key facts only)
- 30+ min: AGGRESSIVE (minimal essentials)
  
The time-based hierarchical system automatically assigns compression tiers based on message age, creating a natural aging process where content flows through compression levels over time.
- The `TimeBasedHierarchicalCompressor` maintains age thresholds defining tier boundaries and applies appropriate compression when messages transition between tiers.
- The `recompress_by_age` method calculates each message's current age and determines its target tier, recompressing from the original content to avoid compounding compression artifacts. Messages younger than the recent threshold remain verbatim for coherent interaction, messages in the medium age range receive light compression to reduce verbosity while maintaining context, older messages receive moderate compression preserving only key facts, and very old messages receive aggressive compression reducing them to bare essentials.
- The example demonstrates a 45-minute conversation where the oldest messages (45 minutes) are aggressively compressed to minimal facts, middle-aged messages (20 minutes) receive moderate compression, and recent messages (2-3 minutes) stay verbatim. This typically achieves 40-60% overall compression in long conversations while maintaining excellent recent context quality, making it ideal for long-running conversation systems where older context provides background awareness but does not need full detail.

## Progressive summarization with multiple levels
Rather than jumping directly from verbatim to heavily compressed, progressive summarization applies compression in multiple stages, with each stage building on the previous compression. This creates a smooth gradient of compression ratios and allows the system to maintain intermediate representations that balance detail and efficiency. A message might start verbatim, transition to a light summary after some time, then to a moderate summary and finally to a high-level abstract, creating a natural information degradation curve.

We will implement a progressive summarization system with explicit stages where content transitions through defined compression levels over time or based on triggers. The system maintains the compression history and ensures smooth transitions between levels without information loss at each stage.

Before implementing progressive summarization, we need to understand how it differs from simple time-based compression. Instead of jumping directly from verbatim to aggressive compression when a message ages past a threshold, progressive summarization applies compression in multiple stages, creating smooth transitions through compression levels. Each message maintains a complete history of its compression stages - first verbatim, then compressed to light, then to moderate, and finally to aggressive. The key advantage is that each compression stage builds on the previous stage's output rather than always compressing from the original, creating a natural degradation curve. We will start by defining data models that can track this multi-stage history for each message.

In [9]:
class ProgressiveStage(BaseModel):
    """
    Represents a single compression stage in the progressive hierarchy.
    Each stage captures the content at a specific compression level, along with metadata about when it was created and its token cost.
    """
    # The compression level for this stage
    level: CompressionTier = Field(
        description="Compression level for this stage"
    )
    
    # The content at this compression level
    content: str = Field(
        description="Content compressed to this level"
    )
    
    # When this stage was created (for tracking compression history)
    created_at: datetime = Field(
        description="Timestamp when this compression stage was created"
    )
    
    # Token count at this stage (for efficiency tracking)
    token_count: int = Field(
        description="Estimated token count at this compression level"
    )


class ProgressiveMessage(BaseModel):
    """
    Message with complete progressive compression history.
    Instead of a single compression state, this maintains all stages the message has progressed through, allowing retrieval at any level.
    """
    # Unique identifier for this message
    id: str = Field(
        description="Message identifier"
    )
    
    # Original uncompressed content (for reference)
    original_content: str = Field(
        description="Original verbatim content"
    )
    
    # Complete compression history from verbatim through all stages
    stages: List[ProgressiveStage] = Field(
        description="List of compression stages in progression order"
    )
    
    # Message creation timestamp
    created_at: datetime = Field(
        description="When the message was created"
    )
    
    # Message type (human, ai, system)
    message_type: str = Field(
        description="Type of message"
    )
    
    def get_current_stage(self) -> ProgressiveStage:
        """
        Get the most compressed (latest) stage. This represents the current compression state of the message.
        
        Returns:
            The latest compression stage
        """
        return self.stages[-1]
    
    def get_stage_by_level(self, level: CompressionTier) -> Optional[ProgressiveStage]:
        """
        Retrieve a specific compression stage by level. This allows accessing any intermediate compression level from history.
        
        Args:
            level: The compression level to retrieve
            
        Returns:
            The stage at that level, or None if not found
        """
        for stage in self.stages:
            if stage.level == level:
                return stage
        return None

Now we will build the `ProgressiveSummarizer` class that manages multi-stage compression. This class maintains a defined progression path (verbatim → light → moderate → aggressive) and provides methods to advance messages through these stages one at a time. The key method is `progress_message`, which takes a message at its current stage and compresses it to the next level in the progression path. Critically, each stage compresses from the previous stage's output rather than from the original content, creating a chain where compressions build upon each other. This produces smoother semantic transitions but requires careful prompting to ensure quality does not degrade too quickly. The class also provides a `progress_oldest` method that automatically advances the oldest messages in the system, simulating how messages naturally age and compress over time in a running conversation.

In [11]:
class ProgressiveSummarizer:
    """
    Progressive summarization system that compresses content through multiple stages, maintaining complete history at each level.
    Messages advance through stages: VERBATIM → LIGHT → MODERATE → AGGRESSIVE
    Each stage builds on the previous compression (not original content).
    """
    
    def __init__(self, llm: ChatOpenAI):
        """
        Initialize the progressive summarizer.
        
        Args:
            llm: Language model for performing compressions
        """
        self.llm = llm
        self.messages: List[ProgressiveMessage] = []
        self.message_counter = 0
        
        # Define the progression path through compression tiers
        # Messages flow through these levels in order
        self.progression_path = [
            CompressionTier.VERBATIM,
            CompressionTier.LIGHT,
            CompressionTier.MODERATE,
            CompressionTier.AGGRESSIVE
        ]
    
    def _estimate_tokens(self, text: str) -> int:
        """
        Estimate token count for text content.
        
        Uses rough approximation of 1.3 tokens per word.
        
        Args:
            text: Text to estimate tokens for
            
        Returns:
            Estimated token count
        """
        return int(len(text.split()) * 1.3)
    
    def add_message(
        self,
        content: str,
        message_type: str = "human"
    ) -> None:
        """
        Add a new message starting at verbatim stage.
        New messages begin with a single stage (verbatim) and progress through additional stages over time.
        
        Args:
            content: Message content
            message_type: Type of message (human, ai, system)
        """
        self.message_counter += 1
        
        # Create the initial verbatim stage
        initial_stage = ProgressiveStage(
            level=CompressionTier.VERBATIM,
            content=content,
            created_at=datetime.now(),
            token_count=self._estimate_tokens(content)
        )
        
        # Create progressive message with initial stage
        message = ProgressiveMessage(
            id=f"msg_{self.message_counter}",
            original_content=content,
            stages=[initial_stage],  # Start with just verbatim
            created_at=datetime.now(),
            message_type=message_type
        )
        
        self.messages.append(message)
    
    def progress_message(self, message: ProgressiveMessage) -> bool:
        """
        Advance a message to its next compression stage.
        This compresses the CURRENT stage to the NEXT level, creating a progressive chain where each compression builds on the previous.
        
        Args:
            message: Message to progress to next stage
            
        Returns:
            True if progressed, False if already at final stage
        """
        # Get the current (most compressed) stage
        current_stage = message.get_current_stage()
        current_level = current_stage.level
        
        # Find the next level in our progression path
        try:
            current_index = self.progression_path.index(current_level)
            
            # Check if we're already at the most compressed stage
            if current_index >= len(self.progression_path) - 1:
                return False  # Can't progress further
            
            # Get the next compression level
            next_level = self.progression_path[current_index + 1]
        except ValueError:
            # Current level not in progression path
            return False
        
        # IMPORTANT: Compress from CURRENT stage (not original content)
        # This creates progressive compression where each stage builds on previous
        source_content = current_stage.content
        
        # Apply the appropriate compression for the next level
        if next_level == CompressionTier.LIGHT:
            compressed = compress_light(source_content, self.llm)
        elif next_level == CompressionTier.MODERATE:
            compressed = compress_moderate(source_content, self.llm)
        else:  # AGGRESSIVE
            compressed = compress_aggressive(source_content, self.llm)
        
        # Create the new compression stage
        new_stage = ProgressiveStage(
            level=next_level,
            content=compressed,
            created_at=datetime.now(),
            token_count=self._estimate_tokens(compressed)
        )
        
        # Add the new stage to the message's history
        message.stages.append(new_stage)
        return True
    
    def progress_oldest(self, count: int = 1) -> int:
        """
        Progress the oldest messages to their next compression stage.
        This simulates the natural aging process where older messages get progressively more compressed over time.
        
        Args:
            count: Number of oldest messages to progress
            
        Returns:
            Number of messages actually progressed
        """
        progressed = 0
        
        # Progress the first 'count' messages (oldest first)
        for message in self.messages[:count]:
            if self.progress_message(message):
                progressed += 1
        
        return progressed
    
    def get_context(self) -> List[Dict[str, Any]]:
        """
        Get current context using the latest stage of each message.
        
        Returns:
            List of message dictionaries with current compression state
        """
        return [
            {
                'id': msg.id,
                'content': msg.get_current_stage().content,
                'type': msg.message_type,
                'compression_level': msg.get_current_stage().level.value,
                'stage_count': len(msg.stages),
                'tokens': msg.get_current_stage().token_count
            }
            for msg in self.messages
        ]
    
    def get_stats(self) -> Dict[str, Any]:
        """
        Get progressive compression statistics.
        
        Returns:
            Dictionary with token counts, savings, and level distribution
        """
        # Calculate current token usage (latest stages)
        current_tokens = sum(
            msg.get_current_stage().token_count for msg in self.messages
        )
        
        # Calculate original token usage (verbatim)
        original_tokens = sum(
            self._estimate_tokens(msg.original_content) for msg in self.messages
        )
        
        # Count messages at each compression level
        from collections import Counter
        level_distribution = Counter(
            msg.get_current_stage().level.value for msg in self.messages
        )
        
        return {
            'total_messages': len(self.messages),
            'current_tokens': current_tokens,
            'original_tokens': original_tokens,
            'tokens_saved': original_tokens - current_tokens,
            'compression_ratio': f"{(1 - current_tokens/max(original_tokens, 1))*100:.1f}%",
            'level_distribution': dict(level_distribution)
        }

Let's demonstrate progressive summarization by adding several messages to the system and then progressively compressing them through multiple stages. We will start with all messages at verbatim level, then call `progress_oldest` repeatedly to watch how messages flow through the compression stages. In each progression round, the oldest messages advance to their next compression tier - first from verbatim to light, then from light to moderate, then from moderate to aggressive. We will track the statistics at each stage to see how token usage decreases as compressions accumulate, and examine the final compressed context to understand the distribution of messages across different compression levels. This illustrates how progressive compression creates a natural hierarchy where the oldest messages have been compressed the most times and newer messages remain relatively uncompressed.

In [12]:
# Example: Progressive multi-level summarization
print("Progressive Multi-Level Summarization:")
print("="*80)

# Create a progressive summarizer
summarizer = ProgressiveSummarizer(llm)

# Add several messages to the system
messages_to_add = [
    ("I need help with my subscription renewal", "human"),
    ("I'd be happy to help with your subscription. Can you provide your account email?", "ai"),
    ("It's john.doe@example.com", "human"),
    ("Thank you. I can see your subscription expires next week. Would you like to renew for another year?", "ai"),
    ("Yes, but I'd like to upgrade to the premium plan", "human"),
]

for content, msg_type in messages_to_add:
    summarizer.add_message(content, msg_type)

# Show initial state (all messages at verbatim)
print("\nInitial State (all messages verbatim):")
print("-" * 80)
stats = summarizer.get_stats()
print(f"Messages: {stats['total_messages']}")
print(f"Tokens: {stats['current_tokens']}")
print(f"Level distribution: {stats['level_distribution']}")

# Now progressively compress messages through stages
print("\n" + "="*80)
print("\nProgressing messages through compression stages...")
print("-" * 80)

# Stage 1: Progress 2 oldest messages to light
print("\nStage 1: Progressing 2 oldest messages (VERBATIM → LIGHT)")
summarizer.progress_oldest(count=2)
stats = summarizer.get_stats()
print(f"  Tokens: {stats['current_tokens']} (saved {stats['tokens_saved']})")
print(f"  Distribution: {stats['level_distribution']}")

# Stage 2: Progress 2 oldest messages again
# The 2 that were light will go to moderate
# The next 2 verbatim will go to light
print("\nStage 2: Progressing 2 oldest messages again")
summarizer.progress_oldest(count=2)
stats = summarizer.get_stats()
print(f"  Tokens: {stats['current_tokens']} (saved {stats['tokens_saved']})")
print(f"  Distribution: {stats['level_distribution']}")

# Stage 3: Progress oldest message to aggressive
print("\nStage 3: Progressing oldest message (MODERATE → AGGRESSIVE)")
summarizer.progress_oldest(count=1)
stats = summarizer.get_stats()
print(f"  Tokens: {stats['current_tokens']} (saved {stats['tokens_saved']})")
print(f"  Distribution: {stats['level_distribution']}")

# Display final compressed context
print("\n" + "="*80)
print("\nFinal Context (progressive compression applied):")
print("-" * 80)

context = summarizer.get_context()
for i, msg in enumerate(context, 1):
    print(f"\n{i}. [{msg['compression_level'].upper()}] (Stages: {msg['stage_count']}, Tokens: {msg['tokens']})")
    print(f"   {msg['content']}")

# Show final statistics
print("\n" + "="*80)
print("\nProgressive Compression Benefits:")
print("  • Smooth compression gradient (no abrupt quality drops)")
print("  • Maintains intermediate representations for retrieval")
print("  • Can access any compression level from stage history")
print(f"  • Achieved {stats['compression_ratio']} overall compression")
print(f"  • Saved {stats['tokens_saved']} tokens total")

Progressive Multi-Level Summarization:

Initial State (all messages verbatim):
--------------------------------------------------------------------------------
Messages: 5
Tokens: 65
Level distribution: {'verbatim': 5}


Progressing messages through compression stages...
--------------------------------------------------------------------------------

Stage 1: Progressing 2 oldest messages (VERBATIM → LIGHT)
  Tokens: 63 (saved 2)
  Distribution: {'light': 2, 'verbatim': 3}

Stage 2: Progressing 2 oldest messages again
  Tokens: 56 (saved 9)
  Distribution: {'moderate': 2, 'verbatim': 3}

Stage 3: Progressing oldest message (MODERATE → AGGRESSIVE)
  Tokens: 56 (saved 9)
  Distribution: {'aggressive': 1, 'moderate': 1, 'verbatim': 3}


Final Context (progressive compression applied):
--------------------------------------------------------------------------------

1. [AGGRESSIVE] (Stages: 4, Tokens: 7)
   User needs help with subscription renewal.

2. [MODERATE] (Stages: 3, Tokens: 11)


The progressive summarization system maintains a complete compression history for each message, storing stages at each compression level as the message flows through the hierarchy.
- The `ProgressiveMessage` model tracks all stages from verbatim through light, moderate, and aggressive compression, with each stage storing its content, compression level, creation time, and token count.
- The `progress_message` method advances a message to its next compression stage by applying the appropriate compression function to the current stage's content, creating a chain where each level builds on the previous compression rather than always compressing from original content. This progressive approach creates smoother quality transitions - light compression condenses the verbatim version, moderate compression condenses the light version, and aggressive compression condenses the moderate version.
- The example demonstrates adding 5 messages and progressively compressing them through stages, showing how the oldest messages flow through multiple compression levels while newer messages remain verbatim, creating a natural hierarchy. The stage history allows retrieval of any compression level if more detail is needed, and the progressive approach typically achieves better semantic coherence than direct aggressive compression from original content. This makes it ideal for systems that need fine-grained control over compression and want to maintain multiple detail levels for different use cases.

## Production hierarchical compression manager
For production systems, we need a comprehensive hierarchical compression manager that combines time-based aging, importance-based preservation, progressive compression stages, and intelligent tier assignment policies. The system should automatically manage the complete lifecycle of context elements, handle both messages and structured data, provide flexible configuration for different use cases, and maintain detailed metrics for monitoring and optimization.

We will implement a production-ready hierarchical manager that integrates multiple strategies, provides sophisticated tier assignment based on multiple factors, automatically triggers compression operations, and offers comprehensive APIs for context retrieval with different detail levels. This represents a complete solution for production agents requiring efficient hierarchical context management.

For production use, we need a configuration system that allows fine-tuning of compression behavior for different use cases. We will create a `HierarchyPolicy` model that serves as a comprehensive configuration object, enabling or disabling different compression strategies and setting their parameters. This policy will control whether time-based compression is enabled, whether importance scores affect compression decisions, whether progressive multi-stage compression is used, what the age thresholds are for time-based tiers, and what importance score qualifies as "high importance" that deserves preferential treatment. By externalizing these decisions into a configuration object, we can easily adapt the same compression system for different scenarios - a customer service system might use longer thresholds and higher importance sensitivity, while a casual chatbot might use more aggressive settings.

In [13]:
class HierarchyPolicy(BaseModel):
    """
    Configuration policy for hierarchical compression behavior.
    This allows customization of compression strategies for different use cases without modifying the core compression logic.
    """
    # Enable/disable different compression strategies
    time_based_enabled: bool = Field(
        default=True,
        description="Use time-based tier assignment (age determines compression)"
    )
    
    importance_based_enabled: bool = Field(
        default=True,
        description="Use importance scores for tier assignment (high-importance items preserved longer)"
    )
    
    progressive_enabled: bool = Field(
        default=True,
        description="Use progressive multi-stage compression (smooth transitions)"
    )
    
    # Time-based thresholds (in minutes)
    recent_threshold_minutes: int = Field(
        default=10,
        description="Age threshold for recent tier (below this: verbatim)"
    )
    
    medium_threshold_minutes: int = Field(
        default=30,
        description="Age threshold for medium tier (below this: light compression)"
    )
    
    # Importance-based thresholds (0.0 to 1.0)
    high_importance_threshold: float = Field(
        default=0.8,
        description="Importance score threshold for preferential treatment"
    )

print("HierarchyPolicy configuration model defined!")
print("  Configurable features:")
print("    • Enable/disable time-based compression")
print("    • Enable/disable importance-based preservation")
print("    • Enable/disable progressive compression")
print("    • Customizable age thresholds")
print("    • Customizable importance threshold")

HierarchyPolicy configuration model defined!
  Configurable features:
    • Enable/disable time-based compression
    • Enable/disable importance-based preservation
    • Enable/disable progressive compression
    • Customizable age thresholds
    • Customizable importance threshold


Configurable features:
- Enable/disable time-based compression.
- Enable/disable importance-based preservation.
- Enable/disable progressive compression.
- Customizable age thresholds.
- Customizable importance threshold.
    
Now we will build the `ProductionHierarchicalManager`, which combines all the techniques we have explored into a single, production-ready system. This manager implements multi-factor compression that considers both age and importance when assigning tiers. The core logic is in the `_determine_tier method`, which checks if a message has high importance (above the threshold) and if so, gives it preferential treatment by keeping it at higher detail levels even as it ages. For normal importance messages, the system falls back to standard time-based tier assignment. The manager provides methods to add messages with importance scores and metadata, a `recompress_all` method that evaluates all messages and adjusts their tiers as needed, and a get_context method that supports token budgets by selecting messages to fit within a specified token limit. This represents a complete solution for production agents that need sophisticated context management with multiple compression strategies working in concert.

In [14]:
class ProductionHierarchicalManager:
    """
    Production-ready hierarchical compression manager combining time-based, importance-based and progressive strategies.
    This manager provides a complete solution for production agents requiring sophisticated multi-dimensional context compression.
    """
    
    def __init__(
        self,
        llm: ChatOpenAI,
        policy: Optional[HierarchyPolicy] = None
    ):
        """
        Initialize production hierarchical manager with policy.
        
        Args:
            llm: Language model for compression operations
            policy: Hierarchical compression policy (uses defaults if not provided)
        """
        self.llm = llm
        # Use provided policy or create default
        self.policy = policy or HierarchyPolicy()
        
        # Initialize sub-managers based on policy settings
        if self.policy.progressive_enabled:
            self.progressive_summarizer = ProgressiveSummarizer(llm)
        
        if self.policy.time_based_enabled:
            self.time_compressor = TimeBasedHierarchicalCompressor(
                llm=llm,
                recent_threshold_minutes=self.policy.recent_threshold_minutes,
                medium_threshold_minutes=self.policy.medium_threshold_minutes
            )
        
        # Unified message storage with rich metadata
        # Each message is a dictionary with content, metadata, tier, timestamps, etc.
        self.messages: List[Dict[str, Any]] = []
        
        # Metrics tracking for monitoring
        self.metrics = {
            'total_messages': 0,
            'compression_operations': 0,
            'total_tokens_saved': 0
        }
    
    def _determine_tier(
        self,
        age_minutes: int,
        importance: float
    ) -> CompressionTier:
        """
        Determine compression tier based on both age and importance.
        
        This implements multi-factor tier assignment:
        - High importance items (>= threshold) get preferential treatment
        - Normal importance items follow standard time-based rules
        
        Args:
            age_minutes: Message age in minutes
            importance: Importance score 0.0-1.0
            
        Returns:
            Appropriate compression tier
        """
        # HIGH IMPORTANCE PATH: Preserve detail longer for important messages
        if self.policy.importance_based_enabled:
            if importance >= self.policy.high_importance_threshold:
                # High importance gets preserved at higher detail levels
                if age_minutes < self.policy.medium_threshold_minutes:
                    return CompressionTier.VERBATIM
                elif age_minutes < self.policy.medium_threshold_minutes * 2:
                    return CompressionTier.LIGHT
                else:
                    # Even high importance eventually gets moderately compressed
                    return CompressionTier.MODERATE
        
        # NORMAL IMPORTANCE PATH: Standard time-based tier assignment
        if age_minutes < self.policy.recent_threshold_minutes:
            return CompressionTier.VERBATIM
        elif age_minutes < self.policy.medium_threshold_minutes:
            return CompressionTier.LIGHT
        elif age_minutes < self.policy.medium_threshold_minutes * 2:
            return CompressionTier.MODERATE
        else:
            return CompressionTier.AGGRESSIVE
    
    def _estimate_tokens(self, text: str) -> int:
        """
        Estimate token count for text.
        
        Args:
            text: Text to estimate
            
        Returns:
            Estimated token count
        """
        return int(len(text.split()) * 1.3)
    
    def add_message(
        self,
        content: str,
        message_type: str = "human",
        importance: float = 0.5,
        metadata: Optional[Dict[str, Any]] = None
    ) -> None:
        """
        Add a new message to the hierarchical manager.
        
        Messages start at VERBATIM tier and will be compressed through recompression based on age and importance.
        
        Args:
            content: Message content
            message_type: Type of message (human, ai, system)
            importance: Importance score 0.0-1.0 (higher = more important)
            metadata: Additional custom metadata
        """
        # Create message dictionary with all metadata
        message = {
            'id': f"msg_{self.metrics['total_messages']}",
            'original_content': content,
            'current_content': content,
            'type': message_type,
            'importance': importance,
            'created_at': datetime.now(),
            'current_tier': CompressionTier.VERBATIM,
            'metadata': metadata or {}
        }
        
        self.messages.append(message)
        self.metrics['total_messages'] += 1
    
    def recompress_all(self) -> None:
        """
        Recompress all messages based on current age and importance.
        This should be called periodically to maintain optimal compression as messages age and importance changes.
        
        Processes:
        1. Calculate current age of each message
        2. Determine target tier based on age and importance
        3. Recompress if tier has changed
        4. Always compress from original to avoid artifacts
        """
        current_time = datetime.now()
        
        for message in self.messages:
            # Calculate how old this message is now
            age = current_time - message['created_at']
            age_minutes = int(age.total_seconds() / 60)
            
            # Determine what tier this message should be at
            target_tier = self._determine_tier(age_minutes, message['importance'])
            
            # Only recompress if tier has changed
            if target_tier != message['current_tier']:
                # Compress from ORIGINAL content to avoid compounding artifacts
                if target_tier == CompressionTier.VERBATIM:
                    compressed = message['original_content']
                elif target_tier == CompressionTier.LIGHT:
                    compressed = compress_light(message['original_content'], self.llm)
                elif target_tier == CompressionTier.MODERATE:
                    compressed = compress_moderate(message['original_content'], self.llm)
                else:  # AGGRESSIVE
                    compressed = compress_aggressive(message['original_content'], self.llm)
                
                # Update message state
                message['current_content'] = compressed
                message['current_tier'] = target_tier
                
                # Track metrics
                self.metrics['compression_operations'] += 1
    
    def get_context(
        self,
        max_tokens: Optional[int] = None,
        include_metadata: bool = False
    ) -> List[Dict[str, Any]]:
        """
        Get hierarchically compressed context, optionally within a token budget.
        
        Args:
            max_tokens: Optional token budget limit (will fit as many messages as possible)
            include_metadata: Whether to include compression metadata in output
            
        Returns:
            List of context messages with compression applied
        """
        context = []
        current_tokens = 0
        
        # Process messages in reverse order (newest first) to prioritize recent content
        for message in reversed(self.messages):
            msg_tokens = self._estimate_tokens(message['current_content'])
            
            # Check token budget if specified
            if max_tokens and current_tokens + msg_tokens > max_tokens:
                break  # Stop adding messages once budget is exceeded
            
            # Build context message
            context_msg = {
                'content': message['current_content'],
                'type': message['type']
            }
            
            # Add metadata if requested (useful for debugging and monitoring)
            if include_metadata:
                age_minutes = int((datetime.now() - message['created_at']).total_seconds() / 60)
                context_msg.update({
                    'id': message['id'],
                    'tier': message['current_tier'].value,
                    'importance': message['importance'],
                    'age_minutes': age_minutes,
                    'tokens': msg_tokens
                })
            
            # Insert at beginning to maintain chronological order
            context.insert(0, context_msg)
            current_tokens += msg_tokens
        
        return context
    
    def get_stats(self) -> Dict[str, Any]:
        """
        Get comprehensive statistics about compression performance.
        
        Returns:
            Dictionary with metrics, token counts, and tier distribution
        """
        from collections import Counter
        
        # Count messages at each compression tier
        tier_distribution = Counter(msg['current_tier'].value for msg in self.messages)
        
        # Calculate current token usage (with compression)
        current_tokens = sum(
            self._estimate_tokens(msg['current_content']) for msg in self.messages
        )
        
        # Calculate original token usage (without compression)
        original_tokens = sum(
            self._estimate_tokens(msg['original_content']) for msg in self.messages
        )
        
        return {
            **self.metrics,  # Include tracked metrics
            'tier_distribution': dict(tier_distribution),
            'current_tokens': current_tokens,
            'original_tokens': original_tokens,
            'tokens_saved': original_tokens - current_tokens,
            'compression_ratio': f"{(1 - current_tokens/max(original_tokens, 1))*100:.1f}%"
        }

print("ProductionHierarchicalManager class defined!")

ProductionHierarchicalManager class defined!


Let's see the production manager in action with a realistic scenario that demonstrates importance-based preservation. We woll simulate a customer service conversation spanning 30 minutes where some messages have high importance (customer ID, commitments, resolutions) and others have normal importance (greetings, general questions). We will create a custom policy with specific thresholds, add messages with different ages and importance scores, then apply recompression to see how the manager handles the multi-factor tier assignment. The key observation will be how high-importance messages resist compression even as they age, while normal-importance messages follow standard time-based compression. This demonstrates the value of importance weighting in production systems where certain facts must be preserved with higher fidelity regardless of age.

In [16]:
# Example: Production hierarchical compression with importance weighting
print("Production Hierarchical Compression Manager:")
print("="*80)

# Create a custom policy for this use case
policy = HierarchyPolicy(
    recent_threshold_minutes=5,    # Last 5 min: verbatim
    medium_threshold_minutes=15,   # 5-15 min: light
    high_importance_threshold=0.8  # 0.8+ importance gets preferential treatment
)

# Initialize the production manager
manager = ProductionHierarchicalManager(llm, policy)

# Simulate a conversation over 30 minutes
base_time = datetime.now() - timedelta(minutes=30)

# Old message (30 min ago) with NORMAL importance
# Should be aggressively compressed
manager.add_message(
    "Hello, I need help with my account",
    importance=0.5  # Normal importance
)
manager.messages[-1]['created_at'] = base_time

# Old message (28 min ago) with HIGH importance
# Should resist compression due to importance
manager.add_message(
    "Customer ID: CUST-12345, Premium tier, expires 2024-12-31",
    importance=0.9  # High importance - critical customer data
)
manager.messages[-1]['created_at'] = base_time + timedelta(minutes=2)

# Medium-age message (10 min ago) with NORMAL importance
# Should be moderately compressed
manager.add_message(
    "I want to upgrade my subscription to include the new features",
    importance=0.6  # Slightly above normal
)
manager.messages[-1]['created_at'] = base_time + timedelta(minutes=20)

# Recent message (2 min ago) with NORMAL importance
# Should stay verbatim
manager.add_message(
    "Yes, please proceed with the upgrade",
    importance=0.5
)
manager.messages[-1]['created_at'] = base_time + timedelta(minutes=28)

# Recent message (1 min ago) with HIGH importance
# Should definitely stay verbatim
manager.add_message(
    "Confirmed: Upgraded to Premium Plus. New monthly rate: $49.99",
    message_type="ai",
    importance=0.95  # Very high importance - commitment/resolution
)
manager.messages[-1]['created_at'] = base_time + timedelta(minutes=29)

# Apply hierarchical compression
print("\nApplying multi-factor hierarchical compression...")
manager.recompress_all()

# Get compressed context with metadata
context = manager.get_context(include_metadata=True)
stats = manager.get_stats()

# Display the hierarchical context
print("\n" + "="*80)
print("\nHierarchical Context (with importance weighting):")
print("-" * 80)

for i, msg in enumerate(context, 1):
    print(f"\n{i}. [{msg['tier'].upper()}]")
    print(f"   Age: {msg['age_minutes']} min | Importance: {msg['importance']} | Tokens: {msg['tokens']}")
    print(f"   Type: {msg['type']}")
    print(f"   Content: {msg['content']}")

# Display compression statistics
print("\n" + "="*80)
print("\nCompression Statistics:")
for key, value in stats.items():
    print(f"  {key.replace('_', ' ').title()}: {value}")

# Highlight the production benefits
print("\n" + "="*80)
print("\nProduction Benefits:")
print("  ✓ Multi-factor tier assignment (age + importance)")
print("  ✓ High-importance items preserved at higher detail")
print("  ✓ Automatic recompression as content ages")
print("  ✓ Token budget enforcement via get_context(max_tokens=...)")
print("  ✓ Flexible policy configuration for different use cases")
print("  ✓ Comprehensive metrics and monitoring")
print(f"  ✓ Achieved {stats['compression_ratio']} compression")
print(f"  ✓ Saved {stats['tokens_saved']} tokens")

Production Hierarchical Compression Manager:

Applying multi-factor hierarchical compression...


Hierarchical Context (with importance weighting):
--------------------------------------------------------------------------------

1. [AGGRESSIVE]
   Age: 30 min | Importance: 0.5 | Tokens: 5
   Type: human
   Content: Need help with account.

2. [LIGHT]
   Age: 28 min | Importance: 0.9 | Tokens: 9
   Type: human
   Content: Customer ID: CUST-12345, Premium tier, expires 12/31/2024.

3. [LIGHT]
   Age: 10 min | Importance: 0.6 | Tokens: 14
   Type: human
   Content: I want to upgrade my subscription to access the new features.

4. [VERBATIM]
   Age: 2 min | Importance: 0.5 | Tokens: 7
   Type: human
   Content: Yes, please proceed with the upgrade

5. [VERBATIM]
   Age: 1 min | Importance: 0.95 | Tokens: 11
   Type: ai
   Content: Confirmed: Upgraded to Premium Plus. New monthly rate: $49.99


Compression Statistics:
  Total Messages: 5
  Compression Operations: 3
  Total Tokens Saved: 0


Importance-based preservation:
  • Normal importance (0.5) 30 min old → AGGRESSIVE compression
  • High importance (0.9) 28 min old → LIGHT compression
  • High importance items stay detailed 2x longer than normal items

The `ProductionHierarchicalManager` implements a sophisticated multi-factor compression system that considers both temporal and semantic factors when assigning compression tiers. 
- The `_determine_tier` method combines age-based and importance-based logic, ensuring high-importance messages (like customer IDs, commitments, and outcomes) receive preferential treatment with less aggressive compression even as they age.
- The importance threshold creates a two-track system where critical information stays verbatim or lightly compressed much longer than normal messages.
- The `recompress_all` method periodically evaluates all messages and adjusts their compression tiers based on current age and importance, creating dynamic compression that responds to changing context.
- The `HierarchyPolicy` model provides comprehensive configuration for different use cases - customer service systems might use longer recent thresholds and higher importance sensitivity, while chatbots might use more aggressive compression.
- The `get_context` method supports token budgets, automatically selecting messages to include based on available space and prioritizing recent and important content.
- In the example, a high-importance customer ID message from 30 minutes ago receives only light compression while a normal greeting from the same time receives aggressive compression, and a recent high-importance confirmation stays verbatim. This multi-dimensional approach typically achieves 40-70% overall compression in long conversations while ensuring critical information remains accessible and recent context stays detailed, making it production-ready for complex agent systems requiring sophisticated context management.