# üß© Mini-Lab: Context Window Limits

**Module 2: LLM Core Concepts** | **Duration: ~30 min** | **Type: Mini-Lab**

---

## Learning Objectives

By the end of this mini-lab, you will be able to:

1. **Understand** what a context window is and why it matters
2. **Compare** context window sizes across different models
3. **Implement** strategies for handling long content
4. **Recognize** the "lost in the middle" phenomenon
5. **Plan** context allocation for prompts, content, and responses

## Target Concepts

| Concept | Description |
|---------|-------------|
| Context Window | Maximum number of tokens a model can process in one request |
| Tokenization | Converting text to tokens (prerequisite from mini-tokenizer) |

## Prerequisites

- **mini-tokenizer**: Understanding of tokens and token counting

## 1. Setup

In [9]:
import os
from dotenv import load_dotenv
from openai import OpenAI
import tiktoken
from IPython.display import Markdown, display

# Helper function to render LLM output as formatted markdown
def md(text):
    """Display text as rendered markdown."""
    display(Markdown(text))

load_dotenv()
client = OpenAI()
enc = tiktoken.encoding_for_model("gpt-4o")

def count_tokens(text):
    """Count tokens in text."""
    return len(enc.encode(text))

print("‚úì Setup complete")

‚úì Setup complete


## 2. Understanding Context Windows

A **context window** is the maximum number of tokens a model can process in a single request. This includes:
- System prompt
- Conversation history
- User input
- Retrieved context (RAG)
- **AND** the generated response

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ           CONTEXT WINDOW (128K)             ‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
‚îÇ  [System]  [History]  [Context]  [Response] ‚îÇ
‚îÇ    500       2000       8000        4000    ‚îÇ
‚îÇ                                             ‚îÇ
‚îÇ  Total: 14,500 tokens used of 128,000       ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

In [2]:
# Model context window sizes (as of 2024)
MODEL_CONTEXTS = {
    # OpenAI
    "gpt-4o": 128_000,
    "gpt-4o-mini": 128_000,
    "gpt-4-turbo": 128_000,
    "gpt-3.5-turbo": 16_385,
    "gpt-3.5-turbo-16k": 16_385,
    
    # Anthropic
    "claude-3-opus": 200_000,
    "claude-3-sonnet": 200_000,
    "claude-3-haiku": 200_000,
    
    # Google
    "gemini-1.5-pro": 1_000_000,
    "gemini-1.5-flash": 1_000_000,
    
    # Open Source
    "llama-3-70b": 8_192,
    "mistral-7b": 32_768,
}

def visualize_context_sizes():
    """Visualize context window sizes."""
    print("\nüìä Context Window Comparison")
    print("="*60)
    
    max_ctx = max(MODEL_CONTEXTS.values())
    
    for model, ctx_size in sorted(MODEL_CONTEXTS.items(), key=lambda x: -x[1]):
        bar_length = int((ctx_size / max_ctx) * 40)
        bar = "‚ñà" * bar_length
        
        # Estimate pages (assuming ~500 tokens/page)
        pages = ctx_size // 500
        
        print(f"\n{model:20} {ctx_size:>10,} tokens (~{pages:,} pages)")
        print(f"                     {bar}")

visualize_context_sizes()


üìä Context Window Comparison

gemini-1.5-pro        1,000,000 tokens (~2,000 pages)
                     ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà

gemini-1.5-flash      1,000,000 tokens (~2,000 pages)
                     ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà

claude-3-opus           200,000 tokens (~400 pages)
                     ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà

claude-3-sonnet         200,000 tokens (~400 pages)
                     ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà

claude-3-haiku          200,000 tokens (~400 pages)
                     ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà

gpt-4o                  128,000 tokens (~256 pages)
                     ‚ñà‚ñà‚ñà‚ñà‚ñà

gpt-4o-mini             128,000 tokens (~256 pages)
                     ‚ñà‚ñà‚ñà‚ñà‚ñà

gpt-4-turbo             128,000 tokens (~256 pages)
                     ‚ñà‚ñà‚ñà‚ñà‚ñà

mistra

## 3. Context Budget Planning

When designing prompts, you need to allocate your context budget wisely:

In [3]:
def plan_context_budget(model, system_prompt, user_message, context="", 
                        max_response_tokens=4000, conversation_history=""):
    """Plan context budget for an API call."""
    
    context_limit = MODEL_CONTEXTS.get(model, 128_000)
    
    # Count tokens for each component
    system_tokens = count_tokens(system_prompt)
    user_tokens = count_tokens(user_message)
    context_tokens = count_tokens(context)
    history_tokens = count_tokens(conversation_history)
    
    # Calculate totals
    input_total = system_tokens + user_tokens + context_tokens + history_tokens
    reserved_for_response = max_response_tokens
    total_needed = input_total + reserved_for_response
    remaining = context_limit - total_needed
    
    print(f"\nüìê Context Budget Plan for {model}")
    print("="*50)
    print(f"Context limit: {context_limit:,} tokens")
    print(f"\nüì• INPUT ALLOCATION:")
    print(f"   System prompt:     {system_tokens:>6,} tokens")
    print(f"   User message:      {user_tokens:>6,} tokens")
    print(f"   Retrieved context: {context_tokens:>6,} tokens")
    print(f"   History:           {history_tokens:>6,} tokens")
    print(f"   ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ")
    print(f"   Input subtotal:    {input_total:>6,} tokens")
    print(f"\nüì§ OUTPUT RESERVATION:")
    print(f"   Max response:      {reserved_for_response:>6,} tokens")
    print(f"\nüìä SUMMARY:")
    print(f"   Total needed:      {total_needed:>6,} tokens")
    print(f"   Remaining buffer:  {remaining:>6,} tokens")
    
    usage_percent = (total_needed / context_limit) * 100
    
    if remaining < 0:
        print(f"\n‚ö†Ô∏è  WARNING: Over budget by {-remaining:,} tokens!")
    elif usage_percent > 90:
        print(f"\n‚ö†Ô∏è  CAUTION: Using {usage_percent:.1f}% of context")
    else:
        print(f"\n‚úÖ OK: Using {usage_percent:.1f}% of context")
    
    return {
        "input_tokens": input_total,
        "reserved_output": reserved_for_response,
        "remaining": remaining,
        "usage_percent": usage_percent
    }

# Example: RAG application
system = "You are a helpful assistant that answers questions based on the provided context."
user = "What are the key benefits of using transformers for NLP tasks?"
retrieved_context = "\n".join([
    "Transformers introduced self-attention mechanisms that allow parallel processing.",
    "Unlike RNNs, transformers can capture long-range dependencies efficiently.",
    "The architecture enables transfer learning through pre-trained models.",
    "BERT and GPT demonstrated state-of-the-art results across NLP benchmarks.",
] * 20)  # Simulate more context

plan_context_budget(
    model="gpt-4o-mini",
    system_prompt=system,
    user_message=user,
    context=retrieved_context,
    max_response_tokens=2000
)


üìê Context Budget Plan for gpt-4o-mini
Context limit: 128,000 tokens

üì• INPUT ALLOCATION:
   System prompt:         14 tokens
   User message:          12 tokens
   Retrieved context:    980 tokens
   History:                0 tokens
   ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
   Input subtotal:     1,006 tokens

üì§ OUTPUT RESERVATION:
   Max response:       2,000 tokens

üìä SUMMARY:
   Total needed:       3,006 tokens
   Remaining buffer:  124,994 tokens

‚úÖ OK: Using 2.3% of context


{'input_tokens': 1006,
 'reserved_output': 2000,
 'remaining': 124994,
 'usage_percent': 2.3484374999999997}

## 4. The "Lost in the Middle" Problem

Research shows that LLMs pay less attention to information in the middle of long contexts. Let's demonstrate this:

In [4]:
import sys
sys.path.append('../data')
from load_corpus import load_documents

def test_lost_in_middle():
    """Demonstrate the lost-in-the-middle phenomenon."""
    
    # Load real documents from the corpus as filler
    docs = load_documents()
    filler_texts = [doc['content'] for doc in docs]
    filler = "\n\n".join(filler_texts)  # Combine all documents
    
    # Split filler into two halves for positioning
    filler_tokens = enc.encode(filler)
    mid_point = len(filler_tokens) // 3 # to fot the ncet
    filler_part1 = enc.decode(filler_tokens[:mid_point])
    filler_part2 = enc.decode(filler_tokens[-mid_point:])
    
    # Three different secret codes - one for each position
    code_start = "SECRET-START-1234"
    code_middle = "SECRET-MIDDLE-5678"
    code_end = "SECRET-END-9012"
    
    fact_start = f"IMPORTANT: The first secret code is {code_start}."
    fact_middle = f"IMPORTANT: The second secret code is {code_middle}."
    fact_end = f"IMPORTANT: The third secret code is {code_end}."
    
    # Build ONE document with all three codes at START, MIDDLE, and END
    document = f"{fact_start}\n\n{filler_part1}\n\n{fact_middle}\n\n{filler_part2}\n\n{fact_end}"
    total_tokens = count_tokens(document)
    
    # Questions to test retrieval from each position
    test_cases = [
        ("START", "What is the FIRST secret code?", code_start),
        ("MIDDLE", "What is the SECOND secret code?", code_middle),
        ("END", "What is the THIRD secret code?", code_end),
    ]
    
    print("\nüî¨ Lost-in-the-Middle Test")
    print("="*60)
    print(f"üìö Using {len(docs)} real AI Engineering documents as filler")
    print(f"üìÑ Total document size: {total_tokens:,} tokens")
    print(f"\nüîë Secret codes placed at:")
    print(f"   START:  {code_start}")
    print(f"   MIDDLE: {code_middle}")
    print(f"   END:    {code_end}")
    print("\n" + "="*60)
    
    results = []
    for position, question, expected_code in test_cases:
        response = client.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=[
                {"role": "system", "content": "Answer based only on the provided document. Be concise."},
                {"role": "user", "content": f"Document:\n{document}\n\nQuestion: {question}"}
            ],
            temperature=0,
            max_tokens=100
        )
        
        answer = response.choices[0].message.content
        found = expected_code in answer
        results.append(found)
        
        print(f"\nüìç Code at {position}:")
        print(f"   Question: {question}")
        print(f"   Expected: {expected_code}")
        print(f"   Answer: {answer}")
        print(f"   Found correctly: {'‚úÖ Yes' if found else '‚ùå No'}")
    
    # Summary
    print("\n" + "="*60)
    print("üìä SUMMARY:")
    print(f"   START code found:  {'‚úÖ' if results[0] else '‚ùå'}")
    print(f"   MIDDLE code found: {'‚úÖ' if results[1] else '‚ùå'}")
    print(f"   END code found:    {'‚úÖ' if results[2] else '‚ùå'}")
    
    if not results[1] and (results[0] or results[2]):
        print("\nüí° This demonstrates the 'Lost in the Middle' phenomenon!")
        print("   The model found codes at the start/end but missed the middle.")

test_lost_in_middle()


üî¨ Lost-in-the-Middle Test
üìö Using 60 real AI Engineering documents as filler
üìÑ Total document size: 15,494 tokens

üîë Secret codes placed at:
   START:  SECRET-START-1234
   MIDDLE: SECRET-MIDDLE-5678
   END:    SECRET-END-9012


üìç Code at START:
   Question: What is the FIRST secret code?
   Expected: SECRET-START-1234
   Answer: The first secret code is SECRET-START-1234.
   Found correctly: ‚úÖ Yes

üìç Code at MIDDLE:
   Question: What is the SECOND secret code?
   Expected: SECRET-MIDDLE-5678
   Answer: The second secret code is SECRET-MIDDLE-5678.
   Found correctly: ‚úÖ Yes

üìç Code at END:
   Question: What is the THIRD secret code?
   Expected: SECRET-END-9012
   Answer: The third secret code is SECRET-END-9012.
   Found correctly: ‚úÖ Yes

üìä SUMMARY:
   START code found:  ‚úÖ
   MIDDLE code found: ‚úÖ
   END code found:    ‚úÖ


## 5. Strategies for Long Content

When content exceeds context limits, use these strategies:

In [5]:
def chunk_text(text, max_tokens=1000, overlap_tokens=100):
    """
    Split text into overlapping chunks that fit within token limits.
    Overlap helps maintain context between chunks.
    """
    tokens = enc.encode(text)
    chunks = []
    
    start = 0
    while start < len(tokens):
        end = start + max_tokens
        chunk_tokens = tokens[start:end]
        chunk_text = enc.decode(chunk_tokens)
        chunks.append({
            "text": chunk_text,
            "tokens": len(chunk_tokens),
            "start_idx": start,
            "end_idx": end
        })
        start = end - overlap_tokens  # Overlap for continuity
    
    return chunks

# Example: Long document
long_document = """
Chapter 1: Introduction to AI

Artificial Intelligence (AI) has transformed numerous industries over the past decade.
From healthcare diagnostics to autonomous vehicles, AI applications continue to expand.
This chapter explores the fundamentals of AI and machine learning.

Machine learning, a subset of AI, enables computers to learn from data without explicit
programming. Deep learning, using neural networks with many layers, has achieved
remarkable success in image recognition, natural language processing, and game playing.

The transformer architecture, introduced in 2017, revolutionized NLP by enabling
parallel processing of sequences through self-attention mechanisms.
""" * 20  # Simulate a longer document

print(f"üìÑ Original document: {count_tokens(long_document):,} tokens")
print("\nüì¶ Chunking with overlap:\n")

chunks = chunk_text(long_document, max_tokens=500, overlap_tokens=50)
for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1}: {chunk['tokens']} tokens (positions {chunk['start_idx']}-{chunk['end_idx']})")
    print(f"   Preview: {chunk['text'][:100]}...\n")

üìÑ Original document: 2,401 tokens

üì¶ Chunking with overlap:

Chunk 1: 500 tokens (positions 0-500)
   Preview: 
Chapter 1: Introduction to AI

Artificial Intelligence (AI) has transformed numerous industries ove...

Chunk 2: 500 tokens (positions 450-950)
   Preview: , and game playing.

The transformer architecture, introduced in 2017, revolutionized NLP by enablin...

Chunk 3: 500 tokens (positions 900-1400)
   Preview:  data without explicit
programming. Deep learning, using neural networks with many layers, has achie...

Chunk 4: 500 tokens (positions 1350-1850)
   Preview:  AI applications continue to expand.
This chapter explores the fundamentals of AI and machine learni...

Chunk 5: 500 tokens (positions 1800-2300)
   Preview: .

Chapter 1: Introduction to AI

Artificial Intelligence (AI) has transformed numerous industries o...

Chunk 6: 151 tokens (positions 2250-2750)
   Preview: , and game playing.

The transformer architecture, introduced in 2017, revolutionized NLP b

In [10]:
def summarize_and_process(long_text, question, max_context=4000):
    """
    Process long content by:
    1. Chunking the content
    2. Extracting relevant parts per chunk
    3. Combining and answering
    """
    
    chunks = chunk_text(long_text, max_tokens=2000, overlap_tokens=100)
    
    print(f"\nüìã Processing {len(chunks)} chunks...")
    
    # Extract relevant info from each chunk
    relevant_extracts = []
    for i, chunk in enumerate(chunks):
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{
                "role": "user",
                "content": f"""Extract only the sentences relevant to this question: "{question}"
                
From this text:
{chunk['text']}

Return only relevant sentences, or 'Nothing relevant' if none found."""
            }],
            temperature=0,
            max_tokens=500
        )
        extract = response.choices[0].message.content
        if "nothing relevant" not in extract.lower():
            relevant_extracts.append(extract)
            print(f"   Chunk {i+1}: Found relevant content")
        else:
            print(f"   Chunk {i+1}: No relevant content")
    
    # Combine and answer
    combined_context = "\n---\n".join(relevant_extracts)
    
    if count_tokens(combined_context) > max_context:
        print(f"\n‚ö†Ô∏è  Combined context too large, truncating...")
        combined_tokens = enc.encode(combined_context)[:max_context]
        combined_context = enc.decode(combined_tokens)
    
    final_response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "user",
            "content": f"""Based on the following information:

{combined_context}

Answer this question: {question}"""
        }],
        temperature=0,
        max_tokens=500
    )
    
    print(f"\n‚úÖ Final Answer:")
    md(final_response.choices[0].message.content)
    
    return final_response.choices[0].message.content

# Test with our long document
summarize_and_process(
    long_document,
    "What is the transformer architecture and when was it introduced?"
);


üìã Processing 2 chunks...
   Chunk 1: Found relevant content
   Chunk 2: Found relevant content

‚úÖ Final Answer:


The transformer architecture is a model introduced in 2017 that revolutionized natural language processing (NLP) by enabling parallel processing of sequences through self-attention mechanisms.

## üí° Advanced Topic: Context Manager Class

A **ContextManager** is a utility class that helps manage context window allocation for LLM API calls. It provides methods to:

- Calculate available tokens for context after accounting for system prompts and expected response length
- Select and fit context items (like RAG retrieval results) within the available token budget  
- Prepare optimized requests that maximize the use of available context

### Key Features:
- **Token counting**: Calculate exact token usage for different text inputs
- **Budget allocation**: Determine how much space is available for retrieved context
- **Smart truncation**: Select as many context items as possible that fit within the limit
- **Request preparation**: Format everything into a ready-to-use prompt structure

### Example Use Cases:
- RAG (Retrieval-Augmented Generation) systems that need to fit multiple retrieved documents
- Chat applications with long conversation histories
- Multi-document summarization tasks
- Any scenario where you need to optimize context usage

> **üìö Note**: Implementing a production-ready ContextManager involves additional considerations like token estimation accuracy, chunk prioritization strategies, and error handling. This advanced topic will be covered in depth in more advanced courses on LLM application development and RAG system architecture.

In [None]:
print("üí° Context Manager is an advanced topic for production LLM applications")
print("üìö It will be covered in detail in advanced courses on RAG and LLM app development")

üí° Context Manager is an advanced topic for production LLM applications
üìö It will be covered in detail in advanced courses on RAG and LLM app development


: 

## üéØ Summary

### Key Takeaways

1. **Context Window Basics**
   - Context includes: system + history + user + context + response
   - Different models have vastly different limits (8K to 1M tokens)
   - Always reserve space for the response

2. **Budget Planning**
   - Calculate token usage before API calls
   - Prioritize most relevant content
   - Leave buffer for safety

3. **Lost in the Middle**
   - Models attend more to beginning and end
   - Place important information strategically
   - Consider reranking retrieved content

4. **Long Content Strategies**
   - Chunking with overlap
   - Extract-then-synthesize
   - Use context manager

### Next Steps

- **mini-temperature**: Learn about generation parameters
- **mini-sampling**: Explore Top-K and Top-P
- **Module 4**: Apply context management in RAG systems