# Context Engineering: Short-Term Memory Management with Sessions from OpenAI Agents SDK 

In this cookbook, we’ll explore how to **manage context effectively using the `Session` object from the [OpenAI Agents SDK](https://github.com/openai/openai-agents-python)**.

AI agents often operate in **long-running, multi-turn interactions**, where keeping the right balance of context is critical. If too much is carried forward, the model risks distraction, inefficiency, or outright failure. If too little is preserved, the agent loses coherence. This guide focuses on two proven context management techniques—**trimming** and **compression**—to keep agents fast, reliable, and cost-efficient.


#### Why Context Management Matters

* **Sustained coherence across long threads** – Keep the agent anchored to the latest user goal without dragging along stale details. Session-level trimming and summaries prevent “yesterday’s plan” from overriding today’s ask.
* **Higher tool-call accuracy** – Focused context improves function selection and argument filling, reducing retries, timeouts, and cascading failures during multi-tool runs.
* **Lower latency & cost** – Smaller, sharper prompts cut tokens per turn and attention load.
* **Error & hallucination containment** – Summaries act as “clean rooms” that correct or omit prior mistakes; trimming avoids amplifying bad facts (“context poisoning”) turn after turn.
* **Easier debugging & observability** – Stable summaries and bounded histories make logs comparable: you can diff summaries, attribute regressions, and reproduce failures reliably.
* **Multi-issue and handoff resilience** – In multi-problem chats, per-issue mini-summaries let the agent pause/resume, escalate to humans, or hand off to another agent while staying consistent.


![Memory Comparison in AI Agents](../../images/memory_comparison.jpg)


#### Real-World Scenario

We’ll ground the techniques in a practical example for one of the common long-running tasks, such as:

* **Multi-turn Customer Service Conversations**
In extended conversations about tech products—spanning both hardware and software—customers often surface multiple issues over time. The agent must stay consistent and goal-focused while retaining only the essentials rather than hauling along every past detail.

#### Techniques Covered

To address these challenges, we introduce two concrete approaches using OpenAI Agents SDK:

1. **Trimming Messages** – dropping older turns while keeping the last N turns.
2. **Summarizing Messages** – compressing prior exchanges into structured, shorter representations.


<br>

## Prerequisites

Before running this cookbook, you must set up the following accounts and complete a few setup actions. These prerequisites are essential to interact with the APIs used in this project.

#### Step0: OpenAI Account

- **Purpose:**  
  You need an OpenAI account to access language models and use the Agents SDK featured in this cookbook.

- **Action:**  
  [Sign up for an OpenAI account](https://openai.com) if you don’t already have one. Once you have an account, create an API key by visiting the [OpenAI API Keys page](https://platform.openai.com/api-keys).

#### Step1: Install the Required Libraries

Below we install the `openai-agents` library (the [OpenAI Agents SDK](https://github.com/openai/openai-agents-python)

In [None]:
%pip install openai-agents nest_asyncio

In [1]:
from openai import OpenAI

client = OpenAI()

In [2]:
from agents import set_tracing_disabled
set_tracing_disabled(True)

Let's test the installed libraries by defining and running an agent.

In [3]:
import asyncio
from agents import Agent, Runner


agent = Agent(
    name="Assistant",
    instructions="Reply very concisely.",
)

result = await Runner.run(agent, "Tell me why it is important to evaluate AI agents.")
print(result.final_output)


Evaluating AI agents ensures reliability, safety, ethical alignment, performance accuracy, and helps avoid biases, improving overall trust and effectiveness.


### Define Agents

We can start by defining the necessary components from Agents SDK Library.

#### Customer Service Agent

In [214]:
support_agent = Agent(
    name="Customer Support Assistant",
    model="gpt-5",
    instructions=(
        "You are a patient, step-by-step IT support assistant. "
        "Your role is to help customers troubleshoot and resolve issues with devices and software. "
        "Guidelines:\n"
        "- Be concise and use numbered steps where possible.\n"
        "- Ask only one focused, clarifying question at a time before suggesting next actions.\n"
        "- Track and remember multiple issues across the conversation; update your understanding as new problems emerge.\n"
        "- When a problem is resolved, briefly confirm closure before moving to the next.\n"
    )
)


## 1. Context Trimming 

#### Implement Custom Session Object

We are using [Session](https://openai.github.io/openai-agents-python/sessions/) object from [OpenAI Agents Python SDK](https://openai.github.io/openai-agents-python/). Here’s a `MyCustomSession` implementation that **keeps only the last N turns** (a “turn” = one user message and everything until the next user message—including the assistant reply and any tool calls/results). It’s in-memory and trims automatically on every write and read.


In [6]:
from __future__ import annotations

import asyncio
from collections import deque
from typing import Any, Deque, Dict, List, Tuple, cast

from agents.memory.session import SessionABC
from agents.items import TResponseInputItem  # typically a dict-like item


def _is_user_msg(item: TResponseInputItem) -> bool:
    """
    Heuristic: treat items with role=='user' as user messages.
    Falls back to type=='message' and author checks if your SDK uses a different shape.
    Adjust this if your item schema differs.
    """
    if isinstance(item, dict):
        role = cast(Dict[str, Any], item).get("role")
        if role == "user":
            return True
        # Some SDKs encode messages as {"type": "message", "role": "..."}
        if item.get("type") == "message" and item.get("role") == "user":
            return True
    # Extend here if you carry custom classes with .role attribute:
    role_attr = getattr(item, "role", None)
    return role_attr == "user"


class TrimmingSession(SessionABC):
    """
    Custom session that keeps only the last N user-turns.
    A 'turn' is defined as a user message and all subsequent items
    (assistant/tool calls/results) up to—but not including—the next user message.

    Works entirely in memory. If you need persistence, replace the in-memory
    deque with your storage of choice (SQLite/Redis/etc.), preserving the
    trimming logic in `_trim_to_last_turns`.
    """

    def __init__(self, session_id: str, max_turns: int = 8):
        self.session_id = session_id
        self.max_turns = max(1, max_turns)
        self._items: Deque[TResponseInputItem] = deque()  # full chronological log
        self._lock = asyncio.Lock()

    # ---- SessionABC API ----

    async def get_items(self, limit: int | None = None) -> List[TResponseInputItem]:
        """
        Return the history trimmed to the last N turns.
        If `limit` is provided, return at most that many most-recent items
        from within the trimmed history.
        """
        async with self._lock:
            trimmed = self._trim_to_last_turns(list(self._items))
            if limit is not None and limit >= 0:
                return trimmed[-limit:]
            return trimmed

    async def add_items(self, items: List[TResponseInputItem]) -> None:
        """
        Append new items, then trim to last N turns.
        """
        if not items:
            return
        async with self._lock:
            self._items.extend(items)
            # Trim in place by rebuilding from trimmed list
            trimmed = self._trim_to_last_turns(list(self._items))
            self._items.clear()
            self._items.extend(trimmed)

    async def pop_item(self) -> TResponseInputItem | None:
        """
        Remove and return the most recent item (post-trim).
        """
        async with self._lock:
            if not self._items:
                return None
            return self._items.pop()

    async def clear_session(self) -> None:
        """
        Remove all items for this session.
        """
        async with self._lock:
            self._items.clear()

    # ---- Helpers ----

    def _trim_to_last_turns(self, items: List[TResponseInputItem]) -> List[TResponseInputItem]:
        """
        Keep only the suffix of `items` that contains the last `max_turns` user messages
        and everything after the earliest of those user messages.

        Algorithm:
          1) Scan from the end to find indices of the last `max_turns` user messages.
          2) Cut history to start from the earliest of those (inclusive).
        Edge cases:
          - If there are fewer than `max_turns` user messages, keep entire history.
          - If there are no user messages yet, treat all existing items as a single turn and keep them.
        """
        if not items:
            return items

        # Find indices of user messages scanning from the end
        user_indices: List[int] = []
        for idx in range(len(items) - 1, -1, -1):
            if _is_user_msg(items[idx]):
                user_indices.append(idx)
                if len(user_indices) >= self.max_turns:
                    break

        if not user_indices:
            # No user messages yet; keep everything
            return items

        # The earliest index among the last N user messages
        cut_from = min(user_indices)  # since we collected from the end
        return items[cut_from:]

    # ---- Optional convenience API (not part of SessionABC) ----

    async def set_max_turns(self, max_turns: int) -> None:
        async with self._lock:
            self.max_turns = max(1, int(max_turns))
            trimmed = self._trim_to_last_turns(list(self._items))
            self._items.clear()
            self._items.extend(trimmed)

    async def raw_items(self) -> List[TResponseInputItem]:
        """Return the untrimmed in-memory log (for debugging)."""
        async with self._lock:
            return list(self._items)


Let's define the custom session object we implemented.

In [None]:
# Keep only the last 8 turns (user + assistant/tool interactions)
session = TrimmingSession("my_session", max_turns=3)

**How to choose the right `max_turns`?**

Determining this parameter usually requires experimentation with your conversation history. One approach is to extract the total number of turns across conversations and analyze their distribution. Another option is to use an LLM to evaluate conversations—identifying how many tasks or issues each one contains and calculating the average number of turns needed per issue.


In [27]:
message = "There is a red light on the dashboard."

In [28]:
result = await Runner.run(
    support_agent,
    message,
    session=session
)

In [29]:
conversation = await session.get_items()


In [30]:
conversation

[{'content': 'There is a red light on the dashboard.', 'role': 'user'},
 {'id': 'rs_68ba0a2abf3c8196adc68a947b457ae0001b849827e34d0e',
  'summary': [],
  'type': 'reasoning',
  'content': []},
 {'id': 'msg_68ba0a3480748196b7059d0e23d87350001b849827e34d0e',
  'content': [{'annotations': [],
    'text': 'Which device or system is the dashboard on (e.g., car, printer, router, software)?',
    'type': 'output_text',
    'logprobs': []}],
  'role': 'assistant',
  'status': 'completed',
  'type': 'message'}]

In [32]:
len(conversation)

3

In [31]:
conversation[0]['role'], conversation[0]['content']

('user', 'There is a red light on the dashboard.')

In [50]:
# Example flow
await session.add_items([{"role": "user", "content": "Hi, my router won't connect."}])
await session.add_items([{"role": "assistant", "content": "Let's check your firmware version."}])
await session.add_items([{"role": "user", "content": "Firmware v1.0.3; still failing."}])
await session.add_items([{"role": "assistant", "content": "Try a factory reset."}])
await session.add_items([{"role": "user", "content": "Reset done; error 42 now."}])
await session.add_items([{"role": "assistant", "content": "test1"}])
# At this point, with max_turns=3, everything *before* the earliest of the last 3 user
# messages is summarized into a synthetic pair, and the last 3 turns remain verbatim.

history = await session.get_items()
# Pass `history` into your agent runner / responses call as the conversation context.


In [51]:
len(history)

4

In [52]:
history

[{'role': 'user', 'content': 'Firmware v1.0.3; still failing.'},
 {'role': 'assistant', 'content': 'Try a factory reset.'},
 {'role': 'user', 'content': 'Reset done; error 42 now.'},
 {'role': 'assistant', 'content': 'test1'}]

Below, you can see how the trimming session works for max_turns=3.

![Context Trimming in Session](../../images/trimingSession.jpg)

**What counts as a “turn”**

* A **turn** = one **user** message **plus everything that follows it** (assistant replies, reasoning, tool calls, tool results) **until the next user message**.

**When trimming happens**

* On **write**: `add_items(...)` appends the new items, then immediately trims the stored history.
* On **read**: `get_items(...)` returns a **trimmed** view (so even if you bypassed a write, reads won’t leak old turns).

**How it decides what to keep**

1. Treat any item with `role == "user"` as a **user message** (via `_is_user_msg`).
2. Scan the history **backwards** and collect the indices of the last **N** user messages (`max_turns`).
3. Find the **earliest** index among those N user messages.
4. **Keep everything from that index to the end**; drop everything before it.

That preserves each complete turn boundary: if the earliest kept user message is at index `k`, you also keep all assistant/tool items that came after `k`.

**Edge cases**

* **Fewer than N user messages**: keep **everything** (no trimming yet).
* **No user messages**: keep **everything** (treat as a single in-progress turn).
* **`limit` in `get_items(limit=…)`**: applied **after** trimming; returns only the last `limit` items of the already-trimmed slice.

**Tiny example**

History (old → new):

```
0: user("Hi")
1: assistant("Hello!")
2: tool_call("lookup")
3: tool_result("…")
4: user("It didn't work")
5: assistant("Try rebooting")
6: user("Rebooted, now error 42")
7: assistant("On it")
```

With `max_turns = 2`, the last two user messages are at indices **4** and **6**.
Earliest of those is **4** → keep items **4..7**, drop **0..3**.

**Why this works well**

* You always keep **complete** turns, so the assistant retains the immediate context it needs (both the user’s last asks and the assistant/tool steps in between).
* It prevents context bloat by discarding older turns wholesale, not just messages.

**Customization knobs**

* Change `max_turns` at init or via `set_max_turns(...)`.
* Adjust `_is_user_msg(...)` if your item schema differs.
* If you’d rather cap by **message count** or **tokens**, replace `_trim_to_last_turns(...)` or add a second pass that measures tokens.


## 2. Context Summarization 

Once the history exceeds `max_turns`. It keeps the most recent N user turns intact, **summarizes everything older into two synthetic messages**:

* `user`: *"Summarize the conversation we had so far."*
* `assistant`: *{generated summary}*

The shadow prompt from the user to request the summarization added to keep natural flow of the conversation without confusing the chat flow between user and assistant. Final version of the generated summary injected to assistant message.

**Summarization Prompt**



A well-crafted summarization prompt is essential for preserving the context of a conversation, and it should always be tailored to the specific use case. Think of it like **being a customer support agent handing off a case to the next agent**. What concise yet critical details would they need to continue smoothly? The prompt should strike the right balance: not overloaded with unnecessary information, but not so sparse that key context is lost. Achieving this balance requires careful design and ongoing experimentation to fine-tune the level of detail.

In [66]:
SUMMARY_PROMPT = """
You are a senior customer-support assistant for tech devices, setup, and software issues.
Compress the earlier conversation into a precise, reusable snapshot for future turns.

Before you write (do this silently):
- Contradiction check: compare user claims with system instructions and tool definitions/logs; note any conflicts or reversals.
- Temporal ordering: sort key events by time; the most recent update wins. If timestamps exist, keep them.
- Hallucination control: if any fact is uncertain/not stated, mark it as UNVERIFIED rather than guessing.

Write a structured, factual summary ≤ 200 words using the sections below (use the exact headings):

• Product & Environment:
  - Device/model, OS/app versions, network/context if mentioned.

• Reported Issue:
  - Single-sentence problem statement (latest state).

• Steps Tried & Results:
  - Chronological bullets (include tool calls + outcomes, errors, codes).

• Identifiers:
  - Ticket #, device serial/model, account/email (only if provided).

• Timeline Milestones:
  - Key events with timestamps or relative order (e.g., 10:32 install → 10:41 error).

• Tool Performance Insights:
  - What tool calls worked/failed and why (if evident).

• Current Status & Blockers:
  - What’s resolved vs pending; explicit blockers preventing progress.

• Next Recommended Step:
  - One concrete action (or two alternatives) aligned with policies/tools.

Rules:
- Be concise, no fluff; use short bullets, verbs first.
- Do not invent new facts; quote error strings/codes exactly when available.
- If previous info was superseded, note “Superseded:” and omit details unless critical.
"""


**Key Principles for Designing Memory Summarization Prompts**

* **Milestones:** Highlight important events in the conversation—for example, when an issue is resolved, valuable information is uncovered, or all necessary details have been collected.

* **Contradiction Check:** Ensure the summary does not conflict with itself, system instructions or tool definitions. This is especially critical for reasoning models, which are more prone to conflicts in the context.

* **Timestamps & Temporal Flow:** Incorporate timing of events in the summary. This helps the model reason about updates in sequence and reduces confusion when forgetting or remembering the latest memory over a timeline.

* **Chunking:** Organize details into categories or sections rather than long paragraphs. Structured grouping improves an LLM’s ability to understand relationships between pieces of information.

* **Tool Performance Insights:** Capture lessons learned from multi-turn, tool-enabled interactions—for example, noting which tools worked effectively for specific queries and why. These insights are valuable for guiding future steps.

* **Guidance & Examples:** Steer the summary with clear guidance. Where possible, extract concrete examples from the conversation history to make future turns more grounded and context-rich.

* **Hallucination Control:** Be precise in what you include. Even minor hallucinations in a summary can propagate forward, contaminating future context with inaccuracies.

* **Use Case Specificity:** Tailor the compression prompt to the specific use case. Think about how a human would track and recall information in working memory while solving the same task.

* **Model Choice:** Select a summarizer model based on use case requirements, summary length, and tradeoffs between latency and cost. In some cases, using the same model as the agent itself can be advantageous.


In [219]:
class LLMSummarizer:
    def __init__(self, client, model="gpt-4o", max_tokens=400, tool_trim_limit=600):
        self.client = client
        self.model = model
        self.max_tokens = max_tokens
        self.tool_trim_limit = tool_trim_limit

    async def summarize(self, messages: List[Item]) -> Tuple[str, str]:
        user_shadow = "Summarize the conversation we had so far."
        # Map history into a compact prompt
        history_snippets = []
        for m in messages:
            role = m.get("role", "assistant")
            content = (m.get("content") or "").strip()
            if not content:
                continue
            # trim very long tool blobs
            if role in ("tool", "tool_result") and len(content) > self.tool_trim_limit:
                content = content[:self.tool_trim_limit] + " …"
            history_snippets.append(f"{role.upper()}: {content}")
            #print(history_snippets)
        # Example using Responses; adapt if you use SDK Agents runs instead
        prompt_messages = [
            {"role": "system", "content": SUMMARY_PROMPT},
            {"role": "user", "content": "\n".join(history_snippets)}
        ]
        print(len(prompt_messages))
        resp = await asyncio.to_thread(
                    self.client.responses.create,
                    model=self.model,
                    input=prompt_messages,
                    max_output_tokens=self.max_tokens
                )      
        
        summary = resp.output_text

        await asyncio.sleep(0)  # yield control
        return user_shadow, summary

In [250]:
import asyncio
import itertools
from collections import deque
from typing import Optional, List, Tuple, Dict, Any

class SummarizingSession:
    """
    Keeps the last N *user turns* verbatim (keep_last_n_turns).
    A turn = one real user message + everything that follows it (assistant replies,
    reasoning, tool calls, tool results) until the next real user message.
    Summarizes everything before that into a synthetic user→assistant pair.

    Summarization is triggered once the number of *real* user turns
    (non-synthetic 'user' messages) exceeds `context_limit`.

    Internally stores (message, metadata) records. Exposes:
      - get_items(): model-safe messages only (no metadata)
      - get_full_history(): [{ "message": msg, "metadata": meta }, ...]
    """

    # Only these keys are sent to the model. Everything else goes to metadata.
    _ALLOWED_MSG_KEYS = {"role", "content", "name"}

    def __init__(
        self,
        keep_last_n_turns: int = 3,
        context_limit: int = 3,
        summarizer: Optional["Summarizer"] = None,
        session_id: Optional[str] = None,
    ):
        assert context_limit >= 1
        assert keep_last_n_turns >= 0
        assert keep_last_n_turns <= context_limit, "keep_last_n_turns should not be greater than context_limit"
        self.keep_last_n_turns = keep_last_n_turns
        self.context_limit = context_limit
        # Each record: {"msg": {...}, "meta": {...}}
        self._records: deque[Dict[str, Dict[str, Any]]] = deque()
        self._lock = asyncio.Lock()
        self.session_id = session_id or "default"
        self.summarizer = summarizer

    # --------- public API used by your runner ---------

    async def get_items(self, limit: Optional[int] = None) -> List[Dict[str, Any]]:
        """
        Returns messages in a model-safe shape (no metadata).
        Runner.run(..., session=self) should call this.
        """
        async with self._lock:
            data = list(self._records)
        msgs = [self._sanitize_for_model(rec["msg"]) for rec in data]
        return msgs[-limit:] if limit else msgs

    async def add_items(self, items: List[Dict[str, Any]]) -> None:
        async with self._lock:
            for it in items:
                msg, meta = self._split_msg_and_meta(it)
                self._records.append({"msg": msg, "meta": meta})
            need_summary, boundary_idx = self._should_summarize_locked()

        if need_summary:
            async with self._lock:
                prefix_records = list(itertools.islice(self._records, 0, boundary_idx))
                prefix_msgs = [r["msg"] for r in prefix_records]

            user_shadow, assistant_summary = await self._summarize(prefix_msgs)

            async with self._lock:
                need_summary_now, boundary_idx_now = self._should_summarize_locked()
                if not need_summary_now:
                    # normalize anyway if summarization got skipped
                    self._normalize_synthetic_flags_locked()
                    return

                suffix_records = list(itertools.islice(self._records, boundary_idx_now, None))
                self._records.clear()

                # Synthetic summary pair keeps synthetic=True
                self._records.extend([
                    {
                        "msg": {"role": "user", "content": user_shadow},
                        "meta": {
                            "synthetic": True,
                            "kind": "history_summary_prompt",
                            "summary_for_turns": f"< all before idx {boundary_idx_now} >",
                        },
                    },
                    {
                        "msg": {"role": "assistant", "content": assistant_summary},
                        "meta": {
                            "synthetic": True,
                            "kind": "history_summary",
                            "summary_for_turns": f"< all before idx {boundary_idx_now} >",
                        },
                    },
                ])
                self._records.extend(suffix_records)

                # ✅ Ensure all real messages explicitly have synthetic=False
                self._normalize_synthetic_flags_locked()
        else:
            # ✅ Even when we don't summarize, enforce the invariant
            async with self._lock:
                self._normalize_synthetic_flags_locked()

    async def pop_item(self) -> Optional[Dict[str, Any]]:
        async with self._lock:
            if not self._records:
                return None
            rec = self._records.pop()
            return dict(rec["msg"])  # model-safe

    async def clear_session(self) -> None:
        async with self._lock:
            self._records.clear()

    def set_max_turns(self, n: int) -> None:
        """
        Back-compat: interpret as updating context_limit.
        Ensures keep_last_n_turns <= context_limit.
        """
        assert n >= 1
        self.context_limit = n
        if self.keep_last_n_turns > self.context_limit:
            self.keep_last_n_turns = self.context_limit

    # --------- full-history (for debugging/analytics/observability) ---------

    # ✅ Backfill safeguard for older records that might lack the flag
    def _normalize_synthetic_flags_locked(self) -> None:
        for rec in self._records:
            role = rec["msg"].get("role")
            if role in ("user", "assistant") and "synthetic" not in rec["meta"]:
                rec["meta"]["synthetic"] = False

    
    async def get_full_history(self, limit: Optional[int] = None) -> List[Dict[str, Any]]:
        """
        Returns combined history where each entry is:
          { "message": {role, content[, name]}, "metadata": {...} }
        This is NOT sent to the model; it's for your logs/UI/debugging.
        """
        async with self._lock:
            data = list(self._records)
        out = [{"message": dict(rec["msg"]), "metadata": dict(rec["meta"])} for rec in data]
        return out[-limit:] if limit else out

    # Backwards-compatible alias if you were using this name before
    async def get_items_with_metadata(self, limit: Optional[int] = None) -> List[Dict[str, Any]]:
        return await self.get_full_history(limit)

    # --------- helpers ---------

    def _split_msg_and_meta(self, it: Dict[str, Any]) -> Tuple[Dict[str, Any], Dict[str, Any]]:
        msg = {k: v for k, v in it.items() if k in self._ALLOWED_MSG_KEYS}
        extra = {k: v for k, v in it.items() if k not in self._ALLOWED_MSG_KEYS}
        meta = dict(extra.pop("metadata", {}))
        meta.update(extra)

        if "role" not in msg or "content" not in msg:
            msg.setdefault("role", "user")
            msg.setdefault("content", str(it))

        # ✅ Default synthetic flag for real (non-summarized) messages
        role = msg.get("role")
        if role in ("user", "assistant") and "synthetic" not in meta:
            meta["synthetic"] = False
        return msg, meta

    def _sanitize_for_model(self, msg: Dict[str, Any]) -> Dict[str, Any]:
        """
        Strictly keep only allowed keys for model input.
        """
        return {k: v for k, v in msg.items() if k in self._ALLOWED_MSG_KEYS}

    def _is_user(self, rec: Dict[str, Dict[str, Any]]) -> bool:
        return rec["msg"].get("role") == "user"

    def _should_summarize_locked(self) -> Tuple[bool, int]:
        """
        Trigger summarization if the number of *real* user turns exceeds `context_limit`.

        Keep the last `keep_last_n_turns` *turns* verbatim:
        find the earliest index among the last `keep_last_n_turns` real user messages;
        everything before that index becomes the summarization prefix.

        Returns: (need_summary: bool, boundary_idx: int)
        """
        # Collect indices of real user messages (turn starts)
        user_idxs: List[int] = []
        for i, rec in enumerate(self._records):
            if self._is_user(rec) and not rec["meta"].get("synthetic", False):
                user_idxs.append(i)

        real_turns = len(user_idxs)
        if real_turns <= self.context_limit:
            return False, -1

        # Determine boundary according to "turns"
        if self.keep_last_n_turns == 0:
            # summarize everything; keep no turns verbatim
            boundary = len(self._records)
        else:
            if len(user_idxs) < self.keep_last_n_turns:
                return False, -1  # defensive; should not happen due to the check above
            # earliest index among the last N real user-turn starts
            boundary = user_idxs[-self.keep_last_n_turns]

        # If boundary is 0 and we intend to keep >=1 turn, there's nothing before to summarize.
        if boundary <= 0 and self.keep_last_n_turns > 0:
            return False, -1

        return True, boundary

    async def _summarize(self, prefix_msgs: List[Dict[str, Any]]) -> Tuple[str, str]:
        """
        Adapter to your summarizer. Provide *model-safe* messages only.
        """
        if not self.summarizer:
            # Fallback summary if no summarizer is configured
            return ("Summarize the conversation we had so far.", "Summary unavailable.")
        # Only send role/content/name to the summarizer as well
        clean_prefix = [self._sanitize_for_model(m) for m in prefix_msgs]
        return await self.summarizer.summarize(clean_prefix)


![Contxt Trimming in Session](../../images/SummarizingSession.jpg)

**High‑level idea**

* **A turn** = one **real user** message **plus everything that follows it** (assistant replies, tool calls/results, etc.) **until the next real user message**.
* You configure two knobs:

  * **`context_limit`**: the maximum number of **real user turns** allowed in the raw history before we summarize.
  * **`keep_last_n_turns`**: how many of the most recent **turns** to keep verbatim when we do summarize.

    * Invariant: `keep_last_n_turns <= context_limit`.
* When the number of **real** user turns exceeds `context_limit`, the session:

  1. **Summarizes** everything **before** the earliest of the last `keep_last_n_turns` turn starts,
  2. Injects a **synthetic user→assistant pair** at the top of the kept region:

     * `user`: `"Summarize the conversation we had so far."` (shadow prompt)
     * `assistant`: `{generated summary}`
  3. **Keeps** the last `keep_last_n_turns` turns **verbatim**.

This guarantees the last `keep_last_n_turns` turns are preserved exactly as they occurred, while all earlier content is compressed into the two synthetic messages.


In [244]:
session = SummarizingSession(
    keep_last_n_turns=2,
    context_limit=4,
    summarizer=LLMSummarizer(client)
)

In [None]:

# Example flow
await session.add_items([{"role": "user", "content": "Hi, my router won't connect. by the way, I am using Windows 10. I tried troubleshooting via your FAQs but I didn't get anywhere. This is my third tiem calling you. I am based in the US and one of Premium customers."}])
await session.add_items([{"role": "assistant", "content": "Let's check your firmware version."}])
await session.add_items([{"role": "user", "content": "Firmware v1.0.3; still failing."}])
await session.add_items([{"role": "assistant", "content": "Try a factory reset."}])
await session.add_items([{"role": "user", "content": "Reset done; error 42 now."}])
await session.add_items([{"role": "assistant", "content": "Try to install a new firmware."}])
await session.add_items([{"role": "user", "content": "I tried but I got another error now."}])
await session.add_items([{"role": "assistant", "content": "Can you please provide me with the error code?"}])
await session.add_items([{"role": "user", "content": "It says 404 not found when I try to access the page."}])
await session.add_items([{"role": "assistant", "content": "Are you connected to the internet?"}])
# At this point, with context_limit=4, everything *before* the earliest of the last 4 turns
# is summarized into a synthetic pair, and the last 2 turns remain verbatim.


2


In [None]:
history = await session.get_items()
# Pass `history` into your agent runner / responses call as the conversation context.

In [246]:
history

[{'role': 'user', 'content': 'Summarize the conversation we had so far.'},
 {'role': 'assistant',
  'content': "• Product & Environment:\n  - Router with firmware v1.0.3, Windows 10, US-based, Premium customer.\n\n• Reported Issue:\n  - Router won't connect to the internet.\n\n• Steps Tried & Results:\n  - Followed FAQs for troubleshooting; no resolution.\n  - Checked firmware version: v1.0.3; issue persists.\n  - Performed factory reset; encountered error 42.\n\n• Identifiers:\n  - None provided.\n\n• Timeline Milestones:\n  - Initial troubleshooting via FAQs → Firmware check → Factory reset → Error 42.\n\n• Tool Performance Insights:\n  - Factory reset unsuccessful in resolving connection issue; led to error 42.\n\n• Current Status & Blockers:\n  - Connection issue unresolved; error 42 after reset is blocking progress.\n\n• Next Recommended Step:\n  - Install new firmware version compatible with device."},
 {'role': 'user', 'content': 'I tried but I got another error now.'},
 {'role'

In [213]:
print(history[1]['content'])

• Product & Environment:
  - Windows 10, router (specific model UNVERIFIED).

• Reported Issue:
  - Router won't connect.

• Steps Tried & Results:
  - Used FAQs for troubleshooting; no resolution.

• Identifiers:
  - Premium customer; based in the US (no specific identifiers provided).

• Timeline Milestones:
  - Third interaction reported by user.

• Tool Performance Insights:
  - User FAQs insufficient for resolution.

• Current Status & Blockers:
  - Connection issue unresolved; firmware version not yet checked.

• Next Recommended Step:
  - Verify and update the router firmware version.


You can use the get_items_with_metadata method to get the full history of the session including the metadata for debugging and analysis purposes.

In [248]:
full_history = await session.get_items_with_metadata()


In [209]:
full_history

[{'message': {'role': 'user',
   'content': 'Summarize the conversation we had so far.'},
  'metadata': {'synthetic': True,
   'kind': 'history_summary_prompt',
   'summary_for_turns': '< all before idx 6 >'}},
 {'message': {'role': 'assistant',
   'content': '**Product & Environment:**\n- Device: Router\n- OS: Windows 10\n- Firmware: v1.0.3\n\n**Reported Issue:**\n- Router fails to connect to the internet, now showing error 42.\n\n**Steps Tried & Results:**\n- Checked FAQs: No resolution.\n- Firmware version checked: v1.0.3.\n- Factory reset performed: Resulted in error 42.\n\n**Identifiers:**\n- UNVERIFIED\n\n**Timeline Milestones:**\n- User attempted FAQ troubleshooting.\n- Firmware checked after initial advice.\n- Factory reset led to error 42.\n\n**Tool Performance Insights:**\n- FAQs and basic reset process did not resolve the issue.\n\n**Current Status & Blockers:**\n- Error 42 unresolved; firmware update needed.\n\n**Next Recommended Step:**\n- Install the latest firmware updat

In [210]:
print(history[1]['content'])

**Product & Environment:**
- Device: Router
- OS: Windows 10
- Firmware: v1.0.3

**Reported Issue:**
- Router fails to connect to the internet, now showing error 42.

**Steps Tried & Results:**
- Checked FAQs: No resolution.
- Firmware version checked: v1.0.3.
- Factory reset performed: Resulted in error 42.

**Identifiers:**
- UNVERIFIED

**Timeline Milestones:**
- User attempted FAQ troubleshooting.
- Firmware checked after initial advice.
- Factory reset led to error 42.

**Tool Performance Insights:**
- FAQs and basic reset process did not resolve the issue.

**Current Status & Blockers:**
- Error 42 unresolved; firmware update needed.

**Next Recommended Step:**
- Install the latest firmware update and check for resolution.


### Notes & design choices

* **Turn boundary preserved at the “fresh” side**: the **`keep_last_n_turns` user turns** remain verbatim; everything older is compressed.
* **Two-message summary block**: easy for downstream tooling to detect or display (`metadata.synthetic == True`).
* **Async + lock discipline**: we **release the lock** while the (potentially slow) summarization runs; then re-check the condition before applying the summary to avoid racey merges.
* **Idempotent behavior**: if more messages arrive during summarization, the post-await recheck prevents stale rewrites.


## Evals

At the end of the day, **evals is all you need** for context engineering. The key question to ask is: *how do we know the model isn’t “losing context” or "confusing context"?*

While a full cookbook around memory could stand on its own in the future, here are some lightweight evaluation harness ideas to start with:

* **Baseline & Deltas:** Continue running your core eval sets and compare before/after experiments to measure memory improvements.
* **LLM-as-Judge:** Use a model with a carefully designed grader prompt to evaluate summarization quality. Focus on whether it captures the most important details in the correct format.
* **Transcript Replay:** Re-run long conversations and measure next-turn accuracy with and without context trimming. Metrics could include exact match on entities/IDs and rubric-based scoring on reasoning quality.
* **Error Regression Tracking:** Watch for common failure modes—unanswered questions, dropped constraints, or unnecessary/repeated tool calls.
* **Token Pressure Checks:** Flag cases where token limits force dropping protected context. Log before/after token counts to detect when critical details are being pruned.

---