## Model Context

A model context supports storage and retrieval of Chat Completion messages.
It is always used together with a model client to generate LLM-based responses.

For example, {py:mod}`~autogen_core.model_context.BufferedChatCompletionContext`
is a most-recent-used (MRU) context that stores the most recent `buffer_size`
number of messages. This is useful to avoid context overflow in many LLMs.

Let's see an example that uses
{py:mod}`~autogen_core.model_context.BufferedChatCompletionContext`.

In [1]:
from jet.logger import logger

[1m[38;5;213mStarting cache cleanup thread...[0m
[1m[38;5;213mRunning cache cleanup...[0m
[38;5;250m/Users/jethroestrada/Desktop/External_Projects/Jet_Projects/jet_python_modules/jet/cache/joblib/.cache[0m
[38;5;250mEvent:[0m [1m[38;5;213mpre_start_hook[0m
[38;5;250mFile:[0m [1m[38;5;208mInteractive or unknown context[0m

[1m[38;5;40mpre_start_hook triggered at: 2025-04-07|18:03:15[0m


[1m[38;5;213mRunning cache cleanup...[0m
[38;5;250m/Users/jethroestrada/Desktop/External_Projects/Jet_Projects/jet_python_modules/jet/cache/joblib/.cache[0m
[1m[38;5;213mRunning cache cleanup...[0m
[38;5;250m/Users/jethroestrada/Desktop/External_Projects/Jet_Projects/jet_python_modules/jet/cache/joblib/.cache[0m
[1m[38;5;213mRunning cache cleanup...[0m
[38;5;250m/Users/jethroestrada/Desktop/External_Projects/Jet_Projects/jet_python_modules/jet/cache/joblib/.cache[0m
[1m[38;5;213mRunning cache cleanup...[0m
[38;5;250m/Users/jethroestrada/Desktop/External_Projects/Jet_Projects/jet_python_modules/jet/cache/joblib/.cache[0m
[1m[38;5;213mRunning cache cleanup...[0m
[38;5;250m/Users/jethroestrada/Desktop/External_Projects/Jet_Projects/jet_python_modules/jet/cache/joblib/.cache[0m
[1m[38;5;213mRunning cache cleanup...[0m
[38;5;250m/Users/jethroestrada/Desktop/External_Projects/Jet_Projects/jet_python_modules/jet/cache/joblib/.cache[0m
[1m[38;5;213mRunning cache

In [2]:
from dataclasses import dataclass

from autogen_core import AgentId, MessageContext, RoutedAgent, SingleThreadedAgentRuntime, message_handler
from autogen_core.model_context import BufferedChatCompletionContext
from autogen_core.models import AssistantMessage, ChatCompletionClient, SystemMessage, UserMessage
from jet.adapters.autogen.ollama_client import OllamaChatCompletionClient

In [3]:
@dataclass
class Message:
    content: str

In [4]:
class SimpleAgentWithContext(RoutedAgent):
    def __init__(self, model_client: ChatCompletionClient) -> None:
        super().__init__("A simple agent")
        self._system_messages = [SystemMessage(content="You are a helpful AI assistant.")]
        self._model_client = model_client
        self._model_context = BufferedChatCompletionContext(buffer_size=5)

    @message_handler
    async def handle_user_message(self, message: Message, ctx: MessageContext) -> Message:
        # Prepare input to the chat completion model.
        user_message = UserMessage(content=message.content, source="user")
        # Add message to model context.
        await self._model_context.add_message(user_message)
        # Generate a response.
        response = await self._model_client.create(
            self._system_messages + (await self._model_context.get_messages()),
            cancellation_token=ctx.cancellation_token,
        )
        # Return with the model's response.
        assert isinstance(response.content, str)
        # Add message to model context.
        await self._model_context.add_message(AssistantMessage(content=response.content, source=self.metadata["type"]))
        return Message(content=response.content)

Now let's try to ask follow up questions after the first one.

In [6]:
model_client = OllamaChatCompletionClient(model="llama3.2")

runtime = SingleThreadedAgentRuntime()
await SimpleAgentWithContext.register(
    runtime,
    "simple_agent_context",
    lambda: SimpleAgentWithContext(model_client=model_client),
)
# Start the runtime processing messages.
runtime.start()
agent_id = AgentId("simple_agent_context", "default")

# First question.
message = Message("Hello, what are some fun things to do in Seattle?")
logger.debug(f"Question: {message.content}")
response = await runtime.send_message(message, agent_id)
logger.gray("-----")

# Second question.
message = Message("What was the first thing you mentioned?")
logger.debug(f"Question: {message.content}")
response = await runtime.send_message(message, agent_id)

# Stop the runtime processing messages.
await runtime.stop()
await model_client.close()

[1m[38;5;45mQuestion: Hello, what are some fun things to do in Seattle?[0m

[1m[38;5;208mCalling Ollama chat...[0m
[38;5;250mLLM model:[0m [1m[38;5;213mllama3.2[0m [1m[38;5;213m(3072)[0m [38;5;250m|[0m [1m[38;5;213mTokens:[0m [1m[38;5;213m43[0m


[38;5;250mLLM Settings:[0m
[38;5;250mtemperature:[0m [1m[38;5;45m0.6[0m
[38;5;250mnum_keep:[0m [1m[38;5;45m0[0m
[38;5;250mnum_predict:[0m [1m[38;5;45m-1[0m
[38;5;250mnum_ctx:[0m [1m[38;5;45m3072[0m

[38;5;250mStream:[0m [1m[38;5;213mTrue[0m
[38;5;250mModel:[0m [1m[38;5;213mllama3.2[0m
[38;5;250mPrompt Tokens:[0m [1m[38;5;213m43[0m
[38;5;250mMax Prompt Tokens:[0m [1m[38;5;213m2816[0m
[38;5;250mRemaining Tokens:[0m [1m[38;5;213m2773[0m
[38;5;250mnum_ctx:[0m [1m[38;5;208m3072[0m
[38;5;250mMax Tokens:[0m [1m[38;5;208m3072[0m

[1m[38;5;45mGenerating response...[0m
[38;5;250mEvent:[0m [1m[38;5;213mcall_ollama_chat[0m
[38;5;250mFile:[0m [1m[38;5;208mInteractive

From the second response, you can see the agent now can recall its own previous responses.