### Chat Summary Memory Buffer

In this demo, we use the new ChatSummaryMemoryBuffer to limit the chat history to a certain token length, and iteratively summarize all messages that do not fit in the memory buffer. This can be useful if you want to limit costs and latency (assuming the summarization prompt uses and generates fewer tokens than including the entire history).

The original ChatMemoryBuffer gives you the option to truncate the history after a certain number of tokens, which is useful to limit costs and latency, but also removes potentially relevant information from the chat history.

The newer ChatSummaryMemoryBuffer aims to makes this a bit more flexible, so the user has more control over which chat_history is retained.

First, we simulate some chat history that will not fit in the memory buffer in its entirety.

In [1]:
from llama_index.core.llms import ChatMessage

chat_history = [
    ChatMessage(role="user", content="What is LlamaIndex?"),
    ChatMessage(
        role="assistant",
        content="LlamaaIndex is the leading data framework for building LLM applications",
    ),
    ChatMessage(role="user", content="Can you give me some more details?"),
    ChatMessage(
        role="assistant",
        content="""LlamaIndex is a framework for building context-augmented LLM applications. Context augmentation refers to any use case that applies LLMs on top of your private or domain-specific data. Some popular use cases include the following: 
        Question-Answering Chatbots (commonly referred to as RAG systems, which stands for "Retrieval-Augmented Generation"), Document Understanding and Extraction, Autonomous Agents that can perform research and take actions
        LlamaIndex provides the tools to build any of these above use cases from prototype to production. The tools allow you to both ingest/process this data and implement complex query workflows combining data access with LLM prompting.""",
    ),
]

By supplying an llm and token_limit for summarization, we create a ChatSummaryMemoryBuffer instance.

In [2]:
from helpers import MODEL_NAME, OPENROUTER_API_KEY
from llama_index.llms.openrouter import OpenRouter
from llama_index.core.memory import ChatSummaryMemoryBuffer
from transformers import AutoTokenizer

summarizer_llm = OpenRouter(
    api_key=OPENROUTER_API_KEY,
    model="gemini-2.0-flash",
    is_chat_model=True,
    is_function_calling_model=True,
    max_tokens=256
)

tokenizer_fn = AutoTokenizer.from_pretrained("gemini-2.0-flash").encode
memory = ChatSummaryMemoryBuffer.from_defaults(
    chat_history=chat_history,
    llm=summarizer_llm,
    token_limit=2,
    tokenizer_fn=tokenizer_fn,
)

history = memory.get()

No sentence-transformers model found with name sentence-transformers/gemini-embedding-001. Creating a new one with mean pooling.


OSError: sentence-transformers/gemini-embedding-001 is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo either by logging in with `huggingface-cli login` or by passing `token=<your_token>`

When printing the history, we can observe that older messages have been summarized.



In [None]:
print(history)

[ChatMessage(role=<MessageRole.SYSTEM: 'system'>, additional_kwargs={}, blocks=[TextBlock(block_type='text', text='The conversation revolves around an explanation of LlamaIndex, described as a leading data framework for developing LLM (Large Language Model) applications. It focuses on context-augmented LLM applications, which involve using LLMs in conjunction with private or specialized data. Key use cases highlighted include question-answering chatbots (RAG systems), document understanding and extraction, and autonomous agents capable of research and action. LlamaIndex offers tools for processing data and implementing intricate query workflows that integrate data access with LLM prompts, facilitating the development of such applications from initial prototypes to full-scale production.')])]


Let's add some new chat history.

In [None]:
new_chat_history = [
    ChatMessage(role="user", content="Why context augmentation?"),
    ChatMessage(
        role="assistant",
        content="LLMs offer a natural language interface between humans and data. Widely available models come pre-trained on huge amounts of publicly available data. However, they are not trained on your data, which may be private or specific to the problem you're trying to solve. It's behind APIs, in SQL databases, or trapped in PDFs and slide decks. LlamaIndex provides tooling to enable context augmentation. A popular example is Retrieval-Augmented Generation (RAG) which combines context with LLMs at inference time. Another is finetuning.",
    ),
    ChatMessage(role="user", content="Who is LlamaIndex for?"),
    ChatMessage(
        role="assistant",
        content="LlamaIndex provides tools for beginners, advanced users, and everyone in between. Our high-level API allows beginner users to use LlamaIndex to ingest and query their data in 5 lines of code. For more complex applications, our lower-level APIs allow advanced users to customize and extend any module—data connectors, indices, retrievers, query engines, reranking modules—to fit their needs.",
    ),
]
memory.put(new_chat_history[0])
memory.put(new_chat_history[1])
memory.put(new_chat_history[2])
memory.put(new_chat_history[3])
history = memory.get()

The history will now be updated with a new summary, containing the latest information.

In [None]:
print(history)

[ChatMessage(role=<MessageRole.SYSTEM: 'system'>, additional_kwargs={}, blocks=[TextBlock(block_type='text', text='The conversation discusses LlamaIndex, a data framework for developing context-augmented LLM applications. It explains the importance of context augmentation for integrating private or specialized data with LLMs, enhancing their capabilities for specific tasks. LlamaIndex supports various use cases like chatbots, document understanding, and autonomous agents. It caters to users of all skill levels, offering a simple high-level API for beginners and flexible lower-level APIs for advanced users to customize their applications.')])]


Using a longer token_limit allows the user to control the balance between retaining the full chat history and summarization.

In [None]:
memory = ChatSummaryMemoryBuffer.from_defaults(
    chat_history=chat_history + new_chat_history,
    llm=summarizer_llm,
    token_limit=256,
    tokenizer_fn=tokenizer_fn,
)
print(memory.get())

[ChatMessage(role=<MessageRole.SYSTEM: 'system'>, additional_kwargs={}, blocks=[TextBlock(block_type='text', text='The conversation revolves around an explanation of LlamaIndex, described as a leading data framework for developing LLM (Large Language Model) applications. It focuses on context-augmented LLM applications, which involve using LLMs with private or specific-domain data. Key use cases highlighted include question-answering chatbots (RAG systems), document understanding and extraction, and autonomous agents capable of research and action. LlamaIndex offers tools for processing data and implementing complex query workflows, facilitating the transition from prototype to production in these applications.')]), ChatMessage(role=<MessageRole.USER: 'user'>, additional_kwargs={}, blocks=[TextBlock(block_type='text', text='Why context augmentation?')]), ChatMessage(role=<MessageRole.ASSISTANT: 'assistant'>, additional_kwargs={}, blocks=[TextBlock(block_type='text', text="LLMs offer a 