# Memories: Short-Term & Long-Term

We will use the `Memory` class (from LlamaIndex)to store and retrieve both short-term and long-term memory.

You can use it on its own and orchestrate within a custom workflow, or use it within an existing agent.

By default, short-term memory is represented as a FIFO queue of `ChatMessage` objects. Once the queue exceeds a certain size, the last X messages within a flush size are archived and optionally flushed to long-term memory blocks.

Long-term memory is represented as `Memory Block` objects. These objects receive the messages that are flushed from short-term memory, and optionally process them to extract information. Then when memory is retrieved, the short-term and long-term memories are merged together.


## 1. Setup


In [1]:
from dotenv import load_dotenv, find_dotenv

load_dotenv(find_dotenv())

import os, sys
from pathlib import Path

# Resolve project root robustly by finding the folder that contains `asdrp/`
PROJECT_ROOT = None
for candidate in [Path.cwd(), *Path.cwd().parents]:
    if (candidate / "asdrp").exists():
        PROJECT_ROOT = candidate
        break

if PROJECT_ROOT is None:
    raise RuntimeError("Could not find repo root containing 'asdrp'.")

print(f"project root: {PROJECT_ROOT}")
if str(PROJECT_ROOT) not in sys.path:
    sys.path.insert(0, str(PROJECT_ROOT))
        
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core.settings import Settings

llm = OpenAI(model="gpt-5-mini", temperature=0.01)
embed_model = OpenAIEmbedding(model="text-embedding-3-small")
Settings.embed_model = embed_model
Settings.llm = llm


project root: /Users/pmui/SynologyDrive/research/2026/research2026/projects/memagents


## 2. Short-term Memory

Let's explore how to configure various components of short-term memory.

For visual purposes, we will set some low token limits to more easily observe the memory behavior.


In [None]:

from llama_index.core.memory import Memory

memory = Memory.from_defaults(
    session_id="my_session",
    token_limit=50,  # small enough to observe the memory behavior
    token_flush_size=10,
    chat_history_token_ratio=0.7,
)

Let's review the configuration we used and what it means:

- `session_id`: A unique identifier for the session. Used to mark chat messages in a SQL database as belonging to a specific session.
- `token_limit`: The maximum number of tokens that can be stored in short-term + long-term memory.
- `chat_history_token_ratio`: The ratio of tokens in the short-term chat history to the total token limit. Here this means that 50\*0.7 = 35 tokens are allocated to short-term memory, and the rest is allocated to long-term memory.
- `token_flush_size`: The number of tokens to flush to long-term memory when the token limit is exceeded. Note that we did not configure long-term memory, so these messages are merely archived in the database and removed from the short-term memory.

Using our memory, we can manually add some messages and observe how it works.


In [None]:
from llama_index.core.llms import ChatMessage

# Simulate a long conversation
for i in range(100):
    await memory.aput_messages(
        [
            ChatMessage(role="user", content="Hello, world!  Message " + str(i)),
            ChatMessage(role="assistant", content="Hello, world to you too!  Message " + str(i)),
            ChatMessage(role="user", content="What is the capital of France?  Message " + str(i)),
            ChatMessage(
                role="assistant", content="The capital of France is Paris.  Message " + str(i)
            ),
        ]
    )

Since our token limit is small, we will only see the last 2 messages in short-term memory (since this fits withint the `50*0.7` limit)


In [None]:
current_chat_history = await memory.aget()
for msg in current_chat_history:
    print(msg)

If we retrieva all messages, we will find all 400 messages.


In [None]:

all_messages = await memory.aget_all()
print(len(all_messages))

We can clear the memory at any time to start fresh.


In [None]:
await memory.areset()
all_messages = await memory.aget_all()
print(len(all_messages))

## 3. Long-term Memory

Long-term memory is represented as Memory Block objects. These objects receive the messages that are flushed from short-term memory, and optionally process them to extract information. Then when memory is retrieved, the short-term and long-term memories are merged together.


We have 3 prebuilt memory blocks:

- `StaticMemoryBlock`: A memory block that stores a static piece of information.
- `FactExtractionMemoryBlock`: A memory block that extracts facts from the chat history.
- `VectorMemoryBlock`: A memory block that stores and retrieves batches of chat messages from a vector database.

Each block has a `priority` that is used when the long-term memory + short-term memory exceeds the token limit. Priority 0 means the block will always be kept in memory, priority 1 means the block will be temporarily disabled, and so on.


In [None]:
from llama_index.core.memory import (
    StaticMemoryBlock,
    FactExtractionMemoryBlock,
    VectorMemoryBlock,
)
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.vector_stores.chroma import ChromaVectorStore
import chromadb

llm = OpenAI(model="gpt-4.1-mini")
embed_model = OpenAIEmbedding(model="text-embedding-3-small")

client = chromadb.EphemeralClient()
vector_store = ChromaVectorStore(
    chroma_collection=client.get_or_create_collection("test_collection")
)

blocks = [
    StaticMemoryBlock(
        name="core_info",
        static_content="My name is ASDRP Agent.  I live in Fremont, CA and I love to talk about nested Matryoshka dolls.",
        priority=0,
    ),
    FactExtractionMemoryBlock(
        name="extracted_info",
        llm=llm,
        max_facts=50,
        priority=1,
    ),
    VectorMemoryBlock(
        name="vector_memory",
        # required: pass in a vector store like qdrant, chroma, weaviate, milvus, etc.
        vector_store=vector_store,
        priority=2,
        embed_model=embed_model,
        # The top-k message batches to retrieve
        # similarity_top_k=2,
        # optional: How many previous messages to include in the retrieval query
        # retrieval_context_window=5
        # optional: pass optional node-postprocessors for things like similarity threshold, etc.
        # node_postprocessors=[...],
    ),
]

With our blocks created, we can pass them into the Memory class.


In [None]:
from llama_index.core.memory import Memory

memory = Memory.from_defaults(
    session_id="my_session",
    token_limit=30000,
    # Setting a extremely low ratio so that more tokens are flushed to long-term memory
    chat_history_token_ratio=0.02,
    token_flush_size=500,
    memory_blocks=blocks,
    # insert into the latest user message, can also be "system"
    insert_method="user",
)

With this, we can simulate a conversation with an agent and inspect the long-term memory.


In [None]:
from llama_index.core.agent.workflow import FunctionAgent
from llama_index.llms.openai import OpenAI

agent = FunctionAgent(
    tools=[],
    llm=llm,
)

user_msgs = [
    "Hi! My name is Jerry",
    "What is your opinion on nested Matryoshka dolls?",
    "What is the most popular nesting doll?",
    "In history, what is the most significant nesting doll?",
    "What is the most expensive nesting doll?",
    "I am interested in buying a nesting doll, what is the most popular nesting doll?",
    "What is the most valuable nesting doll?",
    "Last week, I bought a nesting doll.",
    "What is the most rare nesting doll?",
    "I am thinking about the historical significance of nesting dolls, what is the most interesting nesting doll?",
    "What is the most unique nesting doll?",
    "Why are nesting dolls so popular?",
    "What is the most interesting nesting doll?",
]

for user_msg in user_msgs:
    _ = await agent.run(user_msg=user_msg, memory=memory)

Now, let's inspect the most recent user-message and see what the memory inserts into the user message.

Note that we pass in at least one chat message so that the vector memory actually runs retrieval.


In [None]:
chat_history = await memory.aget()
for chat in chat_history:
    print(f"==> {chat}")

Great, we can see that the current FIFO queue is only 2-3 messages (expected since we set the chat history token ratio to 0.02).

Now, let's inspect the long-term memory blocks that are inserted into the latest user message.


In [None]:
for block in chat_history[-2].blocks:
    print(block.text)

To use this memory outside an agent, and to highlight more of the usage, you might do something like the following:


In [None]:
new_user_msg = ChatMessage(
    role="user", content="What kind of doll was I asking about?"
)
await memory.aput(new_user_msg)

# Get the new chat history
new_chat_history = await memory.aget()
resp = await llm.achat(new_chat_history)
await memory.aput(resp.message)
print(resp.message.content)

## 4. Mem0Agent: LlamaIndex + Mem0 Memory

This section demonstrates a full Mem0-enabled agent using the LlamaIndex **Mem0Memory** wrapper. It covers setup, memory scoping, direct memory access, and test scenarios.

References:

- https://docs.mem0.ai/integrations/llama-index
- https://docs.mem0.ai/cookbooks/frameworks/llamaindex-react

```mermaid
flowchart LR
  U[User] -->|message| A[Mem0Agent]
  A -->|search context| M[(Mem0 Memory)]
  M -->|memories| A
  A -->|tool calls| T[Tools]
  A -->|response| U
  A -->|store new memory| M
```


### Setup

Install the Mem0 LlamaIndex integration (and dotenv if needed):

```
pip install llama-index-core llama-index-memory-mem0 python-dotenv
```

You will need:

- `OPENAI_API_KEY`
- `MEM0_API_KEY` (Mem0 Platform) or a Mem0 OSS config


In [2]:
import os
from dotenv import load_dotenv, find_dotenv

load_dotenv(find_dotenv())

missing = [k for k in ["OPENAI_API_KEY", "MEM0_API_KEY"] if not os.getenv(k)]
if missing:
    raise ValueError(f"Missing required environment variables: {missing}")

print("Environment ready.")


Environment ready.


### Memory Scoping with Context

Mem0 memory is scoped by `context` so you can isolate or share memory across users, agents, and runs.

```mermaid
flowchart TB
  C[Context] --> U[user_id]
  C --> A[agent_id]
  C --> R[run_id]
  U --> M[(Mem0 Memory)]
  A --> M
  R --> M
```


### Suppress Dependency Warnings

The following deprecation warnings come from dependencies (LlamaIndex, Pydantic), not your code. We suppress them for cleaner output.


In [None]:
import warnings

# Suppress Pydantic deprecation warnings from dependencies
warnings.filterwarnings("ignore", category=DeprecationWarning, module="pydantic")
warnings.filterwarnings("ignore", message=".*PydanticDeprecated.*")
warnings.filterwarnings("ignore", message=".*utcfromtimestamp.*")


In [3]:
from asdrp.agent.mem0_agent import Mem0Agent
from llama_index.core.tools import FunctionTool


def call_fn(name: str) -> str:
    return f"Calling {name}"


def email_fn(name: str) -> str:
    return f"Emailing {name}"


call_tool = FunctionTool.from_defaults(fn=call_fn)
email_tool = FunctionTool.from_defaults(fn=email_fn)

tools = [call_tool, email_tool]

agent = Mem0Agent(
    context={"user_id": "david", "agent_id": "mem0_agent"},
    tools=tools,
)

agent


/Users/pmui/.local/share/uv/python/cpython-3.13.3-macos-aarch64-none/lib/python3.13/inspect.py:602: PydanticDeprecatedSince20: The `__fields__` attribute is deprecated, use the `model_fields` class property instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.12/migration/
  value = getter(object, key)
/Users/pmui/.local/share/uv/python/cpython-3.13.3-macos-aarch64-none/lib/python3.13/inspect.py:602: PydanticDeprecatedSince20: The `__fields_set__` attribute is deprecated, use `model_fields_set` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.12/migration/
  value = getter(object, key)
/Users/pmui/.local/share/uv/python/cpython-3.13.3-macos-aarch64-none/lib/python3.13/inspect.py:602: PydanticDeprecatedSince211: Accessing the 'model_computed_fields' attribute on the instance is deprecated. Instead, you should access this attribute from the m

<asdrp.agent.mem0_agent.Mem0Agent at 0x11446b4d0>

In [None]:
import asyncio


async def run_session():
    reply = await agent.achat("Hi, my name is David.")
    print(reply.response_str)

    reply = await agent.achat("I prefer email updates.")
    print(reply.response_str)

    reply = await agent.achat("Please contact me about my next order.")
    print(reply.response_str)

await run_session()


### Direct Memory Access (Search and Add)

`Mem0Agent` exposes simple helpers to search and add memories directly. This is useful for debugging, seeding, or inspection.


In [None]:
try:
    # Mem0 API requires non-empty filters for search
    results = agent.search_memories(
        "preferred communication",
        limit=3,
        filters={"user_id": "david"},
    )
    print(results)

    agent.add_memories(
        messages=[
            {"role": "user", "content": "My favorite cuisine is Italian."},
            {"role": "assistant", "content": "Got it. I will remember that."},
        ]
    )

    results = agent.search_memories(
        "favorite cuisine",
        limit=3,
        filters={"user_id": "david"},
    )
    print(results)
except Exception as exc:
    print(f"Mem0 backend not available: {exc}")


### Memory Isolation vs Shared Context

Use different `user_id` values to isolate memory, or reuse the same `user_id` to share memory across agents.


In [None]:
agent_a = Mem0Agent(context={"user_id": "shared_user", "agent_id": "agent_a"})
agent_b = Mem0Agent(context={"user_id": "shared_user", "agent_id": "agent_b"})
agent_c = Mem0Agent(context={"user_id": "isolated_user", "agent_id": "agent_c"})

async def share_and_isolate():
    await agent_a.achat("I like black coffee.")

    reply_shared = await agent_b.achat("What coffee do I like?")
    print("Shared context:", reply_shared.response_str)

    reply_isolated = await agent_c.achat("What coffee do I like?")
    print("Isolated context:", reply_isolated.response_str)

await share_and_isolate()


### Using Mem0 OSS (Optional)

If you run Mem0 OSS locally, configure the vector store, embedder, and LLM in a Mem0 config and initialize `Mem0Agent` with `use_platform=False`.


In [None]:
mem0_config = {
    "vector_store": {
        "provider": "qdrant",
        "config": {
            "collection_name": "mem0_demo",
            "host": "localhost",
            "port": 6333,
            "embedding_model_dims": 1536,
        },
    },
    "llm": {
        "provider": "openai",
        "config": {
            "model": "gpt-4.1-nano-2025-04-14",
            "temperature": 0.2,
            "max_tokens": 2000,
        },
    },
    "embedder": {
        "provider": "openai",
        "config": {"model": "text-embedding-3-small"},
    },
    "version": "v1.1",
}

# oss_agent = Mem0Agent(
#     context={"user_id": "oss_user", "agent_id": "mem0_agent"},
#     mem0_config=mem0_config,
#     use_platform=False,
# )


### Testing Checklist

- Verify the agent remembers preferences across multiple sessions.
- Confirm `search_memories()` returns expected memories.
- Compare shared vs isolated contexts (`user_id`).
- Validate tool use with memory-backed prompts.
- Measure latency and token usage with and without Mem0.
