# 03 - Memory Store & Recall: Context-Enhanced Responses

This notebook demonstrates **how memory transforms agent responses** by comparing
answers **with** and **without** memory context.

## What You'll Learn
1. **Baseline** - LLM response without any memory (generic, unhelpful)
2. **Hybrid recall** - how ST + LT memories are retrieved with a token budget
3. **Context injection** - how recalled memories enhance LLM responses
4. **Token budget** - how the 60/40 ST/LT split works
5. **Full agent loop** - multi-turn conversation where memory progressively improves answers

## Prerequisites
- Run Notebook 02 first (to populate ST and LT memories)

In [1]:
#  Setup ‚îÄ
import sys, os, re
sys.path.insert(0, "../src")

from dotenv import load_dotenv
load_dotenv()

# Configure loguru
from infrastructure.log import setup_logging
from loguru import logger
setup_logging("INFO", for_notebook=True)

import pandas as pd
from datetime import datetime
from sqlalchemy.orm import sessionmaker

from memory import (
    ShortTermMemoryStore,
    LongTermMemoryStore,
    MemoryRecaller,
)
from services.crm_service import get_crm_client
from infrastructure.db import create_tables, get_sql_engine
from infrastructure.db.crm_models import Patient, Booking, Doctor, Location, Specialty
from infrastructure.llm import get_chat_llm, get_default_embeddings

create_tables()
embedder = get_default_embeddings()
llm = get_chat_llm()
crm = get_crm_client()

st_store = ShortTermMemoryStore()
lt_store = LongTermMemoryStore(embedder)
recaller = MemoryRecaller(st_store, lt_store)

logger.success("‚úÖ Short-term memory : Supabase (st_turns)")
logger.success("‚úÖ Long-term memory  : Supabase pgvector")
logger.success("‚úÖ Recaller          : ready")

  from .autonotebook import tqdm as notebook_tqdm


[1m‚ÑπÔ∏è[0m [1m‚úì Supabase SQL engine created[0m
[1m‚ÑπÔ∏è[0m [1m‚úÖ Supabase connection test: SUCCESS[0m
[1m‚ÑπÔ∏è[0m [1m‚úÖ pgvector extension: INSTALLED[0m
[1m‚ÑπÔ∏è[0m [1m‚úì Schema validation passed: vector(1536)[0m
[1m‚ÑπÔ∏è[0m [1m‚úì Database tables created/verified[0m
[32m[1m‚úÖ[0m [32m[1m‚úÖ Short-term memory : Supabase (st_turns)[0m
[32m[1m‚úÖ[0m [32m[1m‚úÖ Long-term memory  : Supabase pgvector[0m
[32m[1m‚úÖ[0m [32m[1m‚úÖ Recaller          : ready[0m


---

## Part 1 ¬∑ Baseline - Query WITHOUT Memory

The LLM has **no context** about the user. Answers are generic.

In [2]:
query = "What medications does Anushka take and does she have any allergies?"

print(f"üìù Query: {query}")
print("\nü§ñ Baseline Answer (NO memory):")
print("-" * 60)

response = llm.invoke(query)
baseline_answer = response.content if hasattr(response, "content") else str(response)
print(baseline_answer)
print("-" * 60)
logger.error("\n‚ùå The LLM has no idea about the user - generic / unhelpful response.")

üìù Query: What medications does Anushka take and does she have any allergies?

ü§ñ Baseline Answer (NO memory):
------------------------------------------------------------
I cannot provide you with information about Anushka's medications or allergies. That kind of information is private medical data and I do not have access to it.
------------------------------------------------------------
[31m[1m‚ùå[0m [31m[1m
‚ùå The LLM has no idea about the user - generic / unhelpful response.[0m


---

## Part 2 ¬∑ Hybrid Memory Recall (ST + LT)

The `MemoryRecaller` combines:
- **Short-term (60% budget)**: Recent conversation turns (conversational continuity)
- **Long-term (40% budget)**: Distilled facts retrieved via cosine similarity (personalisation)

Both are subject to a **token budget** (default: 500 tokens) to keep prompts efficient.

In [3]:
#  Intelligent phone extraction ‚îÄ
def extract_phone(text: str) -> str:
    """Extract and normalise a Sri Lankan phone number from free-form text."""
    match = re.search(r"\+?[\d][\d\s\-\.\(\)]{7,18}[\d]", text)
    if not match:
        raise ValueError("‚ùå No phone number found in the message!")
    raw = re.sub(r"\D", "", match.group())
    if raw.startswith("0") and len(raw) == 10:
        raw = "94" + raw[1:]           # local ‚Üí international
    elif len(raw) == 9 and not raw.startswith("94"):
        raw = "94" + raw               # bare subscriber number
    logger.info(f"   Normalised ‚Üí {raw}")
    return raw


#  Identify user from a chat message (same as NB02) ‚îÄ
# Using a different format to show the extractor handles it
greeting = "I'm back! It's Anushka. My mobile is +94-781-030-736, checking on my records."

user_id = extract_phone(greeting)
logger.success(f"üì± Extracted phone ‚Üí user_id = {user_id}")

session_id = "nb02-demo"   # same session as NB02 so we can recall its turns

#  Show patient record from CRM 
patient = crm.get_patient_by_user_id(user_id)
if patient:
    df = pd.DataFrame([{
        "Field": k, "Value": v
    } for k, v in {
        "Patient ID": patient["patient_id"],
        "Full Name": patient["full_name"],
        "Phone": patient.get("phone", "-"),
        "DOB": patient.get("dob", "-"),
    }.items()])
    print("üìã Patient Record  (Supabase ‚Üí patients table)")
    display(df.style.hide(axis="index"))
else:
    logger.warning("‚ö†Ô∏è  Patient not found in CRM")

#  Recall memories for this user 
print("\nüß† Recalling memories...\n")

st_turns, lt_facts = recaller.recall(
    user_id=user_id,
    session_id=session_id,
    query=query,
    k_st=6,
    k_lt=5,
    max_tokens=500,
)

print(f"üì§ Retrieved:")
print(f"  Short-term turns : {len(st_turns)}")
print(f"  Long-term facts  : {len(lt_facts)}")

# Show what was recalled
if st_turns:
    print(f"\n Short-Term (recent conversation) ")
    for t in st_turns:
        emoji = "üë§" if t.role == "user" else "ü§ñ"
        print(f"  {emoji} {t.content[:80]}")

if lt_facts:
    print(f"\n Long-Term (distilled facts) ")
    for i, f in enumerate(lt_facts, 1):
        print(f"  {i}. {f.text}  [tags: {', '.join(f.tags)}]")

[1m‚ÑπÔ∏è[0m [1m   Normalised ‚Üí 94781030736[0m
[32m[1m‚úÖ[0m [32m[1müì± Extracted phone ‚Üí user_id = 94781030736[0m
üìã Patient Record  (Supabase ‚Üí patients table)


Field,Value
Patient ID,12692f5d-1630-4ecd-bf7e-bcfd08260b73
Full Name,Anushka Perera
Phone,+94781030736
DOB,1985-03-15



üß† Recalling memories...

[1m‚ÑπÔ∏è[0m [1mRetrieved 5 facts from LT memory for user 94781030736[0m
[1m‚ÑπÔ∏è[0m [1mRecalled 4 ST turns, 3 LT facts for user 94781030736[0m
üì§ Retrieved:
  Short-term turns : 4
  Long-term facts  : 3

 Short-Term (recent conversation) 
  üë§ I'm allergic to penicillin, please always remember this.
  ü§ñ Important! I've noted your penicillin allergy. This is critical information.
  üë§ Also remind me that I have a meniscus tear follow-up with orthopedics.
  ü§ñ Noted! I'll remember your orthopedics follow-up for the meniscus tear.

 Long-Term (distilled facts) 
  1. Anushka is allergic to penicillin.  [tags: allergy, penicillin, allergic_reaction]
  2. Anushka is allergic to penicillin.  [tags: allergy, penicillin, allergic_reaction, important]
  3. Anushka needs to inform her doctor about current medications  [tags: medication, communication, appointment]


---

## Part 3 ¬∑ Token Budget Analysis

The recaller allocates tokens: **60% short-term, 40% long-term** within a 500-token cap.

In [4]:
st_tokens = sum(recaller.count_tokens(t.content) for t in st_turns)
lt_tokens = sum(recaller.count_tokens(f.text) for f in lt_facts)
total_tokens = st_tokens + lt_tokens

print("üìä Token Budget Allocation:")
print(f"   Target  : ‚â§500 tokens")
print(f"   Actual  : {total_tokens} tokens")
print()
if total_tokens > 0:
    print(f"   ST (60% target) : {st_tokens} tokens ({st_tokens/total_tokens*100:.1f}%)")
    print(f"   LT (40% target) : {lt_tokens} tokens ({lt_tokens/total_tokens*100:.1f}%)")
print()
logger.success(f"   {'‚úÖ Within budget!' if total_tokens <= 500 else f'‚ö†Ô∏è Over budget by {total_tokens - 500} tokens'}")

üìä Token Budget Allocation:
   Target  : ‚â§500 tokens
   Actual  : 97 tokens

   ST (60% target) : 66 tokens (68.0%)
   LT (40% target) : 31 tokens (32.0%)

[32m[1m‚úÖ[0m [32m[1m   ‚úÖ Within budget![0m


---

## Part 4 ¬∑ Query WITH Memory Context

Now we inject the recalled memories into the prompt and compare the answer.

In [5]:
# Format recalled memories as a context string
memory_context = recaller.format_context(st_turns, lt_facts)

print("üìù Memory context that gets injected into the prompt:\n")
print(memory_context)
print("-" * 60)

üìù Memory context that gets injected into the prompt:

=== RECENT CONVERSATION ===
User: I'm allergic to penicillin, please always remember this.
Assistant: Important! I've noted your penicillin allergy. This is critical information.
User: Also remind me that I have a meniscus tear follow-up with orthopedics.
Assistant: Noted! I'll remember your orthopedics follow-up for the meniscus tear.

=== REMEMBERED FACTS ===
1. Anushka is allergic to penicillin. [allergy, penicillin, allergic_reaction]
2. Anushka is allergic to penicillin. [allergy, penicillin, allergic_reaction, important]
3. Anushka needs to inform her doctor about current medications [medication, communication, appointment]

------------------------------------------------------------


In [6]:
# Build prompt WITH memory and query the LLM
prompt_with_memory = f"""{memory_context}

USER QUERY: {query}

Answer based on the information above:"""

print(f"üìù Query: {query}")
print("\nü§ñ Answer WITH Memory:")
print("-" * 60)

response = llm.invoke(prompt_with_memory)
memory_answer = response.content if hasattr(response, "content") else str(response)
print(memory_answer)
print("-" * 60)
logger.success("\n‚úÖ With memory: the LLM knows the user's specific medications and schedule!")

üìù Query: What medications does Anushka take and does she have any allergies?

ü§ñ Answer WITH Memory:
------------------------------------------------------------
Anushka is allergic to penicillin. I do not have information about what medications she takes.
------------------------------------------------------------
[32m[1m‚úÖ[0m [32m[1m
‚úÖ With memory: the LLM knows the user's specific medications and schedule![0m


---

## Part 5 ¬∑ Side-by-Side Comparison

In [7]:
print("=" * 72)
print("üìä MEMORY RECALL EFFECTIVENESS")
print("=" * 72)

logger.error("\n‚ùå WITHOUT Memory:")
print(f"   {baseline_answer[:200]}{'...' if len(baseline_answer) > 200 else ''}")

logger.success(f"\n‚úÖ WITH Memory ({total_tokens} tokens injected):")
print(f"   {memory_answer[:200]}{'...' if len(memory_answer) > 200 else ''}")

logger.info(f"\nüéØ Key Benefit:")
print(f"   Hybrid recall (60% ST / 40% LT) provides both:")
print(f"   ‚Ä¢ Conversational continuity (ST - what was just discussed)")
print(f"   ‚Ä¢ Long-term knowledge (LT - distilled facts and preferences)")
print(f"   ‚Ä¢ Token-efficient ({total_tokens}/500 tokens used)")
print("=" * 72)

üìä MEMORY RECALL EFFECTIVENESS
[31m[1m‚ùå[0m [31m[1m
‚ùå WITHOUT Memory:[0m
   I cannot provide you with information about Anushka's medications or allergies. That kind of information is private medical data and I do not have access to it.
[32m[1m‚úÖ[0m [32m[1m
‚úÖ WITH Memory (97 tokens injected):[0m
   Anushka is allergic to penicillin. I do not have information about what medications she takes.
[1m‚ÑπÔ∏è[0m [1m
üéØ Key Benefit:[0m
   Hybrid recall (60% ST / 40% LT) provides both:
   ‚Ä¢ Conversational continuity (ST - what was just discussed)
   ‚Ä¢ Long-term knowledge (LT - distilled facts and preferences)
   ‚Ä¢ Token-efficient (97/500 tokens used)


---

## Part 6 ¬∑ Full Agent with Memory - Progressive Context Building

Watch how the **agent accumulates context** across multiple turns.
Each turn adds to short-term memory; distillation extracts long-term facts.
By the end, the agent knows the user deeply.

In [8]:
from agents import build_agent

agent = build_agent(enable_crm=True, enable_rag=True, enable_web=True)

# The first message includes the user's phone - extract it like a real system
# Using local format (078‚Ä¶) to show the normaliser in action
first_msg = "Hi, I'm Anushka. My mobile is 078 103 0736. I have a cardiac stress test coming up."

RECALL_USER = extract_phone(first_msg)
RECALL_SESSION = "nb03-recall"

logger.success(f"üì± Extracted phone ‚Üí RECALL_USER = {RECALL_USER}")

# Show patient info from CRM for context
patient = crm.get_patient_by_user_id(RECALL_USER)
if patient:
    df = pd.DataFrame([{
        "Field": k, "Value": v
    } for k, v in {
        "Full Name": patient["full_name"],
        "Phone": patient.get("phone", "-"),
        "DOB": patient.get("dob", "-"),
    }.items()])
    print("üìã Patient Record  (Supabase ‚Üí patients table)")
    display(df.style.hide(axis="index"))

# A conversation that progressively builds memory
# Tailored to Anushka Perera's CRM data:
#   - Cardiology bookings (stress test, cardiac risk)
#   - Orthopedics booking (meniscus tear)
#   - Dermatology booking (actinic keratosis monitoring)
messages = [
    first_msg,                                                            # ‚Üí direct (identity + condition)
    "I take atenolol 50mg every morning for blood pressure. Please remember this.",
    "I'm allergic to penicillin. Very important - always remember!",
    "Can you find me a cardiologist?",                                    # ‚Üí CRM tool
    "What is the medication administration policy at the hospital?",       # ‚Üí RAG tool
    "What do you remember about my health conditions and medications?",    # ‚Üí direct (from memory)
]

print("\nüîÑ Progressive Memory Building")
print("=" * 72)

for i, msg in enumerate(messages, 1):
    print(f"\n{'‚îÄ' * 72}")
    print(f"üë§ Turn {i}: {msg}")
    print(f"{'‚îÄ' * 72}")
    
    resp = agent.chat(
        user_message=msg,
        user_id=RECALL_USER,
        session_id=RECALL_SESSION,
    )
    
    # Show route and memory context size
    ctx_lines = len(resp.memory_context.strip().split("\n")) if resp.memory_context.strip() else 0
    print(f"üõ§Ô∏è  Route: {resp.route}" + (f" / {resp.action}" if resp.action else ""))
    print(f"üìù Memory context: {ctx_lines} lines")
    print(f"‚è±Ô∏è  {resp.latency_ms}ms")
    print(f"ü§ñ {resp.answer[:300]}{'...' if len(resp.answer) > 300 else ''}")

print(f"\n{'=' * 72}")
logger.success("‚úÖ Progressive memory building complete!")
print("   Notice how memory_context grows with each turn.")

[1m‚ÑπÔ∏è[0m [1mLangFuse client initialised (host=https://us.cloud.langfuse.com)[0m
[1m‚ÑπÔ∏è[0m [1mLLM models loaded:[0m
[1m‚ÑπÔ∏è[0m [1m   Chat (synthesis) : google/gemini-2.5-flash[0m
[1m‚ÑπÔ∏è[0m [1m   Router           : openai/gpt-4o-mini[0m
[1m‚ÑπÔ∏è[0m [1m   Extractor        : llama-3.1-8b-instant[0m
[1m‚ÑπÔ∏è[0m [1m‚úì CRM tool loaded[0m
[1m‚ÑπÔ∏è[0m [1mConnected to Qdrant Cloud at https://025872ed-03a9-42c8-84ff-5caad58a460b.us-east-1-1.aws.cloud.qdrant.io[0m
[1m‚ÑπÔ∏è[0m [1m‚úì Qdrant KB ready - collection 'nawaloka' has 124 points, skipping ingestion[0m
[1m‚ÑπÔ∏è[0m [1m‚úì CAG cache ready (Qdrant collection='cag_cache', dim=1536, threshold=0.90)[0m
[1m‚ÑπÔ∏è[0m [1mRAGTool initialised: CAG cache (CAGCache(collection='cag_cache', threshold=0.9, ttl=86400s, entries=68, backend='qdrant')) -> CRAG (k=4, expanded_k=8, threshold=0.60)[0m
[1m‚ÑπÔ∏è[0m [1m‚úì RAG tool loaded (CAG-enabled)[0m
[1m‚ÑπÔ∏è[0m [1mCAG cache HIT (sim=1.000): 'Wh

Field,Value
Full Name,Anushka Perera
Phone,+94781030736
DOB,1985-03-15



üîÑ Progressive Memory Building


üë§ Turn 1: Hi, I'm Anushka. My mobile is 078 103 0736. I have a cardiac stress test coming up.

[1m‚ÑπÔ∏è[0m [1mRetrieved 5 facts from LT memory for user 94781030736[0m
[1m‚ÑπÔ∏è[0m [1mRecalled 4 ST turns, 3 LT facts for user 94781030736[0m
[1m‚ÑπÔ∏è[0m [1mRoute: crm (action=lookup_patient, conf=0.90) - The user provided their mobile number and mentioned an upcoming cardiac stress test, indicating a need for patient lookup or related actions.[0m
[1m‚ÑπÔ∏è[0m [1mDispatching CRM action: lookup_patient params={'phone': '078 103 0736'}[0m
[1m‚ÑπÔ∏è[0m [1mTriggering memory distillation for 94781030736[0m
[1m‚ÑπÔ∏è[0m [1mUpserted 4 facts to LT memory (0 new, 4 merged)[0m
[1m‚ÑπÔ∏è[0m [1mDistilled 4 facts for user 94781030736[0m
üõ§Ô∏è  Route: crm / lookup_patient
üìù Memory context: 10 lines
‚è±Ô∏è  25729ms
ü§ñ Hello Anushka!

It's great to hear from you. I see you have a cardiac stress test coming up. I've noted your mobil

---

## Part 7 ¬∑ Memory Lifecycle Summary

```
Conversation Turn
       ‚îÇ
       ‚ñº
‚îå‚îê
‚îÇ  SHORT-TERM MEMORY   ‚îÇ  ‚Üê Stored immediately (ring buffer)
‚îÇ  (last N turns)      ‚îÇ     Retrieved by recency
‚îî‚î¨‚îÄ‚îò
           ‚îÇ  distillation triggered?
           ‚ñº
‚îå‚îê
‚îÇ  DISTILLER (LLM)     ‚îÇ  ‚Üê Extracts facts from conversation
‚îÇ  "remember", ‚â•5 turns‚îÇ
‚îî‚î¨‚îÄ‚îò
           ‚îÇ
           ‚ñº
‚îå‚îê
‚îÇ  LONG-TERM MEMORY    ‚îÇ  ‚Üê Stored with pgvector embedding
‚îÇ  (semantic facts)    ‚îÇ     Retrieved by cosine similarity
‚îî‚îò

EPISODIC: Full sessions stored at end-of-conversation
PROCEDURAL: Pre-loaded workflows retrieved by intent similarity
```

### Key Takeaways

1. **Memory makes agents personal** - the same LLM gives generic vs. specific answers
2. **Token budget prevents bloat** - 60/40 ST/LT split within 500 tokens
3. **Hybrid recall** - combines recency (ST) with relevance (LT cosine similarity)
4. **Progressive context** - memory grows across turns within a session
5. **Cross-session persistence** - LT facts survive across sessions (stored in pgvector)
6. **Distillation is triggered** - not every turn distills, only when policy says so