LLM Memory System with Ollama

A conversational AI system using Ollama with hybrid memory management that combines:

Sliding window context (last 8-10 turns)
Long-term vector memory storage (ChromaDB)
Asynchronous memory updates
Session persistence

Features

✨ Hybrid Memory Architecture

Maintains recent conversation context for natural flow
Stores long-term memories in vector database
Semantic search for relevant memory retrieval

🚀 Asynchronous Processing

Non-blocking memory extraction and storage
Smooth conversation experience
Background memory updates

💾 Session Management

Save and resume conversations
Persistent memory across sessions
Export conversation history

🎯 Smart Memory Extraction

Automatic fact and preference detection
Entity recognition
Conversation summarization

Prerequisites

Python 3.10+
Ollama - Install from ollama.ai
Gemma3:4b-it-qat model - Pull with: ollama pull gemma3:4b-it-qat

Installation

Clone or download this repository
Install dependencies:

pip install -r requirements.txt

Ensure Ollama is running:

ollama serve

Verify the model is available:

ollama list

Usage

Start the Chat System

python main.py

Commands

/new - Start a new conversation
/continue - Load a previous session
/memories - View stored memories
/save - Save current session
/clear - Clear context window (keeps long-term memory)
/exit - Save and quit

Example Conversation

You: Hi! My name is Alex and I'm a Python developer from Seattle.
Assistant: Hello Alex! Nice to meet you! It's great to connect with a fellow Python developer...

You: What's my name?
Assistant: Your name is Alex! You mentioned you're a Python developer from Seattle.

# System automatically extracts and stores:
# - FACT: User's name is Alex
# - FACT: User is a Python developer
# - FACT: User is from Seattle

The system will remember these facts even in new sessions!

Configuration

Edit config.py to customize:

# Model settings
OLLAMA_MODEL = "gemma3:4b-it-qat"
OLLAMA_HOST = "http://localhost:11434"

# Context window size
CONTEXT_WINDOW_SIZE = 10  # Recent turns to keep

# Memory retrieval
MEMORY_RETRIEVAL_COUNT = 5  # Top memories to retrieve

# Memory extraction frequency
EXTRACTION_FREQUENCY = 1  # Extract after every N turns

Project Structure

llm-memory/
├── main.py                    # Entry point and chat loop
├── config.py                  # Configuration settings
├── requirements.txt           # Dependencies
├── memory/
│   ├── vector_store.py       # ChromaDB interface
│   ├── extractor.py          # Memory extraction logic
│   └── retriever.py          # Memory retrieval and formatting
├── conversation/
│   ├── ollama_client.py      # Ollama API wrapper
│   └── context_manager.py    # Sliding window management
├── sessions/                  # Saved conversation sessions
└── chroma_db/                # Vector database storage

How It Works

1. Conversation Flow

User sends a message
System retrieves relevant memories from vector DB
Builds prompt with: system prompt + memories + recent context
Streams response from Ollama
Adds turn to context window (sliding window of last 10 turns)
Asynchronously extracts and stores new memories

2. Memory Types

Facts: Personal information (name, location, occupation)
Preferences: Likes, dislikes, favorites
Entities: Names, places, dates mentioned
Summaries: Conversation topic summaries

3. Memory Extraction

Pattern-based extraction:

"I am/I'm [something]"
"My name is [name]"
"I like/love/prefer [thing]"
"I live in [place]"

LLM-based extraction:

Uses lightweight prompting to extract facts/preferences
Runs asynchronously to avoid blocking conversation

4. Memory Retrieval

Semantic similarity search using embeddings
Top 5 most relevant memories retrieved per query
Formatted and injected into system prompt
Maintains conversation context across sessions

Testing Scenarios

Multi-turn Conversation

You: I prefer async code
Assistant: [responds]
You: Can you make that async?
Assistant: [understands "that" refers to previous code]

Cross-session Memory

Session 1:
You: I love Python and FastAPI
[exit and restart]

Session 2:
You: What frameworks do I like?
Assistant: You mentioned you love FastAPI!

Long Conversation Performance

The system maintains consistent performance even with 50+ turns by:

Keeping only last 10 turns in active context
Storing older information in vector DB
Retrieving only relevant memories on-demand

Troubleshooting

Ollama Connection Error

# Start Ollama server
ollama serve

# Check if running
curl http://localhost:11434/api/tags

Model Not Found

# Pull the model
ollama pull gemma3:4b-it-qat

# Verify
ollama list

ChromaDB Issues

# Delete and reinitialize
rm -rf chroma_db/
# Restart the application

Advanced Usage

Using Different Models

Edit config.py:

OLLAMA_MODEL = "llama3.2"  # or any other Ollama model

Adjusting Context Window

Larger window = more recent context, more tokens:

CONTEXT_WINDOW_SIZE = 15  # Increase from default 10

Memory Management

View memories:

# In chat
/memories

Clear conversation memories:

# In Python
vector_store.clear_conversation_memories(conversation_id)

Dependencies

chromadb - Vector database for memory storage
sentence-transformers - Embedding generation
ollama - Ollama Python client
rich - Beautiful terminal UI
aiofiles - Async file operations

Performance Considerations

Memory extraction: Runs asynchronously, doesn't block conversation
Context window: Fixed size prevents token bloat
Vector search: O(log n) retrieval time
Session files: JSON format, easily portable

Future Enhancements

License

MIT License - Feel free to use and modify!

Contributing

Contributions welcome! Please feel free to submit issues or pull requests.

Acknowledgments

Built with Ollama
Uses ChromaDB for vector storage
Inspired by human memory systems

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
conversation		conversation
memory		memory
.gitignore		.gitignore
README.md		README.md
config.py		config.py
main.py		main.py
requirements.txt		requirements.txt

maranone/MemoryLLM

Folders and files

Latest commit

History

Repository files navigation