A conversational AI system using Ollama with hybrid memory management that combines:
- Sliding window context (last 8-10 turns)
- Long-term vector memory storage (ChromaDB)
- Asynchronous memory updates
- Session persistence
β¨ Hybrid Memory Architecture
- Maintains recent conversation context for natural flow
- Stores long-term memories in vector database
- Semantic search for relevant memory retrieval
π Asynchronous Processing
- Non-blocking memory extraction and storage
- Smooth conversation experience
- Background memory updates
πΎ Session Management
- Save and resume conversations
- Persistent memory across sessions
- Export conversation history
π― Smart Memory Extraction
- Automatic fact and preference detection
- Entity recognition
- Conversation summarization
- Python 3.10+
- Ollama - Install from ollama.ai
- Gemma3:4b-it-qat model - Pull with:
ollama pull gemma3:4b-it-qat
-
Clone or download this repository
-
Install dependencies:
pip install -r requirements.txt
- Ensure Ollama is running:
ollama serve
- Verify the model is available:
ollama list
python main.py
/new
- Start a new conversation/continue
- Load a previous session/memories
- View stored memories/save
- Save current session/clear
- Clear context window (keeps long-term memory)/exit
- Save and quit
You: Hi! My name is Alex and I'm a Python developer from Seattle.
Assistant: Hello Alex! Nice to meet you! It's great to connect with a fellow Python developer...
You: What's my name?
Assistant: Your name is Alex! You mentioned you're a Python developer from Seattle.
# System automatically extracts and stores:
# - FACT: User's name is Alex
# - FACT: User is a Python developer
# - FACT: User is from Seattle
The system will remember these facts even in new sessions!
Edit config.py
to customize:
# Model settings
OLLAMA_MODEL = "gemma3:4b-it-qat"
OLLAMA_HOST = "http://localhost:11434"
# Context window size
CONTEXT_WINDOW_SIZE = 10 # Recent turns to keep
# Memory retrieval
MEMORY_RETRIEVAL_COUNT = 5 # Top memories to retrieve
# Memory extraction frequency
EXTRACTION_FREQUENCY = 1 # Extract after every N turns
llm-memory/
βββ main.py # Entry point and chat loop
βββ config.py # Configuration settings
βββ requirements.txt # Dependencies
βββ memory/
β βββ vector_store.py # ChromaDB interface
β βββ extractor.py # Memory extraction logic
β βββ retriever.py # Memory retrieval and formatting
βββ conversation/
β βββ ollama_client.py # Ollama API wrapper
β βββ context_manager.py # Sliding window management
βββ sessions/ # Saved conversation sessions
βββ chroma_db/ # Vector database storage
- User sends a message
- System retrieves relevant memories from vector DB
- Builds prompt with: system prompt + memories + recent context
- Streams response from Ollama
- Adds turn to context window (sliding window of last 10 turns)
- Asynchronously extracts and stores new memories
- Facts: Personal information (name, location, occupation)
- Preferences: Likes, dislikes, favorites
- Entities: Names, places, dates mentioned
- Summaries: Conversation topic summaries
Pattern-based extraction:
- "I am/I'm [something]"
- "My name is [name]"
- "I like/love/prefer [thing]"
- "I live in [place]"
LLM-based extraction:
- Uses lightweight prompting to extract facts/preferences
- Runs asynchronously to avoid blocking conversation
- Semantic similarity search using embeddings
- Top 5 most relevant memories retrieved per query
- Formatted and injected into system prompt
- Maintains conversation context across sessions
You: I prefer async code
Assistant: [responds]
You: Can you make that async?
Assistant: [understands "that" refers to previous code]
Session 1:
You: I love Python and FastAPI
[exit and restart]
Session 2:
You: What frameworks do I like?
Assistant: You mentioned you love FastAPI!
The system maintains consistent performance even with 50+ turns by:
- Keeping only last 10 turns in active context
- Storing older information in vector DB
- Retrieving only relevant memories on-demand
# Start Ollama server
ollama serve
# Check if running
curl http://localhost:11434/api/tags
# Pull the model
ollama pull gemma3:4b-it-qat
# Verify
ollama list
# Delete and reinitialize
rm -rf chroma_db/
# Restart the application
Edit config.py
:
OLLAMA_MODEL = "llama3.2" # or any other Ollama model
Larger window = more recent context, more tokens:
CONTEXT_WINDOW_SIZE = 15 # Increase from default 10
View memories:
# In chat
/memories
Clear conversation memories:
# In Python
vector_store.clear_conversation_memories(conversation_id)
- chromadb - Vector database for memory storage
- sentence-transformers - Embedding generation
- ollama - Ollama Python client
- rich - Beautiful terminal UI
- aiofiles - Async file operations
- Memory extraction: Runs asynchronously, doesn't block conversation
- Context window: Fixed size prevents token bloat
- Vector search: O(log n) retrieval time
- Session files: JSON format, easily portable
- Memory importance scoring
- Memory decay (older memories fade)
- Multi-user support
- Web UI interface
- Conversation branching
- Memory conflict resolution
- Export to various formats
MIT License - Feel free to use and modify!
Contributions welcome! Please feel free to submit issues or pull requests.