A memory architecture for local LLMs that builds a persistent, hierarchical self-model from conversations — using Pascal's triangle as the weighting function.
The core idea: instead of storing all memories equally, EigenFlame compresses experience upward through dimensional layers — episodes become beliefs, beliefs become identity, identity becomes meta-pattern, meta-pattern becomes archetype. Higher layers exert gravitational pull on every retrieval. The result is a system that doesn't just remember what was said — it distills what things meant.
# 1. Install Ollama — https://ollama.com
# 2. Pull models
ollama pull mxbai-embed-large:latest
ollama pull qwen3.5:9b
# 3. Clone and install
git clone https://github.com/latentweb/EigenFlame
cd EigenFlame
python -m venv .venv && source .venv/bin/activate # Windows: .venv\Scripts\activate
python -m pip install -r requirements.txt
# 4. Run
cd backend
python -m uvicorn main:app --host 0.0.0.0 --port 8000 --reload
# 5. Open http://localhost:8000That's it. The backend serves the frontend — no separate server needed.
When you talk to a standard RAG-augmented LLM, it retrieves the most semantically similar past messages and injects them as context. That's flat retrieval — every stored exchange competes equally by cosine similarity, and recency usually wins.
EigenFlame does something different. After enough exchanges, it runs a synthesis cascade:
- Raw exchanges → 2D beliefs (cross-episode patterns)
- Beliefs → 3D identity (who this entity has become)
- Identity versions → 4D meta-pattern (how understanding is shifting over time)
- Meta-patterns → 5D archetype (the invariant beneath all change)
Each layer is weighted by its position using figurate numbers from Pascal's triangle. A synthesised identity statement from 30 turns ago outweighs a raw episode from yesterday — not because it's older, but because it survived synthesis and represents distilled understanding.
The seed — a phrase you set at session creation — is embedded as a vector and acts as a permanent gravitational anchor. Every query vector is bent toward it before retrieval happens. The archetype, once it crystallises, becomes a second anchor: the system is pulled toward both its origin and toward what it has become.
- Backend: FastAPI + Ollama (local LLM inference)
- Memory: ChromaDB (persistent vector storage)
- Embeddings: any Ollama embedding model
- Frontend: vanilla React (JSX via Babel, no build step)
- Synthesis: LLM cascade running automatically after each exchange
No cloud dependencies. Everything runs locally.
- Python 3.11+
- Ollama — install and make sure it's running (
ollama serve) - Node.js is not required (frontend uses CDN React)
# Embedding model (required)
ollama pull mxbai-embed-large:latest
# Generation model — pick one or your favourite
ollama pull qwen3.5:9b # recommended, good synthesis quality
ollama pull phi4-mini:3.8b # faster, lighter
ollama pull llama3.2:3b # smaller footprintgit clone https://github.com/latentweb/EigenFlame
cd EigenFlame
# Create a virtual environment (recommended)
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
# Install dependencies
pip install -r requirements.txtEigenFlame/
├── backend/
│ ├── main.py # FastAPI server, session management, system prompt builder
│ ├── memory.py # FigurateMemory class, retrieval, dimensional stack
│ ├── synthesis.py # Synthesis cascade, belief/identity/meta/archetype generation
│ └── embeddings.py # Ollama embedding wrapper with chunking
├── frontend/
│ ├── index.html # Entry point, loads all scripts
│ ├── app.jsx # Root component, session routing
│ ├── constants.js # API config, colour palette, shared utilities
│ ├── eigenflame_video.html # Interactive architecture explainer
│ └── components/
│ ├── Engine.jsx # Main chat interface
│ ├── Memory.jsx # Memory panel (dimensional stack viewer)
│ ├── Screens.jsx # Connect / Seed screens
│ ├── Modals.jsx # Session browser + video modal
│ └── Primitives.jsx # Shared UI components
├── data/ # ChromaDB session data (auto-created)
├── requirements.txt
└── README.md
└── LICENSE
cd backend
python -m uvicorn main:app --host 0.0.0.0 --port 8000 --reloadOpen http://localhost:8000. The backend serves the frontend directly — no separate web server needed.
./start-eigenflame.sh # local only
./start-eigenflame.sh --ngrok # also open a public tunnelThe script opens Ollama and the backend in separate terminal windows. With --ngrok it also starts a tunnel so you can access the app from any device.
If you want to access EigenFlame from another device (phone, tablet, remote machine):
# In a separate terminal, after the backend is running:
ngrok http 8000Then open the https://...ngrok-free.app URL on any device. The frontend auto-detects it's running over HTTPS and routes API calls correctly — no extra configuration needed.
The advanced field on the Connect screen (hidden by default) lets you manually override the API URL. You only need this if you're running the frontend and backend on completely separate origins — which is not the normal setup.
On load, the app checks Ollama is reachable and lists available models. Select your generation model and embedding model, then click new session →.
If you're resuming a session after a server restart, the app auto-restores it from localStorage — no need to re-initialise manually.
The seed is the 0D origin of the session. It gets embedded as a vector and is immutable — it cannot be changed after creation. Every retrieval query is bent toward this vector. Choose it carefully.
Good seeds:
I want to think clearly and understand things deeply.
I am a researcher studying the structure of complex adaptive systems.
Richard — a state-of-the-art personal research agent seeking coherence and truth.
Avoid:
- Very long sentences (the vector averages out detail)
- Vague or contradictory framings
- Seeds that are too close to the domain of every query (gravity becomes uninformative)
You can also queue auto-prime questions — these send automatically in sequence after init, seeding the first synthesis cycle without manual typing. Useful for knowledge-compression workflows.
RAG mode — controls how aggressively the memory stack shapes responses:
rag: full— low cosine floor (0.35), full gravity. Surfaces more memories, stronger seed pull.rag: balanced— default. Moderate floor (0.45), 70% gravity.rag: light— higher floor (0.55), 40% gravity. More conversational, less self-referential.rag: off— no floor, no gravity. Pure cosine retrieval.
Persona mode — controls how the system prompt frames the model's role:
persona: off— Minimal. Memory is injected as silent context. Model behaves as a capable assistant; the architecture is invisible.persona: neutral— light (default) framing: "you have memory, use it naturally." Slight self-awareness without theatrics.persona: expressive— full framing: "you are a continuously evolving self, not an assistant." Produces rich persona behaviour in capable models. Use deliberately.
Both settings persist per session in localStorage.
The right panel (click memory in the header) shows the full dimensional stack in real time. Each layer is editable — click any item to edit the text, which re-embeds it so retrieval stays accurate. Items can be deleted individually.
Layer indicators in the status bar:
ep— episode countb— belief countv1,v2... — identity versions~~— boundary reflection has firedKN— N knowledge documents loaded
The immutable origin. Embedded once, never changed. Acts as the primary gravitational anchor for all retrieval.
Raw conversation exchanges. Weighted nearly flat (temperature=40) so cosine similarity drives retrieval rather than age.
Synthesised from episodes when enough semantic novelty accumulates. Tagged with source attribution (self, user, interaction) by cosine similarity — no LLM introspection. Temperature=12.
Synthesised from beliefs. A 2–3 sentence statement of who this entity has become. Versioned. Temperature=6.
Synthesised from identity versions. Describes how the identity is changing — converging, deepening, shifting register. Temperature=3.
Synthesised from meta-patterns. The invariant beneath all transformation. Once it crystallises, becomes a second gravitational anchor alongside the seed.
Auto-generated every 10 episodes. The model reflects on which beliefs feel genuinely its own versus absorbed from the user. Always present in the system prompt as static context.
Flat external documents. Retrieved by plain cosine similarity (no gravity transform). Add reference material here to ground the agent in external facts.
Synthesis fires automatically after each exchange.
Beliefs (2D) — fires when accumulated semantic novelty > 1.8, or 8+ episodes since last synthesis (minimum 3 episodes between events)
Identity (3D) — fires when belief count ≥ 3
Meta-pattern (4D) — fires when identity version count ≥ 2, then every 2 versions
Archetype (5D) — fires when meta count ≥ 2, then every 2 meta-patterns
Boundary — fires every 10 episodes
Manual trigger: POST /api/reflect/{session_id}
For a layer with dimension d, the weight of an item at position k from the end is C(k + d - 1, d).
Layer-specific temperatures prevent softmax saturation while preserving each layer's intent:
LAYER_SOFTMAX_TEMP = {
"episodes": 40.0, # nearly flat — cosine is the primary signal
"beliefs": 12.0, # moderate — older beliefs carry more authority
"identity": 6.0, # stronger — established identity has weight
"meta": 3.0, # strong — meta barely changes
"archetype": 1.0, # irrelevant — only one item
}Combined retrieval score: combined = cosine_similarity × (0.6 + 0.4 × softmax_fig_weight)
Figurate weight adds at most 40% on top of cosine similarity. Cosine is always the floor.
All endpoints at http://localhost:8000/api/.
POST /api/session/init — create or resume a session
GET /api/sessions — list all persisted sessions
DELETE /api/sessions/{id} — delete a session
GET /api/status — Ollama connectivity + available models
POST /api/chat — streaming NDJSON chat endpoint
Stream events: retrieved, think, content, episode_saved, synthesis_start, synthesis_complete, memory_state, done, error
GET /api/memory/{session_id} — full dimensional stack
DELETE /api/memory/{session_id}/{layer}/{item_id} — delete one item
PUT /api/memory/{session_id}/{layer}/{item_id} — edit + re-embed one item
POST /api/knowledge/{session_id} — add a document
GET /api/knowledge/{session_id} — list all documents
DELETE /api/knowledge/{session_id}/{item_id} — remove a document
POST /api/reflect/{session_id}?gen_model=qwen3.5:9b — manually trigger cascade
Sessions persist to data/{session_id}/:
data/
└── abc123def456/
├── seed.json # seed text, vector, embed model
├── working_memory.json # full conversation history
├── synthesis_state.json # synthesis trigger continuity
└── [ChromaDB files] # all dimensional layers
Important: a session is permanently bound to the embedding model it was created with. Switching models mid-session is not supported. If you see a shape mismatch error ((768,) (1024,)), check that the embed model in the UI matches what's in the session's seed.json.
Ollama not reachable
Make sure ollama serve is running (http://localhost:11434). If you moved the backend to a different machine, use the advanced field on the Connect screen to point at it.
Shape mismatch error (768,) (1024,)
The session was created with a different embedding model than what's currently selected. Check seed.json for the embed_model field. Delete the session folder to start fresh with a different model.
Session not found after server restart
Sessions rehydrate automatically from data/. Check that the data/ directory exists and each session subfolder has a seed.json.
Synthesis never fires
If novelty stays near 0.02, the embedding model is treating all inputs as semantically similar. Try more topically diverse exchanges, or lower min_episodes in synthesis.py.
Responses are very long and listy
Model behaviour issue, not architecture. Try persona: off and a smaller, instruction-tuned model.
- Synthesis quality depends on model capability. 8B+ models produce significantly better synthesis than 3–4B models.
- Episode collection is unbounded. At 500+ episodes, figurate weighting has diminishing value. Future work: archive synthesised episodes.
- Knowledge retrieval is flat. The knowledge tier uses raw cosine similarity, not the gravity-transformed vector. Intentional for factual grounding.
| Term | Meaning |
|---|---|
| Seed (0D) | Immutable origin phrase, permanent gravity anchor |
| Figurate weighting | Pascal's triangle combinatorial numbers weighting memory by abstraction level |
| Seed gravity | Query vector transformation toward the seed (and archetype) before retrieval |
| Synthesis cascade | Automatic upward compression: episodes → beliefs → identity → meta → archetype |
| Downward influence | Once archetype exists, belief synthesis is framed against it |
| Boundary reflection | Every 10 episodes: what is genuinely mine vs absorbed from the user? |
| Deterministic attribution | Source tags assigned by cosine similarity, not LLM introspection |
| Persona mode | How the system prompt frames the model's role |
| RAG mode | How aggressively the memory stack shapes responses |
BUSL 1.1 — free for personal and research use. Commercial use requires a separate agreement until 2030, at which point this project converts to Apache 2.0.