Personal continually-learning local LLM. Knowledge in weights, not in prompts.
Early-stage research project. The architectural thesis — hybrid Mamba+attention base with three coordinated memory channels — works end-to-end on a MacBook (see empirical findings). Agentic / tool-using flows are an open problem.
Three things make cognit different from a frontier coding agent:
- Smaller memory footprint, vastly longer context. A hybrid Mamba+attention base (Zamba2-1.2B by default) carries running conversation state in a fixed-size SSM hidden state instead of an ever-growing KV cache. Inference stays fast on a MacBook even as conversations grow, and long-context flows that would be GB-scale on pure-attention stay MB-scale here.
- No prompt inflation. Personalization lives in weights (a small LoRA adapter) and persistent session state (SSM), not in ever-bigger system prompts, RAG injection, or in-context memory files. The prompt stays minimal — the model already knows you.
- Continually learns you. Mark turns in chat (or via the HTTP endpoint) and on session exit cognit runs a few gradient steps of LoRA fine-tuning. Next session, the corrections are in the weights. The model gets sharper at you the more you use it — without shipping your data to a vendor.
| Frontier agents | Cognit |
|---|---|
| Personalization in prompts (RAG, memory injection, system msgs) | Personalization in weights & state (LoRA adapter + persistent SSM state) |
| Memory grows your context cost on every call | Memory grows your adapter file (~1 MB), free at inference |
| Your data sits in prompts the vendor sees | Your data is baked into a local adapter you own |
| Personalization stops working when context fills | Adapter capacity scales with parameters, not context |
Cognit runs on a MacBook today. Hybrid Mamba+attention base (Zamba2-1.2B-Instruct-v2 by default), persistent per-session SSM state, LoRA fine-tuning between sessions, catastrophic-forgetting protection by default, OpenAI-compatible HTTP server for coding-agent integration.
pip install -e ".[server]"(Python 3.10+, PyTorch 2.2+, an Apple Silicon or CUDA GPU recommended. CPU works but training is slow.)
# First run downloads Zamba2-1.2B-Instruct-v2 (~2.4 GB) and creates the
# default profile. Subsequent runs reuse the cached model.
cognit init --yes
# Chat with it
cognit chat
# One-shot
cognit generate "Once upon a time"In the chat REPL, mark turns to teach cognit:
> Explain how our auth flow works
cognit: ...
> /good # mark the response as worth learning from
> /fix It actually uses JWT, not OAuth. # correct the response
> /exit # session ends; adapter trains on marked turns
Next session, cognit has absorbed those corrections into its weights.
Cognit exposes an OpenAI-compatible HTTP endpoint. Any agent that takes
OPENAI_API_BASE works against it unchanged.
cognit serve(Default port is 11435 — one above Ollama's 11434, so the two can run
side-by-side. Override with --port N if needed.)
Then in your agent's config:
export OPENAI_API_BASE=http://localhost:11435/v1
export OPENAI_API_KEY=not-needed
# model name: "cognit/default" (or any named profile)Or with the OpenAI Python SDK directly:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11435/v1", api_key="anything")
stream = client.chat.completions.create(
model="cognit/default",
messages=[{"role": "user", "content": "Refactor this function..."}],
stream=True,
)
for chunk in stream:
print(chunk.choices[0].delta.content or "", end="", flush=True)Standard endpoints (OpenAI-compatible): /v1/chat/completions (with SSE
streaming), /v1/models.
Cognit-specific extensions for learning:
POST /v1/cognit/sessions/{id}/mark— promote the last captured turn topositiveorcorrection(the agent calls this when the user accepts a suggestion / corrects the response).POST /v1/cognit/profiles/{name}/train— flush captured turns into the LoRA adapter.GET /v1/cognit/profiles/{name}/status— adapter + captures summary.
Pi is a terminal coding agent
from earendil-works that reads provider config from
~/.pi/agent/models.json. Cognit-on-Pi is an experimental integration
right now: tool calling isn't implemented (see open problems),
so Pi works well for paste-and-discuss flows (explaining code you paste,
planning, conceptual coding questions) but not for autonomous file
exploration. The launcher below is the easiest way to point Pi at
cognit; on the first tool-using request cognit returns a one-time notice
explaining the limit, then falls back to chat on subsequent requests.
Install Pi:
npm install -g @earendil-works/pi-coding-agent
# or: curl -fsSL https://pi.dev/install.sh | shThen point Pi at cognit with a single command:
cognit launch piThis starts cognit serve on port 11435 in the background (logs to
~/.cognit/launch-pi.log, with automatic fallback to a free port if
something else holds 11435), registers cognit as a Pi provider while
preserving any existing entries, sets it as Pi's default, and spawns
pi interactively. On exit, the background server is stopped.
Flags:
--no-default— register as a provider but leave Pi's default alone--no-launch— just write the config; don't spawnpi--port N— use a different port for the cognit server--profile NAME(at the top level:cognit --profile NAME launch pi) — register a specific profile instead ofdefault
If you'd rather wire it up by hand — e.g. to register multiple cognit
profiles, or to keep Pi under your own config management — run
cognit serve in one terminal, then add cognit to
~/.pi/agent/models.json:
{
"providers": {
"cognit": {
"baseUrl": "http://localhost:11435/v1",
"api": "openai-completions",
"apiKey": "cognit",
"models": [{ "id": "cognit/default" }]
}
}
}Merge with any existing providers entries (don't replace the whole
file). To make cognit Pi's default, add to ~/.pi/agent/settings.json:
{
"defaultProvider": "cognit",
"defaultModel": "cognit/default"
}Then pi will route through cognit.
Cognit's memory isn't one mechanism — it's three, each operating at a different timescale and storing something different. The hybrid base isn't a stylistic choice; it's what makes this work at MacBook scale.
| Channel | Mechanism | What it stores | Persists across | Job |
|---|---|---|---|---|
| Within turn | Attention KV cache | Verbatim tokens, exact positions | Single forward pass | Exact recall — "what did the user just type 200 tokens back?" |
| Within session | SSM hidden state | Compressed running summary (gist) | Save/load — serialized to disk | Cheap long threads — resume tomorrow with conversation state intact |
| Across all sessions | LoRA adapter weights | Durable learned patterns | All future sessions, across reboots | Knowledge in weights — the model gets sharper at you over weeks |
These are not interchangeable. Don't try to smuggle exact-recall into the SSM channel; don't expect LoRA to carry the gist of an ongoing conversation. Each layer has its job.
Each channel above depends on a property of the base architecture:
- Pure attention has unbounded KV cache that grows linearly with context. Memory and decode latency both balloon as conversations lengthen — and persisting a long session's KV across save/load would mean GB-scale state files. Fine for a chatbot, fatal for a daily-driver you resume across days.
- Pure SSM has constant-size state (great for persistence) but loses the exact recall that attention gives you — "the function name three messages ago" becomes lossy.
- Hybrid keeps the SSM's constant-size state and attention's exact
recall on a recent window. Compared to pure attention at the same
context length, three concrete wins fall out:
- Much less memory. SSM layers carry O(1) state per token; the attention KV cache only spans a short recent window. At 10–100K context the difference is order-of-magnitude — the difference between "fits in laptop RAM" and "doesn't."
- Faster decode. Per-token cost doesn't grow with conversation length. The SSM half of every layer is constant-cost per token, so a 50-message thread decodes at the same speed as a 5-message one.
- Cheap online updates. Both the in-conversation SSM state (updated in place, token-by-token, O(1) per token) and the LoRA adapter (trained between sessions on the same machine) stay tractable on a laptop. At pure-attention scaling the same workloads would need server-class memory.
This is why cognit's session save/load works and why the whole system fits on a MacBook: SSM state is what serializes, hybrid runtime costs stay bounded as threads grow, and a 1-3B hybrid base plus a small LoRA can do inference and online learning in laptop-class memory.
SSM state across sessions could in principle carry knowledge forward. In practice it doesn't, for the same reason gist-summaries don't substitute for studying: it's a lossy compression, unstructured, and the next conversation will overwrite it. What you actually want preserved across weeks of use — your codebase quirks, your writing style, terminology you've corrected — needs to live in weights. LoRA is that channel.
The base model stays frozen; only a small rank-r adapter (~1 MB at
Zamba2-1.2B) learns. Adapter updates are cheap enough to run as
background work after a chat session ends.
When a turn is promoted to the capture queue (you /good or /fix it
in chat, or an agent calls POST /mark), cognit runs a few gradient
steps of LoRA fine-tuning on the accumulated captures — on session exit,
or via explicit train_pending. The base model's weights are frozen;
only the small adapter changes.
The same training loop has two feeders, both writing into the same LoRA:
- Conversation capture — turns from chat sessions, promoted via
/good,/fix, orPOST /mark. Online, incremental, driven by how you actually use the model. - Corpus ingestion —
QmdCorpusIngesterpoints at a qmd-managed collection (your notes, project docs, style guide) and trains the adapter on the documents directly. Bulk, no chat session required. Hash-tracked so unchanged docs don't re-train.
Conversation capture handles correction ("you got this wrong, here's right"); corpus ingestion handles bootstrapping ("here's everything I want you to already know on day one"). Both land in the same adapter and coexist seamlessly.
qmd as the backend — rather than a hand-rolled markdown walker — gives
cognit stable qmd://collection/path URIs, dedup, and shared state with
anything else you've connected to qmd (search, MCP, retrieval). See
examples/06_growing_brain_demo.py
for the full loop end-to-end.
Adapter interference is the real engineering risk: earlier-learned
patterns in the LoRA can be overwritten by later updates. Cognit's
mitigation is replay sampling — mix previously-trained captures into
each new training pass — which is trivial (the captures are already in
the queue) but ~99% effective in our measurement on Zamba2-1.2B vs
unprotected. L2 regularization toward pre-pass weights was nearly
useless at typical LoRA scales (kept as an opt-in research knob, not the
default). See
examples/08_forgetting_protection_demo.py.
The three-channel thesis isn't just architectural framing — both non-trivial channels (SSM, LoRA) have been demonstrated end-to-end on Zamba2-1.2B running locally. Reproduce with the example scripts; numbers below are from M-series MPS runs.
examples/04_persistence_demo.py
Pre-fill a ~500-character prior context into a session, persist the SSM state to disk on exit, then load it back into a new session. Generate from a neutral prompt with and without the loaded state. Findings:
- Logit divergence at the first generated position is nonzero and meaningful — the same prompt produces measurably different next-token predictions when SSM state from the prior session is loaded.
- State file is KB-scale on disk per session, vs GB-scale for attention KV at the same context length.
- Sampled generation is topic-consistent with the prior context, not with the cold baseline.
This validates the SSM channel as actual working memory across save/load. A caveat baked into Zamba2's hybrid design: attention-heavy hybrid layers make SSM-only persistence a "gist" memory rather than exact recall — by design (see the architecture section above).
examples/05_lora_learning_demo.py
Baseline: generate from "The Glorpfish lives in" — Zamba2 confabulates
something (the Glorpfish is fictional). Then teach cognit 3 corrected
turns about it; session exit triggers a few gradient steps of LoRA
fine-tuning. Open a fresh session, generate from the same prompt:
- Generated continuation matches the taught content ("the purple caves of Mooncliff") in the new session — no prompting reminds the model of what it learned.
- Logit divergence at the first generated position confirms the weights changed, not just sampling artifacts.
- Adapter file is a tiny fraction of base (sub-percent of 2.4 GB) yet durably changes what cognit knows.
This validates the LoRA channel as durable across-session learning. A small artifact, layered on the shared frozen base, encodes what's been taught.
Each profile (work, personal, research, …) has its own adapter,
captures, and sessions. Switching profiles is like switching to a
different model — no cross-contamination.
cognit profile new work --model-type zamba2 --base Zyphra/Zamba2-1.2B-Instruct-v2
cognit profile new personal --model-type zamba2 --base Zyphra/Zamba2-1.2B-Instruct-v2
cognit --profile work chat --session pr-review
cognit --profile personal generate "What do I usually cook on Tuesdays?"Within a profile, sessions are conversation threads — separate SSM state per thread, shared LoRA. Layout:
~/.cognit/profiles/<name>/
├── profile.json # config
├── adapter.pt # LoRA weights for this profile
├── captures.jsonl # capture queue
└── sessions/
├── default.safetensors # SSM state per thread
└── pr-review.safetensors
| Script | What it shows |
|---|---|
04_persistence_demo.py |
SSM state persistence empirically changes generation |
05_lora_learning_demo.py |
LoRA training durably encodes new knowledge |
06_growing_brain_demo.py |
Corpus ingestion + conversation capture → same adapter |
07_profile_isolation_demo.py |
Profile isolation (work vs personal) |
08_forgetting_protection_demo.py |
Catastrophic-forgetting protection measured 4 ways |
All examples run against Zamba2-1.2B (no checkpoint setup beyond
cognit init --yes).
Alpha. The system works end-to-end on a MacBook with Zamba2-1.2B today. What's known:
- Inference: ~5–7 tok/s for Zamba2-1.2B on M-series MPS.
- LoRA training: ~8s per step at 1.2B on MPS (slow but workable for background "train between sessions" usage).
- Sessions: SSM state persists across runs. Hybrid models lose exact-recall (attention KV) across sessions by design; LoRA covers the durable channel.
- Base models: Zamba2-1.2B / 2.7B / 7B via HuggingFace; or cognit-native checkpoints (sparse Jamba).
- HTTP server: OpenAI-compatible, SSE streaming, profile-as-model.
What's missing or rough:
- The chat REPL doesn't apply chat templates between turns (sessions rely on cache continuity; a proper instruct-style multi-turn protocol is a future improvement).
- Adapter management CLI (save/load/list/diff/share) — not yet built.
The architectural thesis is validated end-to-end (see empirical findings). What this project hasn't yet answered:
- Tool calling. Coding agents like Pi assume the model can read/write/edit files via structured tool calls. Zamba2-1.2B can't reliably produce tool-call output. Existing small specialists (Qwen2.5-1.5B, Hammer, Arch-Function) work in isolation, but composing them with cognit's LoRA learning loop is an open design question. For now, cognit returns an explicit "not implemented yet" notice + chat fallback on tool-using requests.
- Longitudinal effectiveness. Does cognit measurably get sharper at you over weeks of real use? The architecture supports it; the empirical study hasn't been run. No formal retention benchmark across many training sessions yet.
- Multi-turn coherence at scale. SSM state should carry conversation context across long threads. How does coherence hold at 100+ turns? Unmeasured.
- Bigger hybrid bases. Zamba2-7B may dissolve a lot of these limits — better chat quality, more robust to capture training, plausibly tool-capable. Memory budget on a MacBook is the trade-off.
Cognit wraps a frozen hybrid Mamba+attention base model with three wrapper layers:
- Profile — top-level identity. Owns one LoRA adapter and one capture queue.
- Session — conversation thread within a profile. Owns one SSM state (serialized between runs); attention KV stays ephemeral.
- LoRA adapter — trainable surface. Wraps Linear modules in both
Mamba (
in_proj,out_proj) and attention (q/k/v/o_proj) blocks. Base weights stay frozen; only the rank-radapter learns.
At inference, the frozen base + your profile's LoRA produce the response. The session's attention KV builds within the forward pass and discards after; the SSM state updates in place and is checkpointed to disk on session save. Marked turns feed the capture queue; on session exit, captures train the adapter (with replay protection). Next session starts with the updated adapter and the previous SSM state restored.
MIT — see LICENSE.