cognit

Personal continually-learning local LLM. Knowledge in weights, not in prompts.

Early-stage research project. The architectural thesis — hybrid Mamba+attention base with three coordinated memory channels — works end-to-end on a MacBook (see empirical findings). Agentic / tool-using flows are an open problem.

Three things make cognit different from a frontier coding agent:

Smaller memory footprint, vastly longer context. A hybrid Mamba+attention base (Zamba2-1.2B by default) carries running conversation state in a fixed-size SSM hidden state instead of an ever-growing KV cache. Inference stays fast on a MacBook even as conversations grow, and long-context flows that would be GB-scale on pure-attention stay MB-scale here.
No prompt inflation. Personalization lives in weights (a small LoRA adapter) and persistent session state (SSM), not in ever-bigger system prompts, RAG injection, or in-context memory files. The prompt stays minimal — the model already knows you.
Continually learns you. Mark turns in chat (or via the HTTP endpoint) and on session exit cognit runs a few gradient steps of LoRA fine-tuning. Next session, the corrections are in the weights. The model gets sharper at you the more you use it — without shipping your data to a vendor.

Frontier agents	Cognit
Personalization in prompts (RAG, memory injection, system msgs)	Personalization in weights & state (LoRA adapter + persistent SSM state)
Memory grows your context cost on every call	Memory grows your adapter file (~1 MB), free at inference
Your data sits in prompts the vendor sees	Your data is baked into a local adapter you own
Personalization stops working when context fills	Adapter capacity scales with parameters, not context

Cognit runs on a MacBook today. Hybrid Mamba+attention base (Zamba2-1.2B-Instruct-v2 by default), persistent per-session SSM state, LoRA fine-tuning between sessions, catastrophic-forgetting protection by default, OpenAI-compatible HTTP server for coding-agent integration.

Install

pip install -e ".[server]"

(Python 3.10+, PyTorch 2.2+, an Apple Silicon or CUDA GPU recommended. CPU works but training is slow.)

Quick start

# First run downloads Zamba2-1.2B-Instruct-v2 (~2.4 GB) and creates the
# default profile. Subsequent runs reuse the cached model.
cognit init --yes

# Chat with it
cognit chat

# One-shot
cognit generate "Once upon a time"

In the chat REPL, mark turns to teach cognit:

> Explain how our auth flow works
cognit: ...
> /good                 # mark the response as worth learning from
> /fix It actually uses JWT, not OAuth.   # correct the response
> /exit                 # session ends; adapter trains on marked turns

Next session, cognit has absorbed those corrections into its weights.

Use with coding agents (pi, Claude Code, Codex, Cursor, …)

Cognit exposes an OpenAI-compatible HTTP endpoint. Any agent that takes OPENAI_API_BASE works against it unchanged.

cognit serve

(Default port is 11435 — one above Ollama's 11434, so the two can run side-by-side. Override with --port N if needed.)

Then in your agent's config:

export OPENAI_API_BASE=http://localhost:11435/v1
export OPENAI_API_KEY=not-needed
# model name: "cognit/default" (or any named profile)

Or with the OpenAI Python SDK directly:

from openai import OpenAI
client = OpenAI(base_url="http://localhost:11435/v1", api_key="anything")

stream = client.chat.completions.create(
    model="cognit/default",
    messages=[{"role": "user", "content": "Refactor this function..."}],
    stream=True,
)
for chunk in stream:
    print(chunk.choices[0].delta.content or "", end="", flush=True)

Standard endpoints (OpenAI-compatible): /v1/chat/completions (with SSE streaming), /v1/models.

Cognit-specific extensions for learning:

POST /v1/cognit/sessions/{id}/mark — promote the last captured turn to positive or correction (the agent calls this when the user accepts a suggestion / corrects the response).
POST /v1/cognit/profiles/{name}/train — flush captured turns into the LoRA adapter.
GET /v1/cognit/profiles/{name}/status — adapter + captures summary.

Use with Pi (experimental)

Pi is a terminal coding agent from earendil-works that reads provider config from ~/.pi/agent/models.json. Cognit-on-Pi is an experimental integration right now: tool calling isn't implemented (see open problems), so Pi works well for paste-and-discuss flows (explaining code you paste, planning, conceptual coding questions) but not for autonomous file exploration. The launcher below is the easiest way to point Pi at cognit; on the first tool-using request cognit returns a one-time notice explaining the limit, then falls back to chat on subsequent requests.

Install Pi:

npm install -g @earendil-works/pi-coding-agent
# or: curl -fsSL https://pi.dev/install.sh | sh

Then point Pi at cognit with a single command:

cognit launch pi

This starts cognit serve on port 11435 in the background (logs to ~/.cognit/launch-pi.log, with automatic fallback to a free port if something else holds 11435), registers cognit as a Pi provider while preserving any existing entries, sets it as Pi's default, and spawns pi interactively. On exit, the background server is stopped.

Flags:

--no-default — register as a provider but leave Pi's default alone
--no-launch — just write the config; don't spawn pi
--port N — use a different port for the cognit server
--profile NAME (at the top level: cognit --profile NAME launch pi) — register a specific profile instead of default

Manual setup (skip the launcher)

If you'd rather wire it up by hand — e.g. to register multiple cognit profiles, or to keep Pi under your own config management — run cognit serve in one terminal, then add cognit to ~/.pi/agent/models.json:

{
  "providers": {
    "cognit": {
      "baseUrl": "http://localhost:11435/v1",
      "api": "openai-completions",
      "apiKey": "cognit",
      "models": [{ "id": "cognit/default" }]
    }
  }
}

Merge with any existing providers entries (don't replace the whole file). To make cognit Pi's default, add to ~/.pi/agent/settings.json:

{
  "defaultProvider": "cognit",
  "defaultModel": "cognit/default"
}

Then pi will route through cognit.

How cognit remembers and learns

Cognit's memory isn't one mechanism — it's three, each operating at a different timescale and storing something different. The hybrid base isn't a stylistic choice; it's what makes this work at MacBook scale.

Channel	Mechanism	What it stores	Persists across	Job
Within turn	Attention KV cache	Verbatim tokens, exact positions	Single forward pass	Exact recall — "what did the user just type 200 tokens back?"
Within session	SSM hidden state	Compressed running summary (gist)	Save/load — serialized to disk	Cheap long threads — resume tomorrow with conversation state intact
Across all sessions	LoRA adapter weights	Durable learned patterns	All future sessions, across reboots	Knowledge in weights — the model gets sharper at you over weeks

These are not interchangeable. Don't try to smuggle exact-recall into the SSM channel; don't expect LoRA to carry the gist of an ongoing conversation. Each layer has its job.

Why hybrid Mamba+attention is the right base

Each channel above depends on a property of the base architecture:

Pure attention has unbounded KV cache that grows linearly with context. Memory and decode latency both balloon as conversations lengthen — and persisting a long session's KV across save/load would mean GB-scale state files. Fine for a chatbot, fatal for a daily-driver you resume across days.
Pure SSM has constant-size state (great for persistence) but loses the exact recall that attention gives you — "the function name three messages ago" becomes lossy.
Hybrid keeps the SSM's constant-size state and attention's exact recall on a recent window. Compared to pure attention at the same context length, three concrete wins fall out:
- Much less memory. SSM layers carry O(1) state per token; the attention KV cache only spans a short recent window. At 10–100K context the difference is order-of-magnitude — the difference between "fits in laptop RAM" and "doesn't."
- Faster decode. Per-token cost doesn't grow with conversation length. The SSM half of every layer is constant-cost per token, so a 50-message thread decodes at the same speed as a 5-message one.
- Cheap online updates. Both the in-conversation SSM state (updated in place, token-by-token, O(1) per token) and the LoRA adapter (trained between sessions on the same machine) stay tractable on a laptop. At pure-attention scaling the same workloads would need server-class memory.

This is why cognit's session save/load works and why the whole system fits on a MacBook: SSM state is what serializes, hybrid runtime costs stay bounded as threads grow, and a 1-3B hybrid base plus a small LoRA can do inference and online learning in laptop-class memory.

Why LoRA on top — and not "just longer sessions"

SSM state across sessions could in principle carry knowledge forward. In practice it doesn't, for the same reason gist-summaries don't substitute for studying: it's a lossy compression, unstructured, and the next conversation will overwrite it. What you actually want preserved across weeks of use — your codebase quirks, your writing style, terminology you've corrected — needs to live in weights. LoRA is that channel.

The base model stays frozen; only a small rank-r adapter (~1 MB at Zamba2-1.2B) learns. Adapter updates are cheap enough to run as background work after a chat session ends.

The training loop

When a turn is promoted to the capture queue (you /good or /fix it in chat, or an agent calls POST /mark), cognit runs a few gradient steps of LoRA fine-tuning on the accumulated captures — on session exit, or via explicit train_pending. The base model's weights are frozen; only the small adapter changes.

Two ways to feed the adapter

The same training loop has two feeders, both writing into the same LoRA:

Conversation capture — turns from chat sessions, promoted via /good, /fix, or POST /mark. Online, incremental, driven by how you actually use the model.
Corpus ingestion — QmdCorpusIngester points at a qmd-managed collection (your notes, project docs, style guide) and trains the adapter on the documents directly. Bulk, no chat session required. Hash-tracked so unchanged docs don't re-train.

Conversation capture handles correction ("you got this wrong, here's right"); corpus ingestion handles bootstrapping ("here's everything I want you to already know on day one"). Both land in the same adapter and coexist seamlessly.

qmd as the backend — rather than a hand-rolled markdown walker — gives cognit stable qmd://collection/path URIs, dedup, and shared state with anything else you've connected to qmd (search, MCP, retrieval). See examples/06_growing_brain_demo.py for the full loop end-to-end.

Catastrophic-forgetting protection (default-on)

Adapter interference is the real engineering risk: earlier-learned patterns in the LoRA can be overwritten by later updates. Cognit's mitigation is replay sampling — mix previously-trained captures into each new training pass — which is trivial (the captures are already in the queue) but ~99% effective in our measurement on Zamba2-1.2B vs unprotected. L2 regularization toward pre-pass weights was nearly useless at typical LoRA scales (kept as an opt-in research knob, not the default). See examples/08_forgetting_protection_demo.py.

What we've validated empirically

The three-channel thesis isn't just architectural framing — both non-trivial channels (SSM, LoRA) have been demonstrated end-to-end on Zamba2-1.2B running locally. Reproduce with the example scripts; numbers below are from M-series MPS runs.

SSM state is real cross-session memory

examples/04_persistence_demo.py

Pre-fill a ~500-character prior context into a session, persist the SSM state to disk on exit, then load it back into a new session. Generate from a neutral prompt with and without the loaded state. Findings:

Logit divergence at the first generated position is nonzero and meaningful — the same prompt produces measurably different next-token predictions when SSM state from the prior session is loaded.
State file is KB-scale on disk per session, vs GB-scale for attention KV at the same context length.
Sampled generation is topic-consistent with the prior context, not with the cold baseline.

This validates the SSM channel as actual working memory across save/load. A caveat baked into Zamba2's hybrid design: attention-heavy hybrid layers make SSM-only persistence a "gist" memory rather than exact recall — by design (see the architecture section above).

LoRA learning is durable across sessions

examples/05_lora_learning_demo.py

Baseline: generate from "The Glorpfish lives in" — Zamba2 confabulates something (the Glorpfish is fictional). Then teach cognit 3 corrected turns about it; session exit triggers a few gradient steps of LoRA fine-tuning. Open a fresh session, generate from the same prompt:

Generated continuation matches the taught content ("the purple caves of Mooncliff") in the new session — no prompting reminds the model of what it learned.
Logit divergence at the first generated position confirms the weights changed, not just sampling artifacts.
Adapter file is a tiny fraction of base (sub-percent of 2.4 GB) yet durably changes what cognit knows.

This validates the LoRA channel as durable across-session learning. A small artifact, layered on the shared frozen base, encodes what's been taught.

Profiles — isolated cognit identities

Each profile (work, personal, research, …) has its own adapter, captures, and sessions. Switching profiles is like switching to a different model — no cross-contamination.

cognit profile new work --model-type zamba2 --base Zyphra/Zamba2-1.2B-Instruct-v2
cognit profile new personal --model-type zamba2 --base Zyphra/Zamba2-1.2B-Instruct-v2

cognit --profile work chat --session pr-review
cognit --profile personal generate "What do I usually cook on Tuesdays?"

Within a profile, sessions are conversation threads — separate SSM state per thread, shared LoRA. Layout:

~/.cognit/profiles/<name>/
├── profile.json       # config
├── adapter.pt         # LoRA weights for this profile
├── captures.jsonl     # capture queue
└── sessions/
    ├── default.safetensors    # SSM state per thread
    └── pr-review.safetensors

Examples

Script	What it shows
`04_persistence_demo.py`	SSM state persistence empirically changes generation
`05_lora_learning_demo.py`	LoRA training durably encodes new knowledge
`06_growing_brain_demo.py`	Corpus ingestion + conversation capture → same adapter
`07_profile_isolation_demo.py`	Profile isolation (work vs personal)
`08_forgetting_protection_demo.py`	Catastrophic-forgetting protection measured 4 ways

All examples run against Zamba2-1.2B (no checkpoint setup beyond cognit init --yes).

Status

Alpha. The system works end-to-end on a MacBook with Zamba2-1.2B today. What's known:

Inference: ~5–7 tok/s for Zamba2-1.2B on M-series MPS.
LoRA training: ~8s per step at 1.2B on MPS (slow but workable for background "train between sessions" usage).
Sessions: SSM state persists across runs. Hybrid models lose exact-recall (attention KV) across sessions by design; LoRA covers the durable channel.
Base models: Zamba2-1.2B / 2.7B / 7B via HuggingFace; or cognit-native checkpoints (sparse Jamba).
HTTP server: OpenAI-compatible, SSE streaming, profile-as-model.

What's missing or rough:

The chat REPL doesn't apply chat templates between turns (sessions rely on cache continuity; a proper instruct-style multi-turn protocol is a future improvement).
Adapter management CLI (save/load/list/diff/share) — not yet built.

Open problems

The architectural thesis is validated end-to-end (see empirical findings). What this project hasn't yet answered:

Tool calling. Coding agents like Pi assume the model can read/write/edit files via structured tool calls. Zamba2-1.2B can't reliably produce tool-call output. Existing small specialists (Qwen2.5-1.5B, Hammer, Arch-Function) work in isolation, but composing them with cognit's LoRA learning loop is an open design question. For now, cognit returns an explicit "not implemented yet" notice + chat fallback on tool-using requests.
Longitudinal effectiveness. Does cognit measurably get sharper at you over weeks of real use? The architecture supports it; the empirical study hasn't been run. No formal retention benchmark across many training sessions yet.
Multi-turn coherence at scale. SSM state should carry conversation context across long threads. How does coherence hold at 100+ turns? Unmeasured.
Bigger hybrid bases. Zamba2-7B may dissolve a lot of these limits — better chat quality, more robust to capture training, plausibly tool-capable. Memory budget on a MacBook is the trade-off.

Implementation, briefly

Cognit wraps a frozen hybrid Mamba+attention base model with three wrapper layers:

Profile — top-level identity. Owns one LoRA adapter and one capture queue.
Session — conversation thread within a profile. Owns one SSM state (serialized between runs); attention KV stays ephemeral.
LoRA adapter — trainable surface. Wraps Linear modules in both Mamba (in_proj, out_proj) and attention (q/k/v/o_proj) blocks. Base weights stay frozen; only the rank-r adapter learns.

At inference, the frozen base + your profile's LoRA produce the response. The session's attention KV builds within the forward pass and discards after; the SSM state updates in place and is checkpointed to disk on session save. Marked turns feed the capture queue; on session exit, captures train the adapter (with replay protection). Next session starts with the updated adapter and the previous SSM state restored.

License

MIT — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
examples		examples
src/cognit		src/cognit
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
pyrightconfig.json		pyrightconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cognit

Install

Quick start

Use with coding agents (pi, Claude Code, Codex, Cursor, …)

Use with Pi (experimental)

Manual setup (skip the launcher)

How cognit remembers and learns

Why hybrid Mamba+attention is the right base

Why LoRA on top — and not "just longer sessions"

The training loop

Two ways to feed the adapter

Catastrophic-forgetting protection (default-on)

What we've validated empirically

SSM state is real cross-session memory

LoRA learning is durable across sessions

Profiles — isolated cognit identities

Examples

Status

Open problems

Implementation, briefly

License

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

cognit

Install

Quick start

Use with coding agents (pi, Claude Code, Codex, Cursor, …)

Use with Pi (experimental)

Manual setup (skip the launcher)

How cognit remembers and learns

Why hybrid Mamba+attention is the right base

Why LoRA on top — and not "just longer sessions"

The training loop

Two ways to feed the adapter

Catastrophic-forgetting protection (default-on)

What we've validated empirically

SSM state is real cross-session memory

LoRA learning is durable across sessions

Profiles — isolated cognit identities

Examples

Status

Open problems

Implementation, briefly

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages