Skip to content

jnormore/cognit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

cognit

CI License: MIT Python

Personal continually-learning local LLM. Knowledge in weights, not in prompts.

Early-stage research project. The architectural thesis — hybrid Mamba+attention base with three coordinated memory channels — works end-to-end on a MacBook (see empirical findings). Agentic / tool-using flows are an open problem.

Three things make cognit different from a frontier coding agent:

  1. Smaller memory footprint, vastly longer context. A hybrid Mamba+attention base (Zamba2-1.2B by default) carries running conversation state in a fixed-size SSM hidden state instead of an ever-growing KV cache. Inference stays fast on a MacBook even as conversations grow, and long-context flows that would be GB-scale on pure-attention stay MB-scale here.
  2. No prompt inflation. Personalization lives in weights (a small LoRA adapter) and persistent session state (SSM), not in ever-bigger system prompts, RAG injection, or in-context memory files. The prompt stays minimal — the model already knows you.
  3. Continually learns you. Mark turns in chat (or via the HTTP endpoint) and on session exit cognit runs a few gradient steps of LoRA fine-tuning. Next session, the corrections are in the weights. The model gets sharper at you the more you use it — without shipping your data to a vendor.
Frontier agents Cognit
Personalization in prompts (RAG, memory injection, system msgs) Personalization in weights & state (LoRA adapter + persistent SSM state)
Memory grows your context cost on every call Memory grows your adapter file (~1 MB), free at inference
Your data sits in prompts the vendor sees Your data is baked into a local adapter you own
Personalization stops working when context fills Adapter capacity scales with parameters, not context

Cognit runs on a MacBook today. Hybrid Mamba+attention base (Zamba2-1.2B-Instruct-v2 by default), persistent per-session SSM state, LoRA fine-tuning between sessions, catastrophic-forgetting protection by default, OpenAI-compatible HTTP server for coding-agent integration.

Install

pip install -e ".[server]"

(Python 3.10+, PyTorch 2.2+, an Apple Silicon or CUDA GPU recommended. CPU works but training is slow.)

Quick start

# First run downloads Zamba2-1.2B-Instruct-v2 (~2.4 GB) and creates the
# default profile. Subsequent runs reuse the cached model.
cognit init --yes

# Chat with it
cognit chat

# One-shot
cognit generate "Once upon a time"

In the chat REPL, mark turns to teach cognit:

> Explain how our auth flow works
cognit: ...
> /good                 # mark the response as worth learning from
> /fix It actually uses JWT, not OAuth.   # correct the response
> /exit                 # session ends; adapter trains on marked turns

Next session, cognit has absorbed those corrections into its weights.

Use with coding agents (pi, Claude Code, Codex, Cursor, …)

Cognit exposes an OpenAI-compatible HTTP endpoint. Any agent that takes OPENAI_API_BASE works against it unchanged.

cognit serve

(Default port is 11435 — one above Ollama's 11434, so the two can run side-by-side. Override with --port N if needed.)

Then in your agent's config:

export OPENAI_API_BASE=http://localhost:11435/v1
export OPENAI_API_KEY=not-needed
# model name: "cognit/default" (or any named profile)

Or with the OpenAI Python SDK directly:

from openai import OpenAI
client = OpenAI(base_url="http://localhost:11435/v1", api_key="anything")

stream = client.chat.completions.create(
    model="cognit/default",
    messages=[{"role": "user", "content": "Refactor this function..."}],
    stream=True,
)
for chunk in stream:
    print(chunk.choices[0].delta.content or "", end="", flush=True)

Standard endpoints (OpenAI-compatible): /v1/chat/completions (with SSE streaming), /v1/models.

Cognit-specific extensions for learning:

  • POST /v1/cognit/sessions/{id}/mark — promote the last captured turn to positive or correction (the agent calls this when the user accepts a suggestion / corrects the response).
  • POST /v1/cognit/profiles/{name}/train — flush captured turns into the LoRA adapter.
  • GET /v1/cognit/profiles/{name}/status — adapter + captures summary.

Use with Pi (experimental)

Pi is a terminal coding agent from earendil-works that reads provider config from ~/.pi/agent/models.json. Cognit-on-Pi is an experimental integration right now: tool calling isn't implemented (see open problems), so Pi works well for paste-and-discuss flows (explaining code you paste, planning, conceptual coding questions) but not for autonomous file exploration. The launcher below is the easiest way to point Pi at cognit; on the first tool-using request cognit returns a one-time notice explaining the limit, then falls back to chat on subsequent requests.

Install Pi:

npm install -g @earendil-works/pi-coding-agent
# or: curl -fsSL https://pi.dev/install.sh | sh

Then point Pi at cognit with a single command:

cognit launch pi

This starts cognit serve on port 11435 in the background (logs to ~/.cognit/launch-pi.log, with automatic fallback to a free port if something else holds 11435), registers cognit as a Pi provider while preserving any existing entries, sets it as Pi's default, and spawns pi interactively. On exit, the background server is stopped.

Flags:

  • --no-default — register as a provider but leave Pi's default alone
  • --no-launch — just write the config; don't spawn pi
  • --port N — use a different port for the cognit server
  • --profile NAME (at the top level: cognit --profile NAME launch pi) — register a specific profile instead of default

Manual setup (skip the launcher)

If you'd rather wire it up by hand — e.g. to register multiple cognit profiles, or to keep Pi under your own config management — run cognit serve in one terminal, then add cognit to ~/.pi/agent/models.json:

{
  "providers": {
    "cognit": {
      "baseUrl": "http://localhost:11435/v1",
      "api": "openai-completions",
      "apiKey": "cognit",
      "models": [{ "id": "cognit/default" }]
    }
  }
}

Merge with any existing providers entries (don't replace the whole file). To make cognit Pi's default, add to ~/.pi/agent/settings.json:

{
  "defaultProvider": "cognit",
  "defaultModel": "cognit/default"
}

Then pi will route through cognit.

How cognit remembers and learns

Cognit's memory isn't one mechanism — it's three, each operating at a different timescale and storing something different. The hybrid base isn't a stylistic choice; it's what makes this work at MacBook scale.

Channel Mechanism What it stores Persists across Job
Within turn Attention KV cache Verbatim tokens, exact positions Single forward pass Exact recall — "what did the user just type 200 tokens back?"
Within session SSM hidden state Compressed running summary (gist) Save/load — serialized to disk Cheap long threads — resume tomorrow with conversation state intact
Across all sessions LoRA adapter weights Durable learned patterns All future sessions, across reboots Knowledge in weights — the model gets sharper at you over weeks

These are not interchangeable. Don't try to smuggle exact-recall into the SSM channel; don't expect LoRA to carry the gist of an ongoing conversation. Each layer has its job.

Why hybrid Mamba+attention is the right base

Each channel above depends on a property of the base architecture:

  • Pure attention has unbounded KV cache that grows linearly with context. Memory and decode latency both balloon as conversations lengthen — and persisting a long session's KV across save/load would mean GB-scale state files. Fine for a chatbot, fatal for a daily-driver you resume across days.
  • Pure SSM has constant-size state (great for persistence) but loses the exact recall that attention gives you — "the function name three messages ago" becomes lossy.
  • Hybrid keeps the SSM's constant-size state and attention's exact recall on a recent window. Compared to pure attention at the same context length, three concrete wins fall out:
    • Much less memory. SSM layers carry O(1) state per token; the attention KV cache only spans a short recent window. At 10–100K context the difference is order-of-magnitude — the difference between "fits in laptop RAM" and "doesn't."
    • Faster decode. Per-token cost doesn't grow with conversation length. The SSM half of every layer is constant-cost per token, so a 50-message thread decodes at the same speed as a 5-message one.
    • Cheap online updates. Both the in-conversation SSM state (updated in place, token-by-token, O(1) per token) and the LoRA adapter (trained between sessions on the same machine) stay tractable on a laptop. At pure-attention scaling the same workloads would need server-class memory.

This is why cognit's session save/load works and why the whole system fits on a MacBook: SSM state is what serializes, hybrid runtime costs stay bounded as threads grow, and a 1-3B hybrid base plus a small LoRA can do inference and online learning in laptop-class memory.

Why LoRA on top — and not "just longer sessions"

SSM state across sessions could in principle carry knowledge forward. In practice it doesn't, for the same reason gist-summaries don't substitute for studying: it's a lossy compression, unstructured, and the next conversation will overwrite it. What you actually want preserved across weeks of use — your codebase quirks, your writing style, terminology you've corrected — needs to live in weights. LoRA is that channel.

The base model stays frozen; only a small rank-r adapter (~1 MB at Zamba2-1.2B) learns. Adapter updates are cheap enough to run as background work after a chat session ends.

The training loop

When a turn is promoted to the capture queue (you /good or /fix it in chat, or an agent calls POST /mark), cognit runs a few gradient steps of LoRA fine-tuning on the accumulated captures — on session exit, or via explicit train_pending. The base model's weights are frozen; only the small adapter changes.

Two ways to feed the adapter

The same training loop has two feeders, both writing into the same LoRA:

  • Conversation capture — turns from chat sessions, promoted via /good, /fix, or POST /mark. Online, incremental, driven by how you actually use the model.
  • Corpus ingestionQmdCorpusIngester points at a qmd-managed collection (your notes, project docs, style guide) and trains the adapter on the documents directly. Bulk, no chat session required. Hash-tracked so unchanged docs don't re-train.

Conversation capture handles correction ("you got this wrong, here's right"); corpus ingestion handles bootstrapping ("here's everything I want you to already know on day one"). Both land in the same adapter and coexist seamlessly.

qmd as the backend — rather than a hand-rolled markdown walker — gives cognit stable qmd://collection/path URIs, dedup, and shared state with anything else you've connected to qmd (search, MCP, retrieval). See examples/06_growing_brain_demo.py for the full loop end-to-end.

Catastrophic-forgetting protection (default-on)

Adapter interference is the real engineering risk: earlier-learned patterns in the LoRA can be overwritten by later updates. Cognit's mitigation is replay sampling — mix previously-trained captures into each new training pass — which is trivial (the captures are already in the queue) but ~99% effective in our measurement on Zamba2-1.2B vs unprotected. L2 regularization toward pre-pass weights was nearly useless at typical LoRA scales (kept as an opt-in research knob, not the default). See examples/08_forgetting_protection_demo.py.

What we've validated empirically

The three-channel thesis isn't just architectural framing — both non-trivial channels (SSM, LoRA) have been demonstrated end-to-end on Zamba2-1.2B running locally. Reproduce with the example scripts; numbers below are from M-series MPS runs.

SSM state is real cross-session memory

examples/04_persistence_demo.py

Pre-fill a ~500-character prior context into a session, persist the SSM state to disk on exit, then load it back into a new session. Generate from a neutral prompt with and without the loaded state. Findings:

  • Logit divergence at the first generated position is nonzero and meaningful — the same prompt produces measurably different next-token predictions when SSM state from the prior session is loaded.
  • State file is KB-scale on disk per session, vs GB-scale for attention KV at the same context length.
  • Sampled generation is topic-consistent with the prior context, not with the cold baseline.

This validates the SSM channel as actual working memory across save/load. A caveat baked into Zamba2's hybrid design: attention-heavy hybrid layers make SSM-only persistence a "gist" memory rather than exact recall — by design (see the architecture section above).

LoRA learning is durable across sessions

examples/05_lora_learning_demo.py

Baseline: generate from "The Glorpfish lives in" — Zamba2 confabulates something (the Glorpfish is fictional). Then teach cognit 3 corrected turns about it; session exit triggers a few gradient steps of LoRA fine-tuning. Open a fresh session, generate from the same prompt:

  • Generated continuation matches the taught content ("the purple caves of Mooncliff") in the new session — no prompting reminds the model of what it learned.
  • Logit divergence at the first generated position confirms the weights changed, not just sampling artifacts.
  • Adapter file is a tiny fraction of base (sub-percent of 2.4 GB) yet durably changes what cognit knows.

This validates the LoRA channel as durable across-session learning. A small artifact, layered on the shared frozen base, encodes what's been taught.

Profiles — isolated cognit identities

Each profile (work, personal, research, …) has its own adapter, captures, and sessions. Switching profiles is like switching to a different model — no cross-contamination.

cognit profile new work --model-type zamba2 --base Zyphra/Zamba2-1.2B-Instruct-v2
cognit profile new personal --model-type zamba2 --base Zyphra/Zamba2-1.2B-Instruct-v2

cognit --profile work chat --session pr-review
cognit --profile personal generate "What do I usually cook on Tuesdays?"

Within a profile, sessions are conversation threads — separate SSM state per thread, shared LoRA. Layout:

~/.cognit/profiles/<name>/
├── profile.json       # config
├── adapter.pt         # LoRA weights for this profile
├── captures.jsonl     # capture queue
└── sessions/
    ├── default.safetensors    # SSM state per thread
    └── pr-review.safetensors

Examples

Script What it shows
04_persistence_demo.py SSM state persistence empirically changes generation
05_lora_learning_demo.py LoRA training durably encodes new knowledge
06_growing_brain_demo.py Corpus ingestion + conversation capture → same adapter
07_profile_isolation_demo.py Profile isolation (work vs personal)
08_forgetting_protection_demo.py Catastrophic-forgetting protection measured 4 ways

All examples run against Zamba2-1.2B (no checkpoint setup beyond cognit init --yes).

Status

Alpha. The system works end-to-end on a MacBook with Zamba2-1.2B today. What's known:

  • Inference: ~5–7 tok/s for Zamba2-1.2B on M-series MPS.
  • LoRA training: ~8s per step at 1.2B on MPS (slow but workable for background "train between sessions" usage).
  • Sessions: SSM state persists across runs. Hybrid models lose exact-recall (attention KV) across sessions by design; LoRA covers the durable channel.
  • Base models: Zamba2-1.2B / 2.7B / 7B via HuggingFace; or cognit-native checkpoints (sparse Jamba).
  • HTTP server: OpenAI-compatible, SSE streaming, profile-as-model.

What's missing or rough:

  • The chat REPL doesn't apply chat templates between turns (sessions rely on cache continuity; a proper instruct-style multi-turn protocol is a future improvement).
  • Adapter management CLI (save/load/list/diff/share) — not yet built.

Open problems

The architectural thesis is validated end-to-end (see empirical findings). What this project hasn't yet answered:

  • Tool calling. Coding agents like Pi assume the model can read/write/edit files via structured tool calls. Zamba2-1.2B can't reliably produce tool-call output. Existing small specialists (Qwen2.5-1.5B, Hammer, Arch-Function) work in isolation, but composing them with cognit's LoRA learning loop is an open design question. For now, cognit returns an explicit "not implemented yet" notice + chat fallback on tool-using requests.
  • Longitudinal effectiveness. Does cognit measurably get sharper at you over weeks of real use? The architecture supports it; the empirical study hasn't been run. No formal retention benchmark across many training sessions yet.
  • Multi-turn coherence at scale. SSM state should carry conversation context across long threads. How does coherence hold at 100+ turns? Unmeasured.
  • Bigger hybrid bases. Zamba2-7B may dissolve a lot of these limits — better chat quality, more robust to capture training, plausibly tool-capable. Memory budget on a MacBook is the trade-off.

Implementation, briefly

Cognit wraps a frozen hybrid Mamba+attention base model with three wrapper layers:

  1. Profile — top-level identity. Owns one LoRA adapter and one capture queue.
  2. Session — conversation thread within a profile. Owns one SSM state (serialized between runs); attention KV stays ephemeral.
  3. LoRA adapter — trainable surface. Wraps Linear modules in both Mamba (in_proj, out_proj) and attention (q/k/v/o_proj) blocks. Base weights stay frozen; only the rank-r adapter learns.

At inference, the frozen base + your profile's LoRA produce the response. The session's attention KV builds within the forward pass and discards after; the SSM state updates in place and is checkpointed to disk on session save. Marked turns feed the capture queue; on session exit, captures train the adapter (with replay protection). Next session starts with the updated adapter and the previous SSM state restored.

License

MIT — see LICENSE.

About

A local LLM that doesn't pay for context the way frontier models do. Hybrid Mamba+attention base: fixed-size session state, MB-scale persistence, minimal prompts. MacBook-friendly, OpenAI-compatible.

Resources

License

Stars

Watchers

Forks

Contributors

Languages