BRB

be right back, with context / barbara remembers better

Long-term memory for Claude. Runs locally, learns silently, remembers everything.

Quick Start · How It Works · Configuration

The Problem

You tell Claude your diet, your tech stack, your project deadlines, your stock portfolio, your name. Next conversation? Gone. All of it. Every time.

BRB is a local proxy that sits between you and the Anthropic API. It learns from every conversation and injects relevant memories into future ones. No commands, no tagging, no manual work. Just talk.

Session 1: You tell Claude something

Session 2: Claude remembers

Features

🧠 Learns automatically Extracts facts, preferences, and decisions from every conversation

🎯 Retrieves what matters Memories scored by topic similarity, reinforcement, recency, and confidence

💰 No API costs Embedding and extraction run on local models through llama.cpp. Zero external calls

🔄 Self-corrects Say "my name is Leo", later say "actually it's Leoncio", the memory updates in place

🔒 Private 100% local. PII redacted before storage. Your API key passes through, never stored

⚡ Zero wait Retrieval before the request, extraction after the response in the background

🔍 Transparent GET /memories and GET /memories/search?q=... to inspect everything stored

Quick Start

You need Node.js 20+ and llama.cpp built for your machine.

1. Build llama.cpp

BRB uses llama.cpp to run two local model servers — one for embeddings and one for fact extraction. You need to build it from source:

git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build --config Release

Mac users: add -DGGML_METAL=ON to the cmake configure step for GPU acceleration, and -j$(sysctl -n hw.ncpu) to build in parallel:
cmake -B build -DGGML_METAL=ON
cmake --build build --config Release -j$(sysctl -n hw.ncpu)

After building, the server binary will be at build/bin/llama-server. By default, start.sh expects llama.cpp at ~/workspace/llama.cpp. Set LLAMA_DIR to override:

export LLAMA_DIR=/your/path/to/llama.cpp

2. Download models into llama.cpp (~2.3GB total)

Models go inside the models/ directory of your llama.cpp build, not inside BRB:

llama.cpp/
├── build/bin/llama-server
└── models/
    ├── nomic-embed-text-v1.5.Q8_0.gguf        # Embedding model (~134MB)
    └── Qwen2.5-3B-Instruct-Q4_K_M.gguf        # Extraction model (~2.1GB)

mkdir -p ~/workspace/llama.cpp/models
cd ~/workspace/llama.cpp/models

curl -L -o nomic-embed-text-v1.5.Q8_0.gguf \
  https://huggingface.co/nomic-ai/nomic-embed-text-v1.5-GGUF/resolve/main/nomic-embed-text-v1.5.Q8_0.gguf

curl -L -o Qwen2.5-3B-Instruct-Q4_K_M.gguf \
  https://huggingface.co/Qwen/Qwen2.5-3B-Instruct-GGUF/resolve/main/qwen2.5-3b-instruct-q4_k_m.gguf

3. Clone and install BRB

git clone https://github.com/lboquillon/brb.git
cd brb
npm install

4. Start model servers and proxy

chmod +x start.sh
./start.sh                              # llama.cpp on :9090 and :9091
cp .env.example .env
npm start                               # BRB on :3000

The start.sh script launches two llama.cpp server instances:

Port 9090 — Embedding server (nomic-embed-text-v1.5) with --embedding --pooling mean
Port 9091 — Extraction server (Qwen2.5-3B-Instruct) for chat completions

You can also start them manually:

~/workspace/llama.cpp/build/bin/llama-server \
  -m ~/workspace/llama.cpp/models/nomic-embed-text-v1.5.Q8_0.gguf \
  --port 9090 --embedding --pooling mean

~/workspace/llama.cpp/build/bin/llama-server \
  -m ~/workspace/llama.cpp/models/Qwen2.5-3B-Instruct-Q4_K_M.gguf \
  --port 9091

5. Point Claude at BRB

export ANTHROPIC_BASE_URL=http://localhost:3000
claude

How It Works

You ──▶ BRB (localhost:3000) ──▶ Anthropic API
         │                            │
         │  BEFORE request:           │
         │  nomic-embed (:9090)       │
         │    embed query             │
         │        ▼                   │
         │  zvec vector search        │
         │    top 30 candidates       │
         │        ▼                   │
         │  score + rank              │
         │    inject top 10           │
         │    into system prompt ─────┘
         │                            │
You ◀────│◀─── stream response ───────┘
         │
         │  AFTER response (background):
         │  Qwen 3B (:9091)
         │    extract facts
         │        ▼
         │  redact PII
         │  deduplicate
         │  embed + store in zvec

BRB intercepts every API call. Before forwarding, it searches for relevant memories and appends them to the system prompt. After the response streams back, it extracts new facts in the background for next time.

The pieces

nomic-embed-text-v1.5 (~134MB, port :9090) converts text into 768-dimensional vectors. "I hate avocados" and "guacamole recipe" end up close in vector space. "database indexes" doesn't. Used for both storing and searching memories.

Qwen2.5-3B-Instruct (~2.1GB, port :9091) does two jobs: extract atomic facts from conversations ("User prefers dark mode", "User uses PostgreSQL") and rewrite vague follow-ups into searchable queries ("and pears?" with conversation context becomes "user pear food preference").

zvec is an embedded vector database. No server, no Docker, just a local file. Memories are stored with HNSW indexing for fast similarity search.

Scoring

Your brain doesn't treat all memories equally. Something you heard once three years ago is faint. Something people keep telling you every week is strong. And if someone asks you about cooking, your brain doesn't surface your tax documents, no matter how recent they are. Topic has to match first, then recency and repetition break the tie.

BRB works the same way. Similarity gates the score so off-topic memories can't sneak through, then reinforcement and recency rank what's left:

if similarity < 0.30 → score = 0 (hard gate)

strength  = min(1, mentions / 15) * exp(-0.009 * days_since_reinforced)
recency   = exp(-ln2/140 * days_since_last_accessed)
temporal  = 0.65 * strength + 0.35 * recency

score = similarity * (0.68 + 0.32 * temporal) + 0.05 * confidence

Signal	What it means
similarity	Cosine similarity between your message and the stored memory. This dominates the score. If the topic doesn't match, nothing else matters
strength	How many times a fact has been reinforced (mentioned again), decaying from the last reinforcement date. Mentioned 10 times last week = strong. Mentioned once 6 months ago = faded
recency	When the memory was last used in a response. Half-life of 140 days. Keeps stale facts from hogging the top 10
confidence	How sure the extraction model was (0 to 1). Weighted low at 5% because the model is usually either right or wrong

Below 0.3 composite score? Dropped. Top 10 survivors get injected into the system prompt.

Fact extraction runs on a 3B parameter model. It will occasionally drop qualifiers, extract from questions, or hallucinate facts. BRB uses few-shot examples and code-level filters as safety nets, but perfect extraction from a model this size isn't realistic. That's the trade-off for running entirely on your machine with zero API costs.

Configuration

Copy .env.example to .env:

BRB_PORT=3000                         # Proxy port
BRB_DATA_DIR=./data                   # Where memories live
BRB_EMBED_URL=http://localhost:9090   # Embedding server
BRB_EXTRACT_URL=http://localhost:9091 # Extraction server
BRB_EMBED_DIM=768                     # Embedding dimensions
BRB_MAX_MEMORIES=10                   # Max memories injected per request
BRB_MAX_MEMORY_TOKENS=1500            # Max tokens in injected memory block
BRB_MIN_SCORE=0.3                     # Minimum composite score threshold
BRB_MIN_SIMILARITY=0.30               # Similarity gate (below this = score 0)
BRB_DEDUP_THRESHOLD=0.82              # Cosine similarity for merging duplicates
BRB_NO_REWRITE=false                  # Skip query rewriting
BRB_LOG_LEVEL=info                    # Log level: debug, info, warn, error

Memory categories

Category	Examples
`preference`	"prefers dark mode", "hates semicolons"
`project_context`	"building a REST API", "using PostgreSQL"
`technical_choice`	"chose JWT over sessions", "using Tailwind"
`personal_info`	"name is Leoncio", "based in Miami"
`decision`	"will deploy on Fly.io", "shipping v2 first"
`constraint`	"budget is $500/mo", "deadline is March 15"
`todo`	"need to fix auth bug", "migrate to v3"

Development

npm run dev        # Dev mode with auto-reload
npm test           # Run tests (requires llama.cpp servers on :9090/:9091)
npm run build      # Compile TypeScript
npm run clearData  # Nuke all memories and start fresh

Named after my daughter Barbara, who never forgets a single thing you tell her, even when you wish she would.

BRB be right back, with context.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.vscode		.vscode
docs/images		docs/images
src		src
test		test
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
package.json		package.json
start.sh		start.sh
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BRB

Long-term memory for Claude. Runs locally, learns silently, remembers everything.

The Problem

Features

Quick Start

How It Works

The pieces

Scoring

Configuration

Development

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

BRB

Long-term memory for Claude. Runs locally, learns silently, remembers everything.

The Problem

Features

Quick Start

How It Works

The pieces

Scoring

Configuration

Development

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages