A fully offline, on-device Context-Augmented Generation (CAG) support agent for gas field inspection and maintenance engineers. Built with Foundry Local, this sample shows you how to build a production-style CAG application that runs entirely on your machine, with no cloud, no API keys, and no internet required. The app automatically selects the best model for your device based on available system RAM.
New to CAG? Context-Augmented Generation is a pattern where all domain knowledge is pre-loaded into the model's context window at startup. Unlike RAG (Retrieval-Augmented Generation), which retrieves relevant chunks at query time, CAG injects the full knowledge base into the system prompt upfront. This eliminates the need for vector databases, embeddings, or retrieval pipelines, making the system simpler and faster whilst still grounding the model's answers in your documents.
Want to compare approaches? See local-rag for a RAG-based implementation of the same scenario using vector search and embeddings.
If you're a developer getting started with AI-powered applications, this project demonstrates:
- How CAG works end-to-end: document loading, context injection, and grounded generation
- Running AI models locally with Foundry Local (no GPU required, works on CPU/NPU)
- Building a mobile-responsive web UI that works in the field (large touch targets, high contrast, PWA-ready)
- Streaming AI responses using Server-Sent Events (SSE)
- Zero-infrastructure AI: no vector database, no embeddings, no retrieval pipeline
How a query flows:
- At startup, all 20 domain documents are loaded from
docs/into memory and a document index is built - The user types a question in the browser
- The Express server receives it and selects the top 3 most relevant documents using keyword scoring
- The chat engine builds a prompt containing the system instructions, the document index, the selected documents (~6K chars), and the user's question
- Foundry Local generates a response using the auto-selected model, grounded in the relevant context
- The response streams back to the browser token-by-token via SSE
- 100% offline: no internet, no cloud, no outbound calls
- Dynamic model selection: automatically picks the best model for your device based on available RAM
- Visual startup progress: loading overlay with progress bar and step-by-step status displayed in the browser whilst the model downloads and loads
- Safety-first prompting: safety warnings surface before any procedure
- CAG context injection: answers grounded in pre-loaded gas engineering documents
- Streaming responses: real-time SSE streaming to the UI
- Mobile responsive: works on phones, tablets, and desktops in the field
- Edge/compact mode: toggle for extreme latency / constrained devices
- Field-ready UI: high contrast, large touch targets, works with gloves/PPE
| Desktop | Mobile |
|---|---|
![]() |
![]() |
Before you begin, make sure you have:
- Node.js ≥ 20: Download here
- Foundry Local: Microsoft's on-device AI runtime
winget install Microsoft.FoundryLocal - The best model is auto-selected and auto-downloaded on first run based on your device's RAM
Tip: Run
foundry model listto check which models are already cached on your machine. Set theFOUNDRY_MODELenvironment variable to force a specific model alias (e.g.FOUNDRY_MODEL=phi-3.5-mini npm start).
# 1. Clone the repository
git clone https://github.com/leestott/local-cag.git
cd local-cag
# 2. Install dependencies
npm install
# 3. Start the server (loads documents and starts Foundry Local automatically)
npm startOpen http://127.0.0.1:3000 in a browser. You should see the landing page with quick-action buttons and the chat input.
- The Express server starts immediately on port 3000 and begins serving the web UI.
- The browser connects to the
/api/statusSSE endpoint and displays a loading overlay with a progress bar. - All
.mdfiles indocs/are read and parsed (including optional YAML front-matter for title, category, and ID). - The documents are grouped by category and assembled into a structured domain context block.
- The model selector evaluates available system RAM and picks the best model from the Foundry Local catalogue (downloading it on first run if needed, with download progress streamed to the browser).
- Once the model is loaded, the overlay fades away and the chat interface becomes active.
Chat endpoints return 503 whilst the model is loading, so the UI cannot send queries before the engine is ready. There is no ingestion step, no vector database, and no embedding pipeline. Documents are loaded into memory at startup and the most relevant ones are selected per query.
Type a question or tap one of the quick-action buttons. The agent uses the pre-loaded domain context to generate a safety-first response:
The UI is fully responsive: the same interface works on mobile devices with appropriately sized touch targets:
To expand the knowledge base, add .md files to the docs/ folder and restart the server. Documents are loaded at startup and injected into the system prompt.
---
title: My Procedure Title
category: Inspection Procedures
id: DOC-CUSTOM-001
---
# My Procedure Title
## Safety Warning
- Important safety note here.
## Procedure
1. Step one.
2. Step two.LOCAL-CAG/
├── docs/ # 20 gas engineering domain documents
│ ├── 01-gas-leak-detection.md
│ ├── 02-regulator-fault-low-pressure.md
│ ├── 03-emergency-shutdown.md
│ ├── ...
│ └── 20-no-gas-flow-decision-tree.md
├── public/
│ └── index.html # Field engineer web UI (single-file, no build step)
├── src/
│ ├── chatEngine.js # Foundry Local + CAG orchestration
│ ├── config.js # App configuration (paths, RAM budget)
│ ├── context.js # Document loading + context block construction
│ ├── modelSelector.js # Dynamic model selection based on device RAM
│ ├── prompts.js # System prompts (full + compact/edge)
│ └── server.js # Express server, SSE status broadcast, API endpoints
├── screenshots/ # App screenshots
├── test/ # Unit tests (Node.js test runner)
├── package.json
└── README.md
Understanding each stage will help you adapt this pattern to your own projects.
At startup, all .md files from docs/ are read into memory. Optional YAML front-matter (title, category, ID) is parsed and used to organise the documents. A document index listing all available topics is also built so the model knows what knowledge is available.
Rather than injecting all 20 documents into every prompt (which can exceed what smaller models handle efficiently on CPU), the engine selects the top 3 most relevant documents per query using keyword scoring. This reduces the context from ~41K chars to ~6K chars, enabling fast responses even on modest hardware. The full document index is always included so the model can reference any topic.
Orchestrates the CAG flow:
- Selects the most relevant documents for the user's query via
selectRelevantDocs() - Builds a messages array with the system prompt, the document index, the selected context, conversation history, and the user's question
- Sends it to the locally loaded model via the Foundry Local SDK (in-process, no HTTP round-trips)
- Streams the response back token-by-token
Two prompt variants:
- Full mode (~300 tokens): detailed instructions for safety-first, structured responses
- Edge mode (~80 tokens): minimal prompt for constrained devices with limited context windows
This project uses CAG rather than RAG. Here is how they compare:
| Aspect | CAG (this project) | RAG |
|---|---|---|
| Context delivery | All documents pre-loaded at startup; top 3 selected per query | Relevant chunks retrieved per query via vector similarity |
| Infrastructure | No vector database, no embeddings, no chunking pipeline | Requires vector store, embedding model, chunking pipeline |
| Query latency | No retrieval overhead; prompt is already assembled | Retrieval adds latency (embedding + similarity search) |
| Accuracy | Model sees relevant documents plus a full topic index | Model sees only the top-K retrieved chunks |
| Scalability | Limited by model context window size | Scales to large document collections |
| Complexity | Minimal: just load files and inject into prompt | More moving parts: chunker, embedder, vector store, retriever |
- Small, curated document sets (tens of documents, not thousands)
- Models with large context windows (e.g. Phi-4 supports 16k tokens)
- Constrained environments where simplicity and reliability matter more than scale
- Safety-critical domains where the model should see all relevant information, not just the top-K results
- Hundreds or thousands of documents that exceed the model's context window
- Dynamic document collections that change frequently and are too large to reload
- Precision-critical retrieval where only the most relevant chunks should be included
For the current use case, 20 short procedural guides on constrained local hardware, CAG delivers the best balance of simplicity, reliability, and answer quality.
| Method | Endpoint | Description |
|---|---|---|
POST |
/api/chat |
Non-streaming chat completion |
POST |
/api/chat/stream |
Streaming chat via SSE |
GET |
/api/status |
SSE stream of initialisation progress (model download/load status) |
GET |
/api/context |
List pre-loaded context documents |
GET |
/api/health |
Health check (includes selected model and selection reason) |
The 20 included documents cover:
| # | Category | Documents |
|---|---|---|
| 1 | Safety & Compliance | Emergency shutdown, PPE, confined space, hot work permits |
| 2 | Inspection Procedures | Leak detection, pressure testing, valve inspection, pipeline integrity, pre-inspection checklist |
| 3 | Fault Diagnosis | Regulator faults, gas detector fault codes, no-gas-flow decision tree |
| 4 | Repair & Maintenance | Gasket replacement, cathodic protection, corrosion treatment, purging |
| 5 | Equipment Manuals | Compressor maintenance, sensor calibration, relief valve testing, meter installation |
Toggle Edge Mode in the UI header for constrained devices:
| Setting | Full Mode | Edge Mode |
|---|---|---|
| System prompt | ~300 tokens | ~80 tokens |
| Context | Full document content | Safety warnings and key procedures only |
| Max output tokens | 1024 | 512 |
Foundry Local is Microsoft's on-device AI runtime. It lets you run small language models (SLMs) directly on your laptop or workstation, with no GPU required and no cloud dependency. The foundry-local-sdk npm package provides native bindings for direct in-process inference.
This project uses dynamic model selection: the app queries the SDK catalogue at startup, checks system RAM, and picks the largest model that fits comfortably. You can override this by setting the FOUNDRY_MODEL environment variable.
import { FoundryLocalManager } from "foundry-local-sdk";
const manager = FoundryLocalManager.create({ appName: "my-app" });
// Auto-select the best model for this device
const models = await manager.catalog.getModels();
// ... or force a specific alias:
const model = await manager.catalog.getModel("phi-3.5-mini");
await model.load();
const chatClient = model.createChatClient();CAG is a pattern where domain knowledge is pre-loaded into memory and injected into the model's context window as part of the system prompt. In this implementation, all 20 documents are loaded at startup and the most relevant ones are selected per query using keyword scoring, keeping prompts small enough for efficient CPU inference. This approach is simpler than RAG because it requires no vector database, embeddings, or retrieval infrastructure.
- CAG: pre-load all documents at startup and select the most relevant ones per query. Simple, no infrastructure, limited by context window.
- RAG: retrieve relevant chunks per query using vector similarity search. More complex, but scales to large document collections.
npm testTests use the built-in Node.js test runner (no extra dependencies). They cover configuration and server endpoints.
| Script | Command | Description |
|---|---|---|
| Start | npm start |
Start the server (production) |
| Dev | npm run dev |
Start with auto-restart on file changes |
| Test | npm test |
Run unit tests |
This project is a scenario sample: you can fork it and adapt it to any domain:
- Replace the documents in
docs/with your own.mdfiles (product manuals, internal wikis, support articles) - Edit the system prompt in
src/prompts.jsto match your domain and tone - Force a specific model: set
FOUNDRY_MODEL=<alias>as an environment variable, or leave it unset for automatic selection (runfoundry model listto see available models) - Customise the UI: the frontend is a single HTML file with inline CSS, easy to modify
MIT: this solution is a scenario sample for learning and experimentation.






