Skip to content

ignolia/local-llm-rag-stack

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 

Repository files navigation

🧠 Local LLM + RAG Stack on Consumer Hardware

A complete, practical guide to running private, offline AI with Retrieval-Augmented Generation (RAG) on a modern gaming PC — no cloud, no API costs, no data leaving your machine.


📋 Table of Contents

  • Why Local LLM?
  • Hardware
  • Stack Overview
  • Part 1 — Ollama Setup
  • Part 2 — Custom Modelfiles
  • Part 3 — Docker + AnythingLLM
  • Part 4 — RAG Workspaces
  • Part 5 — SearXNG: Anonymous Live Web Search
  • Part 6 — OCR Preprocessing Pipeline
  • Part 7 — Performance Notes (RTX 5060 Ti)
  • Daily Startup Sequence
  • Tips & Lessons Learned

Why Local LLM?

Running LLMs locally gives you:

  • Privacy — your documents never leave your machine
  • Zero API costs — run thousands of queries for free after setup
  • Low latency — no network round trips
  • Full control — customize context, system prompts, and model behavior
  • Offline capability — works without internet

This guide covers a complete production-ready setup that I actively use daily for document analysis, research, and coding assistance.


Hardware

Component Spec
CPU Intel Core i5-14600K
GPU NVIDIA RTX 5060 Ti 16GB GDDR7
RAM 32GB DDR5
OS Windows 11 + WSL2 (Ubuntu)

The RTX 5060 Ti with 16GB VRAM is the key enabler here. Most consumer GPUs cap at 8GB, which limits you to small models. 16GB lets you run 14B parameter models fully in VRAM with room to spare — a significant leap in output quality.


Stack Overview

┌─────────────────────────────────────┐     ┌──────────────────────┐
│           AnythingLLM (UI)          │────▶│  SearXNG (port 8080) │
│           Port 3001 via Docker      │     │  Anonymous web search│
└──────────────┬──────────────────────┘     └──────────────────────┘
               │ Ollama API (port 11434)

┌──────────────▼──────────────────────┐
│              Ollama                 │  ← Model runner, GPU inference
│   qwen2.5:14b | deepseek-r1:14b    │
│   qwen2.5-coder:14b | nomic-embed  │
└──────────────┬──────────────────────┘
               │

┌──────────────▼──────────────────────┐
│         Your Documents              │  ← PDFs, TXT, MD — preprocessed locally
│         (RAG Knowledge Base)        │
└─────────────────────────────────────┘

Part 1 — Ollama Setup

Install Ollama

Download from ollama.com and install.

Move model storage off your system drive

By default, Ollama stores models on C:. If your system drive is space-constrained, redirect to another drive before pulling any models:

[System.Environment]::SetEnvironmentVariable("OLLAMA_MODELS", "X:\your-storage\models", "Machine")
[System.Environment]::SetEnvironmentVariable("OLLAMA_HOME", "X:\your-storage\ollama", "Machine")

Restart after setting these. Verify with:

echo $env:OLLAMA_MODELS

Pull models

ollama pull qwen2.5:14b
ollama pull qwen2.5-coder:14b
ollama pull deepseek-r1:14b
ollama pull nomic-embed-text

Note: nomic-embed-text is the embedding model used by AnythingLLM for RAG. Pull this one even if you don't use it for chat.

Verify

ollama list

Part 2 — Custom Modelfiles

Modelfiles let you extend base models with custom system prompts and context window settings. Store them at X:\your-storage\modelfiles\.

General-purpose assistant

File: X:\your-storage\modelfiles\assistant

FROM qwen2.5:14b
PARAMETER num_ctx 16384
SYSTEM """
You are a knowledgeable, detail-oriented assistant. Think step by step.
Provide thorough, well-structured answers. When uncertain, say so clearly.
"""

Coding assistant

File: X:\your-storage\modelfiles\coding-assistant

FROM qwen2.5-coder:14b
PARAMETER num_ctx 16384
SYSTEM """
You are an expert software engineer. Write clean, well-commented code.
Explain your reasoning. Prefer idiomatic solutions. Flag potential issues.
Always consider edge cases and error handling.
"""

Build and register

ollama create assistant -f X:\your-storage\modelfiles\assistant
ollama create coding-assistant -f X:\your-storage\modelfiles\coding-assistant

Verify

ollama list

You should now see your custom models alongside the base ones.


Part 3 — Docker + AnythingLLM

AnythingLLM provides the web UI, RAG pipeline, workspace management, and embedding integration.

Prerequisites

  • Docker Desktop installed
  • Redirect Docker disk image off C: (optional but recommended)
    • Move Docker disk image to D: drive: In Docker Desktop → Settings → Resources → Disk image location → change to X:\your-storage\docker-data

Pull and run AnythingLLM

docker pull mintplexlabs/anythingllm
docker run -d -p 3001:3001 -v X:\your-storage\anythingllm:/app/server/storage -e STORAGE_DIR=/app/server/storage --name anythingllm mintplexlabs/anythingllm

Critical: The -e STORAGE_DIR=/app/server/storage flag is required. Without it, AnythingLLM ignores your volume mount and stores data inside the container (lost on restart).

Access the UI

Open: http://localhost:3001

Connect to Ollama

In AnythingLLM Settings:

  • LLM Provider: Ollama
  • Base URL: http://host.docker.internal:11434
  • Model: qwen2.5:14b (or your preferred model)
  • Embedding Provider: Ollama
  • Embedding Model: nomic-embed-text

Use host.docker.internal — not localhost — because AnythingLLM runs inside Docker and needs to reach Ollama on the host machine.


Part 4 — RAG Workspaces

Workspaces in AnythingLLM are isolated RAG environments. Each has its own document collection, system prompt, and query mode.

Workspace modes

Mode Behavior Best for
Query Only answers from uploaded documents Domain-specific knowledge bases
Chat Uses documents + model's general knowledge Coding help, general Q&A

Recommended workspace structure

  • Workspace 1: Domain Knowledge Base
    • Mode: Query
    • System prompt: Focused on your specific domain (research area, technology stack, etc.)
    • Documents: Upload relevant PDFs, reports, reference material
  • Workspace 2: Coding Assistant
    • Mode: Chat
    • Model: coding-assistant (your custom modelfile)
    • Documents: API docs, internal codebase references

Uploading documents

Use the AnythingLLM UI to upload .pdf, .txt, .md, or .docx files directly into each workspace. After upload, AnythingLLM chunks and embeds them automatically using nomic-embed-text.


Part 5 — SearXNG: Anonymous Live Web Search

Pairing AnythingLLM with a local SearXNG instance creates a fully self-contained, air-gapped AI environment. Your private documents are vectorized locally, your LLM queries stay on your machine, and live web research is routed through an anonymous metasearch engine — no query data sent to Google, Bing, or any commercial API.

What SearXNG adds

Capability Without SearXNG With SearXNG
Document Q&A
Live web queries ✅ (anonymized)
Privacy Full Full
Internet required No Only for web queries

Step 1 — Deploy SearXNG via Docker

docker run -d -p 8080:8080 --name searxng searxng/searxng

Step 2 — Enable JSON API format

AnythingLLM queries SearXNG via its JSON API. This must be explicitly enabled in SearXNG's settings.yml. Locate the file inside your container or mounted volume and ensure the following block is present:

search:
  formats:
    - html
    - json  # Required — AnythingLLM cannot query SearXNG without this

If you need to edit the file inside the container:

docker exec -it searxng sh
vi /etc/searxng/settings.yml

Restart the container after saving:

docker restart searxng

Verify the JSON API is live by visiting: http://localhost:8080/search?q=test&format=json

Step 3 — Connect SearXNG to AnythingLLM

  1. Open AnythingLLM at http://localhost:3001
  2. Click the Settings gear icon (bottom left corner)
  3. Select Agent Skills from the left navigation menu
  4. Find the Web Search capability card and toggle it On
  5. In the provider dropdown, select SearXNG
  6. In the base URL field, enter your SearXNG endpoint: http://host.docker.internal:8080

Critical — Docker networking: Both AnythingLLM and SearXNG run inside Docker containers. From inside a container, localhost refers to that container's own loopback — not your Windows host. Use host.docker.internal:8080 so AnythingLLM's container can reach SearXNG on the host network.

Step 4 — Assign Agent Model to Workspace

  1. Navigate to your workspace
  2. Open Workspace Settings
  3. Click Agent Configuration
  4. Confirm your Ollama model (e.g., qwen2.5:14b) is set as the default driving agent

Step 5 — Trigger Web Search in Chat

Standard queries use your local document vector store first. To explicitly trigger a live web search through SearXNG, prefix your message with @agent:

@agent What are the latest developments in local LLM quantization methods?

Architecture recap

AnythingLLM (Docker) ──@agent──▶ SearXNG (Docker, port 8080)
                                        │
                                        ▼
                              Web (anonymized queries)
                              Results returned locally
                              Synthesized by Ollama

Part 6 — OCR Preprocessing Pipeline

Scanned documents and image-based PDFs need OCR before AnythingLLM can use them. This pipeline extracts clean text and optionally removes formatting artifacts.

Dependencies

pip install pytesseract pillow pdf2image

Also install Tesseract OCR and Poppler.

Basic pipeline script

# ocr_pipeline.py
import argparse
import pytesseract
from pdf2image import convert_from_path
from pathlib import Path

def ocr_pdf(input_path: str, output_path: str, dpi: int = 450):
    images = convert_from_path(input_path, dpi=dpi)
    text_pages = []
    
    for i, image in enumerate(images):
        print(f"Processing page {i+1}/{len(images)}...")
        text = pytesseract.image_to_string(image)
        text_pages.append(text)
        
    full_text = "\n\n--- Page Break ---\n\n".join(text_pages)
    Path(output_path).write_text(full_text, encoding="utf-8")
    print(f"Saved: {output_path}")

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--file", required=True)
    parser.add_argument("--output", required=True)
    parser.add_argument("--dpi", type=int, default=450)
    args = parser.parse_args()
    
    ocr_pdf(args.file, args.output, args.dpi)

Usage

python ocr_pipeline.py --file X:\your-documents\scanned\document.pdf --output X:\your-documents\processed\document.txt --dpi 450

Folder structure

X:\your-documents\
├── originals\     ← Never modify; untouched backups
├── scanned\       ← Raw scanned PDFs awaiting OCR
├── processed\     ← Cleaned .txt files ready for RAG upload
└── images\        ← Extracted images if needed

Part 7 — Performance Notes (RTX 5060 Ti)

The RTX 5060 Ti is a fantastic fit for local inference. Here's how models fit in 16GB VRAM:

Model fit in 16GB VRAM

Model VRAM Usage Fits fully?
qwen2.5:7b ~6GB ✅ Yes
qwen2.5:14b ~10GB ✅ Yes
qwen2.5-coder:14b ~10GB ✅ Yes
deepseek-r1:14b ~11GB ✅ Yes
qwen2.5:32b ~22GB ❌ Requires offload

Token generation speed (approx.)

Model Tokens/sec
qwen2.5:7b ~60–80 t/s
qwen2.5:14b ~35–50 t/s
deepseek-r1:14b ~30–45 t/s

Key tips

  • Set num_ctx 16384 in modelfiles for long-document work (default is 2048)
  • Keep other GPU-intensive tasks closed while running 14B models
  • Monitor VRAM with nvidia-smi in a separate terminal

Daily Startup Sequence

  1. Launch Docker Desktop
  2. In PowerShell: docker start searxng && docker start anythingllm
  3. Open browser: http://localhost:3001
  4. Ollama starts automatically with Windows (check system tray)

To verify all containers are running:

docker ps

Troubleshooting

Ollama

  • AnythingLLM says "cannot connect to Ollama": Confirm Ollama is running (check system tray or run ollama list). Ensure base URL is http://host.docker.internal:11434.
  • Models don't appear in dropdown: Fix the base URL connection, then refresh or run docker restart anythingllm.
  • Generation is slow/CPU-bound: Run nvidia-smi to verify GPU usage. Ensure NVIDIA drivers are up to date.

AnythingLLM

  • Data lost after container restart: Ensure your docker run command contains both -v X:\your-storage\anythingllm:/app/server/storage and -e STORAGE_DIR=/app/server/storage.
  • Embedding not working: Verify embedding provider is Ollama with nomic-embed-text. Re-upload documents after fixing.

SearXNG

  • @agent web search returns errors: Verify JSON API is active at http://localhost:8080/search?q=test&format=json. Check settings.yml format layout.
  • No results returned: Some engines block automated queries. Enable multiple backup engines (Google, Bing, Brave, DuckDuckGo) in settings.yml.

Tips & Lessons Learned

  • Keep model data off your system drive. C: fills up fast. Set OLLAMA_MODELS and Docker disk image locations early.
  • Use Query mode for focused knowledge bases. Chat mode mixes general knowledge, which can dilute document precision.
  • DPI matters for OCR. 450 DPI is a reliable default; go to 600 for dense text or tables.
  • host.docker.internal is the Docker bridge. Use it to reach host services from containerized environments.
  • Embed with nomic-embed-text, not your chat model. Keeps RAG snappy and offloads processing from generation models.

Roadmap

  • LlamaIndex Python pipeline for automated document ingestion
  • Continue extension setup in VS Code for inline coding assistance
  • Ingest domain-specific corpus from public sources
  • Evaluate qwen2.5:32b with partial CPU offload

License

MIT — use freely, attribution appreciated.


Built and maintained by Ritesh Kumar · Pittsburgh, PA
Feedback and PRs welcome.

About

A complete, practical guide to running private, offline AI with Retrieval-Augmented Generation (RAG) on consumer hardware.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors