🧠 Local LLM + RAG Stack on Consumer Hardware

A complete, practical guide to running private, offline AI with Retrieval-Augmented Generation (RAG) on a modern gaming PC — no cloud, no API costs, no data leaving your machine.

📋 Table of Contents

Why Local LLM?
Hardware
Stack Overview
Part 1 — Ollama Setup
Part 2 — Custom Modelfiles
Part 3 — Docker + AnythingLLM
Part 4 — RAG Workspaces
Part 5 — SearXNG: Anonymous Live Web Search
Part 6 — OCR Preprocessing Pipeline
Part 7 — Performance Notes (RTX 5060 Ti)
Daily Startup Sequence
Tips & Lessons Learned

Why Local LLM?

Running LLMs locally gives you:

Privacy — your documents never leave your machine
Zero API costs — run thousands of queries for free after setup
Low latency — no network round trips
Full control — customize context, system prompts, and model behavior
Offline capability — works without internet

This guide covers a complete production-ready setup that I actively use daily for document analysis, research, and coding assistance.

Hardware

Component	Spec
CPU	Intel Core i5-14600K
GPU	NVIDIA RTX 5060 Ti 16GB GDDR7
RAM	32GB DDR5
OS	Windows 11 + WSL2 (Ubuntu)

The RTX 5060 Ti with 16GB VRAM is the key enabler here. Most consumer GPUs cap at 8GB, which limits you to small models. 16GB lets you run 14B parameter models fully in VRAM with room to spare — a significant leap in output quality.

Stack Overview

┌─────────────────────────────────────┐     ┌──────────────────────┐
│           AnythingLLM (UI)          │────▶│  SearXNG (port 8080) │
│           Port 3001 via Docker      │     │  Anonymous web search│
└──────────────┬──────────────────────┘     └──────────────────────┘
               │ Ollama API (port 11434)

┌──────────────▼──────────────────────┐
│              Ollama                 │  ← Model runner, GPU inference
│   qwen2.5:14b | deepseek-r1:14b    │
│   qwen2.5-coder:14b | nomic-embed  │
└──────────────┬──────────────────────┘
               │

┌──────────────▼──────────────────────┐
│         Your Documents              │  ← PDFs, TXT, MD — preprocessed locally
│         (RAG Knowledge Base)        │
└─────────────────────────────────────┘

Part 1 — Ollama Setup

Install Ollama

Download from ollama.com and install.

Move model storage off your system drive

By default, Ollama stores models on C:. If your system drive is space-constrained, redirect to another drive before pulling any models:

[System.Environment]::SetEnvironmentVariable("OLLAMA_MODELS", "X:\your-storage\models", "Machine")
[System.Environment]::SetEnvironmentVariable("OLLAMA_HOME", "X:\your-storage\ollama", "Machine")

Restart after setting these. Verify with:

echo $env:OLLAMA_MODELS

Pull models

ollama pull qwen2.5:14b
ollama pull qwen2.5-coder:14b
ollama pull deepseek-r1:14b
ollama pull nomic-embed-text

Note: nomic-embed-text is the embedding model used by AnythingLLM for RAG. Pull this one even if you don't use it for chat.

Verify

ollama list

Part 2 — Custom Modelfiles

Modelfiles let you extend base models with custom system prompts and context window settings. Store them at X:\your-storage\modelfiles\.

General-purpose assistant

File: X:\your-storage\modelfiles\assistant

FROM qwen2.5:14b
PARAMETER num_ctx 16384
SYSTEM """
You are a knowledgeable, detail-oriented assistant. Think step by step.
Provide thorough, well-structured answers. When uncertain, say so clearly.
"""

Coding assistant

File: X:\your-storage\modelfiles\coding-assistant

FROM qwen2.5-coder:14b
PARAMETER num_ctx 16384
SYSTEM """
You are an expert software engineer. Write clean, well-commented code.
Explain your reasoning. Prefer idiomatic solutions. Flag potential issues.
Always consider edge cases and error handling.
"""

Build and register

ollama create assistant -f X:\your-storage\modelfiles\assistant
ollama create coding-assistant -f X:\your-storage\modelfiles\coding-assistant

Verify

ollama list

You should now see your custom models alongside the base ones.

Part 3 — Docker + AnythingLLM

AnythingLLM provides the web UI, RAG pipeline, workspace management, and embedding integration.

Prerequisites

Docker Desktop installed
Redirect Docker disk image off C: (optional but recommended)
- Move Docker disk image to D: drive: In Docker Desktop → Settings → Resources → Disk image location → change to X:\your-storage\docker-data

Pull and run AnythingLLM

docker pull mintplexlabs/anythingllm
docker run -d -p 3001:3001 -v X:\your-storage\anythingllm:/app/server/storage -e STORAGE_DIR=/app/server/storage --name anythingllm mintplexlabs/anythingllm

Critical: The -e STORAGE_DIR=/app/server/storage flag is required. Without it, AnythingLLM ignores your volume mount and stores data inside the container (lost on restart).

Access the UI

Open: http://localhost:3001

Connect to Ollama

In AnythingLLM Settings:

LLM Provider: Ollama
Base URL: http://host.docker.internal:11434
Model: qwen2.5:14b (or your preferred model)
Embedding Provider: Ollama
Embedding Model: nomic-embed-text

Use host.docker.internal — not localhost — because AnythingLLM runs inside Docker and needs to reach Ollama on the host machine.

Part 4 — RAG Workspaces

Workspaces in AnythingLLM are isolated RAG environments. Each has its own document collection, system prompt, and query mode.

Workspace modes

Mode	Behavior	Best for
Query	Only answers from uploaded documents	Domain-specific knowledge bases
Chat	Uses documents + model's general knowledge	Coding help, general Q&A

Recommended workspace structure

Workspace 1: Domain Knowledge Base
- Mode: Query
- System prompt: Focused on your specific domain (research area, technology stack, etc.)
- Documents: Upload relevant PDFs, reports, reference material
Workspace 2: Coding Assistant
- Mode: Chat
- Model: coding-assistant (your custom modelfile)
- Documents: API docs, internal codebase references

Uploading documents

Use the AnythingLLM UI to upload .pdf, .txt, .md, or .docx files directly into each workspace. After upload, AnythingLLM chunks and embeds them automatically using nomic-embed-text.

Part 5 — SearXNG: Anonymous Live Web Search

Pairing AnythingLLM with a local SearXNG instance creates a fully self-contained, air-gapped AI environment. Your private documents are vectorized locally, your LLM queries stay on your machine, and live web research is routed through an anonymous metasearch engine — no query data sent to Google, Bing, or any commercial API.

What SearXNG adds

Capability	Without SearXNG	With SearXNG
Document Q&A	✅	✅
Live web queries	❌	✅ (anonymized)
Privacy	Full	Full
Internet required	No	Only for web queries

Step 1 — Deploy SearXNG via Docker

docker run -d -p 8080:8080 --name searxng searxng/searxng

Step 2 — Enable JSON API format

AnythingLLM queries SearXNG via its JSON API. This must be explicitly enabled in SearXNG's settings.yml. Locate the file inside your container or mounted volume and ensure the following block is present:

search:
  formats:
    - html
    - json  # Required — AnythingLLM cannot query SearXNG without this

If you need to edit the file inside the container:

docker exec -it searxng sh
vi /etc/searxng/settings.yml

Restart the container after saving:

docker restart searxng

Verify the JSON API is live by visiting: http://localhost:8080/search?q=test&format=json

Step 3 — Connect SearXNG to AnythingLLM

Open AnythingLLM at http://localhost:3001
Click the Settings gear icon (bottom left corner)
Select Agent Skills from the left navigation menu
Find the Web Search capability card and toggle it On
In the provider dropdown, select SearXNG
In the base URL field, enter your SearXNG endpoint: http://host.docker.internal:8080

Critical — Docker networking: Both AnythingLLM and SearXNG run inside Docker containers. From inside a container, localhost refers to that container's own loopback — not your Windows host. Use host.docker.internal:8080 so AnythingLLM's container can reach SearXNG on the host network.

Step 4 — Assign Agent Model to Workspace

Navigate to your workspace
Open Workspace Settings
Click Agent Configuration
Confirm your Ollama model (e.g., qwen2.5:14b) is set as the default driving agent

Step 5 — Trigger Web Search in Chat

Standard queries use your local document vector store first. To explicitly trigger a live web search through SearXNG, prefix your message with @agent:

@agent What are the latest developments in local LLM quantization methods?

Architecture recap

AnythingLLM (Docker) ──@agent──▶ SearXNG (Docker, port 8080)
                                        │
                                        ▼
                              Web (anonymized queries)
                              Results returned locally
                              Synthesized by Ollama

Part 6 — OCR Preprocessing Pipeline

Scanned documents and image-based PDFs need OCR before AnythingLLM can use them. This pipeline extracts clean text and optionally removes formatting artifacts.

Dependencies

pip install pytesseract pillow pdf2image

Also install Tesseract OCR and Poppler.

Basic pipeline script

# ocr_pipeline.py
import argparse
import pytesseract
from pdf2image import convert_from_path
from pathlib import Path

def ocr_pdf(input_path: str, output_path: str, dpi: int = 450):
    images = convert_from_path(input_path, dpi=dpi)
    text_pages = []
    
    for i, image in enumerate(images):
        print(f"Processing page {i+1}/{len(images)}...")
        text = pytesseract.image_to_string(image)
        text_pages.append(text)
        
    full_text = "\n\n--- Page Break ---\n\n".join(text_pages)
    Path(output_path).write_text(full_text, encoding="utf-8")
    print(f"Saved: {output_path}")

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--file", required=True)
    parser.add_argument("--output", required=True)
    parser.add_argument("--dpi", type=int, default=450)
    args = parser.parse_args()
    
    ocr_pdf(args.file, args.output, args.dpi)

Usage

python ocr_pipeline.py --file X:\your-documents\scanned\document.pdf --output X:\your-documents\processed\document.txt --dpi 450

Folder structure

X:\your-documents\
├── originals\     ← Never modify; untouched backups
├── scanned\       ← Raw scanned PDFs awaiting OCR
├── processed\     ← Cleaned .txt files ready for RAG upload
└── images\        ← Extracted images if needed

Part 7 — Performance Notes (RTX 5060 Ti)

The RTX 5060 Ti is a fantastic fit for local inference. Here's how models fit in 16GB VRAM:

Model fit in 16GB VRAM

Model	VRAM Usage	Fits fully?
`qwen2.5:7b`	~6GB	✅ Yes
`qwen2.5:14b`	~10GB	✅ Yes
`qwen2.5-coder:14b`	~10GB	✅ Yes
`deepseek-r1:14b`	~11GB	✅ Yes
`qwen2.5:32b`	~22GB	❌ Requires offload

Token generation speed (approx.)

Model	Tokens/sec
`qwen2.5:7b`	~60–80 t/s
`qwen2.5:14b`	~35–50 t/s
`deepseek-r1:14b`	~30–45 t/s

Key tips

Set num_ctx 16384 in modelfiles for long-document work (default is 2048)
Keep other GPU-intensive tasks closed while running 14B models
Monitor VRAM with nvidia-smi in a separate terminal

Daily Startup Sequence

Launch Docker Desktop
In PowerShell: docker start searxng && docker start anythingllm
Open browser: http://localhost:3001
Ollama starts automatically with Windows (check system tray)

To verify all containers are running:

docker ps

Troubleshooting

Ollama

AnythingLLM says "cannot connect to Ollama": Confirm Ollama is running (check system tray or run ollama list). Ensure base URL is http://host.docker.internal:11434.
Models don't appear in dropdown: Fix the base URL connection, then refresh or run docker restart anythingllm.
Generation is slow/CPU-bound: Run nvidia-smi to verify GPU usage. Ensure NVIDIA drivers are up to date.

AnythingLLM

Data lost after container restart: Ensure your docker run command contains both -v X:\your-storage\anythingllm:/app/server/storage and -e STORAGE_DIR=/app/server/storage.
Embedding not working: Verify embedding provider is Ollama with nomic-embed-text. Re-upload documents after fixing.

SearXNG

@agent web search returns errors: Verify JSON API is active at http://localhost:8080/search?q=test&format=json. Check settings.yml format layout.
No results returned: Some engines block automated queries. Enable multiple backup engines (Google, Bing, Brave, DuckDuckGo) in settings.yml.

Tips & Lessons Learned

Keep model data off your system drive. C: fills up fast. Set OLLAMA_MODELS and Docker disk image locations early.
Use Query mode for focused knowledge bases. Chat mode mixes general knowledge, which can dilute document precision.
DPI matters for OCR. 450 DPI is a reliable default; go to 600 for dense text or tables.
host.docker.internal is the Docker bridge. Use it to reach host services from containerized environments.
Embed with nomic-embed-text, not your chat model. Keeps RAG snappy and offloads processing from generation models.

Roadmap

LlamaIndex Python pipeline for automated document ingestion
Continue extension setup in VS Code for inline coding assistance
Ingest domain-specific corpus from public sources
Evaluate qwen2.5:32b with partial CPU offload

License

MIT — use freely, attribution appreciated.

Built and maintained by Ritesh Kumar · Pittsburgh, PA
Feedback and PRs welcome.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

🧠 Local LLM + RAG Stack on Consumer Hardware

Why Local LLM?

Hardware

Stack Overview

Part 1 — Ollama Setup

Install Ollama

Move model storage off your system drive

Pull models

Verify

Part 2 — Custom Modelfiles

General-purpose assistant

Coding assistant

Build and register

Verify

Part 3 — Docker + AnythingLLM

Prerequisites

Pull and run AnythingLLM

Access the UI

Connect to Ollama

Part 4 — RAG Workspaces

Workspace modes

Recommended workspace structure

Uploading documents

Part 5 — SearXNG: Anonymous Live Web Search

What SearXNG adds

Step 1 — Deploy SearXNG via Docker

Step 2 — Enable JSON API format

Step 3 — Connect SearXNG to AnythingLLM

Step 4 — Assign Agent Model to Workspace

Step 5 — Trigger Web Search in Chat

Architecture recap

Part 6 — OCR Preprocessing Pipeline

Dependencies

Basic pipeline script

Usage

Folder structure

Part 7 — Performance Notes (RTX 5060 Ti)

Model fit in 16GB VRAM

Token generation speed (approx.)

Key tips

Daily Startup Sequence

Troubleshooting

Ollama

AnythingLLM

SearXNG

Tips & Lessons Learned

Roadmap

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages