A complete, practical guide to running private, offline AI with Retrieval-Augmented Generation (RAG) on a modern gaming PC — no cloud, no API costs, no data leaving your machine.
📋 Table of Contents
- Why Local LLM?
- Hardware
- Stack Overview
- Part 1 — Ollama Setup
- Part 2 — Custom Modelfiles
- Part 3 — Docker + AnythingLLM
- Part 4 — RAG Workspaces
- Part 5 — SearXNG: Anonymous Live Web Search
- Part 6 — OCR Preprocessing Pipeline
- Part 7 — Performance Notes (RTX 5060 Ti)
- Daily Startup Sequence
- Tips & Lessons Learned
Running LLMs locally gives you:
- Privacy — your documents never leave your machine
- Zero API costs — run thousands of queries for free after setup
- Low latency — no network round trips
- Full control — customize context, system prompts, and model behavior
- Offline capability — works without internet
This guide covers a complete production-ready setup that I actively use daily for document analysis, research, and coding assistance.
| Component | Spec |
|---|---|
| CPU | Intel Core i5-14600K |
| GPU | NVIDIA RTX 5060 Ti 16GB GDDR7 |
| RAM | 32GB DDR5 |
| OS | Windows 11 + WSL2 (Ubuntu) |
The RTX 5060 Ti with 16GB VRAM is the key enabler here. Most consumer GPUs cap at 8GB, which limits you to small models. 16GB lets you run 14B parameter models fully in VRAM with room to spare — a significant leap in output quality.
┌─────────────────────────────────────┐ ┌──────────────────────┐
│ AnythingLLM (UI) │────▶│ SearXNG (port 8080) │
│ Port 3001 via Docker │ │ Anonymous web search│
└──────────────┬──────────────────────┘ └──────────────────────┘
│ Ollama API (port 11434)
┌──────────────▼──────────────────────┐
│ Ollama │ ← Model runner, GPU inference
│ qwen2.5:14b | deepseek-r1:14b │
│ qwen2.5-coder:14b | nomic-embed │
└──────────────┬──────────────────────┘
│
┌──────────────▼──────────────────────┐
│ Your Documents │ ← PDFs, TXT, MD — preprocessed locally
│ (RAG Knowledge Base) │
└─────────────────────────────────────┘
Download from ollama.com and install.
By default, Ollama stores models on C:. If your system drive is space-constrained, redirect to another drive before pulling any models:
[System.Environment]::SetEnvironmentVariable("OLLAMA_MODELS", "X:\your-storage\models", "Machine")
[System.Environment]::SetEnvironmentVariable("OLLAMA_HOME", "X:\your-storage\ollama", "Machine")Restart after setting these. Verify with:
echo $env:OLLAMA_MODELSollama pull qwen2.5:14b
ollama pull qwen2.5-coder:14b
ollama pull deepseek-r1:14b
ollama pull nomic-embed-textNote: nomic-embed-text is the embedding model used by AnythingLLM for RAG. Pull this one even if you don't use it for chat.
ollama listModelfiles let you extend base models with custom system prompts and context window settings. Store them at X:\your-storage\modelfiles\.
File: X:\your-storage\modelfiles\assistant
FROM qwen2.5:14b
PARAMETER num_ctx 16384
SYSTEM """
You are a knowledgeable, detail-oriented assistant. Think step by step.
Provide thorough, well-structured answers. When uncertain, say so clearly.
"""
File: X:\your-storage\modelfiles\coding-assistant
FROM qwen2.5-coder:14b
PARAMETER num_ctx 16384
SYSTEM """
You are an expert software engineer. Write clean, well-commented code.
Explain your reasoning. Prefer idiomatic solutions. Flag potential issues.
Always consider edge cases and error handling.
"""
ollama create assistant -f X:\your-storage\modelfiles\assistant
ollama create coding-assistant -f X:\your-storage\modelfiles\coding-assistantollama listYou should now see your custom models alongside the base ones.
AnythingLLM provides the web UI, RAG pipeline, workspace management, and embedding integration.
- Docker Desktop installed
- Redirect Docker disk image off C: (optional but recommended)
- Move Docker disk image to D: drive: In Docker Desktop → Settings → Resources → Disk image location → change to
X:\your-storage\docker-data
- Move Docker disk image to D: drive: In Docker Desktop → Settings → Resources → Disk image location → change to
docker pull mintplexlabs/anythingllm
docker run -d -p 3001:3001 -v X:\your-storage\anythingllm:/app/server/storage -e STORAGE_DIR=/app/server/storage --name anythingllm mintplexlabs/anythingllmCritical: The -e STORAGE_DIR=/app/server/storage flag is required. Without it, AnythingLLM ignores your volume mount and stores data inside the container (lost on restart).
Open: http://localhost:3001
In AnythingLLM Settings:
- LLM Provider: Ollama
- Base URL:
http://host.docker.internal:11434 - Model:
qwen2.5:14b(or your preferred model) - Embedding Provider: Ollama
- Embedding Model:
nomic-embed-text
Use host.docker.internal — not localhost — because AnythingLLM runs inside Docker and needs to reach Ollama on the host machine.
Workspaces in AnythingLLM are isolated RAG environments. Each has its own document collection, system prompt, and query mode.
| Mode | Behavior | Best for |
|---|---|---|
| Query | Only answers from uploaded documents | Domain-specific knowledge bases |
| Chat | Uses documents + model's general knowledge | Coding help, general Q&A |
- Workspace 1: Domain Knowledge Base
- Mode: Query
- System prompt: Focused on your specific domain (research area, technology stack, etc.)
- Documents: Upload relevant PDFs, reports, reference material
- Workspace 2: Coding Assistant
- Mode: Chat
- Model:
coding-assistant(your custom modelfile) - Documents: API docs, internal codebase references
Use the AnythingLLM UI to upload .pdf, .txt, .md, or .docx files directly into each workspace. After upload, AnythingLLM chunks and embeds them automatically using nomic-embed-text.
Pairing AnythingLLM with a local SearXNG instance creates a fully self-contained, air-gapped AI environment. Your private documents are vectorized locally, your LLM queries stay on your machine, and live web research is routed through an anonymous metasearch engine — no query data sent to Google, Bing, or any commercial API.
| Capability | Without SearXNG | With SearXNG |
|---|---|---|
| Document Q&A | ✅ | ✅ |
| Live web queries | ❌ | ✅ (anonymized) |
| Privacy | Full | Full |
| Internet required | No | Only for web queries |
docker run -d -p 8080:8080 --name searxng searxng/searxngAnythingLLM queries SearXNG via its JSON API. This must be explicitly enabled in SearXNG's settings.yml. Locate the file inside your container or mounted volume and ensure the following block is present:
search:
formats:
- html
- json # Required — AnythingLLM cannot query SearXNG without thisIf you need to edit the file inside the container:
docker exec -it searxng sh
vi /etc/searxng/settings.ymlRestart the container after saving:
docker restart searxngVerify the JSON API is live by visiting: http://localhost:8080/search?q=test&format=json
- Open AnythingLLM at
http://localhost:3001 - Click the Settings gear icon (bottom left corner)
- Select Agent Skills from the left navigation menu
- Find the Web Search capability card and toggle it On
- In the provider dropdown, select SearXNG
- In the base URL field, enter your SearXNG endpoint:
http://host.docker.internal:8080
Critical — Docker networking: Both AnythingLLM and SearXNG run inside Docker containers. From inside a container, localhost refers to that container's own loopback — not your Windows host. Use host.docker.internal:8080 so AnythingLLM's container can reach SearXNG on the host network.
- Navigate to your workspace
- Open Workspace Settings
- Click Agent Configuration
- Confirm your Ollama model (e.g.,
qwen2.5:14b) is set as the default driving agent
Standard queries use your local document vector store first. To explicitly trigger a live web search through SearXNG, prefix your message with @agent:
@agent What are the latest developments in local LLM quantization methods?
AnythingLLM (Docker) ──@agent──▶ SearXNG (Docker, port 8080)
│
▼
Web (anonymized queries)
Results returned locally
Synthesized by Ollama
Scanned documents and image-based PDFs need OCR before AnythingLLM can use them. This pipeline extracts clean text and optionally removes formatting artifacts.
pip install pytesseract pillow pdf2imageAlso install Tesseract OCR and Poppler.
# ocr_pipeline.py
import argparse
import pytesseract
from pdf2image import convert_from_path
from pathlib import Path
def ocr_pdf(input_path: str, output_path: str, dpi: int = 450):
images = convert_from_path(input_path, dpi=dpi)
text_pages = []
for i, image in enumerate(images):
print(f"Processing page {i+1}/{len(images)}...")
text = pytesseract.image_to_string(image)
text_pages.append(text)
full_text = "\n\n--- Page Break ---\n\n".join(text_pages)
Path(output_path).write_text(full_text, encoding="utf-8")
print(f"Saved: {output_path}")
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--file", required=True)
parser.add_argument("--output", required=True)
parser.add_argument("--dpi", type=int, default=450)
args = parser.parse_args()
ocr_pdf(args.file, args.output, args.dpi)python ocr_pipeline.py --file X:\your-documents\scanned\document.pdf --output X:\your-documents\processed\document.txt --dpi 450X:\your-documents\
├── originals\ ← Never modify; untouched backups
├── scanned\ ← Raw scanned PDFs awaiting OCR
├── processed\ ← Cleaned .txt files ready for RAG upload
└── images\ ← Extracted images if needed
The RTX 5060 Ti is a fantastic fit for local inference. Here's how models fit in 16GB VRAM:
| Model | VRAM Usage | Fits fully? |
|---|---|---|
qwen2.5:7b |
~6GB | ✅ Yes |
qwen2.5:14b |
~10GB | ✅ Yes |
qwen2.5-coder:14b |
~10GB | ✅ Yes |
deepseek-r1:14b |
~11GB | ✅ Yes |
qwen2.5:32b |
~22GB | ❌ Requires offload |
| Model | Tokens/sec |
|---|---|
qwen2.5:7b |
~60–80 t/s |
qwen2.5:14b |
~35–50 t/s |
deepseek-r1:14b |
~30–45 t/s |
- Set
num_ctx 16384in modelfiles for long-document work (default is 2048) - Keep other GPU-intensive tasks closed while running 14B models
- Monitor VRAM with
nvidia-smiin a separate terminal
- Launch Docker Desktop
- In PowerShell:
docker start searxng && docker start anythingllm - Open browser:
http://localhost:3001 - Ollama starts automatically with Windows (check system tray)
To verify all containers are running:
docker ps- AnythingLLM says "cannot connect to Ollama": Confirm Ollama is running (check system tray or run
ollama list). Ensure base URL ishttp://host.docker.internal:11434. - Models don't appear in dropdown: Fix the base URL connection, then refresh or run
docker restart anythingllm. - Generation is slow/CPU-bound: Run
nvidia-smito verify GPU usage. Ensure NVIDIA drivers are up to date.
- Data lost after container restart: Ensure your
docker runcommand contains both-v X:\your-storage\anythingllm:/app/server/storageand-e STORAGE_DIR=/app/server/storage. - Embedding not working: Verify embedding provider is Ollama with
nomic-embed-text. Re-upload documents after fixing.
@agentweb search returns errors: Verify JSON API is active athttp://localhost:8080/search?q=test&format=json. Checksettings.ymlformat layout.- No results returned: Some engines block automated queries. Enable multiple backup engines (Google, Bing, Brave, DuckDuckGo) in
settings.yml.
- Keep model data off your system drive. C: fills up fast. Set
OLLAMA_MODELSand Docker disk image locations early. - Use Query mode for focused knowledge bases. Chat mode mixes general knowledge, which can dilute document precision.
- DPI matters for OCR. 450 DPI is a reliable default; go to 600 for dense text or tables.
host.docker.internalis the Docker bridge. Use it to reach host services from containerized environments.- Embed with
nomic-embed-text, not your chat model. Keeps RAG snappy and offloads processing from generation models.
- LlamaIndex Python pipeline for automated document ingestion
- Continue extension setup in VS Code for inline coding assistance
- Ingest domain-specific corpus from public sources
- Evaluate
qwen2.5:32bwith partial CPU offload
MIT — use freely, attribution appreciated.
Built and maintained by Ritesh Kumar · Pittsburgh, PA
Feedback and PRs welcome.