📖 Deep Dive: For a complete architectural breakdown, data structures, and evaluation rubric details, see the Technical Write-up.
git clone https://github.com/jerusan/Ignis.git
cd ignis
cp .env.example .env # Add ANTHROPIC_API_KEY
docker compose upOr manually:
# Backend
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
uvicorn backend.main:app --reload
# Frontend (new terminal)
cd frontend-v1
npm install && npm run devNobody reads a 48-page manual. But a multiprocess welder with duty cycle matrices, per-process polarity configs, wire feed tensioner specs, and weld diagnosis photos needs expert-level support — not paragraphs of text.
Before writing any code, I scraped public welding forums using Grok and categorized the results with an LLM. The goal was to understand what technicians actually ask — not infer it from the manual. The pattern was consistent: speed numbers, wiring logic, and physical component placement.
Intentionally unscalable, by design. Ignis follows a product-first ingestion strategy: build the extraction and response layers specifically for this machine, manually verify the generated JSON and markdown files for the first product, and generalize the pattern once it's proven. Welding machines vary enough in structure — duty cycle tables, process configs, wiring diagrams — that forcing a common schema across products now would produce a schema that fits none of them well. The plan is to do this properly for the first N products, let the natural shape of the data surface a common structure, and generalize from evidence rather than assumption. A horizontally-scaled RAG dump over 48 pages would be easier to build and worse to use.
The PDF is pre-extracted into structured JSON and markdown rather than queried raw on each request. Feeding raw PDF pages to the model for every question introduces high hallucination rates on dense tabular data — voltage tables, wire speed charts, duty cycle matrices. Structuring it once upfront lets the agent retrieve exact values instead of re-interpreting page layout at query time.
| Layer | File | Purpose |
|---|---|---|
| Hard specs | specs.json |
Numeric tables — duty cycles, current ranges, schema-constrained to block hallucinations on safety-critical values |
| Decision trees | diagnostic_graph.json |
Troubleshooting matrices converted to a traversable yes/no graph |
| Visual registry | visual_registry.json |
Description of each page |
| Procedural guides | chunks/*.md |
Step-by-step guides parsed via PyMuPDF + Claude, retaining strict tabular data integrity |
At runtime, search_manual applies a process-aware boost when ranking chunks: if the session state shows the user is on MIG, wire feed and spool install chunks rank higher automatically — no manual section routing required.
Three source PDFs feed the pipeline: the 48-page owner-manual.pdf, quick-start-guide.pdf, and selection-chart.pdf. All four knowledge layers are generated by a reproducible pipeline — run from repo root with the venv active:
bash run_ingestion.shThe five-stage pipeline extracts text, exports page PNGs, then runs one build script per layer. Extraction includes a selective Claude vision pass on 13 owner-manual pages — those containing polarity schematics, duty cycle matrices, weld diagnosis photos, and labeled component diagrams — where critical information exists only in image form. Text-only pages skip the vision pass entirely.
After ingestion, all extracted files are verified against the source PDF to catch conversion errors before they reach the agent.
Source data validation pipeline:
Ingestion layer output (JSON + Markdown)
↓
Gemini 3.1 Pro — flags potential hallucinations
↓
Claude Opus 4.7 — refines and corrects extracted data
↓
Gemini 3.1 Pro — second-pass verification
↓
Random manual sampling
For a production system with only 48 pages, full human verification is the right call. For this demo, the multi-model pipeline gets close enough to trust the output on spec numbers.
To ensure sub-second response times and control API costs, Ignis leverages Anthropic's ephemeral prompt caching (prompt-caching-2024-07-31).
- The Ingestion Context: The entire structured manual context (~10,000 tokens) is marked as cacheable.
- The Impact: For multi-turn diagnostic sessions, consecutive turns hit the cache, reducing input token processing costs by up to 90% and reducing user latency from ~4.5s down to sub-second responses.
Prose descriptions of hardware configuration are useless to someone wearing welding gloves. The agent routes intent to interactive components instead of generating text:
-
Duty Cycle Calculator — hard math evaluation, not estimation

-
Specifications Configurator — process + material + thickness → wire speed + voltage

-
Polarity Diagram — cable-to-socket visual showing exactly which plug goes where

The Claude Agent SDK loop acts purely as a deterministic intent router and conversational coordinator — leveraging strict tool-calling schemas to trigger custom components instead of interpreting parameters on the fly.
To prevent the LLM from fabricating numbers inside these widgets, the system enforces a strict boundary:
- The agent does not generate the numbers inside the widget. Instead, it emits a parameter block:
<artifact id="mig-polarity" type="widget" name="PolarityDiagram"> {"process": "MIG"} </artifact>
- The React frontend receives this XML block, parses the JSON input, and extracts the correct parameters directly from
specs.jsonorbaseline_grid.json. The LLM's role is restricted to intent routing, making it physically impossible for it to hallucinate values on screen.
Every answer surfaces its source. The manual page used to generate the response is returned alongside the answer as an expandable reference image toggle — users can verify the exact passage or diagram without leaving the chat.
Pixel-coordinate overlays across Front, Back, and Open-Side cabinet layouts. When a procedure is discussed, the relevant component is pinned on the machine visual.
Bi-directional state syncing: As the technician interacts with custom widgets (e.g., changing wire diameter or thickness), the frontend calls updateWorkbench(payload) to update the backend session state.
- This payload updates the backend's session state.
- The next time a chat request is submitted, these variables are injected into the agent's system prompt:
{ "process": "MIG", "voltage": "240V", "thickness": "1/8\"", "wire_size": "0.030\"" } - This keeps the Claude agent dynamically aware of the physical machine configuration without the user typing it.
Setup procedures trigger a checklist artifact. Each completed step advances the machine visual to highlight the next relevant component. Progress is tracked in a compact step bar above the chat.
Watch a sample troubleshooting walkthrough →
When the answer maps cleanly to polarity, duty cycle, or synergic parameters, the right pane auto-switches to the relevant panel and pre-fills it — no manual tab hunting.
In a safety-critical environment, a confident wrong answer is worse than no answer. Weld quality assessment via photo depends on lighting, camera angle, and distance — none of which are controlled in a shop. A VLM pattern-matching on an inconsistent input can confidently misidentify porosity as undercut or vice versa, sending a technician down the wrong fix path. Instead, Ignis routes users through the deterministic diagnostic graph, forcing observation of specific symptoms with yes/no decisions. Where visual comparison helps, canonical reference photos from the manual are surfaced so the technician makes the call themselves — not the model.
Making STT reliable in a welding shop is a separate engineering problem: sustained background noise requires audio gating, and domain-specific terms (DCEP, FCAW, synergic, TIG, MIG) have high error rates in generic speech recognition without fine-tuning on welding vocabulary. Solving that properly would dwarf the core problem in scope. A touch-first UI with large hitboxes solves the gloves problem directly and works without any of that infrastructure.
53 questions across 9 categories, scored /7: technical accuracy (0–3), tool routing (0–2), multimodal (0–2). Checks whether artifacts trigger on the right questions, citations are accurate, and responses are not verbose.
Current benchmark — 89.3% passing (≥ 6/7) · avg 6.55 / 7 · 1.3 hallucinations/run
| Category | Questions | Pass rate | Avg |
|---|---|---|---|
| adversarial | 4 | 100% | 7.00 |
| fault_code | 3 | 100% | 7.00 |
| no_info | 3 | 100% | 7.00 |
| spec | 10 | 100% | 7.00 |
| synergic | 5 | 100% | 7.00 |
| technique | 6 | 100% | 6.83 |
| polarity_setup | 6 | 100% | 6.89 |
| complex | 4 | 92% | 6.67 |
| diagnostic | 12 | 56% | 5.25 |
Spec, fault codes, synergic, and adversarial all score 100% — the categories where correctness is non-negotiable. Diagnostic is the known gap: the agent loses track of traversal state across multi-step user turns, producing non-deterministic loop paths where it skips low-level hardware checks (e.g., Eurofitting connections, relay click confirmation) that the graph requires before advancing. The hallucination rate of 1.3/run is concentrated here. Variance audits via python eval_report.py --runs 3 are being used to isolate which graph nodes are consistently skipped, with fixes targeting those missing edges directly in diagnostic_graph.json.
Every query is graded against ground-truth parameters by a Claude Sonnet judge using a strict rubric:
- Technical Accuracy (0–3): Scored on whether all facts and key numbers match ground truth. Any hallucinated spec or guessing a troubleshooting fix without tracing the diagnostic graph caps the score at 1.
- Tool Routing (0–2): Penalizes the agent for failing to call required specs or asking redundant questions already answered by the user.
- Multimodal Output (0–2): Evaluates if the required diagrams, photo comparison charts, or custom interactive widgets were correctly surfaced.
We also run variance audits (via python eval_report.py --runs 3) to calculate the score standard deviation across runs, allowing us to find and fix non-deterministic diagnostic loops.
Ignis was intentionally built "unscalable" to prioritize zero-hallucination accuracy for the Vulcan OmniPro 220. To scale this system to
- Schema Generation: Use the automated multi-model ingestion pipeline to parse new PDFs and output matching
specs.json(Layer 1) anddiagnostic_graph.json(Layer 2) files. - Spatial Alignment: Add named visual entries for new machine diagrams to
visual_registry.json(Layer 3), and define the corresponding pixel coordinates for each part infrontend-v1/src/components/SpatialViewport/registryData.ts. - Human-in-the-Loop Audit: Audit the generated schemas using the structured audit prompts in
eval/before committing to the agent knowledge base.
| Endpoint | Method | Description |
|---|---|---|
/chat |
POST |
SSE stream for agent turns (receives message history + session_id) |
/specs |
GET |
Serves structured specifications table (specs.json) |
/baseline-grid |
GET |
Serves synergic parameter grid (baseline_grid.json) |
/assets/{path} |
GET |
Serves static manual page PNG assets to the UI |
get_machine_spec— Fetches numeric ranges, duty cycles, polarity configurations, and power ratings fromspecs.json.diagnose_defect— Walks node-by-node through the traversable yes/no state machine trees indiagnostic_graph.json.get_visual— Resolves coordinate overlays, cable setups, and reference diagrams by name or semantic query fromvisual_registry.json.search_manual— Performs keyword searches over procedural guide chunks, applying a relevance boost based on the active process.get_fault_code— Looks up LCD error messages infault_codes.jsonto return exact definitions and corrective actions.get_synergic_settings— Computes wire feed speed and voltage configs for MIG/flux-cored based on material type, thickness, and wire diameter.
| Agent | Anthropic Claude Agent SDK, Claude Sonnet |
| Backend | Python, FastAPI |
| Frontend | React, TypeScript, Vite, Tailwind CSS |
| Knowledge extraction | PyMuPDF, Claude, Grok |
| Eval | Custom two-stage harness — LLM judge + exact match |
📖 Looking for more? For a complete architectural breakdown, data structures, and evaluation rubric details, see the Technical Write-up.




