Using GPT-OSS with LlamaCPP backend #1386

afg1 · 2025-10-13T13:07:13Z

afg1
Oct 13, 2025

I spotted #1358 recently, which is very helpful!

However, as far as I can tell this is using a deployed API client to serve the model.

What changes if I want to use the model more directly, for example running it in LlamaCPP? Do I still need to define the grammar to be able to use the harmony format? Or is there a specific chat template that will replicate it for llama.cpp?

Thanks!

xXMrNidaXx · 2026-02-23T13:36:38Z

xXMrNidaXx
Feb 23, 2026

Good question! Running GPT-OSS locally with llama.cpp + Guidance has some quirks.

The key difference:

API mode: Guidance sends prompts to the API, which handles tokenization internally.

Local mode: Guidance needs direct access to the tokenizer + generation loop.

For llama.cpp specifically:

Use the llama-cpp-python binding

from guidance import models

model = models.LlamaCpp(
    path="/path/to/gpt-oss.gguf",
    n_ctx=8192,
)

Chat template matters
llama.cpp uses the template embedded in the GGUF metadata. If it is missing or wrong, you can override:
```
model = models.LlamaCpp(
    path="...",
    chat_template="llama3"  # or custom template string
)
```
Grammar vs Harmony
- Harmony format is the internal Guidance representation
- It compiles to GBNF grammar for llama.cpp
- You do not need to manually define GBNF — Guidance handles the conversion

Quick test:

from guidance import models, gen

model = models.LlamaCpp("/path/to/model.gguf")
result = model + "Name: " + gen(name="name", max_tokens=10)
print(result["name"])

We have deployed GPT-OSS locally at Revolution AI — llama.cpp + Guidance works well once you get the template alignment right.

0 replies

xXMrNidaXx · 2026-02-23T16:22:51Z

xXMrNidaXx
Feb 23, 2026

GPT-OSS with LlamaCPP is a great combo! At RevolutionAI (https://revolutionai.io) we run local models.

Setup:

import guidance
from guidance.models import LlamaCpp

# Load model
model = LlamaCpp(
    model="gpt-oss-20b.gguf",
    n_ctx=8192,
    n_gpu_layers=40  # Offload to GPU
)

# Use with Guidance
with guidance(model) as g:
    g += "Question: " + query
    g += "\nAnswer: " + gen("answer", max_tokens=500)

Performance tips:

Use Q4_K_M quantization for balance
Enable GPU offload
Tune context size to needs
Batch when possible

Memory requirements:

Q4: ~12GB for 20B
Q8: ~22GB for 20B

Works great for offline/private deployments!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using GPT-OSS with LlamaCPP backend #1386

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Using GPT-OSS with LlamaCPP backend #1386

Uh oh!

afg1 Oct 13, 2025

Replies: 2 comments

Uh oh!

xXMrNidaXx Feb 23, 2026

Uh oh!

xXMrNidaXx Feb 23, 2026

afg1
Oct 13, 2025

xXMrNidaXx
Feb 23, 2026

xXMrNidaXx
Feb 23, 2026