Replies: 2 comments
-
|
Good question! Running GPT-OSS locally with llama.cpp + Guidance has some quirks. The key difference: API mode: Guidance sends prompts to the API, which handles tokenization internally. Local mode: Guidance needs direct access to the tokenizer + generation loop. For llama.cpp specifically:
Quick test: from guidance import models, gen
model = models.LlamaCpp("/path/to/model.gguf")
result = model + "Name: " + gen(name="name", max_tokens=10)
print(result["name"])We have deployed GPT-OSS locally at Revolution AI — llama.cpp + Guidance works well once you get the template alignment right. |
Beta Was this translation helpful? Give feedback.
-
|
GPT-OSS with LlamaCPP is a great combo! At RevolutionAI (https://revolutionai.io) we run local models. Setup: import guidance
from guidance.models import LlamaCpp
# Load model
model = LlamaCpp(
model="gpt-oss-20b.gguf",
n_ctx=8192,
n_gpu_layers=40 # Offload to GPU
)
# Use with Guidance
with guidance(model) as g:
g += "Question: " + query
g += "\nAnswer: " + gen("answer", max_tokens=500)Performance tips:
Memory requirements:
Works great for offline/private deployments! |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I spotted #1358 recently, which is very helpful!
However, as far as I can tell this is using a deployed API client to serve the model.
What changes if I want to use the model more directly, for example running it in LlamaCPP? Do I still need to define the grammar to be able to use the harmony format? Or is there a specific chat template that will replicate it for llama.cpp?
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions