Fast, zero-dependency, inference engine for OpenAI's gpt-oss in pure Java.
- Single file, no dependencies, based on llama3.java
- Supports GPT-OSS models (including MoE variants)
- Fast GGUF format parser
- Supported dtypes/quantizations:
F16,BF16,F32,Q4_0,Q4_1,Q4_K,Q5_K,Q6_K,Q8_0,MXFP4 - Matrix-vector kernels using Java's Vector API
- CLI with
--chatand--instructmodes - Thinking mode control with
--think off|on|inline - GraalVM Native Image support
- AOT model preloading for instant time-to-first-token
Download GGUF models from Hugging Face:
| Model | Parameters | GGUF Repository |
|---|---|---|
| GPT-OSS 20B | 20B (MoE) | unsloth/gpt-oss-20b-GGUF |
| GPT-OSS 120B | 120B (MoE) | unsloth/gpt-oss-120b-GGUF |
120B model not tested.
Q4_0 files are often mixed-quant in practice (for example, token_embd.weight and output.weight may use Q6_K).
A pure quantization is not required, but can be generated from an F32/F16/BF16 GGUF source with llama-quantize from llama.cpp:
./llama-quantize --pure ./gpt-oss-20b-f32.gguf ./gpt-oss-20b-Q4_0.gguf Q4_0Pick any supported target quantization, for example Q4_0, Q4_1, Q4_K, Q5_K, Q6_K, or Q8_0.
Java 21+ is required, in particular for the MemorySegment mmap-ing feature.
jbang is a good fit for this use case.
No-setup one-liner, no git clone, no manual model download required ... ~10GB download once, then cached by jbang:
jbang gptoss@mukel \
--model %{https://hf.co/unsloth/gpt-oss-20b-GGUF/resolve/main/gpt-oss-20b-Q8_0.gguf} \
--system-prompt "You are a helpful coding assistant" \
--chatAlternatively:
jbang GptOss.java --help
jbang GptOss.java --model ./gpt-oss-20b-Q8_0.gguf --chat
jbang GptOss.java --model ./gpt-oss-20b-Q8_0.gguf --prompt "Explain quantum computing like I'm five"
Or run it directly (still via jbang):
chmod +x GptOss.java
./GptOss.java --helpA simple Makefile is provided. Run make jar to produce gptoss.jar.
Run the resulting gptoss.jar as follows:
java --enable-preview --add-modules jdk.incubator.vector -jar gptoss.jar --helpCompile with make native to produce a gptoss executable, then:
./gptoss --model ./gpt-oss-20b-Q8_0.gguf --chatGptOss.java supports AOT model preloading to reduce parse overhead and time-to-first-token (TTFT).
To AOT pre-load a GGUF model:
PRELOAD_GGUF=/path/to/model.gguf make nativeA larger specialized binary is generated with parse overhead removed for that specific model. It can still run other models with the usual parsing overhead.
Apache 2.0
