GptOss.java

Fast, zero-dependency, inference engine for OpenAI's gpt-oss in pure Java.

Features

Single file, no dependencies, based on llama3.java
Supports GPT-OSS models (including MoE variants)
Fast GGUF format parser
Supported dtypes/quantizations: F16, BF16, F32, Q4_0, Q4_1, Q4_K, Q5_K, Q6_K, Q8_0, MXFP4
Matrix-vector kernels using Java's Vector API
CLI with --chat and --instruct modes
Thinking mode control with --think off|on|inline
GraalVM Native Image support
AOT model preloading for instant time-to-first-token

Setup

Download GGUF models from Hugging Face:

Model	Parameters	GGUF Repository
GPT-OSS 20B	20B (MoE)	unsloth/gpt-oss-20b-GGUF
GPT-OSS 120B	120B (MoE)	unsloth/gpt-oss-120b-GGUF

120B model not tested.

Optional: pure quantizations

Q4_0 files are often mixed-quant in practice (for example, token_embd.weight and output.weight may use Q6_K). A pure quantization is not required, but can be generated from an F32/F16/BF16 GGUF source with llama-quantize from llama.cpp:

./llama-quantize --pure ./gpt-oss-20b-f32.gguf ./gpt-oss-20b-Q4_0.gguf Q4_0

Pick any supported target quantization, for example Q4_0, Q4_1, Q4_K, Q5_K, Q6_K, or Q8_0.

Build and run

Java 21+ is required, in particular for the MemorySegment mmap-ing feature.

jbang is a good fit for this use case.

No-setup one-liner, no git clone, no manual model download required ... ~10GB download once, then cached by jbang:

jbang gptoss@mukel \
    --model %{https://hf.co/unsloth/gpt-oss-20b-GGUF/resolve/main/gpt-oss-20b-Q8_0.gguf} \
    --system-prompt "You are a helpful coding assistant" \
    --chat

Alternatively:

jbang GptOss.java --help
jbang GptOss.java --model ./gpt-oss-20b-Q8_0.gguf --chat
jbang GptOss.java --model ./gpt-oss-20b-Q8_0.gguf --prompt "Explain quantum computing like I'm five"

Or run it directly (still via jbang):

chmod +x GptOss.java
./GptOss.java --help

Optional: Makefile

A simple Makefile is provided. Run make jar to produce gptoss.jar.

Run the resulting gptoss.jar as follows:

java --enable-preview --add-modules jdk.incubator.vector -jar gptoss.jar --help

GraalVM Native Image

Compile with make native to produce a gptoss executable, then:

./gptoss --model ./gpt-oss-20b-Q8_0.gguf --chat

AOT model preloading

GptOss.java supports AOT model preloading to reduce parse overhead and time-to-first-token (TTFT).

To AOT pre-load a GGUF model:

PRELOAD_GGUF=/path/to/model.gguf make native

A larger specialized binary is generated with parse overhead removed for that specific model. It can still run other models with the usual parsing overhead.

Related Repositories

License

Apache 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
GptOss.java		GptOss.java
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GptOss.java

Features

Setup

Optional: pure quantizations

Build and run

Optional: Makefile

GraalVM Native Image

AOT model preloading

Related Repositories

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GptOss.java

Features

Setup

Optional: pure quantizations

Build and run

Optional: Makefile

GraalVM Native Image

AOT model preloading

Related Repositories

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages