Qwen3 LLM inference using the Futhark language + demo chat application
The inference engine has support for KV Cache & Prompt extension.
The chat app has support for
- user/assistant chat
- Thinking mode on/off
- Tool calling of simple futhark entry points
The default model is Qwen3-0.6B and currently needs around 6GB of VRAM
You will need
- a recent nightly release of Futhark
- a python venv with the requirements installed
As a first step, compile
futhark {backend} --server qwen-f32.fut -o qwen-f32for the float32 version (default)futhark {backend} --server qwen-f16.fut -o qwen-f16for the float16 version You should probably compile both version so everything's ready for chat tests.
Then you can start the chat with python chat.py --tools (or python chat.py --help if you want to see the parameters)
and hopefully, after downloading the model weights ( ~1.5GB in float16 format), start chatting with fuchat :-)
It was developed and tested on an AMD 6700XT with 12GB of VRAM, with the Futhark hip backend.
In f32mode it reaches around 20-25 token/s and 10 tokens/s in f16.
As a comparison, on the same card, llama.cpp has an inference of around 150 t/s with the f16 quantized model, and around 110 t/s with the f32 quantized model.
It is a bit surprising that the f16 version of fuchat is 2 times slower than the f32 version as we could expect a gain from the fact that only half of the memory has to move.
A pure f32 version before implementing KV Cache was running at 2-5 tokens/sec so the Futhark "update in-place" mechanism brings a great performance boost for this type of caching.
Can Futhark reach 100 t/s ? Gotta Go Fast! It is already impressive that Futhark can reach 25 tokens/s with a one file, typed checked, standalone .fut file !
If you are interested on improving the speed of fuchat, a blog post was written on how to benchmark fuchat using futhark's tools.
This is largely inspired by llaf that created a lightweight GPT2 implementation in Futhark
It wouldn't exist without the help and support of [Troels Henriksen] on subtleties around uniqueness and in-place updates.
@software{The_Futhark_Hackers_Futhark,
author = {The Futhark Hackers},
title = {{Futhark}},
url = {https://github.com/diku-dk/futhark}
}@inproceedings{henriksen2017futhark,
title={Futhark: purely functional GPU-programming with nested parallelism and in-place array updates},
author={Henriksen, Troels and Serup, Niels GW and Elsman, Martin and Henglein, Fritz and Oancea, Cosmin E},
booktitle={Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation},
pages={556*571},
year={2017}
}@phdthesis{henriksen2017design,
title={Design and implementation of the Futhark programming language},
author={Henriksen, Troels},
year={2017},
school={University of Copenhagen, Faculty of Science [Department of Computer Science]}
}