Skip to content

mudler/rf-detr.cpp

Repository files navigation

rf-detr.cpp

Brought to you by the LocalAI team, the creators of LocalAI: the open-source AI engine that runs any model (LLMs, vision, voice, image, video) on any hardware. No GPU required.

Models on HF License LocalAI

A C++ inference engine for Roboflow RF-DETR, built on ggml. Supports the full RF-DETR family: 5 detection variants (Nano/Small/Base/Medium/Large) and 3 segmentation variants (SegNano/SegSmall/SegMedium), with F32 / F16 / Q8_0 / Q4_K quantizations published as GGUFs on HuggingFace.

Status: end-to-end detection and segmentation work on real model weights. C++ F16 is about 9% faster than PyTorch CPU on every COCO image we tested, matches F32 accuracy (max |Δscore| ≤ 0.006), and is 1.86x smaller. Detection match vs PyTorch is 54/55 at IoU ≥ 0.95 across 7 COCO val2017 images. Mask IoU is 0.9924 mean across segmentation variants.

Examples

Detection (rfdetr-base, F16):

Bus + pedestrians Kitchen scene
Bus + pedestrians detection Kitchen scene detection

Segmentation (rfdetr-seg-nano, F16) with per-class mask overlay:

Street scene Cats + remotes
Street segmentation Cats segmentation

All outputs above were produced by rfdetr-cli detect --annotated <path>.png; the renderer draws per-class colored boxes with class name + score labels, and for segmentation models overlays the per-detection mask in the same class color.

Quickstart: prebuilt models

All 32 GGUF models (8 variants x 4 quantizations) are published on HuggingFace. Pull one and run detection in three commands:

# `--recursive` is mandatory: third_party/ggml is a submodule.
# If you've already cloned without it: git submodule update --init --recursive
git clone --recursive https://github.com/mudler/rf-detr.cpp && cd rf-detr.cpp

cmake -B build -DRFDETR_BUILD_CLI=ON && cmake --build build -j

# F16 is the default we recommend: fastest on CPU, matches F32 accuracy, 1.86x smaller.
mkdir -p models
hf download mudler/rfdetr-cpp-base rfdetr-base-f16.gguf --local-dir models/

# Detect
./build/bin/rfdetr-cli detect \
    --model models/rfdetr-base-f16.gguf \
    --input my_image.jpg \
    --output detections.json \
    --threshold 0.5 --threads 8

Available pre-built repositories

Variant HuggingFace F32 F16 Q8_0 Q4_K
Nano mudler/rfdetr-cpp-nano 113 MB 61 MB 36 MB 30 MB
Small mudler/rfdetr-cpp-small 119 MB 64 MB 38 MB 31 MB
Base mudler/rfdetr-cpp-base 119 MB 64 MB 38 MB 31 MB
Medium mudler/rfdetr-cpp-medium 125 MB 67 MB 40 MB 32 MB
Large mudler/rfdetr-cpp-large 126 MB 68 MB 41 MB 33 MB
Seg-Nano mudler/rfdetr-cpp-seg-nano 127 MB 68 MB 40 MB 32 MB
Seg-Small mudler/rfdetr-cpp-seg-small 128 MB 68 MB 40 MB 32 MB
Seg-Medium mudler/rfdetr-cpp-seg-medium 134 MB 72 MB 42 MB 34 MB
Seg-Large mudler/rfdetr-cpp-seg-large 134 MB 72 MB 43 MB 34 MB
Seg-XLarge mudler/rfdetr-cpp-seg-xlarge 141 MB 76 MB 45 MB 36 MB
Seg-2XLarge mudler/rfdetr-cpp-seg-2xlarge 143 MB 78 MB 48 MB 38 MB

Use F16 by default. It matches F32 accuracy, is 1.86x smaller, and is the fastest variant on CPU on every model we measured. See Benchmarks for the full numbers.

Quickstart: segmentation with mask output

hf download mudler/rfdetr-cpp-seg-nano rfdetr-seg-nano-f16.gguf --local-dir models/

mkdir -p /tmp/seg_masks
./build/bin/rfdetr-cli detect \
    --model models/rfdetr-seg-nano-f16.gguf \
    --input /tmp/coco_sample.jpg \
    --threshold 0.5 --threads 8 \
    --masks  /tmp/seg_masks \
    --output /tmp/seg.json

ls /tmp/seg_masks/
# det_000_class1_score93.png   <- person silhouette
# det_001_class51_score84.png  <- bowl silhouette
# ...

The --masks <dir> flag writes one PNG per detection (binary mask at the original image resolution). Mask quality matches PyTorch at IoU 0.997 and 99.98% pixel agreement on Seg-Nano F32; the remaining differences are sub-pixel boundary FP rounding.

Quickstart: convert from upstream

To roll your own (different variant, custom checkpoint, different quant):

# One-time: convert upstream RF-DETR .pth to GGUF (requires .venv with rfdetr).
python3 -m venv .venv && .venv/bin/pip install rfdetr

# F16: fastest on CPU, 1.86x smaller than F32, matches F32 accuracy.
.venv/bin/python scripts/convert_rfdetr_to_gguf.py \
    --variant base --dtype f16 \
    --output models/rfdetr-base-f16.gguf

# Pick a variant (nano|small|base|medium|large|seg-nano|seg-small|seg-medium|seg-large|seg-xlarge|seg-2xlarge)
.venv/bin/python scripts/convert_rfdetr_to_gguf.py \
    --variant nano --dtype f16 \
    --output models/rfdetr-nano-f16.gguf

# Re-quantize an existing F32 GGUF to any ggml type (incl. K-quants) without re-converting
./build/bin/rfdetr-cli quantize \
    models/rfdetr-base-f32.gguf models/rfdetr-base-q6_K.gguf q6_K
# Supported: f32 | f16 | q4_0 | q4_1 | q5_0 | q5_1 | q8_0 | q4_K | q5_K | q6_K

# Convert all detection variants in one shot
scripts/convert_all_variants.sh

# Build the full matrix (5 detection + 3 seg, 4 quants each, = 32 models)
scripts/build_all_quants.sh

Quickstart: fine-tuning

rf-detr.cpp is inference-only. To fine-tune RF-DETR on a custom dataset, train with the upstream rfdetr Python library, then convert the resulting checkpoint to GGUF:

.venv/bin/python scripts/convert_rfdetr_to_gguf.py \
    --checkpoint runs/my_train/checkpoint_best_total.pth \
    --variant base --dtype f16 \
    --output models/my_finetune-f16.gguf

The converter reads the head size directly from the checkpoint tensor and resizes the classification head before loading, so arbitrary num_classes values are handled automatically. See docs/finetuning.md for the end-to-end walkthrough (dataset prep, train, convert, quantize, serve), plus a smoke test using a synthetic 5-class checkpoint at scripts/build_custom_checkpoint.py.

Benchmarks

End-to-end CPU inference on AMD Ryzen 9 9950X3D (single batch, --threads 8). C++ F16 is faster than PyTorch on every image, at 1.86x smaller:

Latency comparison: PyTorch vs rf-detr.cpp F32 vs F16 vs Q8_0 across 7 COCO images

Impl Median ms/image Model size vs PyTorch Detection match (IoU ≥ 0.95)
Python rfdetr (PyTorch + oneDNN) 149.5 120 MB 1.00x (ref) reference
C++ rf-detr.cpp F32 (T=8) 142.5 120 MB 1.05x 54/55, max |Δscore| 0.045
C++ rf-detr.cpp F16 (T=8) 136.9 64 MB 1.09x 54/55, max |Δscore| 0.044
C++ rf-detr.cpp Q8_0 (T=8) 147.6 39 MB 1.01x 54/55, max |Δscore| 0.046

Numbers are medians (median-of-medians across 7 diverse COCO val2017 images, 3 passes of 20 iterations each, 5 warmup, 8 s cooldown between cells; see --rigorous mode in scripts/bench_community.py). Build uses -march=native plus ggml's tinyBLAS SGEMM (GGML_LLAMAFILE=ON) plus OpenMP plus a persistent ggml graph allocator.

See BENCHMARK.md for the per-image breakdown, F16 fast-path explanation, thread-scaling sweep, methodology, and reproduction recipe.

Variants comparison

All 5 detection variants share the DINOv2-small backbone; they differ in input resolution and decoder layer count. C++ F16 is faster than PyTorch on each:

Variant Resolution Dec layers C++ F16 median ms @ T=8 PyTorch median ms
Nano 384 2 61.5 88.4
Small 512 3 116.0 120.5
Base 560 3 136.9 149.5
Medium 576 4 149.6 182.8
Large 704 4 237.8 228.7*

* Large is the one variant where PyTorch is competitive at T=8 (within run-to-run variance).

Variants overview

Quantization tradeoffs

K-quants (Q4_K / Q5_K / Q6_K) produced via the C++ quantizer beat legacy block quants (Q4_0 / Q5_0) at the same target bit-width. The full matrix:

Quant tradeoffs

Variant Recall@0.5 Recall@0.95 Max |Δscore| Notes
F32 1.000 0.989 0.008 Reference
F16 1.000 0.989 0.008 Matches F32, fastest variant
Q8_0 1.000 0.989 0.009 3.10x compression, no accuracy loss
Q6_K 1.000 0.989 0.011 3.40x compression, about 10% slower than Q8_0
Q5_K 0.953 0.879 0.014 Mild accuracy loss; still usable
Q4_K 0.953 0.879 0.020 Halves Δscore vs legacy Q4_0 at same size
Q4_0 (legacy) 0.891 0.727 0.226 Steep accuracy drop; not recommended

Recommendation (numbers are for rfdetr-base):

  1. F16: production default. Fastest, matches F32, 1.86x smaller than F32.
  2. Q8_0: when disk size matters. 3.10x compression, no accuracy loss, about 7% latency tax vs F16.
  3. Q6_K: when you need slightly smaller than Q8_0 with near-identical accuracy.
  4. Q4_K: last resort for ≤32 MB deployments. Real but not catastrophic accuracy loss.

See BENCHMARK.md for mask quality across all 12 seg cells (mask IoU stays ≥ 0.99 across F32/F16/Q8_0 on every segmentation variant).

Embedding via the C API

rf-detr.cpp exposes a flat C ABI in include/rfdetr.h for dlopen and purego.RegisterLibFunc consumers, intended for embedding in Go, Python, or any host language that can call C. It follows the same pattern LocalAI uses for its other ggml backends:

#include "rfdetr.h"

rfdetr_init_params p = {
    .model_path = "models/rfdetr-base-f16.gguf",
    .n_threads  = 8,
};
rfdetr_context* ctx;
rfdetr_init(&p, &ctx);

rfdetr_detect_params dp = {
    .image_path = "my_image.jpg",
    .threshold  = 0.5f,
};
rfdetr_detection dets[100];
int n;
rfdetr_detect(ctx, &dp, dets, 100, &n);

for (int i = 0; i < n; i++) {
    printf("class=%d score=%.3f bbox=[%.1f,%.1f,%.1f,%.1f]\n",
           dets[i].class_id, dets[i].score,
           dets[i].bbox[0], dets[i].bbox[1], dets[i].bbox[2], dets[i].bbox[3]);
}

rfdetr_free(ctx);

Build the shared library with cmake -DRFDETR_SHARED=ON. For segmentation models, detection structs additionally carry a mask field (binary uint8 buffer, owned by the context until the next detect call).

Why rf-detr.cpp

The upstream Roboflow RF-DETR runtime is Python + PyTorch + Transformers + Supervision. rf-detr.cpp provides:

  • A native CPU runtime with no Python at inference time. The CLI is a single binary that takes a GGUF file and an image.
  • Faster than PyTorch CPU on every variant we measured (1.05x to 1.45x across Nano-to-Medium).
  • Quantization down to about 30 MB (Q4_K) with measured accuracy tradeoffs.
  • CUDA / Metal / Vulkan support via ggml backends. CPU is the only one we ship and benchmark today; the others compile but are not yet validated.
  • A flat C ABI (include/rfdetr.h) for embedding via dlopen, purego, or cgo.
  • End-to-end parity validation against the upstream PyTorch reference, per-module and end-to-end (see tests/test_parity_*.cpp).

Build

git clone --recursive https://github.com/mudler/rf-detr.cpp
cd rf-detr.cpp
cmake -B build -DRFDETR_BUILD_TESTS=ON -DRFDETR_BUILD_CLI=ON
cmake --build build -j
ctest --test-dir build --output-on-failure

The build applies two patches to third_party/ggml at configure time (stored in third_party/ggml-patches/). These are local performance and debug-instrumentation improvements not yet upstreamed. Re-running CMake is a no-op once they're in place. Run scripts/apply_ggml_patches.sh manually to inspect the patch flow.

CMake options

Option Default Purpose
RFDETR_BUILD_CLI ON Build the rfdetr-cli binary
RFDETR_BUILD_TESTS OFF Build the ctest test suite (24 tests)
RFDETR_SHARED OFF Build librfdetr.so (shared library for embedding)
GGML_NATIVE ON Compile ggml with -march=native
GGML_LLAMAFILE ON Enable ggml's tinyBLAS SGEMM (closes most of the PyTorch gap)
GGML_CUDA / GGML_METAL OFF Enable GPU backends (untested for rf-detr.cpp, may need work)

Tests

ctest --test-dir build --output-on-failure   # 24 ctest targets

Tests cover per-module parity vs the upstream torch reference (backbone, projector, two-stage, decoder, heads, segmentation), end-to-end detection parity, quantization sanity (F16/Q8_0/Q4_K load correctly), and per-variant load checks. The parity tests use precomputed baseline tensor bundles stored as GGUFs; regenerate them with scripts/gen_torch_baseline.py if you change the architecture.

Documentation

Citation

If you use rf-detr.cpp in a publication, please cite both this work and the upstream RF-DETR paper:

@misc{rfdetrcpp2026,
  author       = {Di Giacinto, Ettore and Palethorpe, Richard},
  title        = {rf-detr.cpp: C++/ggml inference engine for RF-DETR},
  year         = {2026},
  publisher    = {GitHub},
  howpublished = {\url{https://github.com/mudler/rf-detr.cpp}},
}

@software{rfdetr2025,
  author    = {Robicheaux, Peter and Popov, Matvei and Madan, Anish and Robinson, Isaac and Nelson, Joseph and Galuba, Wojciech and Wood, James and Kakanos, Sergei and Nemcek, Matthew and Hoshmand, Onur and Ramirez Castro, Carlos},
  title     = {RF-DETR},
  publisher = {GitHub},
  year      = {2025},
  url       = {https://github.com/roboflow/rf-detr},
}

The upstream RF-DETR builds on LW-DETR, DINOv2, and Deformable DETR; cite those too if relevant to your work:

@article{chen2024lwdetr,
  title   = {{LW-DETR}: A Transformer Replacement to {YOLO} for Real-Time Detection},
  author  = {Chen, Qiang and Su, Xiangbo and Zhang, Xinyu and Wang, Jian and Chen, Jiahui and Shen, Yunpeng and Han, Chuchu and Chen, Ziliang and Xu, Weixiang and Li, Fanrong and Zhang, Shan and Wang, Kun and Liu, Yong and Han, Jingdong and Ma, Zhaoxiang and Zhang, Erjin},
  journal = {arXiv preprint arXiv:2406.03459},
  year    = {2024},
}

@article{oquab2023dinov2,
  title   = {{DINOv2}: Learning Robust Visual Features without Supervision},
  author  = {Oquab, Maxime and Darcet, Timothée and Moutakanni, Théo and Vo, Huy and Szafraniec, Marc and Khalidov, Vasil and Fernandez, Pierre and Haziza, Daniel and Massa, Francisco and El-Nouby, Alaaeldin and others},
  journal = {arXiv preprint arXiv:2304.07193},
  year    = {2023},
}

@article{zhu2020deformabledetr,
  title   = {{Deformable DETR}: Deformable Transformers for End-to-End Object Detection},
  author  = {Zhu, Xizhou and Su, Weijie and Lu, Lewei and Li, Bin and Wang, Xiaogang and Dai, Jifeng},
  journal = {arXiv preprint arXiv:2010.04159},
  year    = {2020},
}

Author

Ettore Di Giacinto (@mudler), maintainer of LocalAI. PRs welcome; see issues for the current roadmap (GPU backend validation, end-to-end seg quant comparison, etc.).

License

Apache-2.0; see LICENSE. Copyright © 2026 Ettore Di Giacinto.

The model weights remain under their upstream license: RF-DETR is Apache-2.0 (roboflow/rf-detr).

About

rt-detr implementation in CPP with ggml

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors