Make dinov3 ltdetr object detection ONNX/TensorRT export work with FP16 by simonschoelly · Pull Request #751 · lightly-ai/lightly-train

simonschoelly · 2026-05-27T09:02:39Z

What has changed and why?

This PR fixes two issues with FP16 export to ONNX and TensorRT.

The export did not work when converting the model with model.half(). Instead we always export the model in fp32 precision and then manually convert the model to fp16.
Converting the model to fp16 creates overflow/NaN issues around Softmax. We avoid this by carefully using casts to FP32 around these nodes. This previously already worked for TensorRT export, but not for ONNX export.

In addtion

Removed "auto" as an option for the precision as it was not clear what exactly would happen in that case.
Exporting to FP16 now requires simplify=True - this is, because the simplifier also solves an issue where some two nodes would have the same name in the ONNX Graph.

How has it been tested?

Added some tests - especially for the removal of unncessary cast nodes. In addition tested on Google Colab and ensured that the FP16 models would correctly run without NaN values.

Did you update CHANGELOG.md?

Yes
Not needed (internal change)

Did you update the documentation?

Yes
Not needed (internal change without effects for user)

Copilot

Pull request overview

This PR updates the DINOv3 LT-DETR object detection ONNX export pipeline to support FP16 exports more reliably by tracing in FP32 and converting the exported ONNX graph to FP16 post-export, including cleanup of redundant Cast patterns.

Changes:

Adjust DINOv3LTDETRObjectDetection.export_onnx() to always trace in FP32 and (optionally) convert the exported ONNX model to FP16 via onnxruntime.transformers.float16, then remove redundant Cast pairs.
Add an ONNX graph utility remove_redundant_casts() plus dedicated unit tests for its behavior.
Add a GPU-gated test verifying FP16 ONNX export produces FP16 initializers for both supported decoder variants.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
`src/lightly_train/_task_models/dinov3_ltdetr_object_detection/task_model.py`	Trace ONNX in FP32, optionally post-convert to FP16, adjust verification path, and update TensorRT export precision handling.
`src/lightly_train/_export/onnx_helpers.py`	Add `remove_redundant_casts()` to clean up FP32↔FP16 Cast round-trips in converted ONNX graphs.
`tests/_task_models/dinov3_ltdetr_object_detection/test_task_model.py`	Add FP16 ONNX export test (GPU-gated) for both `rtdetrv2` and `dfine` decoders.
`tests/_export/test_onnx_helpers.py`	Add unit tests for `remove_redundant_casts()` rewiring/removal behavior.
`tests/_export/__init__.py`	Add package marker/license header for the new `tests/_export` test package.

simonschoelly · 2026-05-27T15:20:29Z

+    for out in graph.output:
+        if out.name in rewire:
+            out.name = rewire[out.name]
+
+    new_nodes = [n for n in graph.node if id(n) not in nodes_to_remove]
+    del graph.node[:]
+    graph.node.extend(new_nodes)


That would be nice, but makes the solution even more complicated, let's skip if for now.

I think we can get a relatively easy fix with an Identity node. Replace line 118-120 with

output_names = {out.name for out in graph.output} identity_nodes = [] for output_name, input_name in rewire.items(): if output_name in output_names: identity_nodes.append( helper.make_node( "Identity", inputs=[input_name], outputs=[output_name], name=f"identity_{output_name}", ) ) graph.node.extend(identity_nodes)

Yes, I know I could do that, but it would make the code even more complicated.

If it is exactly what we could do then I don't think it is super complex. Otherwise we are risking tampering the named outputs (like labels and scores for clsf) which people may use for downstream inference code.

simonschoelly · 2026-05-27T15:19:59Z

/review

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.

yutong-xiang-97 · 2026-05-28T09:47:55Z

@@ -1272,10 +1294,15 @@ def export_onnx(
                dummy_input.cpu().to(torch.float32),
            )

-            # Get outputs from the ONNX model.
-            session = ort.InferenceSession(out)
+            # Get outputs from the ONNX model. Load from bytes to avoid
+            # ORT errors about missing external data when weights are inline.
+            with open(out, "rb") as f:
+                session = ort.InferenceSession(f.read())
+            onnx_input = dummy_input.cpu()
+            if precision == "fp16":
+                onnx_input = onnx_input.half()
            input_feed = {
-                "images": dummy_input.cpu().numpy(),
+                "images": onnx_input.numpy(),


What was the argument that we have this ORT verifier inside export_onnx? In Ultralytics there is a separate command val for this where you can choose the backend (likely doable for us after we implement the benchmarking command).

Back when I added the verifier, I think this was just because one might expect that the model works fine when exported, but then creates completely different numerical stuff.

We can definitely think how we can improve this in the future. Not sure if having it as a separate step would make this better though.

OK I see. I will create a Linear issue for us to look into it later. Personally I would consider abstracting the verification logic to another command and keep the verify flag in export_onnx.

yutong-xiang-97 · 2026-05-28T09:56:52Z

            min_batchsize=min_batchsize,
-            # FP32 attention scores required for FP16 model stability. Otherwise output
-            # contains NaN.
-            fp32_attention_scores=True,
+            # We convert the fp32 attention scores already during ONNX export
+            fp32_attention_scores=False,
            verbose=verbose,


More specifically, the helper calls export_onnx_fn(**onnx_args), and this wrapper never adds onnx_args["precision"], so the ONNX export defaults to fp32 while TensorRT is configured for fp16.

That is a good point - I will have to test this - I a bit surprised why it did suddenly work for me though.

yutong-xiang-97 · 2026-05-28T09:50:17Z

+@pytest.mark.skipif(not RequirementCache("onnx"), reason="onnx not installed")
+@pytest.mark.skipif(
+    not RequirementCache("onnxruntime"), reason="onnxruntime not installed"
+)
+@pytest.mark.skipif(not torch.cuda.is_available(), reason="Test requires GPU.")
+@pytest.mark.parametrize("decoder_name", ["rtdetrv2", "dfine"])
+def test_export_onnx__fp16(
+    tmp_path: Path, decoder_name: Literal["rtdetrv2", "dfine"]
+) -> None:
+    import onnx
+


Related to #751

                If True, run onnxslim to simplify and overwrite the exported model.
            verify:
-                If True, validate the ONNX file and compare outputs to a float32 CPU
+                If True, validate the ONNX filef and compare outputs to a float32 CPU


yutong-xiang-97 · 2026-05-28T09:20:35Z

+    for out in graph.output:
+        if out.name in rewire:
+            out.name = rewire[out.name]
+
+    new_nodes = [n for n in graph.node if id(n) not in nodes_to_remove]
+    del graph.node[:]
+    graph.node.extend(new_nodes)


If it is exactly what we could do then I don't think it is super complex. Otherwise we are risking tampering the named outputs (like labels and scores for clsf) which people may use for downstream inference code.

yutong-xiang-97 · 2026-05-28T09:47:55Z

@@ -1272,10 +1294,15 @@ def export_onnx(
                dummy_input.cpu().to(torch.float32),
            )

-            # Get outputs from the ONNX model.
-            session = ort.InferenceSession(out)
+            # Get outputs from the ONNX model. Load from bytes to avoid
+            # ORT errors about missing external data when weights are inline.
+            with open(out, "rb") as f:
+                session = ort.InferenceSession(f.read())
+            onnx_input = dummy_input.cpu()
+            if precision == "fp16":
+                onnx_input = onnx_input.half()
            input_feed = {
-                "images": dummy_input.cpu().numpy(),
+                "images": onnx_input.numpy(),


What was the argument that we have this ORT verifier inside export_onnx? In Ultralytics there is a separate command val for this where you can choose the backend (likely doable for us after we implement the benchmarking command).

yutong-xiang-97 · 2026-05-28T09:50:17Z

+@pytest.mark.skipif(not RequirementCache("onnx"), reason="onnx not installed")
+@pytest.mark.skipif(
+    not RequirementCache("onnxruntime"), reason="onnxruntime not installed"
+)
+@pytest.mark.skipif(not torch.cuda.is_available(), reason="Test requires GPU.")
+@pytest.mark.parametrize("decoder_name", ["rtdetrv2", "dfine"])
+def test_export_onnx__fp16(
+    tmp_path: Path, decoder_name: Literal["rtdetrv2", "dfine"]
+) -> None:
+    import onnx
+


Related to #751

yutong-xiang-97 · 2026-05-28T09:56:52Z

            min_batchsize=min_batchsize,
-            # FP32 attention scores required for FP16 model stability. Otherwise output
-            # contains NaN.
-            fp32_attention_scores=True,
+            # We convert the fp32 attention scores already during ONNX export
+            fp32_attention_scores=False,
            verbose=verbose,


More specifically, the helper calls export_onnx_fn(**onnx_args), and this wrapper never adds onnx_args["precision"], so the ONNX export defaults to fp32 while TensorRT is configured for fp16.

yutong-xiang-97 · 2026-05-28T11:06:56Z

+                    "Skipping ONNX full model check for fp16 model because no "
+                    "GPU is available. Run on a GPU to enable full verification."
+                )
+            else:


This way onnx.checker.check_model is silently skipped for fp16 on CPU. Consider moving it upwards.

yutong-xiang-97 · 2026-05-28T11:10:32Z

+                "Softmax",
+                "MatMul",
+            ]
+            model_fp16 = ort_float16.convert_float_to_float16(


Here the topological order of nodes can be violated. You can add a checker here.

Maybe also include it in the integration tests

I am not sure what you are saying here?

Sorry I definitely mistyped stuff. A more detailed explanation from Codex:

ort_float16.convert_float_to_float16(...) inserts Cast nodes for blocked ops, but some inserted Cast producers end up after the node that consumes them.

Concrete example from the generated graph before onnxslim:

node 363: /backbone/rope_embed/Range input: /backbone/rope_embed/Range_input_cast_0 node 3979: /backbone/rope_embed/Range_input_cast0 (Cast) output: /backbone/rope_embed/Range_input_cast_0

That breaks ONNX topological order because Range consumes a tensor before the graph has produced it. remove_redundant_casts() is not the root cause in this case; the graph is already invalid immediately after convert_float_to_float16.

yutong-xiang-97 · 2026-05-28T11:14:17Z

            import onnxruntime as ort

-            onnx.checker.check_model(out, full_check=True)
+            if precision == "fp16" and not torch.cuda.is_available():


There is a ort.get_available_providers() for these checks

This is about onnx.checker.check_model not about onnxruntime.

I believe onnx.checker.check_model should happen for both cases (hence the comment below). What I meant is to replace torch.cuda.is_available() with something like

providers = ort.get_available_providers() if "CUDAExecutionProvider" not in providers: ...

that is more specific.

Two failure cases

- CUDA GPU present but `onnxruntime` (not `onnxruntime-gpu`) installed → `torch.cuda.is_available()` is True, but ORT will fall back to CPU and FP16 inference will likely fail. - `onnxruntime-gpu` installed in an environment where torch was built CPU-only → the opposite false negative. ``

simonschoelly force-pushed the simon-fp16-onnx-export branch from e7cf153 to efbbfc8 Compare May 27, 2026 09:03

simonschoelly marked this pull request as ready for review May 27, 2026 13:20

Copilot AI review requested due to automatic review settings May 27, 2026 13:20

Copilot started reviewing on behalf of simonschoelly May 27, 2026 13:20 View session

Copilot AI reviewed May 27, 2026

View reviewed changes

Fix dinov3_ltdetr_object_detection FP16 exports

d1ed192

simonschoelly force-pushed the simon-fp16-onnx-export branch from a507404 to d1ed192 Compare May 27, 2026 15:13

simonschoelly changed the title ~~Make LTDetrv2 object dection ONNX/TensorRT export work with FP16~~ Make dinov3 ltdetr object detection ONNX/TensorRT export work with FP16 May 27, 2026

simonschoelly requested a review from Copilot May 27, 2026 15:21

Copilot started reviewing on behalf of simonschoelly May 27, 2026 15:21 View session

Copilot AI reviewed May 27, 2026

View reviewed changes

lightly-ai deleted a comment from chatgpt-codex-connector Bot May 28, 2026

yutong-xiang-97 requested changes May 28, 2026

View reviewed changes

commit

c7a8e10

Conversation

simonschoelly commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What has changed and why?

How has it been tested?

Did you update CHANGELOG.md?

Did you update the documentation?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

simonschoelly commented May 27, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yutong-xiang-97 May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yutong-xiang-97 May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yutong-xiang-97 May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

simonschoelly commented May 27, 2026 •

edited

Loading

yutong-xiang-97 May 28, 2026 •

edited

Loading

yutong-xiang-97 May 28, 2026 •

edited

Loading

yutong-xiang-97 May 28, 2026 •

edited

Loading