Skip to content

Make dinov3 ltdetr object detection ONNX/TensorRT export work with FP16#751

Open
simonschoelly wants to merge 2 commits into
mainfrom
simon-fp16-onnx-export
Open

Make dinov3 ltdetr object detection ONNX/TensorRT export work with FP16#751
simonschoelly wants to merge 2 commits into
mainfrom
simon-fp16-onnx-export

Conversation

@simonschoelly
Copy link
Copy Markdown
Contributor

@simonschoelly simonschoelly commented May 27, 2026

What has changed and why?

This PR fixes two issues with FP16 export to ONNX and TensorRT.

  • The export did not work when converting the model with model.half(). Instead we always export the model in fp32 precision and then manually convert the model to fp16.
  • Converting the model to fp16 creates overflow/NaN issues around Softmax. We avoid this by carefully using casts to FP32 around these nodes. This previously already worked for TensorRT export, but not for ONNX export.

In addtion

  • Removed "auto" as an option for the precision as it was not clear what exactly would happen in that case.
  • Exporting to FP16 now requires simplify=True - this is, because the simplifier also solves an issue where some two nodes would have the same name in the ONNX Graph.

How has it been tested?

Added some tests - especially for the removal of unncessary cast nodes. In addition tested on Google Colab and ensured that the FP16 models would correctly run without NaN values.

Did you update CHANGELOG.md?

  • Yes
  • Not needed (internal change)

Did you update the documentation?

  • Yes
  • Not needed (internal change without effects for user)

@simonschoelly simonschoelly force-pushed the simon-fp16-onnx-export branch from e7cf153 to efbbfc8 Compare May 27, 2026 09:03
@simonschoelly simonschoelly marked this pull request as ready for review May 27, 2026 13:20
Copilot AI review requested due to automatic review settings May 27, 2026 13:20
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the DINOv3 LT-DETR object detection ONNX export pipeline to support FP16 exports more reliably by tracing in FP32 and converting the exported ONNX graph to FP16 post-export, including cleanup of redundant Cast patterns.

Changes:

  • Adjust DINOv3LTDETRObjectDetection.export_onnx() to always trace in FP32 and (optionally) convert the exported ONNX model to FP16 via onnxruntime.transformers.float16, then remove redundant Cast pairs.
  • Add an ONNX graph utility remove_redundant_casts() plus dedicated unit tests for its behavior.
  • Add a GPU-gated test verifying FP16 ONNX export produces FP16 initializers for both supported decoder variants.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/lightly_train/_task_models/dinov3_ltdetr_object_detection/task_model.py Trace ONNX in FP32, optionally post-convert to FP16, adjust verification path, and update TensorRT export precision handling.
src/lightly_train/_export/onnx_helpers.py Add remove_redundant_casts() to clean up FP32↔FP16 Cast round-trips in converted ONNX graphs.
tests/_task_models/dinov3_ltdetr_object_detection/test_task_model.py Add FP16 ONNX export test (GPU-gated) for both rtdetrv2 and dfine decoders.
tests/_export/test_onnx_helpers.py Add unit tests for remove_redundant_casts() rewiring/removal behavior.
tests/_export/__init__.py Add package marker/license header for the new tests/_export test package.

Comment thread src/lightly_train/_task_models/dinov3_ltdetr_object_detection/task_model.py Outdated
Comment on lines +112 to +118
for out in graph.output:
if out.name in rewire:
out.name = rewire[out.name]

new_nodes = [n for n in graph.node if id(n) not in nodes_to_remove]
del graph.node[:]
graph.node.extend(new_nodes)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be nice, but makes the solution even more complicated, let's skip if for now.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can get a relatively easy fix with an Identity node. Replace line 118-120 with

output_names = {out.name for out in graph.output}
identity_nodes = []

for output_name, input_name in rewire.items():
    if output_name in output_names:
        identity_nodes.append(
            helper.make_node(
                "Identity",
                inputs=[input_name],
                outputs=[output_name],
                name=f"identity_{output_name}",
            )
        )

graph.node.extend(identity_nodes)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I know I could do that, but it would make the code even more complicated.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is exactly what we could do then I don't think it is super complex. Otherwise we are risking tampering the named outputs (like labels and scores for clsf) which people may use for downstream inference code.

@simonschoelly simonschoelly force-pushed the simon-fp16-onnx-export branch from a507404 to d1ed192 Compare May 27, 2026 15:13
@simonschoelly simonschoelly changed the title Make LTDetrv2 object dection ONNX/TensorRT export work with FP16 Make dinov3 ltdetr object detection ONNX/TensorRT export work with FP16 May 27, 2026
@simonschoelly
Copy link
Copy Markdown
Contributor Author

/review

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.

Comment on lines 1282 to +1305
@@ -1272,10 +1294,15 @@ def export_onnx(
dummy_input.cpu().to(torch.float32),
)

# Get outputs from the ONNX model.
session = ort.InferenceSession(out)
# Get outputs from the ONNX model. Load from bytes to avoid
# ORT errors about missing external data when weights are inline.
with open(out, "rb") as f:
session = ort.InferenceSession(f.read())
onnx_input = dummy_input.cpu()
if precision == "fp16":
onnx_input = onnx_input.half()
input_feed = {
"images": dummy_input.cpu().numpy(),
"images": onnx_input.numpy(),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the argument that we have this ORT verifier inside export_onnx? In Ultralytics there is a separate command val for this where you can choose the backend (likely doable for us after we implement the benchmarking command).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Back when I added the verifier, I think this was just because one might expect that the model works fine when exported, but then creates completely different numerical stuff.

We can definitely think how we can improve this in the future. Not sure if having it as a separate step would make this better though.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I see. I will create a Linear issue for us to look into it later. Personally I would consider abstracting the verification logic to another command and keep the verify flag in export_onnx.

Comment on lines 1415 to 1418
min_batchsize=min_batchsize,
# FP32 attention scores required for FP16 model stability. Otherwise output
# contains NaN.
fp32_attention_scores=True,
# We convert the fp32 attention scores already during ONNX export
fp32_attention_scores=False,
verbose=verbose,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More specifically, the helper calls export_onnx_fn(**onnx_args), and this wrapper never adds onnx_args["precision"], so the ONNX export defaults to fp32 while TensorRT is configured for fp16.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good point - I will have to test this - I a bit surprised why it did suddenly work for me though.

Comment on lines +503 to +513
@pytest.mark.skipif(not RequirementCache("onnx"), reason="onnx not installed")
@pytest.mark.skipif(
not RequirementCache("onnxruntime"), reason="onnxruntime not installed"
)
@pytest.mark.skipif(not torch.cuda.is_available(), reason="Test requires GPU.")
@pytest.mark.parametrize("decoder_name", ["rtdetrv2", "dfine"])
def test_export_onnx__fp16(
tmp_path: Path, decoder_name: Literal["rtdetrv2", "dfine"]
) -> None:
import onnx

Copy link
Copy Markdown
Contributor

@yutong-xiang-97 yutong-xiang-97 May 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to #751

If True, run onnxslim to simplify and overwrite the exported model.
verify:
If True, validate the ONNX file and compare outputs to a float32 CPU
If True, validate the ONNX filef and compare outputs to a float32 CPU
@lightly-ai lightly-ai deleted a comment from chatgpt-codex-connector Bot May 28, 2026
Comment thread src/lightly_train/_export/onnx_helpers.py
Comment on lines +112 to +118
for out in graph.output:
if out.name in rewire:
out.name = rewire[out.name]

new_nodes = [n for n in graph.node if id(n) not in nodes_to_remove]
del graph.node[:]
graph.node.extend(new_nodes)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is exactly what we could do then I don't think it is super complex. Otherwise we are risking tampering the named outputs (like labels and scores for clsf) which people may use for downstream inference code.

Comment on lines 1282 to +1305
@@ -1272,10 +1294,15 @@ def export_onnx(
dummy_input.cpu().to(torch.float32),
)

# Get outputs from the ONNX model.
session = ort.InferenceSession(out)
# Get outputs from the ONNX model. Load from bytes to avoid
# ORT errors about missing external data when weights are inline.
with open(out, "rb") as f:
session = ort.InferenceSession(f.read())
onnx_input = dummy_input.cpu()
if precision == "fp16":
onnx_input = onnx_input.half()
input_feed = {
"images": dummy_input.cpu().numpy(),
"images": onnx_input.numpy(),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the argument that we have this ORT verifier inside export_onnx? In Ultralytics there is a separate command val for this where you can choose the backend (likely doable for us after we implement the benchmarking command).

Comment on lines +503 to +513
@pytest.mark.skipif(not RequirementCache("onnx"), reason="onnx not installed")
@pytest.mark.skipif(
not RequirementCache("onnxruntime"), reason="onnxruntime not installed"
)
@pytest.mark.skipif(not torch.cuda.is_available(), reason="Test requires GPU.")
@pytest.mark.parametrize("decoder_name", ["rtdetrv2", "dfine"])
def test_export_onnx__fp16(
tmp_path: Path, decoder_name: Literal["rtdetrv2", "dfine"]
) -> None:
import onnx

Copy link
Copy Markdown
Contributor

@yutong-xiang-97 yutong-xiang-97 May 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to #751

Comment on lines 1415 to 1418
min_batchsize=min_batchsize,
# FP32 attention scores required for FP16 model stability. Otherwise output
# contains NaN.
fp32_attention_scores=True,
# We convert the fp32 attention scores already during ONNX export
fp32_attention_scores=False,
verbose=verbose,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More specifically, the helper calls export_onnx_fn(**onnx_args), and this wrapper never adds onnx_args["precision"], so the ONNX export defaults to fp32 while TensorRT is configured for fp16.

"Skipping ONNX full model check for fp16 model because no "
"GPU is available. Run on a GPU to enable full verification."
)
else:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This way onnx.checker.check_model is silently skipped for fp16 on CPU. Consider moving it upwards.

"Softmax",
"MatMul",
]
model_fp16 = ort_float16.convert_float_to_float16(
Copy link
Copy Markdown
Contributor

@yutong-xiang-97 yutong-xiang-97 May 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here the topological order of nodes can be violated. You can add a checker here.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also include it in the integration tests

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what you are saying here?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I definitely mistyped stuff. A more detailed explanation from Codex:

ort_float16.convert_float_to_float16(...) inserts Cast nodes for blocked ops, but some inserted Cast producers end up after the node that consumes them.

Concrete example from the generated graph before onnxslim:

node 363: /backbone/rope_embed/Range
  input: /backbone/rope_embed/Range_input_cast_0

node 3979: /backbone/rope_embed/Range_input_cast0 (Cast)
  output: /backbone/rope_embed/Range_input_cast_0

That breaks ONNX topological order because Range consumes a tensor before the graph has produced it. remove_redundant_casts() is not the root cause in this case; the graph is already invalid immediately after convert_float_to_float16.

import onnxruntime as ort

onnx.checker.check_model(out, full_check=True)
if precision == "fp16" and not torch.cuda.is_available():
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a ort.get_available_providers() for these checks

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is about onnx.checker.check_model not about onnxruntime.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe onnx.checker.check_model should happen for both cases (hence the comment below). What I meant is to replace torch.cuda.is_available() with something like

providers = ort.get_available_providers()
if "CUDAExecutionProvider" not in providers:
   ...

that is more specific.

Two failure cases

- CUDA GPU present but `onnxruntime` (not `onnxruntime-gpu`) installed → `torch.cuda.is_available()` is True, but ORT will fall back to CPU and FP16 inference will likely fail.
- `onnxruntime-gpu` installed in an environment where torch was built CPU-only → the opposite false negative.
``

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants