Make dinov3 ltdetr object detection ONNX/TensorRT export work with FP16#751
Make dinov3 ltdetr object detection ONNX/TensorRT export work with FP16#751simonschoelly wants to merge 2 commits into
Conversation
e7cf153 to
efbbfc8
Compare
There was a problem hiding this comment.
Pull request overview
This PR updates the DINOv3 LT-DETR object detection ONNX export pipeline to support FP16 exports more reliably by tracing in FP32 and converting the exported ONNX graph to FP16 post-export, including cleanup of redundant Cast patterns.
Changes:
- Adjust
DINOv3LTDETRObjectDetection.export_onnx()to always trace in FP32 and (optionally) convert the exported ONNX model to FP16 viaonnxruntime.transformers.float16, then remove redundant Cast pairs. - Add an ONNX graph utility
remove_redundant_casts()plus dedicated unit tests for its behavior. - Add a GPU-gated test verifying FP16 ONNX export produces FP16 initializers for both supported decoder variants.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
src/lightly_train/_task_models/dinov3_ltdetr_object_detection/task_model.py |
Trace ONNX in FP32, optionally post-convert to FP16, adjust verification path, and update TensorRT export precision handling. |
src/lightly_train/_export/onnx_helpers.py |
Add remove_redundant_casts() to clean up FP32↔FP16 Cast round-trips in converted ONNX graphs. |
tests/_task_models/dinov3_ltdetr_object_detection/test_task_model.py |
Add FP16 ONNX export test (GPU-gated) for both rtdetrv2 and dfine decoders. |
tests/_export/test_onnx_helpers.py |
Add unit tests for remove_redundant_casts() rewiring/removal behavior. |
tests/_export/__init__.py |
Add package marker/license header for the new tests/_export test package. |
| for out in graph.output: | ||
| if out.name in rewire: | ||
| out.name = rewire[out.name] | ||
|
|
||
| new_nodes = [n for n in graph.node if id(n) not in nodes_to_remove] | ||
| del graph.node[:] | ||
| graph.node.extend(new_nodes) |
There was a problem hiding this comment.
That would be nice, but makes the solution even more complicated, let's skip if for now.
There was a problem hiding this comment.
I think we can get a relatively easy fix with an Identity node. Replace line 118-120 with
output_names = {out.name for out in graph.output}
identity_nodes = []
for output_name, input_name in rewire.items():
if output_name in output_names:
identity_nodes.append(
helper.make_node(
"Identity",
inputs=[input_name],
outputs=[output_name],
name=f"identity_{output_name}",
)
)
graph.node.extend(identity_nodes)
There was a problem hiding this comment.
Yes, I know I could do that, but it would make the code even more complicated.
There was a problem hiding this comment.
If it is exactly what we could do then I don't think it is super complex. Otherwise we are risking tampering the named outputs (like labels and scores for clsf) which people may use for downstream inference code.
a507404 to
d1ed192
Compare
|
/review |
| @@ -1272,10 +1294,15 @@ def export_onnx( | |||
| dummy_input.cpu().to(torch.float32), | |||
| ) | |||
|
|
|||
| # Get outputs from the ONNX model. | |||
| session = ort.InferenceSession(out) | |||
| # Get outputs from the ONNX model. Load from bytes to avoid | |||
| # ORT errors about missing external data when weights are inline. | |||
| with open(out, "rb") as f: | |||
| session = ort.InferenceSession(f.read()) | |||
| onnx_input = dummy_input.cpu() | |||
| if precision == "fp16": | |||
| onnx_input = onnx_input.half() | |||
| input_feed = { | |||
| "images": dummy_input.cpu().numpy(), | |||
| "images": onnx_input.numpy(), | |||
There was a problem hiding this comment.
What was the argument that we have this ORT verifier inside export_onnx? In Ultralytics there is a separate command val for this where you can choose the backend (likely doable for us after we implement the benchmarking command).
There was a problem hiding this comment.
Back when I added the verifier, I think this was just because one might expect that the model works fine when exported, but then creates completely different numerical stuff.
We can definitely think how we can improve this in the future. Not sure if having it as a separate step would make this better though.
There was a problem hiding this comment.
OK I see. I will create a Linear issue for us to look into it later. Personally I would consider abstracting the verification logic to another command and keep the verify flag in export_onnx.
| min_batchsize=min_batchsize, | ||
| # FP32 attention scores required for FP16 model stability. Otherwise output | ||
| # contains NaN. | ||
| fp32_attention_scores=True, | ||
| # We convert the fp32 attention scores already during ONNX export | ||
| fp32_attention_scores=False, | ||
| verbose=verbose, |
There was a problem hiding this comment.
More specifically, the helper calls export_onnx_fn(**onnx_args), and this wrapper never adds onnx_args["precision"], so the ONNX export defaults to fp32 while TensorRT is configured for fp16.
There was a problem hiding this comment.
That is a good point - I will have to test this - I a bit surprised why it did suddenly work for me though.
| @pytest.mark.skipif(not RequirementCache("onnx"), reason="onnx not installed") | ||
| @pytest.mark.skipif( | ||
| not RequirementCache("onnxruntime"), reason="onnxruntime not installed" | ||
| ) | ||
| @pytest.mark.skipif(not torch.cuda.is_available(), reason="Test requires GPU.") | ||
| @pytest.mark.parametrize("decoder_name", ["rtdetrv2", "dfine"]) | ||
| def test_export_onnx__fp16( | ||
| tmp_path: Path, decoder_name: Literal["rtdetrv2", "dfine"] | ||
| ) -> None: | ||
| import onnx | ||
|
|
| If True, run onnxslim to simplify and overwrite the exported model. | ||
| verify: | ||
| If True, validate the ONNX file and compare outputs to a float32 CPU | ||
| If True, validate the ONNX filef and compare outputs to a float32 CPU |
| for out in graph.output: | ||
| if out.name in rewire: | ||
| out.name = rewire[out.name] | ||
|
|
||
| new_nodes = [n for n in graph.node if id(n) not in nodes_to_remove] | ||
| del graph.node[:] | ||
| graph.node.extend(new_nodes) |
There was a problem hiding this comment.
If it is exactly what we could do then I don't think it is super complex. Otherwise we are risking tampering the named outputs (like labels and scores for clsf) which people may use for downstream inference code.
| @@ -1272,10 +1294,15 @@ def export_onnx( | |||
| dummy_input.cpu().to(torch.float32), | |||
| ) | |||
|
|
|||
| # Get outputs from the ONNX model. | |||
| session = ort.InferenceSession(out) | |||
| # Get outputs from the ONNX model. Load from bytes to avoid | |||
| # ORT errors about missing external data when weights are inline. | |||
| with open(out, "rb") as f: | |||
| session = ort.InferenceSession(f.read()) | |||
| onnx_input = dummy_input.cpu() | |||
| if precision == "fp16": | |||
| onnx_input = onnx_input.half() | |||
| input_feed = { | |||
| "images": dummy_input.cpu().numpy(), | |||
| "images": onnx_input.numpy(), | |||
There was a problem hiding this comment.
What was the argument that we have this ORT verifier inside export_onnx? In Ultralytics there is a separate command val for this where you can choose the backend (likely doable for us after we implement the benchmarking command).
| @pytest.mark.skipif(not RequirementCache("onnx"), reason="onnx not installed") | ||
| @pytest.mark.skipif( | ||
| not RequirementCache("onnxruntime"), reason="onnxruntime not installed" | ||
| ) | ||
| @pytest.mark.skipif(not torch.cuda.is_available(), reason="Test requires GPU.") | ||
| @pytest.mark.parametrize("decoder_name", ["rtdetrv2", "dfine"]) | ||
| def test_export_onnx__fp16( | ||
| tmp_path: Path, decoder_name: Literal["rtdetrv2", "dfine"] | ||
| ) -> None: | ||
| import onnx | ||
|
|
| min_batchsize=min_batchsize, | ||
| # FP32 attention scores required for FP16 model stability. Otherwise output | ||
| # contains NaN. | ||
| fp32_attention_scores=True, | ||
| # We convert the fp32 attention scores already during ONNX export | ||
| fp32_attention_scores=False, | ||
| verbose=verbose, |
There was a problem hiding this comment.
More specifically, the helper calls export_onnx_fn(**onnx_args), and this wrapper never adds onnx_args["precision"], so the ONNX export defaults to fp32 while TensorRT is configured for fp16.
| "Skipping ONNX full model check for fp16 model because no " | ||
| "GPU is available. Run on a GPU to enable full verification." | ||
| ) | ||
| else: |
There was a problem hiding this comment.
This way onnx.checker.check_model is silently skipped for fp16 on CPU. Consider moving it upwards.
| "Softmax", | ||
| "MatMul", | ||
| ] | ||
| model_fp16 = ort_float16.convert_float_to_float16( |
There was a problem hiding this comment.
Here the topological order of nodes can be violated. You can add a checker here.
There was a problem hiding this comment.
Maybe also include it in the integration tests
There was a problem hiding this comment.
I am not sure what you are saying here?
There was a problem hiding this comment.
Sorry I definitely mistyped stuff. A more detailed explanation from Codex:
ort_float16.convert_float_to_float16(...) inserts Cast nodes for blocked ops, but some inserted Cast producers end up after the node that consumes them.
Concrete example from the generated graph before onnxslim:
node 363: /backbone/rope_embed/Range
input: /backbone/rope_embed/Range_input_cast_0
node 3979: /backbone/rope_embed/Range_input_cast0 (Cast)
output: /backbone/rope_embed/Range_input_cast_0
That breaks ONNX topological order because Range consumes a tensor before the graph has produced it. remove_redundant_casts() is not the root cause in this case; the graph is already invalid immediately after convert_float_to_float16.
| import onnxruntime as ort | ||
|
|
||
| onnx.checker.check_model(out, full_check=True) | ||
| if precision == "fp16" and not torch.cuda.is_available(): |
There was a problem hiding this comment.
There is a ort.get_available_providers() for these checks
There was a problem hiding this comment.
This is about onnx.checker.check_model not about onnxruntime.
There was a problem hiding this comment.
I believe onnx.checker.check_model should happen for both cases (hence the comment below). What I meant is to replace torch.cuda.is_available() with something like
providers = ort.get_available_providers()
if "CUDAExecutionProvider" not in providers:
...
that is more specific.
Two failure cases
- CUDA GPU present but `onnxruntime` (not `onnxruntime-gpu`) installed → `torch.cuda.is_available()` is True, but ORT will fall back to CPU and FP16 inference will likely fail.
- `onnxruntime-gpu` installed in an environment where torch was built CPU-only → the opposite false negative.
``
What has changed and why?
This PR fixes two issues with FP16 export to ONNX and TensorRT.
model.half(). Instead we always export the model infp32precision and then manually convert the model tofp16.fp16creates overflow/NaN issues around Softmax. We avoid this by carefully using casts to FP32 around these nodes. This previously already worked for TensorRT export, but not for ONNX export.In addtion
simplify=True- this is, because the simplifier also solves an issue where some two nodes would have the same name in the ONNX Graph.How has it been tested?
Added some tests - especially for the removal of unncessary cast nodes. In addition tested on Google Colab and ensured that the FP16 models would correctly run without NaN values.
Did you update CHANGELOG.md?
Did you update the documentation?