Skip to content

Conversation

@RyanUnderhill
Copy link
Contributor

Address previous PR review comments from #1470 (#1473)
Address QNN specific regressions (#1470)
Fix array eos_token_id handling (#1463)
Constrained decoding integration (#1381)
Remove BF16 CPU from valid GQA configuration (#1469)
Avoid adding providers if not requested (#1464)
Persist provider options across ClearProviders, AppendProvider where possible (#1454)
Fix accuracy issues with Gemma models (#1448)
Add bfloat16 support in model builder (#1447)
Add final norm for LoRA models (#1446)

Update version to 0.8.0-rc3

RyanUnderhill and others added 12 commits May 12, 2025 13:41
### Description

This PR adds the missing pattern to identify the final norm layer in
LoRA models. It also cleans up some of the classes in the model builder.

### Motivation and Context

The missing final norm layer in LoRA models caused the generated LoRA
models to be incorrect.
### Description

This PR adds
[bfloat16](https://en.wikipedia.org/wiki/Bfloat16_floating-point_format)
support in the model builder.

### Motivation and Context

Most SLMs and LLMs are trained in bfloat16 precision. Casting from
bfloat16 to
[float16](https://en.wikipedia.org/wiki/Half-precision_floating-point_format)
can cause accuracy loss in models (e.g. Google's Gemma model family).

The NumPy dependency when converting a
[torch.Tensor](https://pytorch.org/docs/stable/tensors.html) object to
an [ONNX
TensorProto](https://onnx.ai/onnx/api/helper.html#onnx.helper.make_tensor)
object has been removed. This will allow torch.Tensor objects in [other
precisions that are not supported in
NumPy](https://numpy.org/doc/stable/user/basics.types.html#relationship-between-numpy-data-types-and-c-data-types)
to be converted to ONNX TensorProto objects.

This PR also fixes [this
issue](#691).
### Description

This PR fixes accuracy issues with Google's Gemma models by using
bfloat16 precision, [always using float32 precision to compute any
LayerNorms](https://github.com/huggingface/transformers/blob/fee1190601b5d04ec6d3f7f58fd22788d7f3236d/src/transformers/models/gemma3/modeling_gemma3.py#L141-L146),
and casting the output logits to float32 always.

### Motivation and Context

This PR has been tested with Gemma-2 and Gemma-3. It is using the
bfloat16 changes from [this
PR](#1447) as well as
the missing final norm changes from [this
PR](#1420).

---------

Co-authored-by: Nenad Banfic <46795300+nenad1002@users.noreply.github.com>
Co-authored-by: Nenad Banfic <nebanfic@microsoft.com>
Most CPUs do not support BF16, hence removing it as an option since we
miss some underlying kernel implementation
Integrate Constrained decoding using LLGuidance library.

Based on Ying's Constrained Decoding branch
(yingxiong/constrained_decoding)

---------

Co-authored-by: Ying Xiong <yingxiong@microsoft.com>
Co-authored-by: Michał Moskal <michal@moskal.me>
Co-authored-by: Kunal Vaishnavi <kvaishnavi@microsoft.com>
Co-authored-by: Ryan Hill <38674843+RyanUnderhill@users.noreply.github.com>
Co-authored-by: Baiju Meswani <bmeswani@microsoft.com>
Update windows packaging pipelines to use build.py by aciddelgado · Pull Request #1468 · microsoft/onnxruntime-genai
@baijumeswani baijumeswani merged commit c22b15a into rel-0.8.0 May 14, 2025
13 checks passed
@baijumeswani baijumeswani deleted the ryanunderhill/rc3_cherry_picks branch May 14, 2025 16:38
@natke natke changed the title Ryanunderhill/rc3 cherry picks 0.8.0 rc3 cherry picks Jun 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants