Fix MTEBEvaluator: device mapping, padding-free inference, last-token pooling, L2 normalization by natke · Pull Request #2415 · microsoft/Olive

natke · 2026-04-14T21:49:52Z

Fixes several issues in the MTEBEvaluator for embedding model evaluation:

Device mapping

Maps Olive's Device.GPU ("gpu") to PyTorch's "cuda" when initializing SentenceTransformer in the HF evaluation path. Also handles indexed devices (e.g. gpu:0 → cuda:0).

Padding-free inference for GenAI

GenAI's Generator does not accept an attention_mask, so padded batches produce contaminated hidden states via self-attention to padding tokens. Fix: process each sentence individually with only its real tokens, eliminating padding entirely.

Last-token pooling

Replaced mean pooling with last-token pooling in the GenAI and ORT wrappers to match models like Qwen3-Embedding that use pooling_mode_lasttoken=True.

L2 normalization

Added L2 normalization after pooling in the base encode() method, matching the 2_Normalize module in the SentenceTransformer pipeline.

Results

These fixes close the score gap between HF and GenAI evaluation:

Before: HF 0.785 vs GenAI 0.651 (STS17 main_score)
After: HF 0.785 vs GenAI 0.785

…nsformer Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Qwen3-Embedding uses last-token pooling (not mean pooling) and L2 normalization, matching its SentenceTransformer pipeline config: - pooling_mode_lasttoken: true - 2_Normalize module This fixes the ~17% score drop between HF and exported model evaluation. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Last-token pooling made scores worse (0.378 vs 0.651 with mean pooling), likely due to GenAI hidden_states not aligning with HF tokenizer attention_mask positions. Reverting pooling to mean while keeping L2 normalization which should still improve scores. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Temporary debug logging — remove before merge. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

GenAI hidden_states shape matches input_ids shape exactly (including padding positions), so last-token pooling via attention_mask is correct. Debug logging kept temporarily for verification. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

GenAI Generator doesn't accept attention_mask, so padded batches produce contaminated hidden states. Fix: process each sentence individually with only its real tokens, then take last-token pooling. This should close the gap between HF (0.785) and GenAI (0.651) scores. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Improves correctness and consistency of MTEB embedding evaluation across HF / ORT / GenAI backends by aligning device strings, pooling strategy, padding behavior, and embedding normalization.

Changes:

Map Olive gpu / gpu:<idx> device strings to PyTorch cuda / cuda:<idx> for SentenceTransformer initialization.
Switch ORT + GenAI wrappers from mean pooling to last-token pooling; avoid padding in GenAI by encoding each sequence using only real tokens.
Add L2 normalization to ORT embeddings to match SentenceTransformer’s Normalize module.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
olive/evaluator/olive_evaluator.py	Normalizes Olive device strings to PyTorch-compatible `cuda` strings in HF evaluation path.
olive/evaluator/mteb_ort.py	Adds L2 normalization, switches pooling to last-token, and removes padding from GenAI inference by per-sample processing.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…-a-time Group sequences with equal real token counts into a single Generator call, reducing per-sample overhead while still avoiding padding contamination. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

… pooling, L2 normalization (#2415)

natke and others added 7 commits April 14, 2026 12:47

Fix device mapping: map Olive 'gpu' to PyTorch 'cuda' for SentenceTra…

4cc94af

…nsformer Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

DEBUG: log hidden_states shapes for pooling diagnosis

790fc86

Temporary debug logging — remove before merge. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Handle indexed GPU devices in device mapping (e.g. gpu:0 → cuda:0)

efad342

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings April 14, 2026 21:49

Copilot AI reviewed Apr 14, 2026

View reviewed changes

Comment thread olive/evaluator/mteb_ort.py Outdated

Comment thread olive/evaluator/mteb_ort.py Outdated

Comment thread olive/evaluator/mteb_ort.py Outdated

Comment thread olive/evaluator/mteb_ort.py Outdated

natke and others added 4 commits April 14, 2026 15:41

Guard against zero-length sequences in last-token pooling

7d5b6fb

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Compute L2 norms in float32 for numerical stability with fp16 outputs

d9fdc90

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Wrap hidden_states retrieval with actionable error message

a66df02

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

github-advanced-security AI found potential problems Apr 15, 2026

View reviewed changes

Comment thread olive/evaluator/mteb_ort.py Fixed

github-advanced-security AI found potential problems Apr 15, 2026

View reviewed changes

Comment thread olive/evaluator/mteb_ort.py Fixed

Comment thread olive/evaluator/mteb_ort.py Fixed

Copilot started reviewing on behalf of natke April 15, 2026 00:36 View session

natke and others added 2 commits April 15, 2026 09:42

Remove unused embeddings variable

7de4af9

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Merge branch 'main' into natke/mteb-device-fix

90c570e

natke mentioned this pull request Apr 15, 2026

Add Qwen3 Embedding recipes (0.6B and 8B) microsoft/olive-recipes#355

Merged

natke and others added 2 commits April 15, 2026 12:42

Fix lint: remove unused seq_len variable

b2d4e58

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Merge branch 'main' into natke/mteb-device-fix

bde0c66

jambayk approved these changes Apr 16, 2026

View reviewed changes

jambayk merged commit 21fcca9 into main Apr 16, 2026
11 checks passed

jambayk deleted the natke/mteb-device-fix branch April 16, 2026 20:18

xiaoyu-work pushed a commit that referenced this pull request Apr 17, 2026

Fix MTEBEvaluator: device mapping, padding-free inference, last-token…

20e6dbb

… pooling, L2 normalization (#2415)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix MTEBEvaluator: device mapping, padding-free inference, last-token pooling, L2 normalization#2415

Fix MTEBEvaluator: device mapping, padding-free inference, last-token pooling, L2 normalization#2415
jambayk merged 15 commits intomainfrom
natke/mteb-device-fix

natke commented Apr 14, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

natke commented Apr 14, 2026

Device mapping

Padding-free inference for GenAI

Last-token pooling

L2 normalization

Results

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants