Skip to content

Add chat-template hooks to LMEvalORTGenAIEvaluator#2462

Merged
xiaoyu-work merged 6 commits into
microsoft:mainfrom
ykhrustalev:lmeval-ort-chat-template
May 21, 2026
Merged

Add chat-template hooks to LMEvalORTGenAIEvaluator#2462
xiaoyu-work merged 6 commits into
microsoft:mainfrom
ykhrustalev:lmeval-ort-chat-template

Conversation

@ykhrustalev
Copy link
Copy Markdown
Contributor

@ykhrustalev ykhrustalev commented May 12, 2026

Describe your changes

Implement tokenizer_name and apply_chat_template on LMEvalORTGenAIEvaluator so the backend supports lm_eval.simple_evaluate(apply_chat_template=True). Without these, lm-eval raises NotImplementedError at task setup for any chat-formatted task.

Parity with the HuggingFace backend in lm_eval/models/huggingface.py. The HF tokenizer is loaded lazily on the first apply_chat_template call, so model directories without HF tokenizer files still work for non-chat evaluation. Generation continues to go through og.Tokenizer.

Checklist before requesting a review

  • Add unit tests for this change.
  • Make sure all tests can pass.
  • Update documents if necessary.
  • Lint and apply fixes to your code by running lintrunner -a
  • Is this a user-facing change? Release note: Enable apply_chat_template=True in lm-eval for ortgenai-backed evaluators.

(Optional) Issue link

N/A

lm-eval's `simple_evaluate(..., apply_chat_template=True)` requires the
underlying LM class to implement `tokenizer_name` and `apply_chat_template`.
The HFLM backend has both; the ORT GenAI backend does not, so any attempt
to evaluate a chat-tuned ONNX model with chat-formatted prompts raises
`NotImplementedError: To use this model with chat templates, please
implement the 'tokenizer_name' property.`

This adds the two members with the minimum surface area:
  - `tokenizer_name` returns the model path (for lm-eval's chat-aware
    result caching), matching the HFLM convention of slash-replacement.
  - `apply_chat_template` defers to the model's HF tokenizer via
    `AutoTokenizer.apply_chat_template`, mirroring HFLM's
    implementation.

The HF tokenizer is loaded once at `__init__` purely for chat-template
rendering; token-level encode/decode still goes through `og.Tokenizer`
and the runtime, so there is no change to generation behavior or any
existing code path.

Verified end-to-end on LFM2.5-350M (int4, k_quant_mixed) MBPP:
without chat-template hooks the eval raised at task start; with them
plus `num_fewshot=0` and a chat-friendly stop list, pass@1 went from
0.0/500 to 67/500 (13.4%) -- the original 0.0 was a prompt-format
artifact (instruct model + completion-style few-shot), not a
conversion regression.
Copilot AI review requested due to automatic review settings May 12, 2026 21:06
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds lm-eval “chat template” integration hooks to LMEvalORTGenAIEvaluator so ORT GenAI–backed models can be evaluated via lm_eval.simple_evaluate(..., apply_chat_template=True) (matching the capability available in the HuggingFace backend).

Changes:

  • Add an HF tokenizer instance to LMEvalORTGenAIEvaluator for rendering chat templates.
  • Implement tokenizer_name for lm-eval chat-template-aware caching.
  • Implement apply_chat_template(...) by delegating to the HF tokenizer.

Comment thread olive/evaluator/lmeval_ort.py Outdated
Comment thread olive/evaluator/lmeval_ort.py Outdated
… key, tests

- Lazy-load the HF tokenizer on the first ``apply_chat_template`` call rather
  than at ``__init__``. Callers that never enable chat templating no longer
  need HF tokenizer files (``tokenizer_config.json`` etc.) in the model
  directory; eager loading would have regressed those workflows.

- ``tokenizer_name`` now replaces both POSIX and Windows path separators with
  ``__`` so the lm-eval cache identifier is stable across platforms. The
  previous implementation only handled forward slashes, leaving backslashes
  in the key on Windows because ``str(Path(...))`` preserves the native
  separator.

- Add unit tests for both behaviours:
    - ``tokenizer_name`` parametrised over POSIX, relative, and Windows-style
      paths to lock in the normalisation contract.
    - ``apply_chat_template`` verified to (a) not load the HF tokenizer at
      construction, (b) load once on first call, and (c) reuse the cached
      tokenizer on subsequent calls. ``AutoTokenizer`` is patched so the
      tests run without any HF tokenizer files on disk.

All four new tests pass; ``test_olive_evaluator.py`` as a whole stays green
(85 passed). ``lintrunner`` reports no new warnings on the changed files.
Comment thread test/evaluator/test_olive_evaluator.py Fixed
@shaahji shaahji enabled auto-merge (squash) May 20, 2026 17:46
@shaahji
Copy link
Copy Markdown
Collaborator

shaahji commented May 20, 2026

@ykhrustalev Please add the following snippet on tests that depend on lm-eval package

@pytest.mark.skipif(
    importlib.util.find_spec("lm_eval") is None,
    reason="lm_eval not installed",
)

auto-merge was automatically disabled May 20, 2026 18:32

Head branch was pushed to by a user without write access

@xiaoyu-work xiaoyu-work merged commit 5ff3a59 into microsoft:main May 21, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants