Skip to content

RagTokenizer: add encode and patch_token(_id) forwarding#43468

Closed
sherlock-488 wants to merge 2 commits intohuggingface:mainfrom
sherlock-488:rag-35532-ragtokenizer-encode
Closed

RagTokenizer: add encode and patch_token(_id) forwarding#43468
sherlock-488 wants to merge 2 commits intohuggingface:mainfrom
sherlock-488:rag-35532-ragtokenizer-encode

Conversation

@sherlock-488
Copy link
Copy Markdown

Fixes #35532

What does this PR do?

  • Adds encode() to RagTokenizer by forwarding to current_tokenizer.
  • Adds patch_token / patch_token_id getters and setters, also forwarding to current_tokenizer.
  • Adds a lightweight regression test using dummy tokenizers (no model downloads).

Why?

RagTokenizer forwards __call__() via current_tokenizer but lacks common tokenizer APIs like encode() and patch_token(_id),
which can break preprocessing code expecting a standard tokenizer interface.

Tests

python -m pytest -q tests/models/rag/test_tokenization_rag_interface.py
make style

@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: rag

@sherlock-488
Copy link
Copy Markdown
Author

Hi! This PR implements encode() and patch_token/patch_token_id forwarding for RagTokenizer, with a lightweight regression test (no model downloads).
Local test: python -m pytest -q tests/models/rag/test_tokenization_rag_interface.py
It looks like 2 GitHub Actions workflows are awaiting maintainer approval — could a maintainer please approve them so CI can proceed?
Also, the bot suggested run-slow: rag before merge; happy to wait for that run if needed. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

RagTokenizer Missing patch_token_id, patch_token, and encode Functionality

2 participants