Skip to content

feat(tokenizer): Update post-processor when special tokens are modified in TokenizersBackend#43422

Closed
harshaljanjani wants to merge 2 commits intohuggingface:mainfrom
harshaljanjani:fix/tokenizers-backend-special-token-update
Closed

feat(tokenizer): Update post-processor when special tokens are modified in TokenizersBackend#43422
harshaljanjani wants to merge 2 commits intohuggingface:mainfrom
harshaljanjani:fix/tokenizers-backend-special-token-update

Conversation

@harshaljanjani
Copy link
Copy Markdown
Contributor

@harshaljanjani harshaljanjani commented Jan 22, 2026

What does this PR do?

The following fixes are made in this PR:

→ Override __setattr__ in TokenizersBackend to automatically update the post-processor when special tokens (bos_token, eos_token, etc.) are modified at runtime. This ensures modified special tokens are correctly applied during encoding.
→ Also re-enable the previously skipped test_users_can_modify_bos test.

🚨 Co-authored with https://claude.ai/

Fixes #43421.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: gpt2

@harshaljanjani harshaljanjani marked this pull request as ready for review January 22, 2026 19:39
@Rocketknight1
Copy link
Copy Markdown
Member

Hmm, I'm very wary of this. I know Claude did manage to fix the failing test here but a big override on __setattr__ feels like the kind of thing that will create lots of maintenance headaches in future, especially when the issue is not that significant!

@harshaljanjani
Copy link
Copy Markdown
Contributor Author

@Rocketknight1 Agreed; completely understand the maintenance overhead concern!
I was wondering how we could address the underlying issue though. There’s a skipped test currently posing this behavior as a bug, and when users set tokenizer.bos_token_id, having it silently ignored during encoding (without docs) doesn’t seem quite right. To that end, the scope of the fix is fairly narrow, mainly to address this behavior without introducing regressions.
If you’d prefer not to support runtime special token modification, I’m also happy to clarify the behavior in the docs or address the skipped test so it’s not left as a FIXME.
Happy to defer to your judgment; thanks for your time!

Copy link
Copy Markdown
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can make sense actually. I don't want us to rush and it would need more testing ! Its convenient but IDK how many people init a tokenizer, change the eos / bos first then use it. Cuz we do update_post_processor after init

@harshaljanjani
Copy link
Copy Markdown
Contributor Author

That’s fair; happy to iterate on the approach before moving forward. At the moment, is there a specific direction that seems right for this PR?

@ArthurZucker
Copy link
Copy Markdown
Collaborator

I just don't think we need this, let's just add feature request and see if people want it

@harshaljanjani
Copy link
Copy Markdown
Contributor Author

harshaljanjani commented Feb 2, 2026

Sounds good; let me know how you’d like me to proceed, whether that’s closing this PR or continuing the discussion in the linked issue; happy to help file or flesh out the feature request!
Update: Done; I’ll make updates here only if needed :)

@harshaljanjani harshaljanjani changed the title fix(tokenizer): Update post-processor when special tokens are modified in TokenizersBackend feat(tokenizer): Update post-processor when special tokens are modified in TokenizersBackend Feb 16, 2026
@harshaljanjani
Copy link
Copy Markdown
Contributor Author

Closing this PR for now as this fix doesn’t seem to be warranted and the issue has been marked stale, but it may be revisited in the future!

@harshaljanjani harshaljanjani deleted the fix/tokenizers-backend-special-token-update branch March 26, 2026 06:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE] TokenizersBackend does not update post-processor when special tokens are modified at runtime

3 participants