feat(tokenizer): Update post-processor when special tokens are modified in TokenizersBackend#43422
Conversation
|
[For maintainers] Suggested jobs to run (before merge) run-slow: gpt2 |
|
Hmm, I'm very wary of this. I know Claude did manage to fix the failing test here but a big override on |
|
@Rocketknight1 Agreed; completely understand the maintenance overhead concern! |
ArthurZucker
left a comment
There was a problem hiding this comment.
this can make sense actually. I don't want us to rush and it would need more testing ! Its convenient but IDK how many people init a tokenizer, change the eos / bos first then use it. Cuz we do update_post_processor after init
|
That’s fair; happy to iterate on the approach before moving forward. At the moment, is there a specific direction that seems right for this PR? |
|
I just don't think we need this, let's just add feature request and see if people want it |
|
Sounds good; let me know how you’d like me to proceed, whether that’s closing this PR or continuing the discussion in the linked issue; happy to help file or flesh out the feature request! |
|
Closing this PR for now as this fix doesn’t seem to be warranted and the issue has been marked stale, but it may be revisited in the future! |
What does this PR do?
The following fixes are made in this PR:
→ Override
__setattr__inTokenizersBackendto automatically update the post-processor when special tokens (bos_token,eos_token, etc.) are modified at runtime. This ensures modified special tokens are correctly applied during encoding.→ Also re-enable the previously skipped
test_users_can_modify_bostest.🚨 Co-authored with https://claude.ai/
Fixes #43421.
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.