Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP?] Trying to fix Transformers tokenizer issues when models have extended vocabulary and/or special tokens #766

Merged
merged 3 commits into from
Apr 19, 2024

Conversation

Harsha-Nori
Copy link
Collaborator

@Harsha-Nori Harsha-Nori commented Apr 18, 2024

Looping over len(tokenizer) causes issues when we have to fallback to an underlying sentencepiece model, as the tokenizer.sp_model doesn't have context for the extended vocabulary. I think this is part of the puzzle to solve some persistent issues like #434?

This is probably broken right now, need to fix up

@Harsha-Nori
Copy link
Collaborator Author

Harsha-Nori commented Apr 18, 2024

also a random thought, the sp_model attribute only shows up if users have sentencepiece installed. Probably not a good idea to take a hard dependency on it, but it'd be nice to alert people somehow that it might be necessary/useful? Not sure of the right way to do this off the top of my head, but will revisit this with fresh eyes in the morning.

@slundberg @riedgar-ms

@codecov-commenter
Copy link

codecov-commenter commented Apr 18, 2024

Codecov Report

Attention: Patch coverage is 50.00000% with 4 lines in your changes are missing coverage. Please review.

Project coverage is 65.65%. Comparing base (6167225) to head (b34a20b).

Files Patch % Lines
guidance/models/transformers/_transformers.py 50.00% 4 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #766      +/-   ##
==========================================
- Coverage   69.67%   65.65%   -4.02%     
==========================================
  Files          55       55              
  Lines        4059     4062       +3     
==========================================
- Hits         2828     2667     -161     
- Misses       1231     1395     +164     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@riedgar-ms
Copy link
Collaborator

On sentencepiece .... I only found out about that by reading through the mistral issue a couple of days ago. Agree that it would be good to warn, since it appears to be not just a grandchild dependency, but an optional grandchild of an optional child.

@Harsha-Nori Harsha-Nori changed the title [WIP?] Trying to fix Transformers tokenizer issues when models have extended/vocabulary and/or special tokens [WIP?] Trying to fix Transformers tokenizer issues when models have extended vocabulary and/or special tokens Apr 19, 2024
@slundberg
Copy link
Contributor

LGTM. Thanks!

@slundberg slundberg merged commit 9862750 into main Apr 19, 2024
83 checks passed
@Harsha-Nori Harsha-Nori deleted the spectokens branch April 19, 2024 16:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants