[WIP?] Trying to fix Transformers tokenizer issues when models have extended vocabulary and/or special tokens #766

Harsha-Nori · 2024-04-18T07:05:36Z

Looping over len(tokenizer) causes issues when we have to fallback to an underlying sentencepiece model, as the tokenizer.sp_model doesn't have context for the extended vocabulary. I think this is part of the puzzle to solve some persistent issues like #434?

This is probably broken right now, need to fix up

… vocabulary

Harsha-Nori · 2024-04-18T07:09:35Z

also a random thought, the sp_model attribute only shows up if users have sentencepiece installed. Probably not a good idea to take a hard dependency on it, but it'd be nice to alert people somehow that it might be necessary/useful? Not sure of the right way to do this off the top of my head, but will revisit this with fresh eyes in the morning.

@slundberg @riedgar-ms

codecov-commenter · 2024-04-18T07:21:28Z

Codecov Report

Attention: Patch coverage is 50.00000% with 4 lines in your changes are missing coverage. Please review.

Project coverage is 65.65%. Comparing base (6167225) to head (b34a20b).

Files	Patch %	Lines
guidance/models/transformers/_transformers.py	50.00%	4 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #766      +/-   ##
==========================================
- Coverage   69.67%   65.65%   -4.02%     
==========================================
  Files          55       55              
  Lines        4059     4062       +3     
==========================================
- Hits         2828     2667     -161     
- Misses       1231     1395     +164

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

riedgar-ms · 2024-04-18T12:30:20Z

On sentencepiece .... I only found out about that by reading through the mistral issue a couple of days ago. Agree that it would be good to warn, since it appears to be not just a grandchild dependency, but an optional grandchild of an optional child.

slundberg · 2024-04-19T16:35:11Z

LGTM. Thanks!

attempted fix for transformers/sentencepiece tokenizers with extended…

058b289

… vocabulary

Harsha-Nori requested review from marcotcr and slundberg April 18, 2024 07:05

Harsha-Nori added 2 commits April 18, 2024 15:56

bugfix for models falling back to gpt2 tokenizer

3285dfa

run black

b34a20b

Harsha-Nori changed the title ~~[WIP?] Trying to fix Transformers tokenizer issues when models have extended/vocabulary and/or special tokens~~ [WIP?] Trying to fix Transformers tokenizer issues when models have extended vocabulary and/or special tokens Apr 19, 2024

slundberg merged commit 9862750 into main Apr 19, 2024
83 checks passed

Harsha-Nori deleted the spectokens branch April 19, 2024 16:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP?] Trying to fix Transformers tokenizer issues when models have extended vocabulary and/or special tokens #766

[WIP?] Trying to fix Transformers tokenizer issues when models have extended vocabulary and/or special tokens #766

Harsha-Nori commented Apr 18, 2024 •

edited

Loading

Harsha-Nori commented Apr 18, 2024 •

edited

Loading

codecov-commenter commented Apr 18, 2024 •

edited

Loading

riedgar-ms commented Apr 18, 2024

slundberg commented Apr 19, 2024

[WIP?] Trying to fix Transformers tokenizer issues when models have extended vocabulary and/or special tokens #766

[WIP?] Trying to fix Transformers tokenizer issues when models have extended vocabulary and/or special tokens #766

Conversation

Harsha-Nori commented Apr 18, 2024 • edited Loading

Harsha-Nori commented Apr 18, 2024 • edited Loading

codecov-commenter commented Apr 18, 2024 • edited Loading

Codecov Report

riedgar-ms commented Apr 18, 2024

slundberg commented Apr 19, 2024

Harsha-Nori commented Apr 18, 2024 •

edited

Loading

Harsha-Nori commented Apr 18, 2024 •

edited

Loading

codecov-commenter commented Apr 18, 2024 •

edited

Loading