Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SigLIP] Add fast tokenizer #29969
base: main
Are you sure you want to change the base?
[SigLIP] Add fast tokenizer #29969
Changes from 4 commits
d5d67b7
cbde88a
de444e9
009fdc6
f714af0
6cd05c2
d67e40f
f576078
de04050
844c95c
50500a5
8ba6e0b
bf4f6db
d850451
e73fa01
cbe0a31
d2b2339
6379f9d
6f55733
05f8b5c
e296021
0f9669b
0bca141
80d2f46
2133809
3a0e825
e975d96
dc8ed15
dfada5a
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ref for #33751 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to remove
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure we even need that class no? We could just use a PreTrainedTokenizerFast
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I didn't consider we could do that! Is there an example I can reference? Not sure what to do with the functions copied over from T5 here.
Also, looking more into the functions here, would it be better to move common functions like
save_vocabulary
(duplicated in 15_fast
files) toPreTrainedTokenizerFast
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a question, wouldn't that confuse users? if there's no dedicated class with the name of the model
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@NielsRogge yes I see your point I also had the same thought. Do we have other fast models we support without a dedicated class?@ArthurZucker
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have quite a lot of other models that just use
PreTrainedTokenizerFast
, including Llama (llama v3) , all mambas etc. Tokenizers are more prone to change than models (you could have Mamba with LlamaTokenizer) so it makes more sense to deprecate slow and model-specific onesThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need them they are build on the layer of
PreTrainedTokenizerFast
+ we can embed stuff inside the tokenizer fast itelsfThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ArthurZucker Sorry, not really understanding if you mean to just remove this file entirely and not worry about the functions? (or do we need to embed them somewhere?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah just remove it entirely
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ArthurZucker and would we have to add a
tokenizer.json
to the hub?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added the fast
tokenizer.json
file to https://huggingface.co/google/siglip-base-patch16-224/discussions/4There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lots of copied from are missing here as well