Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Example code results in input_id's of varying lengths #29

Open
plamb-viso opened this issue Feb 27, 2023 · 0 comments
Open

Example code results in input_id's of varying lengths #29

plamb-viso opened this issue Feb 27, 2023 · 0 comments

Comments

@plamb-viso
Copy link

plamb-viso commented Feb 27, 2023

I followed #17 (comment) in order to load the UdopTokenizer. I then followed the code examples for tokenizing text provided in rvlcdip.py

This amounts to calling tokenizer.tokenize(text) on a word text, appending the resulting sub_tokens to a text_list and then calling tokenizer.convert_tokens_to_ids on that text_list to get input_ids. However this always results in lengths that are longer or shorter than 512. This is despite the fact that tokenizer_config.json has a "model_max_length": 512, param.

Is this provided example code the expected way to encode text?

(it makes sense that the provided code doesn't pad/truncate correctly, but its odd to me that rvlcdip can correctly fine tune without a step in this tokenization piece that ensures the text_list is 512 tokens long)

EDIT I just noticed this pad_tokens function but it doesn't appear to be used anywhere. Is it used automatically once RvlCdipDataset() is created? Also, it doesn't appear to do any truncation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant