Example code results in input_id's of varying lengths #29

plamb-viso · 2023-02-27T20:15:04Z

I followed #17 (comment) in order to load the UdopTokenizer. I then followed the code examples for tokenizing text provided in rvlcdip.py

This amounts to calling tokenizer.tokenize(text) on a word text, appending the resulting sub_tokens to a text_list and then calling tokenizer.convert_tokens_to_ids on that text_list to get input_ids. However this always results in lengths that are longer or shorter than 512. This is despite the fact that tokenizer_config.json has a "model_max_length": 512, param.

Is this provided example code the expected way to encode text?

(it makes sense that the provided code doesn't pad/truncate correctly, but its odd to me that rvlcdip can correctly fine tune without a step in this tokenization piece that ensures the text_list is 512 tokens long)

EDIT I just noticed this pad_tokens function but it doesn't appear to be used anywhere. Is it used automatically once RvlCdipDataset() is created? Also, it doesn't appear to do any truncation

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Example code results in input_id's of varying lengths #29

Example code results in input_id's of varying lengths #29

plamb-viso commented Feb 27, 2023 •

edited

Loading

Example code results in input_id's of varying lengths #29

Example code results in input_id's of varying lengths #29

Comments

plamb-viso commented Feb 27, 2023 • edited Loading

plamb-viso commented Feb 27, 2023 •

edited

Loading