You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I followed #17 (comment) in order to load the UdopTokenizer. I then followed the code examples for tokenizing text provided in rvlcdip.py
This amounts to callingtokenizer.tokenize(text) on a word text, appending the resulting sub_tokens to a text_list and then callingtokenizer.convert_tokens_to_ids on that text_list to get input_ids. However this always results in lengths that are longer or shorter than 512. This is despite the fact that tokenizer_config.json has a "model_max_length": 512, param.
Is this provided example code the expected way to encode text?
(it makes sense that the provided code doesn't pad/truncate correctly, but its odd to me that rvlcdip can correctly fine tune without a step in this tokenization piece that ensures the text_list is 512 tokens long)
EDIT I just noticed this pad_tokensfunction but it doesn't appear to be used anywhere. Is it used automatically once RvlCdipDataset() is created? Also, it doesn't appear to do any truncation
The text was updated successfully, but these errors were encountered:
I followed #17 (comment) in order to load the
UdopTokenizer
. I then followed the code examples for tokenizing text provided in rvlcdip.pyThis amounts to calling
tokenizer.tokenize(text)
on a word text, appending the resultingsub_tokens
to atext_list
and then callingtokenizer.convert_tokens_to_ids
on thattext_list
to getinput_ids
. However this always results in lengths that are longer or shorter than 512. This is despite the fact thattokenizer_config.json
has a"model_max_length": 512,
param.Is this provided example code the expected way to encode text?
(it makes sense that the provided code doesn't pad/truncate correctly, but its odd to me that rvlcdip can correctly fine tune without a step in this tokenization piece that ensures the
text_list
is 512 tokens long)EDIT I just noticed this
pad_tokens
function but it doesn't appear to be used anywhere. Is it used automatically onceRvlCdipDataset()
is created? Also, it doesn't appear to do any truncationThe text was updated successfully, but these errors were encountered: