句子过长导致的索引错误（Too long a sentence leads to an index error） #42

lioyou · 2019-08-28T10:51:28Z

Token indices sequence length is longer than the specified maximum sequence length for this model (1067 > 1024). Running this sequence through the model will result in indexing errors。
采用默认的配置，其最大长度为1024，而在train.py中，只对最小长度作了限制，但是没有对最大长度作限制，是否也应该根据配置文件的长度限制进行截断。

sublines = [full_tokenizer.tokenize(line) for line in sublines if
                    len(line) > min_length]  # 只考虑长度超过min_length的句子
# 在此转换时，会出现警告
sublines = [full_tokenizer.convert_tokens_to_ids(line) for line in sublines]

Morizeyao · 2019-08-28T13:42:36Z

无需截断，该提示可以无视

nickchiang5121 · 2019-08-29T02:45:30Z

那...長度超過1024後面的字會被拿來訓練嗎？

Morizeyao · 2019-08-29T02:46:45Z

会训练的

Morizeyao closed this as completed Aug 29, 2019

Morizeyao reopened this Aug 29, 2019

Morizeyao closed this as completed Aug 29, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

句子过长导致的索引错误（Too long a sentence leads to an index error） #42

句子过长导致的索引错误（Too long a sentence leads to an index error） #42

lioyou commented Aug 28, 2019 •

edited

Morizeyao commented Aug 28, 2019

nickchiang5121 commented Aug 29, 2019

Morizeyao commented Aug 29, 2019

句子过长导致的索引错误（Too long a sentence leads to an index error） #42

句子过长导致的索引错误（Too long a sentence leads to an index error） #42

Comments

lioyou commented Aug 28, 2019 • edited

Morizeyao commented Aug 28, 2019

nickchiang5121 commented Aug 29, 2019

Morizeyao commented Aug 29, 2019

lioyou commented Aug 28, 2019 •

edited