Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

句子过长导致的索引错误(Too long a sentence leads to an index error) #42

Closed
lioyou opened this issue Aug 28, 2019 · 3 comments

Comments

@lioyou
Copy link

lioyou commented Aug 28, 2019

Token indices sequence length is longer than the specified maximum sequence length for this model (1067 > 1024). Running this sequence through the model will result in indexing errors。
采用默认的配置,其最大长度为1024,而在train.py中,只对最小长度作了限制,但是没有对最大长度作限制,是否也应该根据配置文件的长度限制进行截断。

sublines = [full_tokenizer.tokenize(line) for line in sublines if
                    len(line) > min_length]  # 只考虑长度超过min_length的句子
# 在此转换时,会出现警告
sublines = [full_tokenizer.convert_tokens_to_ids(line) for line in sublines]
@Morizeyao
Copy link
Owner

无需截断,该提示可以无视

@nickchiang5121
Copy link

那...長度超過1024後面的字 會被拿來訓練嗎?

@Morizeyao Morizeyao reopened this Aug 29, 2019
@Morizeyao
Copy link
Owner

会训练的

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants