Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

在CMeEE数据上报错 #44

Closed
nlper01 opened this issue Jul 6, 2022 · 6 comments
Closed

在CMeEE数据上报错 #44

nlper01 opened this issue Jul 6, 2022 · 6 comments

Comments

@nlper01
Copy link

nlper01 commented Jul 6, 2022

5153b086f9b4871374eb7ac90d1e24b
debug后发现是
_dist_inputs[i, j] = dis2idx[-_dist_inputs[i, j]] + 9
这一句这里出错了,不知道应该怎么修改
我的数据处理后如下,中英文混杂的把英文也拆成了单个字符,不知道这样处理对不对
微信图片_20220706102257

@ljynlp
Copy link
Owner

ljynlp commented Jul 6, 2022

应该是处理完后句子长度太长了,超过了设定的1000导致报错。如果句子里面英文单词过多的话不建议把单词拆成字母,不然句子很可能过长导致程序无法运行。

@nlper01
Copy link
Author

nlper01 commented Jul 6, 2022

应该是处理完后句子长度太长了,超过了设定的1000导致报错。如果句子里面英文单词过多的话不建议把单词拆成字母,不然句子很可能过长导致程序无法运行。

好的,我试试吧英文句子删掉试试。另外,这个1000可以修改吗?我尝试修改了这里的1000
dis2idx = np.zeros((1000), dtype='int64')还是一样的错误

@ljynlp
Copy link
Owner

ljynlp commented Jul 6, 2022

即使这里修改了,还是会超出BERT的512个token的限制,同样会报错,最好直接将超出长度的句子处理掉。

@nlper01
Copy link
Author

nlper01 commented Jul 6, 2022

即使这里修改了,还是会超出BERT的512个token的限制,同样会报错,最好直接将超出长度的句子处理掉。

好的,谢谢

@nlper01
Copy link
Author

nlper01 commented Jul 8, 2022

即使这里修改了,还是会超出BERT的512个token的限制,同样会报错,最好直接将超出长度的句子处理掉。
微信截图_20220708213356
你好,句子长度大于500个token的我都处理掉了,又报新的越界错误,这怎么解决呢?是不是因为我数据集里面的实体类别比代码里面原来设定的类别数多的问题?

@ljynlp
Copy link
Owner

ljynlp commented Jul 9, 2022

可能是你数据处理的有问题,最好查验一下每个样本中的实体index与对应文本中的内容是否一致。

@nlper01 nlper01 closed this as completed Jul 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants