语料问题 #3

chiangandy · 2019-07-26T06:20:12Z

我去下载了斗破苍穹的语料，但它是纯text档案而非json档案。
当我修改train.py中...
# doupo = json.load(f)
doupo = f.read()

却发生警示...
W0726 13:58:53.042458 4556223936 tokenization.py:126] Token indices sequence length is longer than the specified maximum sequence length for this BERT model (5340786 > 512). Running this sequence through BERT will result in indexing errors
请问这是何问题？

Morizeyao · 2019-07-26T16:35:16Z

你好，这个可以不用理会的。是pytorch-transformer里的tokenizer的一个自我检查。你可以改这个库的源码把这个提示关掉，也可以无视他。
hmm还有个办法，在 train.py 16行处修改：
full_tokenizer.max_len = 1e5
也行，就是也比较丑陋啦。

chiangandy · 2019-07-29T02:17:15Z

非常感谢回应...
另外想请教一个环境问题...请问这执行环境是如何配置的？我用GTX 1070. 4G RAM跑不动，说memory 不够... 因为一般现在时下的显卡都只有8GB 的GRAM，一般显卡可以跑这BERT的模型吗？

Morizeyao · 2019-07-29T06:51:42Z

一般推荐至少1080Ti或者2080Ti以上。
对于小显存显卡，可以考虑修改config/model_config_small.json内内容，减小模型大小进行训练。

Update README.md

Morizeyao closed this as completed Aug 6, 2019

Morizeyao pushed a commit that referenced this issue Nov 7, 2020

Merge pull request #3 from hhou435/hc

1ef4be6

Update README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

语料问题 #3

语料问题 #3

chiangandy commented Jul 26, 2019 •

edited

Morizeyao commented Jul 26, 2019 •

edited

chiangandy commented Jul 29, 2019

Morizeyao commented Jul 29, 2019

语料问题 #3

语料问题 #3

Comments

chiangandy commented Jul 26, 2019 • edited

Morizeyao commented Jul 26, 2019 • edited

chiangandy commented Jul 29, 2019

Morizeyao commented Jul 29, 2019

chiangandy commented Jul 26, 2019 •

edited

Morizeyao commented Jul 26, 2019 •

edited