Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

语料问题 #3

Closed
chiangandy opened this issue Jul 26, 2019 · 3 comments
Closed

语料问题 #3

chiangandy opened this issue Jul 26, 2019 · 3 comments

Comments

@chiangandy
Copy link

chiangandy commented Jul 26, 2019

我去下载了斗破苍穹的语料,但它是纯text档案而非json档案。
当我修改train.py中...
# doupo = json.load(f)
doupo = f.read()

却发生警示...
W0726 13:58:53.042458 4556223936 tokenization.py:126] Token indices sequence length is longer than the specified maximum sequence length for this BERT model (5340786 > 512). Running this sequence through BERT will result in indexing errors
请问这是何问题?

@Morizeyao
Copy link
Owner

Morizeyao commented Jul 26, 2019

你好,这个可以不用理会的。是pytorch-transformer里的tokenizer的一个自我检查。你可以改这个库的源码把这个提示关掉,也可以无视他。
hmm还有个办法,在 train.py 16行处修改:
full_tokenizer.max_len = 1e5
也行,就是也比较丑陋啦。

@chiangandy
Copy link
Author

非常感谢回应...
另外想请教一个环境问题...请问这执行环境是如何配置的?我用GTX 1070. 4G RAM跑不动,说memory 不够... 因为一般现在时下的显卡都只有8GB 的GRAM,一般显卡可以跑这BERT的模型吗?

@Morizeyao
Copy link
Owner

一般推荐至少1080Ti或者2080Ti以上。
对于小显存显卡,可以考虑修改config/model_config_small.json内内容,减小模型大小进行训练。

Morizeyao pushed a commit that referenced this issue Nov 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants