Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

该plato代码怎么去训练中文模型呢 #25

Open
ShengXiaoXiao opened this issue Aug 13, 2020 · 1 comment
Open

该plato代码怎么去训练中文模型呢 #25

ShengXiaoXiao opened this issue Aug 13, 2020 · 1 comment

Comments

@ShengXiaoXiao
Copy link

No description provided.

@sserdoubleh
Copy link
Collaborator

sserdoubleh commented Aug 13, 2020

可以根据Knover/README.mdhttps://github.com/PaddlePaddle/Knover/blob/master/README.md )的提示准备好语料,可以使用sentencepiece工具( https://github.com/google/sentencepiece )处理生成词表,格式可以参照./package/dialog_en/voca.txt./package/dialog_en/spm.model;或者使用已有的中文词表,如果是使用其他的Tokenizer(不是sentencepiece tokenizer),可以通过修改./utils/tokenization.py,参考SentencePiecieTokenizer的实现实现对应的Tokenizer(比如叫BasicTokneizer),在配置中的train_args中指定Tokenizer即可(加一行train_args="--tokenizer BasicTokenizer"

class SentencePieceTokenizer(object):

训练的具体操作与配置也可以参照Knover/README.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants