Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

自己训练的跟官方的分词不一致 #23

Closed
lishouguang opened this issue Dec 15, 2016 · 4 comments
Closed

自己训练的跟官方的分词不一致 #23

lishouguang opened this issue Dec 15, 2016 · 4 comments

Comments

@lishouguang
Copy link

lishouguang commented Dec 15, 2016

我训练的用的语料是待字闺中公众号里下载的语料。

对demo里的“赵雅淇洒泪道歉 和林丹没有任何经济关系”分词后的结果是:
{
"msg": "OK",
"segments": [
"赵雅",
"淇",
"洒泪",
"道",
"歉",
" ",
"和林",
"丹",
"没",
"有",
"任",
"何",
"经济",
"关",
"系"
],
"status": 0
}

请问下,这是为什么?

我训练的语句如下:
cd kcws

python kcws/train/process_anno_file.py /usr/local/people2014 pre_chars_for_w2v.txt

bazel build third_party/word2vec:word2vec

./bazel-bin/third_party/word2vec/word2vec -train pre_chars_for_w2v.txt -save-vocab pre_vocab.txt -min-count 3

python kcws/train/replace_unk.py pre_vocab.txt pre_chars_for_w2v.txt chars_for_w2v.txt

./bazel-bin/third_party/word2vec/word2vec -train chars_for_w2v.txt -output kcws/models/vec.txt -size 50 -sample 1e-4 -negative 5 -hs 1 -binary 0 -iter 5

bazel build kcws/train:generate_training

./bazel-bin/kcws/train/generate_training kcws/models/vec.txt /usr/local/people2014 all.txt

python kcws/train/filter_sentence.py all.txt

python kcws/train/train_cws_lstm.py --word2vec_path kcws/models/vec.txt --train_data_path /usr/local/kcws/train.txt --test_data_path /usr/local/kcws/test.txt --max_sentence_len 80 --learning_rate 0.001

bazel build kcws/cc:dump_vocab
./bazel-bin/kcws/cc/dump_vocab kcws/models/vec.txt vocab.txt

./bazel-bin/kcws/cc/seg_backend_api --model_path=/usr/local/kcws/kcws/models/seg_model.pbtxt --vocab_path=/usr/local/kcws/vocab.txt --max_sentence_len=80

训练后显示的accuracy是96.61%。

@koth
Copy link
Owner

koth commented Jan 14, 2017

今天重复了下,分词效果跟之前相当。。

@weisong82
Copy link

demo里面,2个地址的训练差异和功能差异是?只看到一个接受短语,一个接受公司

http://45.32.100.248:9090/
附: 使用相同模型训练的公司名识别demo:
http://45.32.100.248:18080

@crownpku
Copy link

crownpku commented Jan 23, 2017

我也遇到了一样的问题。分词结果和楼主一样比较烂。。

"msg": "OK",
"segments": [
"赵雅",
"淇洒泪",
"道",
"歉",
" ",
"和",
"林丹没",
"有",
"任",
"何",
"经",
"济",
"关",
"系"
],
"status": 0
}

稍微看了下,训练完毕以后,看到kcws/kcws/models里面只有basic_vocab.txt是新更新到的,seg_model.pbtxt和vec.txt都没有更新。这是正常的是吧,即训练之后的新模型只用到basic_vocab.txt这一个新结果?
另外看到训练之后的模型貌似存在了/logs/finnal-model.xxx,好像最后也没有用到?
菜鸟一个,求大牛给出些更多的debug线索。。。

@koth
Copy link
Owner

koth commented Feb 1, 2017

参考更新后的文档第6步

@koth koth closed this as completed Feb 1, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants