GitHub - herunkang2018/Chinese-POS-Tagging: Homework of UCAS NLP Course

项目文档

项目分为三部分，代码结构如下：

执行过程：

1.项目采用python3，依赖模块通过requirements.txt安装：

pip install -r requirements.txt

2.预处理：

python preprocess.py

3.统计出所有的tags，用于后续计算：

python calc_all_tag_list.py

4.训练模型，并利用K折法对模型进行交叉验证(K=10)，计算平均准确率、集外词和兼类词准确率：

python tagger.py

5.计算训练和标记性能：

python tagger_calc_speed.py

6.其他实验：测试模型的平均准确率随着语料库大小的变化情况

python tagger_with_diff_corpus_size.py

其他说明：

1.语料库原始文件中存在一处错误，即有一处为面试/vvn，已改正为vn。

2.解码程序默认使用取对数连加的方法，准确率更高（具体实验结果见result/log_vs_no_log/README.txt)。（在tagger.py中默认import viterbi_with_log）

3.平滑结果在smooth目录中，目前的测试结果比简单平滑要差一些，后续需要再调试。

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
result		result
smooth		smooth
tag_descr		tag_descr
README.md		README.md
calc_all_tag_list.py		calc_all_tag_list.py
calc_corpus_vocab_num.py		calc_corpus_vocab_num.py
convert_to_single_byte.py		convert_to_single_byte.py
paint_corpus_size_res.py		paint_corpus_size_res.py
preprocess.py		preprocess.py
requirements.txt		requirements.txt
tagger.py		tagger.py
tagger_calc_speed.py		tagger_calc_speed.py
tagger_with_diff_corpus_size.py		tagger_with_diff_corpus_size.py
utils.py		utils.py
viterbi.py		viterbi.py
viterbi_with_log.py		viterbi_with_log.py
基于HMM的中文词性标注系统技术报告.pdf		基于HMM的中文词性标注系统技术报告.pdf