pytorch 1.9
pip install transformers
pip install gensim==3.8.3
pip install pkuseg
pip install LAC
pip install snownlp
pretrained word embedding need: tencent-ailab-embedding-zh-d200-v0.1.0.txt
GPU: NVIDIA GeForce RTX 3090
cd src/preprocess
python get_word_index.py # Get the start and end index of each word after text segmentation.
python get_pretrained_word_embedding.py # Filter out the word embedding required by this data from the pre-trained word embedding.
cd ../..
python main.py
1.src/data_helper.py中的get_input_tuple函数中的CLS和SEP参数需要按照对应模型进行修改,BERT/BERT-wwm是101和102,ERNIE是1和2
2.本示例给出单个句子的处理方法,对于sentences pair和MRC任务,需要在两个句子之间插入[SEP]
3.部分数据集中有隐藏字符,模型没有处理start==end的情况,需自行删除(当遇到start==end的情况,会报错)
@inproceedings{li2022exploiting,
title={Exploiting Word Semantics to Enrich Character Representations of Chinese Pre-trained Models},
author={Li, Wenbiao and Sun, Rui and Wu, Yunfang},
booktitle={Natural Language Processing and Chinese Computing: 11th CCF International Conference, NLPCC 2022, Guilin, China, September 24--25, 2022, Proceedings, Part I},
pages={3--15},
year={2022},
organization={Springer}
}