Requirement

pytorch 1.9
pip install transformers
pip install gensim==3.8.3
pip install pkuseg
pip install LAC
pip install snownlp
pretrained word embedding need: tencent-ailab-embedding-zh-d200-v0.1.0.txt
GPU: NVIDIA GeForce RTX 3090

Example

cd src/preprocess
python get_word_index.py # Get the start and end index of each word after text segmentation.
python get_pretrained_word_embedding.py # Filter out the word embedding required by this data from the pre-trained word embedding.
cd ../..
python main.py

Tips

1.src/data_helper.py中的get_input_tuple函数中的CLS和SEP参数需要按照对应模型进行修改，BERT/BERT-wwm是101和102，ERNIE是1和2
2.本示例给出单个句子的处理方法，对于sentences pair和MRC任务，需要在两个句子之间插入[SEP]
3.部分数据集中有隐藏字符，模型没有处理start==end的情况，需自行删除（当遇到start==end的情况，会报错）

Citation

@inproceedings{li2022exploiting,
  title={Exploiting Word Semantics to Enrich Character Representations of Chinese Pre-trained Models},
  author={Li, Wenbiao and Sun, Rui and Wu, Yunfang},
  booktitle={Natural Language Processing and Chinese Computing: 11th CCF International Conference, NLPCC 2022, Guilin, China, September 24--25, 2022, Proceedings, Part I},
  pages={3--15},
  year={2022},
  organization={Springer}
}

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
data/chnsenticorp		data/chnsenticorp
src		src
README.md		README.md
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Requirement

Example

Tips

Citation

About

Releases

Packages

Languages

liwb1219/HRMF

Folders and files

Latest commit

History

Repository files navigation

Requirement

Example

Tips

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages