Skip to content

luxinyu1/Chinese-LS

Repository files navigation

Chinese-LS Logo

English|简体中文

What is Chinese-LS?

Lexical simplification (LS) aims to replace complex words in a given sentence with their simpler alternatives of equivalent meaning. Chinese-LS is the first attempt in the field of Chinese Lexical Simplification. It includes a high-quality benchmark dataset and five baseline approaches:

  • Synonym dictionary-based approach

  • Word embedding-based approach

  • Pretrained language model-based approach

  • Sememe-based approach

  • Hybrid approach

The entire framework of Chinese-LS is shown below:

Chinese-LS Framework

Quick start

Requirements

  • Python==3.7.6
  • transformers==3.5.0
  • numpy==1.18.1
  • jieba==0.42.1
  • torch==1.4.0
  • OpenHowNet==0.0.1a11
  • gensim==3.8.2

You can find the complete requirements here.

Preparations

Download Pretrained Models

Chinese-LS uses the following pretrained models:

Please place the models under the ./model directory after downloading.

Run

We have already executed the codes for you and intermediate results can be found in ./data.

You could check out the details of codes and algorithms from our paper: Chinese Lexical Simplification

If you want to run the codes for reproduction, please execute them in the following order:

Generate

  1. Synonym dictionary based-approach

    Run dict_generate.py

  2. Word embedding based-approach

    Run vector_generate.py

  3. Pretrained language model based-approach

    Run bert_generate.sh

  4. Sememe based-approach

    Run hownet_generate.py

  5. Hybrid approach

    Run hybrid_approach.py

Select

Run substitute_selection.py

Rank

Run substitute_ranking.py

Experiments

Chinese-LS designs 5 experiments to evaluate the quality of our dataset and the performance of five approaches. You could get the experiment results through running experiment.py.

Citation

@article{qiang2021chinese,
    title={Chinese Lexical Simplification},
    author={Qiang, Jipeng and Lu, Xinyu and Li, Yun and Yuan, Yun-Hao and Wu, Xindong},
    journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
    year={2021},
    volume={29},
    pages={1819-1828},
    doi={10.1109/TASLP.2021.3078361},
    publisher={IEEE}
}

Contact

This repo may still contain bugs and we are working on improving the reproductivity. Welcome to open an issue or submit a Pull Request to report/fix the bugs.

Email: luxinyu12345@foxmail.com

License

Chinese-LS is under the Apache License, Version 2.0.

About

A dataset and baselines for CLS.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published