StandAloneSpellingCorrection

Repository for Findings of EMNLP 2020 Context-aware Stand-alone Neural Spelling Correction. This work is done by Xiangci Li as research scientist intern at IDL of SVAIL, Baidu USA supervised by Dr. Hairong Liu. If you have any question, please contact lixiangci8@gmail.com.

Requirements

PaddlePaddle 1.6

The following are the explanation to each of the folders.

Creating Dataset

We construct the misspelling dataset from 1-Billion-Word-Language-Model-Benchmark. The natural misspelings are from missp.dat.txt (link) and en.natural.txt (link). We treat sentences 1B-dataset as gold tokens, and randomly replace correct words with candidate misspellings.

dataset_train.py and dataset_train_random.py creates datsets with natural misspellings and synthetic (random character) misspellings respectively, and store them in a input sentence file and label file. Other files convert these dataset files into corresponding format for different models. train_replacement.json saves the sampled "known misspelling vocabulary" described in the paper.

Word-wise Spelling Correction

This is an early exploration of the spelling corrector, which is not covered in the paper. This directly originated from Dr. Hairong Liu's idea of using two Transformer encoders to encode characters and words, which is similar to Word+Char encoder in the paper. The difference is the model is only able to correct spellings at a given position, without being able to detect real-word misspellings.

The model is implemented with PaddlePaddle 1.6 DyGraph. The backbone code is from PaddlePaddle's benchmark code. configs/ includes configuration files for different settings, where standard means Word+Char model, mix means natural+synthetic misspellings and char means character encoder only.

To train the model, run python -u model.py configs/standard_config.json. Add --test for inference only. You can also run tune_model.py to evoke multiple jobs in SLURM system to train models with different configuations.

LM Spelling Correction

This is the model introduced in the paper. Essentially it's a modification of ERNIE 2.0 implementation. Run tune_xxx.py to evoke multiple jobs in SLURM system and run corresponding test_xxx.sh for inference. The usage is the same as ERNIE 2.0.

Baseline 1: ScRNN

ScRNN is the model proposed by Sakaguchi et al. (2017). This is the re-implementation using PaddlePaddle 1.6 DyGraph. Run tune_robust_model.py to evoke multiple jobs in SLURM system to tune learning rate and the hidden size of LSTM. You can run python robust_model.py --test to make inference once you have the model weights.

Baseline 2: MUDE

MUDE is the previous SOTA model for stand-alone spelling correction. It's based on PyTorch. The version here is a minimal revision of the original code so that it can take our dataset as input.

Citation

Please cite Context-aware Stand-alone Neural Spelling Correction as

@inproceedings{li2020context,
  title={Context-aware Stand-alone Neural Spelling Correction},
  author={Li, Xiangci and Liu, Hairong and Huang, Liang},
  booktitle={Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings},
  pages={407--414},
  year={2020}
}

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
LM_spelling_corrector		LM_spelling_corrector
MUDE		MUDE
ScRNN		ScRNN
dataset_creation		dataset_creation
word-wise_spelling_correction		word-wise_spelling_correction
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

StandAloneSpellingCorrection

Requirements

Creating Dataset

Word-wise Spelling Correction

LM Spelling Correction

Baseline 1: ScRNN

Baseline 2: MUDE

Citation

About

Releases

Packages

Languages

jacklxc/StandAloneSpellingCorrection

Folders and files

Latest commit

History

Repository files navigation

StandAloneSpellingCorrection

Requirements

Creating Dataset

Word-wise Spelling Correction

LM Spelling Correction

Baseline 1: ScRNN

Baseline 2: MUDE

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages