helo_word

A Neural Grammatical Error Correction System Built on Better Pre-training and Sequential Transfer Learning

Code accompanying Team Kakao&Brain's submission to the ACL 2019 BEA Workshop Shared Task.
(helo_word is our informal team name.)

Paper: https://arxiv.org/abs/1907.01256

ACL Anthology: https://www.aclweb.org/anthology/papers/W/W19/W19-4423/

Authors

YJ Choe^, Jiyeon Ham^, Kyubyong Park^, Yeoil Yoon^

^Equal contribution.

Installation

Requires Python 3.

# apt-get packages (required for hunspell & pattern)
apt-get update
apt-get install libhunspell-dev libmysqlclient-dev -y

# pip packages
pip install --upgrade pip
pip install --upgrade -r requirements.txt
python -m spacy download en

# custom fairseq (fork of 0.6.1 with gec modifications)
pip install --editable fairseq

# errant
git clone https://github.com/chrisjbryant/errant

# pattern3 (see https://www.clips.uantwerpen.be/pages/pattern for any installation issues)
pip install pattern3
python -c "import site; print(site.getsitepackages())"
# ['PATH_TO_SITE_PACKAGES']
cp tree.py PATH_TO_SITE_PACKAGES/pattern3/text/

Download & Preprocess Data

python preprocess.py

Restricted Track

Prepare data for the restricted track
```
python prepare.py --track 1
```
Pre-train
- If you train the model, the system will automatically create a checkpoint directory.
- Fill it in {ckpt_dir}.
- Also fill in the number of GPUs used for training in {ngpu}.
```
python train.py --track 1 --train-mode pretrain --model base --ngpu {ngpu}
python evaluate.py --track 1 --subset valid --ckpt-dir {ckpt_dir}
```

Train

If you evaluate the model, the system will automatically create an output directory.
Fill the previous model output directory into {prev_model_output_dir}.

python train.py --track 1 --train-mode train --model base --ngpu {ngpu} \
    --lr 1e-4 --max-epoch 40 --reset --prev-model-output-dir {prev_model_output_dir}
python evaluate.py --track 1 --subset valid --ckpt-dir {ckpt_dir}

Fine-tune

Fill the best validation report into {prev_model_output_fpath}.
Then error_type_control.py will give you a list of error types to be removed.
Fill them into {remove_error_type_lst}.

python train.py --track 1 --train-mode finetune --model base --ngpu {ngpu} \
    --lr 5e-5 --max-epoch 80 --reset --prev-model-output-dir {prev_model_output_dir}
python evaluate.py --track 1 --subset valid --ckpt-dir {ckpt_dir}
python error_type_control.py --report {prev_model_output_fpath} \
    --max_error_types 10 --n_simulations 1000000
python evaluate.py --track 1 --subset test --ckpt-fpath {ckpt_fpath} \
    --remove-unk-edits --remove-error-type-lst {remove_error_type_lst} \
    --apply-rerank --preserve-spell --max-edits 7

Low Resource Track

Prepare data for the low resource track
```
python prepare.py --track 3
```

Pre-train

python train.py --track 3 --train-mode pretrain --model base --ngpu {ngpu}
python evaluate.py --track 3 --subset valid --ckpt-dir {ckpt_dir}

Train

python train.py --track 3 --train-mode finetune --model base --ngpu {ngpu} \
    --max-epoch 40 --prev-model-output-dir {prev_model_output_dir} 
python evaluate.py --track 3 --subset valid --ckpt-dir {ckpt_dir}
python evaluate.py --track 3 --subset test --ckpt-fpath {ckpt_fpath} \
    --remove-unk-edits --remove-error-type-lst {remove_error_type_lst} \
    --apply-rerank --preserve-spell --max-edits 7

A Note on `fairseq`

We ran our Transformer models using fairseq-0.6.1. We had to make several modifications to the package though, including our own implementation of the copy-augmented Transformer model. You can find all of our modifications in fairseq/MODIFICATIONS.md.

Citation

If you use our code for research, please cite our work as:

@inproceedings{choe-etal-2019-neural,
    title = "A Neural Grammatical Error Correction System Built On Better Pre-training and Sequential Transfer Learning",
    author = "Choe, Yo Joong  and
      Ham, Jiyeon  and
      Park, Kyubyong  and
      Yoon, Yeoil",
    booktitle = "Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications",
    month = aug,
    year = "2019",
    address = "Florence, Italy",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/W19-4423",
    pages = "213--227",
    abstract = "Grammatical error correction can be viewed as a low-resource sequence-to-sequence task, because publicly available parallel corpora are limited.To tackle this challenge, we first generate erroneous versions of large unannotated corpora using a realistic noising function. The resulting parallel corpora are sub-sequently used to pre-train Transformer models. Then, by sequentially applying transfer learning, we adapt these models to the domain and style of the test set. Combined with a context-aware neural spellchecker, our system achieves competitive results in both restricted and low resource tracks in ACL 2019 BEAShared Task. We release all of our code and materials for reproducibility.",
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

helo_word

Authors

Installation

Download & Preprocess Data

Restricted Track

Low Resource Track

A Note on `fairseq`

Citation

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
bea2019submissions		bea2019submissions
data/language_model/dicts		data/language_model/dicts
fairseq		fairseq
gec		gec
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
error_type_control.py		error_type_control.py
evaluate.py		evaluate.py
prepare.py		prepare.py
preprocess.py		preprocess.py
requirements.txt		requirements.txt
train.py		train.py
tree.py		tree.py

License

kakaobrain/helo-word

Folders and files

Latest commit

History

Repository files navigation

helo_word

Authors

Installation

Download & Preprocess Data

Restricted Track

Low Resource Track

A Note on fairseq

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

A Note on `fairseq`

Packages