Skip to content
Team Kakao&Brain's Grammatical Error Correction System for the ACL 2019 BEA Shared Task
Python Shell Other
Branch: master
Clone or download
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
bea2019submissions Add track1 submissions for the shared task Aug 14, 2019
data/language_model/dicts
fairseq Add initial release Aug 12, 2019
gec modify root path Sep 16, 2019
.gitignore Add initial release Aug 12, 2019
LICENSE Add initial release Aug 12, 2019
README.md Add initial release Aug 12, 2019
error_type_control.py Add initial release Aug 12, 2019
evaluate.py Add initial release Aug 12, 2019
prepare.py Add initial release Aug 12, 2019
preprocess.py modify lm dict path Sep 16, 2019
requirements.txt Add initial release Aug 12, 2019
train.py modify finetune ckpt path Sep 19, 2019
tree.py Add initial release Aug 12, 2019

README.md

helo_word

A Neural Grammatical Error Correction System Built on Better Pre-training and Sequential Transfer Learning

Code accompanying Team Kakao&Brain's submission to the ACL 2019 BEA Workshop Shared Task.
(helo_word is our informal team name.)

Paper: https://arxiv.org/abs/1907.01256

ACL Anthology: https://www.aclweb.org/anthology/papers/W/W19/W19-4423/

Authors

YJ Choe^, Jiyeon Ham^, Kyubyong Park^, Yeoil Yoon^

^Equal contribution.

Installation

Requires Python 3.

# apt-get packages (required for hunspell & pattern)
apt-get update
apt-get install libhunspell-dev libmysqlclient-dev -y

# pip packages
pip install --upgrade pip
pip install --upgrade -r requirements.txt
python -m spacy download en

# custom fairseq (fork of 0.6.1 with gec modifications)
pip install --editable fairseq

# errant
git clone https://github.com/chrisjbryant/errant

# pattern3 (see https://www.clips.uantwerpen.be/pages/pattern for any installation issues)
pip install pattern3
python -c "import site; print(site.getsitepackages())"
# ['PATH_TO_SITE_PACKAGES']
cp tree.py PATH_TO_SITE_PACKAGES/pattern3/text/

Download & Preprocess Data

python preprocess.py

Restricted Track

  • Prepare data for the restricted track
    python prepare.py --track 1
  • Pre-train
    • If you train the model, the system will automatically create a checkpoint directory.
    • Fill it in {ckpt_dir}.
    • Also fill in the number of GPUs used for training in {ngpu}.
    python train.py --track 1 --train-mode pretrain --model base --ngpu {ngpu}
    python evaluate.py --track 1 --subset valid --ckpt-dir {ckpt_dir}
  • Train
    • If you evaluate the model, the system will automatically create an output directory.
    • Fill the previous model output directory into {prev_model_output_dir}.
    python train.py --track 1 --train-mode train --model base --ngpu {ngpu} \
        --lr 1e-4 --max-epoch 40 --reset --prev-model-output-dir {prev_model_output_dir}
    python evaluate.py --track 1 --subset valid --ckpt-dir {ckpt_dir}
  • Fine-tune
    • Fill the best validation report into {prev_model_output_fpath}.
    • Then error_type_control.py will give you a list of error types to be removed.
    • Fill them into {remove_error_type_lst}.
    python train.py --track 1 --train-mode finetune --model base --ngpu {ngpu} \
        --lr 5e-5 --max-epoch 80 --reset --prev-model-output-dir {prev_model_output_dir}
    python evaluate.py --track 1 --subset valid --ckpt-dir {ckpt_dir}
    python error_type_control.py --report {prev_model_output_fpath} \
        --max_error_types 10 --n_simulations 1000000
    python evaluate.py --track 1 --subset test --ckpt-fpath {ckpt_fpath} \
        --remove-unk-edits --remove-error-type-lst {remove_error_type_lst} \
        --apply-rerank --preserve-spell --max-edits 7 

Low Resource Track

  • Prepare data for the low resource track
    python prepare.py --track 3
  • Pre-train
    python train.py --track 3 --train-mode pretrain --model base --ngpu {ngpu}
    python evaluate.py --track 3 --subset valid --ckpt-dir {ckpt_dir}
  • Train
    python train.py --track 3 --train-mode finetune --model base --ngpu {ngpu} \
        --max-epoch 40 --prev-model-output-dir {prev_model_output_dir} 
    python evaluate.py --track 3 --subset valid --ckpt-dir {ckpt_dir}
    python evaluate.py --track 3 --subset test --ckpt-fpath {ckpt_fpath} \
        --remove-unk-edits --remove-error-type-lst {remove_error_type_lst} \
        --apply-rerank --preserve-spell --max-edits 7 

A Note on fairseq

We ran our Transformer models using fairseq-0.6.1. We had to make several modifications to the package though, including our own implementation of the copy-augmented Transformer model. You can find all of our modifications in fairseq/MODIFICATIONS.md.

Citation

If you use our code for research, please cite our work as:

@inproceedings{choe-etal-2019-neural,
    title = "A Neural Grammatical Error Correction System Built On Better Pre-training and Sequential Transfer Learning",
    author = "Choe, Yo Joong  and
      Ham, Jiyeon  and
      Park, Kyubyong  and
      Yoon, Yeoil",
    booktitle = "Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications",
    month = aug,
    year = "2019",
    address = "Florence, Italy",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/W19-4423",
    pages = "213--227",
    abstract = "Grammatical error correction can be viewed as a low-resource sequence-to-sequence task, because publicly available parallel corpora are limited.To tackle this challenge, we first generate erroneous versions of large unannotated corpora using a realistic noising function. The resulting parallel corpora are sub-sequently used to pre-train Transformer models. Then, by sequentially applying transfer learning, we adapt these models to the domain and style of the test set. Combined with a context-aware neural spellchecker, our system achieves competitive results in both restricted and low resource tracks in ACL 2019 BEAShared Task. We release all of our code and materials for reproducibility.",
}
You can’t perform that action at this time.