This repository contains example scripts and evaluation data for the paper Byte-Level Grammatical Error Correction Using Synthetic and Curated Corpora, accepted to the ACL'23 main conference.
We provide data for evaluating Icelandic GEC models, and provide references to the data used for training the models in the paper.
All evaluation data for the models is included in the data/testsets
directory. The is_err
file ending represents the source (errored) file, and .is_corr
is the file containing the corrected references. We refer to the paper for a description of each test set.
The Icelandic Error Corpus and the accompanying specialized corpora can be downloaded from the CLARIN website at the following URLs:
http://hdl.handle.net/20.500.12537/105 http://hdl.handle.net/20.500.12537/106 http://hdl.handle.net/20.500.12537/132 http://hdl.handle.net/20.500.12537/133
Note that sentences from these corpora appear in the following test sets provided with this submission: test.500.dyslex
, test.500.L2
, test.500.child
.
If the test sets are used for evaluation, these sentences need to be filtered out from the training data.
For generating the synthetic error data, we used the Icelandic Gigaword Corpus. This corpus can be downloaded from CLARIN as well:
http://hdl.handle.net/20.500.12537/254
The paper describes how the synthetic data was generated by noising this corpus.
In the example_scripts
directory you can find scripts for training the different models for GEC.
pip install -r requirements.txt
For evaluation using GLEU, you need to install the GLEU package:
git clone https://github.com/cnap/gec-ranking.git
.
and run with ./gec-ranking/scripts/compute_glue -r $REF_FILE -s $SRC_FILE -o $GENERATED_FILE > gleu_results
The scripts are organized in the following way:
- byt5 - scripts for synth and finetuning training Byte-level BPE models. Uses the
transformers
library. - mt5 - scripts for synth and finetuning training mT5 models. Uses the
transformers
library. - mbart - scripts for synth and finetuning training mBART-ISEN models. Uses the
fairseq
library. - noising - scripts for adding noise to the data. Has its own README.
- infer.py - script for inference using the trained ByT5 models. Uses the
transformers
library.
Note that most of the arguments regarding paths have been removed from the scripts. You need to add them manually.
For training the GEC models described in the paper, the following pre-trained models were used:
- mT5 (base) - Available on Hugging Face (https://huggingface.co/google/mt5-base)
- ByT5 (base) - Available on Hugging Face (https://huggingface.co/google/byt5-base)
- mBART-ENIS - This model is not currently published, but its training is described in the paper (see Appendix A). It is trained upon the pre-trained mBART (https://github.com/facebookresearch/fairseq/tree/main/examples/mbart)
The best performing model (referred to as ByT5-Synth-550k+EC
in the paper) is published at the CLARIN website:
http://hdl.handle.net/20.500.12537/255
This model is a ByT5-base model further trained for 550,000 updates on the synthetic error corpus and finetuned on the Icelandic Error Corpus.
Grammatical error correction (GEC) is the task of correcting typos, spelling, punctuation and grammatical issues in text. Approaching the problem as a sequence-to-sequence task, we compare the use of a common subword unit vocabulary and byte-level encoding. Initial synthetic training data is created using an error-generating pipeline, and used for finetuning two subword-level models and one byte-level model. Models are then finetuned further on hand-corrected error corpora, including texts written by children, university students, dyslexic and second-language writers, and evaluated over different error types and origins. We show that a byte-level model enables higher correction quality than a subword approach, not only for simple spelling errors, but also for more complex semantic, stylistic and grammatical issues. In particular, initial training on synthetic corpora followed by finetuning on a relatively small parallel corpus of real-world errors helps the byte-level model correct a wide range of commonly occurring errors. Our experiments are run for the Icelandic language but should hold for other similar languages, particularly morphologically rich ones.
(Will be updated with the ACL Anthology citation once published.)
@article{ingolfsdottir-byte:2023,
author = "Svanhvít Lilja Ingólfsdóttir, Pétur Orri Ragnarsson, Haukur Páll Jónsson, Haukur Barri Símonarson, Vilhjálmur Þorsteinsson, Vésteinn Snæbjarnarson",
title = "{Byte-Level Grammatical Error Correction Using Synthetic and Curated Corpora}",
journal = {ArXiv},
year = {2023},
volume = {abs/2305.17906},
url = {https://arxiv.org/abs/2305.17906}}