GEC for Icelandic

This repository contains example scripts and evaluation data for the paper Byte-Level Grammatical Error Correction Using Synthetic and Curated Corpora, accepted to the ACL'23 main conference.

Data

We provide data for evaluating Icelandic GEC models, and provide references to the data used for training the models in the paper.

Test sets

All evaluation data for the models is included in the data/testsets directory. The is_err file ending represents the source (errored) file, and .is_corr is the file containing the corrected references. We refer to the paper for a description of each test set.

Error corpora

The Icelandic Error Corpus and the accompanying specialized corpora can be downloaded from the CLARIN website at the following URLs:

http://hdl.handle.net/20.500.12537/105 http://hdl.handle.net/20.500.12537/106 http://hdl.handle.net/20.500.12537/132 http://hdl.handle.net/20.500.12537/133

Note that sentences from these corpora appear in the following test sets provided with this submission: test.500.dyslex, test.500.L2, test.500.child. If the test sets are used for evaluation, these sentences need to be filtered out from the training data.

Icelandic Gigaword Corpus

For generating the synthetic error data, we used the Icelandic Gigaword Corpus. This corpus can be downloaded from CLARIN as well:

http://hdl.handle.net/20.500.12537/254

The paper describes how the synthetic data was generated by noising this corpus.

Scripts

In the example_scriptsdirectory you can find scripts for training the different models for GEC.

Installation

pip install -r requirements.txt

For evaluation using GLEU, you need to install the GLEU package: git clone https://github.com/cnap/gec-ranking.git.

and run with ./gec-ranking/scripts/compute_glue -r $REF_FILE -s $SRC_FILE -o $GENERATED_FILE > gleu_results

Structure

The scripts are organized in the following way:

byt5 - scripts for synth and finetuning training Byte-level BPE models. Uses the transformers library.
mt5 - scripts for synth and finetuning training mT5 models. Uses the transformers library.
mbart - scripts for synth and finetuning training mBART-ISEN models. Uses the fairseq library.
noising - scripts for adding noise to the data. Has its own README.
infer.py - script for inference using the trained ByT5 models. Uses the transformers library.

Note that most of the arguments regarding paths have been removed from the scripts. You need to add them manually.

Models

For training the GEC models described in the paper, the following pre-trained models were used:

mT5 (base) - Available on Hugging Face (https://huggingface.co/google/mt5-base)
ByT5 (base) - Available on Hugging Face (https://huggingface.co/google/byt5-base)
mBART-ENIS - This model is not currently published, but its training is described in the paper (see Appendix A). It is trained upon the pre-trained mBART (https://github.com/facebookresearch/fairseq/tree/main/examples/mbart)

The best performing model (referred to as ByT5-Synth-550k+EC in the paper) is published at the CLARIN website:

http://hdl.handle.net/20.500.12537/255

This model is a ByT5-base model further trained for 550,000 updates on the synthetic error corpus and finetuned on the Icelandic Error Corpus.

Abstract of paper

Grammatical error correction (GEC) is the task of correcting typos, spelling, punctuation and grammatical issues in text. Approaching the problem as a sequence-to-sequence task, we compare the use of a common subword unit vocabulary and byte-level encoding. Initial synthetic training data is created using an error-generating pipeline, and used for finetuning two subword-level models and one byte-level model. Models are then finetuned further on hand-corrected error corpora, including texts written by children, university students, dyslexic and second-language writers, and evaluated over different error types and origins. We show that a byte-level model enables higher correction quality than a subword approach, not only for simple spelling errors, but also for more complex semantic, stylistic and grammatical issues. In particular, initial training on synthetic corpora followed by finetuning on a relatively small parallel corpus of real-world errors helps the byte-level model correct a wide range of commonly occurring errors. Our experiments are run for the Icelandic language but should hold for other similar languages, particularly morphologically rich ones.

Citing this paper

(Will be updated with the ACL Anthology citation once published.)

 @article{ingolfsdottir-byte:2023,
    author    = "Svanhvít Lilja Ingólfsdóttir, Pétur Orri Ragnarsson, Haukur Páll Jónsson, Haukur Barri Símonarson, Vilhjálmur Þorsteinsson, Vésteinn Snæbjarnarson",
    title     = "{Byte-Level Grammatical Error Correction Using Synthetic and Curated Corpora}",
    journal   = {ArXiv},
    year      = {2023},
    volume    = {abs/2305.17906},
    url       = {https://arxiv.org/abs/2305.17906}}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data/testsets		data/testsets
example_scripts		example_scripts
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GEC for Icelandic

Data

Test sets

Error corpora

Icelandic Gigaword Corpus

Scripts

Installation

Structure

Models

Abstract of paper

Citing this paper

About

Releases

Packages

Languages

mideind/byte-gec

Folders and files

Latest commit

History

Repository files navigation

GEC for Icelandic

Data

Test sets

Error corpora

Icelandic Gigaword Corpus

Scripts

Installation

Structure

Models

Abstract of paper

Citing this paper

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages