Skip to content

UNOFFICIAL implementation of "IMPARA: Impact-Based Metric for GEC Using Parallel Data", COLING2022


Notifications You must be signed in to change notification settings


Repository files navigation


This is an UNOFFICIAL implementation of IMPARA, one of the reference-less metric for Grammatical Error Correction, proposed in the following paper:

    title = "{IMPARA}: Impact-Based Metric for {GEC} Using Parallel Data",
    author = "Maeda, Koki  and
      Kaneko, Masahiro  and
      Okazaki, Naoaki",
    booktitle = "Proceedings of the 29th International Conference on Computational Linguistics",
    month = oct,
    year = "2022",
    address = "Gyeongju, Republic of Korea",
    publisher = "International Committee on Computational Linguistics",
    url = "",
    pages = "3578--3588",

Trained Quallity Estimation model

I have uploaded the trained Quality Estimation model for IMPARA to Huggingface Hub.
gotutiyan/IMPARA-QE: model card.
You can use it by BertForSequenceClassification.from_pretrained('gotutiyan/IMPARA-QE').

gotutiyan/IMPARA-QE achieves 95.93 for Peason's correlation and 93.01 for Spearman's (with 'bert-base-cased' for SE model). For more information, please see here.
Note that this results does not fully achieve the results of the paper.



If you don't specify --restore_dir, gotutiyan/IMPARA-QE will be used for the QE model.

python \
 --src <source_file> \
 --pred <prediction_file>

# If you use your custom QE model
# python \
#  --src <source_file> \
#  --pred <prediction_file> \
#  --restore_dir <directory of your custom QE model>


from transformers import AutoTokenizer, BertForSequenceClassification
from modeling import IMPARA, SimilarityEstimator

se_model = SimilarityEstimator('bert-base-cased')
qe_model = BertForSequenceClassification.from_pretrained('gotutiyan/IMPARA-QE')
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
impara = IMPARA(se_model, qe_model, tokenizer, threshold=0.9).cuda()
src_sents = ['This is a sentence .', 'This is another sentence .']
pred_sents = ['This a is sentence .', 'This is another sentence .']
scores = impara.score(src_sents, pred_sents)
print(scores) # [0.16174763441085815, 0.8121877312660217]

Experiments Procedure

Confirmed that it works on python 3.8.10.

1. Install

Maybe requirements.txt contains unrelated modules but necessary modules are included.

pip install -r requirements.txt

2. Prepare data

mkdir data
cd data
tar -xf release2.3.1.tar.gz 
git clone
python M2Convertor/ \
 -f release2.3.1/original/data/official-preprocessed.m2 \
 -p release2.3.1/original/data/conll13

We will use release2.3.1/original/data/conll13.src and release2.3.1/original/data/conll13.trg.

3. Create supervision data for quality estimation model

Create supervison data from CoNLL-13 parallel data.

The script tries to create 30 samples for each parallel data, so we temporarily obtain about 40000 supervision instances (about 1380 sentences x 30 samples). Then, the instances are shuffled randomly and used only 4096 instances from the front.

The paper said the data is divided 8:1:1 for train, valid and test set, so I divided 3276:410:410.

pwd # IMPARA/data
python ../ \
 --src release2.3.1/original/data/conll13.src \
 --trg release2.3.1/original/data/conll13.trg \
 > all.tsv

cat all.tsv | awk 'NR==1,NR==410 {print}' > test.tsv
cat all.tsv | awk 'NR==411,NR==820 {print}' > valid.tsv
cat all.tsv | awk 'NR>=821 {print}' > train.tsv

4. Train the quality estimation model

Please rewrite OUTDIR variable to be appropriate. Since the training data is very small, the training will be finished around 10 minutes on a A100.

This script also works on multiple GPUs with Accelerate module of Huggingface but a single GPU is enough to train. Here is a setting of Aceelerate to train on a single GPU:

accelerate config
# In which compute environment are you running? ([0] This machine, [1] AWS (Amazon SageMaker)): 0
# Which type of machine are you using? ([0] No distributed training, [1] multi-CPU, [2] multi-GPU, [3] TPU [4] MPS): 0
# Do you want to run your training on CPU only (even if a GPU is available)? [yes/NO]:NO
# Do you want to use DeepSpeed? [yes/NO]: NO
# What GPU(s) (by id) should be used for training on this machine as a comma-seperated list? [all]:
# Do you wish to use FP16 or BF16 (mixed precision)? [NO/fp16/bf16]: NO


pwd # IMPARA/

# OUTDIR=models/model
# mkdir -p ${OUTDIR}
# accelerate launch \
#  --train_file data/train.tsv \
#  --valid_file data/valid.tsv \
#  --epochs 10 \
#  --batch_size 32 \
#  --outdir ${OUTDIR} 

The results will be saved as the following format.

├── best
│   ├── config.json
│   ├── impara_config.json
│   ├── pytorch_model.bin
├── last
│   ├── config.json
│   ├── impara_config.json
│   └── pytorch_model.bin
└── log.json

5. Evaluate

The way is the same as Usage section mentioned above.

Here is an example if your trained model is saved in models/model/best.

python \
 --src <source_file> \
 --pred <prediction_file> \
 --restore_dir models/model/best

Correlation with Human Evaluation

Here is an example to compute correlation with Grundkiewicz +15's Expected Wins score.

mkdir data/conll14
cd data/conll14
bash ../../
cd ../../
bash path/to/model > result.txt
python --human Grundkiewicz15_EW.txt --system result.txt

The input of the is 12 lines consisting of CAMB CUUI AMU POST NTHU RAC UMC PKU SJTU UFC IPN IITB scores.

The trained QE model of gotutiyan/IMPARA-QE achieves 95.93 for Peason's correlation and 93.01 for Spearman's.


UNOFFICIAL implementation of "IMPARA: Impact-Based Metric for GEC Using Parallel Data", COLING2022







No packages published