Code for the paper:
Takumi Gotou, Ryo Nagata, Masato Mita and Kazuaki Hanawa “Taking the Correction Difficulty into Account in Grammatical Error Correction Evaluation” In Proceedings of the 28th International Conference on Computational Linguistics (COLING 2020)
@inproceedings{gotou-etal-2020-taking,
title = "Taking the Correction Difficulty into Account in Grammatical Error Correction Evaluation",
author = "Gotou, Takumi and
Nagata, Ryo and
Mita, Masato and
Hanawa, Kazuaki",
booktitle = "Proceedings of the 28th International Conference on Computational Linguistics",
month = dec,
year = "2020",
address = "Barcelona, Spain (Online)",
publisher = "International Committee on Computational Linguistics",
url = "https://aclanthology.org/2020.coling-main.188",
doi = "10.18653/v1/2020.coling-main.188",
pages = "2085--2095",
}
GoToScorer can evaluate the GEC systems performances considering the difficulty of error correction.
It is confirmed to work with python 3.8.0.
python gotoscorer.py -ref <ref_file> -hyp <hyp_file>
-ref <ref_file>
represents a reference M2 file and -hyp <hyp_file>
represents a hypothesis M2 file. You can generate both of files by ERRANT. You can see demo/ref.m2
and demo/hyp.m2
for an example.
$ python gotoscorer.py -ref demo/ref.m2 -hyp demo/hyp.m2
Output:
----- Weighted Scores -----
Sys_name Prec. Recall F F0.5 Accuracy
0 : 1.0000 0.4444 0.6154 0.8000 0.5833
1 : 0.2500 0.2222 0.2353 0.2439 0.2500
2 : 0.0000 0.0000 0.0000 0.0000 0.1667
-
-v
The output includes TP, FP, FN and TN.
$ python gotoscorer.py -ref demo/ref.m2 -hyp demo/hyp.m2 -v
----- Weighted Scores ----- Sys_name TP FP FN TN Prec. Recall F F0.5 Accuracy 0 : 1.3333 0.0000 1.6667 1.0000 1.0000 0.4444 0.6154 0.8000 0.5833 1 : 0.6667 2.0000 2.3333 0.3333 0.2500 0.2222 0.2353 0.2439 0.2500 2 : 0.0000 2.6667 3.0000 0.6667 0.0000 0.0000 0.0000 0.0000 0.1667
-
-name <sys_1,sys_2,...,sys_N>
Register system names for output to convert id to specified. Separate each name with comma.
$ python gotoscorer.py -ref demo/ref.m2 -hyp demo/hyp.m2 -name CNN,LSTM,Transformer
----- Weighted Scores ----- Sys_name Prec. Recall F F0.5 Accuracy CNN : 1.0000 0.4444 0.6154 0.8000 0.5833 LSTM : 0.2500 0.2222 0.2353 0.2439 0.2500 Transformer: 0.0000 0.0000 0.0000 0.0000 0.1667
-
-cat {1,2,3}
Compute mean and standard deviation of each error type difficulty in descending order.
{1,2,3}
is granularity of error type, same behavior of ERRANT.$ python gotoscorer.py -ref demo/ref.m2 -hyp demo/hyp.m2 -cat 3
----- Category Difficulty ----- Category Ave. Std. Freq. U:NOUN 1.00 0.00 1 M:VERB 0.67 0.00 1 U:PREP 0.67 0.00 1 R:VERB 0.67 0.00 1 R:PRON 0.00 0.00 1 M:DET 0.00 0.00 1
-
-heat <output_file>
Generate a heat map of error correction difficulty. You can see
demo/heat_map.html
for an example.$ python gotoscorer.py -ref demo/ref.m2 -hyp demo/hyp.m2 -heat demo/heat_map.html
-
-gen_w_file <output_file>
Generate a weight-file. Originally, multiple systems outputs are required to calculate the correction difficulty, but a single system can be evaluated by using a pre-made weight-file. You can see
demo/weight.txt
for an example.$ python gotoscorer.py -ref demo/ref.m2 -hyp demo/hyp.m2 -gen_w_file demo/weight.txt
-
-w_file <weight_file>
Evaluate a system using a weight-file.
$ python gotoscorer.py -ref demo/ref.m2 -hyp demo/hyp_1sys.m2 -w_file demo/weight.txt
----- Weighted Scores ----- Sys_name Prec. Recall F F0.5 Accuracy 0 : 1.0000 0.4444 0.6154 0.8000 0.5833
-
-cv <output_file>
Visualize the chunk with weight and error type, as shown in the following example. If you specify
None
as the file path, the output will be on the terminal.$ python gotoscorer.py -ref demo/ref.m2 -hyp demo/hyp.m2 -cv None
----- Chunk Visualizer ----- orig: | |We | |discussing| |about | | its | | . | | gold: | |We |have been|discussing| | | | it | | . | | weight: |0.33|0.0| 0.67 | 0.33 |0.0| 0.67 |0.0| 0.0 |0.0|0.0|0.33| cat: | | | M:VERB | | |U:PREP| |R:PRON| | | | orig: | | I | |have been| |to | |park| |tomorrow| | . | | gold: | | I | | go | |to | the |park| | | | . | | weight: |0.0|0.0|0.0| 0.67 |0.0|0.0| 0.0 |0.0 |0.0| 1.0 |0.0|0.0|0.0| cat: | | | | R:VERB | | |M:DET| | | U:NOUN | | | |
GTS requires reference M2 and hypothesis M2. You can make these files using ERRANT.
Example for generating M2 files with demo data
-
Generating
demo/hyp.m2
$ errant_parallel -orig demo/orig.txt -cor demo/sys1.txt demo/sys2.txt demo/sys3.txt -out demo/hyp.m2
-
Generating
demo/ref.m2
$ errant_parallel -orig demo/orig.txt -cor demo/gold.txt -out demo/ref.m2
In general, it is unlikely to be generated in this way, since existing correct answer files are used as references.
GTS provides a visualizer of error correction difficulty. Errors are colored according to the success rate: pale (easier) to deep (harder). Furthermore, the red indicates errors what should be corrected (TP, FN), and the blue indicates that system has corrected what should not be corrected (FP). If you mouseover colored words, you can see the detail of the correction: an error type, a correct correction, a weight.