Token-Level Error Detection for Table-to-Text Generation

Code for the CUNI & UPF submission for the Shared Task in Evaluating Accuracy.

The code allows to train, evaluate, and run the models for automatic token-level detection of factual errors in the generated descriptions for the Rotowire tabular data.

For the detailed description of our system, please refer to:

Zdeněk Kasner, Simon Mille & Ondřej Dušek (2021): Text-in-Context: Token-Level Error Detection for Table-to-Text Generation In: Proceedings of the 14th International Conference on Natural Language Generation (INLG 2021).

Quickstart

If you want to quickly try out the trained model using a command-line interface, follow these steps:

Install the requirements:

pip install -r requirements.txt

Download the trained checkpoint used for the submission: https://owncloud.cesnet.cz/index.php/s/lrmTSyBP9GobnVO/download
Place the checkpoint in the experiments directory:

cd accuracySharedTask_CUNI-UPF
mkdir -p experiments/full_compact_r0.25_ctx40
mv DOWNLOAD_DIR/model.ckpt experiments/full_compact_r0.25_ctx40

Unpack the data (generated facts + preprocessed Rotowire data):

unzip data_unpack.zip

For annotating the errors using automatic context retrieval for the game GAME_ID, run the following command:

./interact.py \
    --game_idx GAME_ID \
    --experiment "full_compact_r0.25_ctx40" \
    --templates compact \
    --ctx 40

Prompt for inputting the sentence ([Sentence]:) will be displayed. You can use games.csv for sentence examples. Use the number in the LINE_ID_FROM_TEST_SET column as a GAME_ID. If you want to use a custom context, omit the --game_idx argument and input the context manually.

Note: Sentences are tokenized by default. If you use the pre-tokenized sentences, use the parameter --is_tokenized.

Example:

GAME_ID=563
sentence: The Boston Celtics defeated the Los Angeles Lakers, 113-107, at TD Garden on Wednesday.
output (O=OK, the token Wednesday is tagged with the NAME error):

[('The', 'O'),
 ('Boston', 'O'),
 ('Celtics', 'O'),
 ('defeated', 'O'),
 ('the', 'O'),
 ('Los', 'O'),
 ('Angeles', 'O'),
 ('Lakers', 'O'),
 (',', 'O'),
 ('113-107', 'O'),
 (',', 'O'),
 ('at', 'O'),
 ('TD', 'O'),
 ('Garden', 'O'),
 ('on', 'O'),
 ('Wednesday', 'NAME'),
 ('.', 'O')]

Model Overview

Rule-based NLG

First, a rule-based NLG system is applied on the original Rotowire tabular data. The system is based either on hand-crafted templates (simple) or FORGe NLG (compact).

Note that the code for running the rule-based NLG systems is not included in the repository (the rule-based NLG system outputs are precomputed in data_unpack.zip/context).

Automatic token-level error detection

For detecting errors in a sentence, c facts generated for the respective game (parameter --ctx) with the highest semantic similarity are selected. The semantic similarity is computed using paraphrase-distilroberta-base-v2 from SentenceTransformers.

The sentence and the context is given as an input to a pretrained RoBERTa LM with a token classification head, which outputs error labels for each token.

Usage Instructions

Requirements

Python 3 & pip
packages: see requirements.txt for the full list

All packages can be installed using

pip install -r requirements.txt

Preprocessing

Unpacking the data

The facts generated with the rule-based NLG systems and the preprocessed Rotowire data are included in data_unpack.zip. The file needs to be unpacked prior to preprocessing:

unzip data_unpack.zip

Preprocessing human-annotated data

Use the script generate.py for preprocessing the human-annotated data from the shared task.

Example (using compact templates):

./generate.py \
    --templates compact \
    --annotations gsml.csv \
    --games games.csv \
    --output ann_compact

For generating the test splits for the 6-fold cross-validation, use the parameter --xval_splits 6. The splits will be divided into subdirectories numbered from 0 to xval_splits-1. Cross-validation is necessary only for development - if a separate test dataset is used (see the test directory), all human-annotated data can be used for training.

Generating synthetic data

Use the script generate.py for generating the synthetic data. The data is generated by adding noise to the original Rotowire human-written references.

Example (using compact templates and entity modification rate 0.25):

for SPLIT in train dev test; do
    ./generate.py \
        --modification_rate 0.25 \
        --templates compact \
        --output syn_compact_r0.25 \
        --split $SPLIT
done

The generated data will appear the data folder.

Note that the process may take a while. For generating only a subset of the data (e.g. for parallelizing the process), use the parameters --start and --end denoting the first and last game numbers, respectively.

Rotowire split	# of games
train	3396
dev	726
test	727

For both human-annotated and synthetic data, the versions of the dataset using context sizes 5, 10, 20 and 40 are generated automatically. If you wish to specify a custom context size, use the parameter --ctx. The context size is denoted in the folder name by suffix _ctx{CONTEXT_SIZE}.

Training

Context size 40 is used in all the examples.

Human-annotated data

For training the model on the human-annotated data, use the following command:

./train.py \
    --dataset ann_compact_ctx40 \
    --experiment ann_compact_ctx40 \
    --max_epochs 10 \
    --gpus 1

Synthetic data

For training the model on the synthetic data (EMR=0.25), use the following command:

./train.py \
    --dataset syn_compact_r0.25_ctx40 \
    --experiment syn_compact_r0.25_ctx40 \
    --max_epochs 10 \
    --gpus 1

Human-annotated + synthetic data

For training the model on both the synthetic and human-annotated data:

Train the model on the synthetic data (see the previous section).
Load the trained checkpoint and further train it on human-annotated data:

./train.py \
    --dataset ann_compact_ctx40 \
    --experiment syn_ann_compact_ctx40 \
    --max_epochs 10 \
    --gpus 1 \
    --model_path experiments/syn_compact_r0.25_ctx40/model.ckpt

Decoding

For applying the trained model (syn+ann) on generated sentences provided in games.csv, use the following command:

./decode.py \
    --experiment syn_ann_compact_ctx40 \
    --input_file games.csv \
    --templates compact \
    --ctx 40

The model will produce the file out.csv with annotations in the experiment directory.

Note: the model outputs an individual error category for each token. Due to the way the precision is computed, this may hurt the precision score with multi-token errors. In order to fix this, use the script ./postprocess_submission.py for postprocessing the model output:

./postprocess_submission.py experiments/syn_ann_compact_ctx40/out.csv

Evaluation

An evaluation script provided by the shared task organizers is included in the repository. The script is slighly modified, producing a more compact output and including an option to average results over cross-validation splits.

Evaluating the output.csv from the syn model (without cross-validation):

./evaluate.py \
    --gsml gsml.csv \
    --submitted experiments/syn_compact_r0.25_ctx40/out.csv \
    --token_lookup token_lookup.yaml

Evaluating the output.csv from the syn+ann model (averaging over 6 cross-validation splits):

./evaluate.py \
    --gsml gsml.csv \
    --submitted experiments/syn_ann_compact_ctx40/out.csv \
    --token_lookup token_lookup.yaml \
    --avg_xval 6

Miscellaneous

Submission

The file submission.csv output contains the output from our system for the shared task. The test data are included in the test directory.

Visualization

The script utils/gsml_to_html.py can be used for generating a HTML file for visualizing the identified errors:

Sample output:

Colors:

green: error is identified correctly
orange: error is identified although the token is ok
red: error is not identified
purple: error category is identified incorrectly

Citation

@inproceedings{kasner-etal-2021-text,
    title = "Text-in-Context: Token-Level Error Detection for Table-to-Text Generation",
    author = "Kasner, Zden{\v{e}}k  and
      Mille, Simon  and
      Du{\v{s}}ek, Ond{\v{r}}ej",
    booktitle = "Proceedings of the 14th International Conference on Natural Language Generation",
    month = aug,
    year = "2021",
    address = "Aberdeen, Scotland, UK",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.inlg-1.25",
    pages = "259--265"
}

Name		Name	Last commit message	Last commit date
Latest commit History 113 Commits
img		img
test		test
texts		texts
utils		utils
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
README_shared_task.md		README_shared_task.md
cities.txt		cities.txt
data_unpack.zip		data_unpack.zip
decode.py		decode.py
eval.sh		eval.sh
evaluate.py		evaluate.py
games.csv		games.csv
generate.py		generate.py
gsml.csv		gsml.csv
interact.py		interact.py
model.py		model.py
postprocess_submission.py		postprocess_submission.py
preprocess.py		preprocess.py
qsub_decode.sh		qsub_decode.sh
qsub_generate.sh		qsub_generate.sh
qsub_train.sh		qsub_train.sh
requirements.txt		requirements.txt
shared_task.jsonl		shared_task.jsonl
submission.csv		submission.csv
submit.sh		submit.sh
test_gsml.py		test_gsml.py
token_lookup.yaml		token_lookup.yaml
train.py		train.py

License

kasnerz/accuracySharedTask_CUNI-UPF

Folders and files

Latest commit

History

Repository files navigation