GEO-Aligned is a semantic parsing dataset that was obtained by augmenting the popular GeoQuery data with word alignment information. You can read about it in our *SEM paper.
It comes in three languages:
- English, based on the original dataset
- German, based on multilingual GeoQuery
- Italian, based on translations that were provided by two native speaker annotators as part of this work
The alignments were provided by expert annotators who found that the majority of the examples can be aligned monotonically.
The dataset can be found in the data
folder and it comes in .csv
format. There are five columns:
- ID : contains the ids of the example
- NL : contains the natural language queries
- MR : contains the meaning representation programs
- ALIGNMENT : contains the word alignments
- MONOTONIC : contains labels indicating whether the alignment is monotonic
There are two versions for each language:
- Files named
{language}.csv
contain the dataset with constants (city names, state names, etc) - Files named
{language}_anon.csv
contain the anonymized version of the data
The splits
folder contains lists of IDs for three different splits: a question split, a query split and a length split. The test set IDs are found in the test.txt
files. Moreover, we provide three development sets dev1.txt
, dev2.txt
and dev3.txt
, which can be used for validation.
If you find GEO-Aligned useful in your research, please cite this paper:
@inproceedings{locatelli-quattoni-2022-measuring,
title = "Measuring Alignment Bias in Neural Seq2seq Semantic Parsers",
author = "Locatelli, Davide and
Quattoni, Ariadna",
booktitle = "Proceedings of the 11th Joint Conference on Lexical and Computational Semantics",
month = jul,
year = "2022",
address = "Seattle, Washington",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.starsem-1.17",
doi = "10.18653/v1/2022.starsem-1.17",
pages = "200--207"
}