This repository contains the following:
- the code to build the Ukrainian silver data for coreference resolution based on OntoNotes 5.0
- the manual translation of Winograd Schema Challenge dataset into Ukrainian
The experiments were conducted using OntoNotes 5.0 data. The corpus can be downloaded here; registration needed.
The provided code expects data in jsonlines
format, so some preprocessing is necessary.
-
Extract OntoNotes 5.0 arhive. In case it's in the repo's root directory:
tar -xzvf ontonotes-release-5.0_LDC2013T19.tgz
-
Switch to Python 2.7 environment (where
python
would run 2.7 version). This is necessary for CoNLL scripts to run correctly. To do it with conda:conda create -y --name py27 python=2.7 && conda activate py27
-
Run the CoNLL data preparation scripts:
sh preprocessing/get_conll_data.sh ontonotes-release-5.0 ontonotes-ua
-
Download the CoNLL scorers and Stanford Parser:
sh preprocessing/get_third_party.sh
-
Prepare your environment. To do it with conda:
conda create -y --name ua-coref-data python=3.7 openjdk perl conda activate ua-coref-data python -m pip install -r requirements.txt
-
Build the corpus in
jsonlines
format:python preprocessing/convert_to_jsonlines.py ontonotes-ua/conll-2012/ --out-dir ontonotes-ua
Run the scripts to translate the sentences, align the mentions, and project the annotations from English to Ukrainian:
python scripts/build_silver_data.py -train -dev -test
Processing the whole corpus may take a while because of the current logic behind MT model usage, so you may exclude some splits if necessary.
The machine translation model can be specified using the --translation_model
flag. Note that in our experiments, Helsinki-NLP/opus-mt-en-uk
model was used, and alignment is based on the cross-attention of the 0-th head of the 1-st layer. Using a different model may require changing this as well.
The dataset contains:
Split | Documents | Sentences | Tokens | Mentions | Clusters |
---|---|---|---|---|---|
train | 2,802 | 75,187 | 1,158,965 | 161,010 | 35,025 |
dev | 343 | 9,603 | 146,210 | 20,168 | 4,533 |
test | 348 | 9,479 | 151,542 | 20,522 | 4,513 |
TOTAL | 3,493 | 94,269 | 1,456,717 | 201,700 | 44,071 |
wsc-ua
contains manual translations of 263 Winograd schemas from the WSC dataset in csv
and jsonlines
formats.
text
- the Winograd schema in Ukrainian, tokenizedoptions
- the two entity options that the pronoun may be referring tolabel
- the index of the correct option inoptions
pronoun
- the pronoun in the sequence to be resolvedpronoun_loc
- the index of the ambiguous pronoun intext
No equivalent translations were found for 22 original schemas, so they were excluded:
87-88, 217-218, 221-222, 231-232, 233-234, 237-238, 243-244, 245-246, 247-248, 274-275, 276-277
Data and code improvements are welcome. Please submit a pull request.
@inproceedings{kuchmiichuk-2023-silver,
title = "Silver Data for Coreference Resolution in {U}krainian: Translation, Alignment, and Projection",
author = "Kuchmiichuk, Pavlo",
booktitle = "Proceedings of the Second Ukrainian Natural Language Processing Workshop (UNLP)",
month = may,
year = "2023",
address = "Dubrovnik, Croatia",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.unlp-1.8",
pages = "62--72",
abstract = "Low-resource languages continue to present challenges for current NLP methods, and multilingual NLP is gaining attention in the research community. One of the main issues is the lack of sufficient high-quality annotated data for low-resource languages. In this paper, we show how labeled data for high-resource languages such as English can be used in low-resource NLP. We present two silver datasets for coreference resolution in Ukrainian, adapted from existing English data by manual translation and machine translation in combination with automatic alignment and annotation projection. The code is made publicly available.",
}