Adding nli_tr dataset #787

e-budur · 2020-11-01T21:49:44Z

Hello,

In this pull request, we have implemented the necessary interface to add our recent dataset NLI-TR. The datasets will be presented on a full paper at EMNLP 2020 this month. [arXiv link]

The dataset is the neural machine translation of SNLI and MultiNLI datasets into Turkish. So, we followed a similar format with the original datasets hosted in the HuggingFace datasets hub.

Our dataset is designed to be accessed as follows by following the interface of the GLUE dataset that provides multiple datasets in a single interface over the HuggingFace datasets hub.

from datasets import load_dataset
multinli_tr = load_dataset("nli_tr", "multinli_tr")
snli_tr = load_dataset("nli_tr", "snli_tr")

Thanks for your help in reviewing our pull request.

lhoestq

Looks good to me !
Thanks for adding this one ;)

I left minor comments. Once they're resolved we can merge it :)

datasets/nli_tr/nli_tr.py

…s library. See also:huggingface#787 (comment)

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

…s library. See also:huggingface#787 (comment)

into add-nli-tr-dataset

e-budur · 2020-11-10T13:39:39Z

Thank you @lhoestq for the time you take to review our pull request. We appreciate your help.

We've made the changes you described. Hope that it is ready for being merged. Please let me know if you have any additional requests for revisions.

lhoestq · 2020-11-10T13:52:44Z

datasets/nli_tr/nli_tr.py

+        self.description = "The Natural Language Inference in Turkish (NLI-TR) is a set of two large scale datasets that were obtained by translating the foundational NLI corpora (SNLI and MNLI) using Amazon Translate."
+        self.homepage = "https://github.com/boun-tabi/NLI-TR"
+        self.citation = """\
+                @inproceedings{budur-etal-2020-data,
+                    title = "Data and Representation for Turkish Natural Language Inference",
+                    author = "Budur, Emrah and
+                      \"{O}zçelik, Rıza and
+                      G\"{u}ng\"{o}r, Tunga",
+                    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
+                    month = nov,
+                    year = "2020",
+                    address = "Online",
+                    publisher = "Association for Computational Linguistics",
+                    abstract = "Large annotated datasets in NLP are overwhelmingly in English. This is an obstacle to progress in other languages. Unfortunately, obtaining new annotated resources for each task in each language would be prohibitively expensive. At the same time, commercial machine translation systems are now robust. Can we leverage these systems to translate English-language datasets automatically? In this paper, we offer a positive response for natural language inference (NLI) in Turkish. We translated two large English NLI datasets into Turkish and had a team of experts validate their translation quality and fidelity to the original labels. Using these datasets, we address core issues of representation for Turkish NLI. We find that in-language embeddings are essential and that morphological parsing can be avoided where the training set is large. Finally, we show that models trained on our machine-translated datasets are successful on human-translated evaluation sets. We share all code, models, and data publicly.",
+                }
+                """


Sorry maybe I wasn't clear about description, homepage and citation.
To appear in the datasets hub page on huggingface.co those three fields actually need to be in global variables _DESCRIPTION, _CITATION and _HOMEPAGE.
As they're global variables you don't need to have them in NLITRConfig, and you can directly use those three variables in _info().
See squad.py for example.

Very sorry for the confusion. I've revised the implementation accordingly. Thank you so much for the time you take for reviewing it.

…s library. See also:huggingface#787 (comment)

…-tr-dataset # Conflicts: # datasets/nli_tr/nli_tr.py

…s library. See also:huggingface#787 (comment)

lhoestq

Looks all good now thanks :)

datasets/nli_tr/nli_tr.py

e-budur added 3 commits November 2, 2020 00:31

Adding nli_tr dataset

303d1cd

updated dummy data files for nli_tr

826d3fd

updated latex unicode encoding characters that breaks the ci pipeline

8b3f864

lhoestq reviewed Nov 10, 2020

View reviewed changes

datasets/nli_tr/nli_tr.py Outdated Show resolved Hide resolved

datasets/nli_tr/nli_tr.py Outdated Show resolved Hide resolved

e-budur and others added 4 commits November 10, 2020 13:52

Minor changes to comply with the standards of the HuggingFace dataset…

542483b

…s library. See also:huggingface#787 (comment)

Update datasets/nli_tr/nli_tr.py

f18bd5c

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

Minor changes to comply with the standards of the HuggingFace dataset…

ee1c39c

…s library. See also:huggingface#787 (comment)

Merge branch 'add-nli-tr-dataset' of https://github.com/e-budur/datasets

1075a0e

into add-nli-tr-dataset

lhoestq reviewed Nov 10, 2020

View reviewed changes

yjernite added this to In progress in Datasets to Add via automation Nov 10, 2020

e-budur added 3 commits November 10, 2020 17:03

Minor changes to comply with the standards of the HuggingFace dataset…

2e47472

…s library. See also:huggingface#787 (comment)

Merge remote-tracking branch 'origin/add-nli-tr-dataset' into add-nli…

31e1ea5

…-tr-dataset # Conflicts: # datasets/nli_tr/nli_tr.py

Minor changes to comply with the standards of the HuggingFace dataset…

24a558f

…s library. See also:huggingface#787 (comment)

lhoestq approved these changes Nov 11, 2020

View reviewed changes

datasets/nli_tr/nli_tr.py Outdated Show resolved Hide resolved

Update datasets/nli_tr/nli_tr.py

e65361e

lhoestq merged commit 86d3794 into huggingface:master Nov 12, 2020

Datasets to Add automation moved this from In progress to Done Nov 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding nli_tr dataset #787

Adding nli_tr dataset #787

e-budur commented Nov 1, 2020 •

edited

lhoestq left a comment

e-budur commented Nov 10, 2020

lhoestq Nov 10, 2020

e-budur Nov 10, 2020

lhoestq left a comment

Adding nli_tr dataset #787

Adding nli_tr dataset #787

Conversation

e-budur commented Nov 1, 2020 • edited

lhoestq left a comment

Choose a reason for hiding this comment

e-budur commented Nov 10, 2020

lhoestq Nov 10, 2020

Choose a reason for hiding this comment

e-budur Nov 10, 2020

Choose a reason for hiding this comment

lhoestq left a comment

Choose a reason for hiding this comment

e-budur commented Nov 1, 2020 •

edited