Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding nli_tr dataset #787

Merged
merged 11 commits into from
Nov 12, 2020
Merged

Conversation

e-budur
Copy link
Contributor

@e-budur e-budur commented Nov 1, 2020

Hello,

In this pull request, we have implemented the necessary interface to add our recent dataset NLI-TR. The datasets will be presented on a full paper at EMNLP 2020 this month. [arXiv link]

The dataset is the neural machine translation of SNLI and MultiNLI datasets into Turkish. So, we followed a similar format with the original datasets hosted in the HuggingFace datasets hub.

Our dataset is designed to be accessed as follows by following the interface of the GLUE dataset that provides multiple datasets in a single interface over the HuggingFace datasets hub.

from datasets import load_dataset
multinli_tr = load_dataset("nli_tr", "multinli_tr")
snli_tr = load_dataset("nli_tr", "snli_tr")

Thanks for your help in reviewing our pull request.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me !
Thanks for adding this one ;)

I left minor comments. Once they're resolved we can merge it :)

datasets/nli_tr/nli_tr.py Outdated Show resolved Hide resolved
datasets/nli_tr/nli_tr.py Outdated Show resolved Hide resolved
@e-budur
Copy link
Contributor Author

e-budur commented Nov 10, 2020

Thank you @lhoestq for the time you take to review our pull request. We appreciate your help.

We've made the changes you described. Hope that it is ready for being merged. Please let me know if you have any additional requests for revisions.

Comment on lines 32 to 47
self.description = "The Natural Language Inference in Turkish (NLI-TR) is a set of two large scale datasets that were obtained by translating the foundational NLI corpora (SNLI and MNLI) using Amazon Translate."
self.homepage = "https://github.com/boun-tabi/NLI-TR"
self.citation = """\
@inproceedings{budur-etal-2020-data,
title = "Data and Representation for Turkish Natural Language Inference",
author = "Budur, Emrah and
\"{O}zçelik, Rıza and
G\"{u}ng\"{o}r, Tunga",
booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
month = nov,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
abstract = "Large annotated datasets in NLP are overwhelmingly in English. This is an obstacle to progress in other languages. Unfortunately, obtaining new annotated resources for each task in each language would be prohibitively expensive. At the same time, commercial machine translation systems are now robust. Can we leverage these systems to translate English-language datasets automatically? In this paper, we offer a positive response for natural language inference (NLI) in Turkish. We translated two large English NLI datasets into Turkish and had a team of experts validate their translation quality and fidelity to the original labels. Using these datasets, we address core issues of representation for Turkish NLI. We find that in-language embeddings are essential and that morphological parsing can be avoided where the training set is large. Finally, we show that models trained on our machine-translated datasets are successful on human-translated evaluation sets. We share all code, models, and data publicly.",
}
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry maybe I wasn't clear about description, homepage and citation.
To appear in the datasets hub page on huggingface.co those three fields actually need to be in global variables _DESCRIPTION, _CITATION and _HOMEPAGE.
As they're global variables you don't need to have them in NLITRConfig, and you can directly use those three variables in _info().
See squad.py for example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very sorry for the confusion. I've revised the implementation accordingly. Thank you so much for the time you take for reviewing it.

@yjernite yjernite added this to In progress in Datasets to Add via automation Nov 10, 2020
Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks all good now thanks :)

datasets/nli_tr/nli_tr.py Outdated Show resolved Hide resolved
@lhoestq lhoestq merged commit 86d3794 into huggingface:master Nov 12, 2020
Datasets to Add automation moved this from In progress to Done Nov 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

None yet

2 participants