Add XNLI train set #781

lhoestq · 2020-10-30T13:21:53Z

I added the train set that was built using the translated MNLI.
Now you can load the dataset specifying one language:

from datasets import load_dataset

xnli_en = load_dataset("xnli", "en")
print(xnli_en["train"][0])
# {'hypothesis': 'Product and geography are what make cream skimming work .', 'label': 1, 'premise': 'Conceptually cream skimming has two basic dimensions - product and geography .'}
print(xnli_en["test"][0])                                                                                                                   
# {'hypothesis': 'I havent spoken to him again.', 'label': 2, 'premise': "Well, I wasn't even thinking about that, but I was so frustrated, and, I ended up talking to him again."}

Cc @sgugger

YifanYangEbidanko · 2022-06-01T19:47:40Z

Hi! Thanks for adding the translated MNLI! Do you know what translations system / model you used when you created the datasets in the other languages?

lhoestq · 2022-06-09T09:51:06Z

According to the paper it's the result of the work of professional translators ;)

YifanYangEbidanko · 2022-06-09T13:03:43Z

Thanks for getting back to me. The training data is not from translators. And it appears to be machine translation for all languages. If we can know what system was used to create the training data that would be great! Yifan.

…

On Thu, Jun 9, 2022, 05:51 Quentin Lhoest ***@***.***> wrote: According to the paper <https://arxiv.org/pdf/1809.05053.pdf> it's the result of the work of professional translators ;) — Reply to this email directly, view it on GitHub <#781 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAKLKWDAPTMGB6BE5GJ4GULVOG5BLANCNFSM4TE67NMQ> . You are receiving this because you commented.Message ID: ***@***.***>

lhoestq · 2022-06-09T14:35:20Z

The training data is not from translators.

What makes you think that ? The paper litteraly says

we hire translators to translate the resulting sentences into 15 languages using the One Hour Translation platform.

YifanYangEbidanko · 2022-06-09T23:26:45Z

However the annotators only did test and validation sets, as this was what in the paper: “we construct an evaluation set for XLU by extending the development and test sets of the Multi-Genre Natural Language Inference Corpus (MultiNLI) to 15 languages".

…

On Thu, Jun 9, 2022 at 10:35 AM Quentin Lhoest ***@***.***> wrote: The training data is not from translators. What makes you think that ? The paper litteraly says we hire translators to translate the resulting sentences into 15 languages using the One Hour Translation platform. — Reply to this email directly, view it on GitHub <#781 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAKLKWFZOQPLK4WSKFRLW6DVOH6LLANCNFSM4TE67NMQ> . You are receiving this because you commented.Message ID: ***@***.***>

lhoestq added 3 commits October 30, 2020 12:59

add xnli train set

cded50e

update dataset_infos

3e1b561

add dummy data

cc5a116

yjernite added this to In progress in Datasets to Add via automation Oct 30, 2020

lhoestq merged commit 1abf805 into master Nov 9, 2020

Datasets to Add automation moved this from In progress to Done Nov 9, 2020

lhoestq deleted the add-xnli-train-set branch November 9, 2020 18:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add XNLI train set #781

Add XNLI train set #781

lhoestq commented Oct 30, 2020

YifanYangEbidanko commented Jun 1, 2022

lhoestq commented Jun 9, 2022

YifanYangEbidanko commented Jun 9, 2022 via email

lhoestq commented Jun 9, 2022

YifanYangEbidanko commented Jun 9, 2022 via email

Add XNLI train set #781

Add XNLI train set #781

Conversation

lhoestq commented Oct 30, 2020

YifanYangEbidanko commented Jun 1, 2022

lhoestq commented Jun 9, 2022

YifanYangEbidanko commented Jun 9, 2022 via email

lhoestq commented Jun 9, 2022

YifanYangEbidanko commented Jun 9, 2022 via email