Add MultiEURLEX dataset #2865

iliaschalkidis · 2021-09-02T09:42:24Z

Add new MultiEURLEX Dataset

MultiEURLEX comprises 65k EU laws in 23 official EU languages (some low-ish resource). Each EU law has been annotated with EUROVOC concepts (labels) by the Publication Office of EU. As with the English EURLEX, the goal is to predict the relevant EUROVOC concepts (labels); this is multi-label classification task (given the text, predict multiple labels).

iliaschalkidis · 2021-09-02T13:36:39Z

Hi @lhoestq, we have this new cool multilingual dataset coming at EMNLP 2021. It would be really nice if we could have it in Hugging Face asap. Thanks!

lhoestq

Nice ! Thanks a lot for adding this dataset :)

Also good job on the dataset script and the dataset card !

I left a few comments.

Moreover it looks like the dummy data zip files are quite big (>1MB). Usually we try to have them smaller than 50kB to not make the datasets repo too big. Could you please try to reduce their sizes ? For example feel free to simply keep one or two examples per jsonl file in the dummy_data.zip files.

datasets/multi_eurlex/README.md

lhoestq · 2021-09-06T15:37:15Z

datasets/multi_eurlex/multi_eurlex.py

+                    "text": datasets.Translation(
+                        languages=_LANGUAGES,
+                    ),
+                    "labels": datasets.features.Sequence(datasets.Value("string")),


Here you can use the ClassLabel feature type, with the names of all the labels

I would prefer to keep the labels as a list of the original label ids (str), if this is not a big issue.

datasets/multi_eurlex/README.md

iliaschalkidis · 2021-09-06T18:20:55Z

Hi @lhoestq, I adopted most of your suggestions:

Dummy data files reduced, including the 2 smallest documents per subset JSONL.
README was updated with the publication URL and instructions on how to download and use label descriptors. Excessive newlines were deleted.

I would prefer to keep the label list in a pure format (original ids), to enable people to combine those with more information or possibly in the future explore the dataset, find inconsistencies and fix those to release a new version.

lhoestq · 2021-09-09T08:39:06Z

Thanks for the changes :)

Regarding the labels:

If you use the ClassLabel feature type, the only change is that it will store the ids as integers instead of (currently) string.
The advantage is that if people want to know what id corresponds to which label name, they can use classlabel.int2str. It is also the format that helps automate model training for classification in transformers.

Let me know if that sounds good to you or if you still want to stick with the labels as they are now.

iliaschalkidis · 2021-09-09T12:13:23Z

Hey @lhoestq, thanks for providing this information. This sounds great. I updated my code accordingly to use ClassLabel. Could you please provide a minimal example of how classlabel.int2str works in practice in my case, where labels are a sequence?

from datasets import load_dataset
dataset = load_dataset('multi_eurlex', 'all_languages')
# Read strs from the labels (list of integers) for the 1st sample of the training split

I would like to include this in the README file.

Could you also provide some info on how I could define the supervized key to automate model training, as you said?

Thanks!

lhoestq · 2021-09-10T09:29:42Z

Thanks for the update :)

Here is an example of usage:

from datasets import load_dataset
dataset = load_dataset('multi_eurlex', 'all_languages', split='train')
classlabel = dataset.features["labels"].feature
print(dataset[0]["labels"])
# [1, 20, 7, 3, 0]
print(classlabel.int2str(dataset[0]["labels"]))
# ['100160', '100155', '100158', '100147', '100149']

The ClassLabel is simply used to define the id2label dictionary of classification models, to make the ids match between the model and the dataset. There nothing more to do :p

I think one last thing to do is just update the dataset_infos.json file and we'll be good !

iliaschalkidis · 2021-09-10T10:33:29Z

Everything is ready! 👍

lhoestq

Cool ! LGTM :)

I just did a change in the readme (_int2str[] -> int2str())

datasets/multi_eurlex/README.md

iliaschalkidis added 8 commits September 2, 2021 11:38

Add MultiEURLEX dataset

f65f838

Fix code styles

3a994c7

Fix README taglist

e00007c

Fix README

1e709ce

Fix README

c3dcd95

Fix README

6c4d1c0

Reduct comment from XNLI dataloader

050252d

Fix code styles

04970f2

lhoestq reviewed Sep 6, 2021

View reviewed changes

Update README and dummy data

738f83e

iliaschalkidis added 4 commits September 9, 2021 11:27

Use ClassLabel feature type

f824fbe

Fix code styles

aab6983

Minor changes

0db4658

Minor changes

f1df8a0

iliaschalkidis added 2 commits September 10, 2021 12:18

Fix README and dataset_info.json

28b7c5a

Fix README and dataset_info.json

ec1bb05

lhoestq approved these changes Sep 10, 2021

View reviewed changes

datasets/multi_eurlex/README.md Outdated Show resolved Hide resolved

Update datasets/multi_eurlex/README.md

e5cc70f

lhoestq merged commit ca70e55 into huggingface:master Sep 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MultiEURLEX dataset #2865

Add MultiEURLEX dataset #2865

iliaschalkidis commented Sep 2, 2021

iliaschalkidis commented Sep 2, 2021

lhoestq left a comment

lhoestq Sep 6, 2021

iliaschalkidis Sep 6, 2021

iliaschalkidis commented Sep 6, 2021

lhoestq commented Sep 9, 2021

iliaschalkidis commented Sep 9, 2021

lhoestq commented Sep 10, 2021

iliaschalkidis commented Sep 10, 2021

lhoestq left a comment

Add MultiEURLEX dataset #2865

Add MultiEURLEX dataset #2865

Conversation

iliaschalkidis commented Sep 2, 2021

iliaschalkidis commented Sep 2, 2021

lhoestq left a comment

Choose a reason for hiding this comment

lhoestq Sep 6, 2021

Choose a reason for hiding this comment

iliaschalkidis Sep 6, 2021

Choose a reason for hiding this comment

iliaschalkidis commented Sep 6, 2021

lhoestq commented Sep 9, 2021

iliaschalkidis commented Sep 9, 2021

lhoestq commented Sep 10, 2021

iliaschalkidis commented Sep 10, 2021

lhoestq left a comment

Choose a reason for hiding this comment