Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MultiEURLEX dataset #2865

Merged
merged 16 commits into from
Sep 10, 2021
Merged

Add MultiEURLEX dataset #2865

merged 16 commits into from
Sep 10, 2021

Conversation

iliaschalkidis
Copy link
Contributor

Add new MultiEURLEX Dataset

MultiEURLEX comprises 65k EU laws in 23 official EU languages (some low-ish resource). Each EU law has been annotated with EUROVOC concepts (labels) by the Publication Office of EU. As with the English EURLEX, the goal is to predict the relevant EUROVOC concepts (labels); this is multi-label classification task (given the text, predict multiple labels).

@iliaschalkidis
Copy link
Contributor Author

Hi @lhoestq, we have this new cool multilingual dataset coming at EMNLP 2021. It would be really nice if we could have it in Hugging Face asap. Thanks!

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice ! Thanks a lot for adding this dataset :)

Also good job on the dataset script and the dataset card !

I left a few comments.

Moreover it looks like the dummy data zip files are quite big (>1MB). Usually we try to have them smaller than 50kB to not make the datasets repo too big. Could you please try to reduce their sizes ? For example feel free to simply keep one or two examples per jsonl file in the dummy_data.zip files.

datasets/multi_eurlex/README.md Outdated Show resolved Hide resolved
"text": datasets.Translation(
languages=_LANGUAGES,
),
"labels": datasets.features.Sequence(datasets.Value("string")),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you can use the ClassLabel feature type, with the names of all the labels

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to keep the labels as a list of the original label ids (str), if this is not a big issue.

datasets/multi_eurlex/README.md Outdated Show resolved Hide resolved
@iliaschalkidis
Copy link
Contributor Author

Hi @lhoestq, I adopted most of your suggestions:

  • Dummy data files reduced, including the 2 smallest documents per subset JSONL.
  • README was updated with the publication URL and instructions on how to download and use label descriptors. Excessive newlines were deleted.

I would prefer to keep the label list in a pure format (original ids), to enable people to combine those with more information or possibly in the future explore the dataset, find inconsistencies and fix those to release a new version.

@lhoestq
Copy link
Member

lhoestq commented Sep 9, 2021

Thanks for the changes :)

Regarding the labels:

If you use the ClassLabel feature type, the only change is that it will store the ids as integers instead of (currently) string.
The advantage is that if people want to know what id corresponds to which label name, they can use classlabel.int2str. It is also the format that helps automate model training for classification in transformers.

Let me know if that sounds good to you or if you still want to stick with the labels as they are now.

@iliaschalkidis
Copy link
Contributor Author

Hey @lhoestq, thanks for providing this information. This sounds great. I updated my code accordingly to use ClassLabel. Could you please provide a minimal example of how classlabel.int2str works in practice in my case, where labels are a sequence?

from datasets import load_dataset
dataset = load_dataset('multi_eurlex', 'all_languages')
# Read strs from the labels (list of integers) for the 1st sample of the training split

I would like to include this in the README file.

Could you also provide some info on how I could define the supervized key to automate model training, as you said?

Thanks!

@lhoestq
Copy link
Member

lhoestq commented Sep 10, 2021

Thanks for the update :)

Here is an example of usage:

from datasets import load_dataset
dataset = load_dataset('multi_eurlex', 'all_languages', split='train')
classlabel = dataset.features["labels"].feature
print(dataset[0]["labels"])
# [1, 20, 7, 3, 0]
print(classlabel.int2str(dataset[0]["labels"]))
# ['100160', '100155', '100158', '100147', '100149']

The ClassLabel is simply used to define the id2label dictionary of classification models, to make the ids match between the model and the dataset. There nothing more to do :p

I think one last thing to do is just update the dataset_infos.json file and we'll be good !

@iliaschalkidis
Copy link
Contributor Author

Everything is ready! 👍

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool ! LGTM :)

I just did a change in the readme (_int2str[] -> int2str())

datasets/multi_eurlex/README.md Outdated Show resolved Hide resolved
@lhoestq lhoestq merged commit ca70e55 into huggingface:master Sep 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants