-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add MultiEURLEX dataset #2865
Add MultiEURLEX dataset #2865
Conversation
Hi @lhoestq, we have this new cool multilingual dataset coming at EMNLP 2021. It would be really nice if we could have it in Hugging Face asap. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice ! Thanks a lot for adding this dataset :)
Also good job on the dataset script and the dataset card !
I left a few comments.
Moreover it looks like the dummy data zip files are quite big (>1MB). Usually we try to have them smaller than 50kB to not make the datasets
repo too big. Could you please try to reduce their sizes ? For example feel free to simply keep one or two examples per jsonl file in the dummy_data.zip files.
"text": datasets.Translation( | ||
languages=_LANGUAGES, | ||
), | ||
"labels": datasets.features.Sequence(datasets.Value("string")), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here you can use the ClassLabel
feature type, with the names of all the labels
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to keep the labels as a list of the original label ids (str), if this is not a big issue.
Hi @lhoestq, I adopted most of your suggestions:
I would prefer to keep the label list in a pure format (original ids), to enable people to combine those with more information or possibly in the future explore the dataset, find inconsistencies and fix those to release a new version. |
Thanks for the changes :) Regarding the labels: If you use the ClassLabel feature type, the only change is that it will store the ids as integers instead of (currently) string. Let me know if that sounds good to you or if you still want to stick with the labels as they are now. |
Hey @lhoestq, thanks for providing this information. This sounds great. I updated my code accordingly to use from datasets import load_dataset
dataset = load_dataset('multi_eurlex', 'all_languages')
# Read strs from the labels (list of integers) for the 1st sample of the training split I would like to include this in the README file. Could you also provide some info on how I could define the supervized key to automate model training, as you said? Thanks! |
Thanks for the update :) Here is an example of usage: from datasets import load_dataset
dataset = load_dataset('multi_eurlex', 'all_languages', split='train')
classlabel = dataset.features["labels"].feature
print(dataset[0]["labels"])
# [1, 20, 7, 3, 0]
print(classlabel.int2str(dataset[0]["labels"]))
# ['100160', '100155', '100158', '100147', '100149'] The ClassLabel is simply used to define the I think one last thing to do is just update the |
Everything is ready! 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool ! LGTM :)
I just did a change in the readme (_int2str[]
-> int2str()
)
Add new MultiEURLEX Dataset
MultiEURLEX comprises 65k EU laws in 23 official EU languages (some low-ish resource). Each EU law has been annotated with EUROVOC concepts (labels) by the Publication Office of EU. As with the English EURLEX, the goal is to predict the relevant EUROVOC concepts (labels); this is multi-label classification task (given the text, predict multiple labels).