-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upload greek-legal-code dataset #2966
Upload greek-legal-code dataset #2966
Conversation
f5cfdd5
to
775e71b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool ! Thanks for adding this one :)
I added a few suggestions in the comments. My main concern is that it would be nice to have the list of all the classes. This is useful to define the id2label
parameter when instantiating a model for text classification.
# License for the dataset if available | ||
license=_LICENSE, | ||
# Citation for the dataset | ||
citation=_CITATION, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can add a text classification template:
citation=_CITATION, | |
citation=_CITATION, | |
task_templates=[TextClassification(text_column="text", label_column="label")], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
features = datasets.Features( | ||
{ | ||
"text": datasets.Value("string"), | ||
"label": datasets.Value("string"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be nice to have the list of the possible labels here (even though they can be numerous in the "subject" case):
"label": datasets.Value("string"), | |
"label": datasets.ClassLabel(names=[...]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
datasets/greek_legal_code/README.md
Outdated
`text`: (**str**) The full content of each document, which is represented by its `header` and `articles` (i.e., the `main_body`).\ | ||
`volume`: (**str**) The volume-level class it belongs to.\ | ||
`chapter`: (**str**) The chapter-level class it belongs to.\ | ||
`subject`: (**str**) The subject-level class it belongs to. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can list the name of the classes here ? Or at least for the "volume" case ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this dataset @christospi.
I found a potential issue (see description below).
data = json.loads(row) | ||
yield id_, { | ||
"text": data["text"], | ||
"label": data[self.config.label_type], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is an issue in the "label" field: the dataset will be identically repeated 3 times, once for each of the configuration values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @albertvillanova, thanks for reviewing.
Why is this an issue?
The goal is that the users will be able to use 3 different configurations of the very same dataset based on the examined label set they prefer.
e.g.,
from datasets import load_dataset
# Load the dataset including the volume labels
dataset = load_dataset('greek_legal_code', label_type='volume')
# Load the dataset including the chapter labels
dataset = load_dataset('greek_legal_code', label_type='chapter')
# Load the dataset including the chapter labels
dataset = load_dataset('greek_legal_code', label_type='subject')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine this way. The texts are the same and just the labels are changing depending on the granularity that you need for classification.
ab2d82a
to
2ecbeac
Compare
@albertvillanova @lhoestq thank you very much for reviewing! 🤗 I 've pushed some updates/changes as requested. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks !
Let me just do some final minor changes in the dataset card:
No description provided.