Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upload greek-legal-code dataset #2966

Merged
merged 3 commits into from
Oct 13, 2021

Conversation

christospi
Copy link
Contributor

No description provided.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool ! Thanks for adding this one :)

I added a few suggestions in the comments. My main concern is that it would be nice to have the list of all the classes. This is useful to define the id2label parameter when instantiating a model for text classification.

# License for the dataset if available
license=_LICENSE,
# Citation for the dataset
citation=_CITATION,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can add a text classification template:

Suggested change
citation=_CITATION,
citation=_CITATION,
task_templates=[TextClassification(text_column="text", label_column="label")],

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

features = datasets.Features(
{
"text": datasets.Value("string"),
"label": datasets.Value("string"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be nice to have the list of the possible labels here (even though they can be numerous in the "subject" case):

Suggested change
"label": datasets.Value("string"),
"label": datasets.ClassLabel(names=[...]),

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

datasets/greek_legal_code/README.md Outdated Show resolved Hide resolved
`text`: (**str**) The full content of each document, which is represented by its `header` and `articles` (i.e., the `main_body`).\
`volume`: (**str**) The volume-level class it belongs to.\
`chapter`: (**str**) The chapter-level class it belongs to.\
`subject`: (**str**) The subject-level class it belongs to.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can list the name of the classes here ? Or at least for the "volume" case ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

datasets/greek_legal_code/README.md Outdated Show resolved Hide resolved
datasets/greek_legal_code/README.md Outdated Show resolved Hide resolved
datasets/greek_legal_code/README.md Outdated Show resolved Hide resolved
Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this dataset @christospi.

I found a potential issue (see description below).

data = json.loads(row)
yield id_, {
"text": data["text"],
"label": data[self.config.label_type],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is an issue in the "label" field: the dataset will be identically repeated 3 times, once for each of the configuration values.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @albertvillanova, thanks for reviewing.

Why is this an issue?

The goal is that the users will be able to use 3 different configurations of the very same dataset based on the examined label set they prefer.

e.g.,

from datasets import load_dataset

# Load the dataset including the volume labels
dataset = load_dataset('greek_legal_code', label_type='volume')

# Load the dataset including the chapter labels
dataset = load_dataset('greek_legal_code', label_type='chapter')

# Load the dataset including the chapter labels
dataset = load_dataset('greek_legal_code', label_type='subject')

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine this way. The texts are the same and just the labels are changing depending on the granularity that you need for classification.

@christospi
Copy link
Contributor Author

@albertvillanova @lhoestq thank you very much for reviewing! 🤗

I 've pushed some updates/changes as requested.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks !

Let me just do some final minor changes in the dataset card:

datasets/greek_legal_code/README.md Outdated Show resolved Hide resolved
datasets/greek_legal_code/README.md Outdated Show resolved Hide resolved
datasets/greek_legal_code/README.md Outdated Show resolved Hide resolved
@lhoestq lhoestq merged commit 647712f into huggingface:master Oct 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants