Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loading GLUE dataset loads CoLA by default #130

Closed
zphang opened this issue May 15, 2020 · 3 comments
Closed

Loading GLUE dataset loads CoLA by default #130

zphang opened this issue May 15, 2020 · 3 comments
Labels
dataset bug A bug in a dataset script provided in the library

Comments

@zphang
Copy link

zphang commented May 15, 2020

If I run:

dataset = nlp.load_dataset('glue')

The resultant dataset seems to be CoLA be default, without throwing any error. This is in contrast to calling:

metric = nlp.load_metric("glue")

which throws an error telling the user that they need to specify a task in GLUE. Should the same apply for loading datasets?

@zphang
Copy link
Author

zphang commented May 15, 2020

As a follow-up to this: It looks like the actual GLUE task name is supplied as the name argument. Is there a way to check what names/sub-datasets are available under a grouping like GLUE? That information doesn't seem to be readily available in info from nlp.list_datasets().

Edit: I found the info under Glue.BUILDER_CONFIGS

@thomwolf
Copy link
Member

Yes so the first config is loaded by default when no name is supplied but for GLUE this should probably throw an error indeed.

We can probably just add an __init__ at the top of the class Glue(nlp.GeneratorBasedBuilder) in the glue.py script which does this check:

class Glue(nlp.GeneratorBasedBuilder):
    def __init__(self, *args, **kwargs):
        assert 'name' in kwargs and kwargs[name] is not None, "Glue has to be called with a configuration name"
        super(Glue, self).__init__(*args, **kwargs)

@lhoestq
Copy link
Member

lhoestq commented May 27, 2020

An error is raised if the sub-dataset is not specified :)

ValueError: Config name is missing.
Please pick one among the available configs: ['cola', 'sst2', 'mrpc', 'qqp', 'stsb', 'mnli', 'mnli_mismatched', 'mnli_matched', 'qnli', 'rte', 'wnli', 'ax']
Example of usage:
	`load_dataset('glue', 'cola')`

@lhoestq lhoestq closed this as completed May 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dataset bug A bug in a dataset script provided in the library
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants