Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pretty print dataset objects #725

Merged
merged 3 commits into from
Oct 23, 2020
Merged

pretty print dataset objects #725

merged 3 commits into from
Oct 23, 2020

Conversation

stas00
Copy link
Contributor

@stas00 stas00 commented Oct 12, 2020

Currently, if I do:

from datasets import load_dataset
load_dataset("wikihow", 'all', data_dir="/hf/pegasus-datasets/wikihow/")

I get:


DatasetDict({'train': Dataset(features: {'text': Value(dtype='string', id=None),
'headline': Value(dtype='string', id=None), 'title': Value(dtype='string',
id=None)}, num_rows: 157252), 'validation': Dataset(features: {'text':
Value(dtype='string', id=None), 'headline': Value(dtype='string', id=None),
'title': Value(dtype='string', id=None)}, num_rows: 5599), 'test':
Dataset(features: {'text': Value(dtype='string', id=None), 'headline':
Value(dtype='string', id=None), 'title': Value(dtype='string', id=None)},
num_rows: 5577)})

This is not very readable.

Can we either have a better __repr__ or have a custom method to nicely pprint the dataset object?

Here is my very simple attempt. With this PR, it produces:

DatasetDict({
  train:   Dataset({
    features: ['text', 'headline', 'title'],
    num_rows: 157252
  })
  validation:   Dataset({
    features: ['text', 'headline', 'title'],
    num_rows: 5599
  })
  test:   Dataset({
    features: ['text', 'headline', 'title'],
    num_rows: 5577
  })
})

I did omit the data types on purpose to make it more readable, but it shouldn't be too difficult to integrate those too.

note that this PR also fixes the inconsistency in output that in master misses enclosing {} for Dataset, but it is there for DatasetDict - or perhaps it was by design.

I'm totally not attached to this format, just wanting something more readable. One approach could be to serialize to json.dumps or something similar. It'd make the indentation simpler.

Thank you.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this format, this makes things way more readable.
Thanks !

@stas00
Copy link
Contributor Author

stas00 commented Oct 17, 2020

Great, as you found it useful I improved the code a bit to automate indentation in the parent class, so that the child repr doesn't need to guess the indentation level, while repr'ing nicely on its own.

  • do we want indent=4 or 2?
  • do we want { ... } or w/o?

currently it's indent4 and w/ curly braces, so it looks:

DatasetDict({
    train: Dataset({
        features: ['text', 'headline', 'title'],
        num_rows: 157252
    })
    validation: Dataset({
        features: ['text', 'headline', 'title'],
        num_rows: 5599
    })
    test: Dataset({
        features: ['text', 'headline', 'title'],
        num_rows: 5577
    })
})

just child:

Dataset({
    features: ['text', 'headline', 'title'],
    num_rows: 5577
})

@thomwolf
Copy link
Member

Yes! A lot better indeed!

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it this way !
Thanks :)

@lhoestq lhoestq merged commit 880c2c7 into huggingface:master Oct 23, 2020
@stas00 stas00 deleted the pprint branch October 23, 2020 16:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants