pretty print dataset objects #725

stas00 · 2020-10-12T02:03:46Z

Currently, if I do:

from datasets import load_dataset
load_dataset("wikihow", 'all', data_dir="/hf/pegasus-datasets/wikihow/")

I get:


DatasetDict({'train': Dataset(features: {'text': Value(dtype='string', id=None),
'headline': Value(dtype='string', id=None), 'title': Value(dtype='string',
id=None)}, num_rows: 157252), 'validation': Dataset(features: {'text':
Value(dtype='string', id=None), 'headline': Value(dtype='string', id=None),
'title': Value(dtype='string', id=None)}, num_rows: 5599), 'test':
Dataset(features: {'text': Value(dtype='string', id=None), 'headline':
Value(dtype='string', id=None), 'title': Value(dtype='string', id=None)},
num_rows: 5577)})

This is not very readable.

Can we either have a better __repr__ or have a custom method to nicely pprint the dataset object?

Here is my very simple attempt. With this PR, it produces:

DatasetDict({
  train:   Dataset({
    features: ['text', 'headline', 'title'],
    num_rows: 157252
  })
  validation:   Dataset({
    features: ['text', 'headline', 'title'],
    num_rows: 5599
  })
  test:   Dataset({
    features: ['text', 'headline', 'title'],
    num_rows: 5577
  })
})

I did omit the data types on purpose to make it more readable, but it shouldn't be too difficult to integrate those too.

note that this PR also fixes the inconsistency in output that in master misses enclosing {} for Dataset, but it is there for DatasetDict - or perhaps it was by design.

I'm totally not attached to this format, just wanting something more readable. One approach could be to serialize to json.dumps or something similar. It'd make the indentation simpler.

Thank you.

lhoestq

I like this format, this makes things way more readable.
Thanks !

stas00 · 2020-10-17T02:57:26Z

Great, as you found it useful I improved the code a bit to automate indentation in the parent class, so that the child repr doesn't need to guess the indentation level, while repr'ing nicely on its own.

do we want indent=4 or 2?
do we want { ... } or w/o?

currently it's indent4 and w/ curly braces, so it looks:

DatasetDict({
    train: Dataset({
        features: ['text', 'headline', 'title'],
        num_rows: 157252
    })
    validation: Dataset({
        features: ['text', 'headline', 'title'],
        num_rows: 5599
    })
    test: Dataset({
        features: ['text', 'headline', 'title'],
        num_rows: 5577
    })
})

just child:

Dataset({
    features: ['text', 'headline', 'title'],
    num_rows: 5577
})

thomwolf · 2020-10-17T09:41:42Z

Yes! A lot better indeed!

lhoestq

I like it this way !
Thanks :)

stas00 added 2 commits October 11, 2020 18:53

pretty print dataset objects

4b3e908

fix

271669b

lhoestq approved these changes Oct 15, 2020

View reviewed changes

automate the indentation

ce2ec62

thomwolf approved these changes Oct 17, 2020

View reviewed changes

lhoestq approved these changes Oct 23, 2020

View reviewed changes

lhoestq merged commit 880c2c7 into huggingface:master Oct 23, 2020

stas00 deleted the pprint branch October 23, 2020 16:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pretty print dataset objects #725

pretty print dataset objects #725

stas00 commented Oct 12, 2020 •

edited

lhoestq left a comment

stas00 commented Oct 17, 2020 •

edited

thomwolf commented Oct 17, 2020

lhoestq left a comment

pretty print dataset objects #725

pretty print dataset objects #725

Conversation

stas00 commented Oct 12, 2020 • edited

lhoestq left a comment

Choose a reason for hiding this comment

stas00 commented Oct 17, 2020 • edited

thomwolf commented Oct 17, 2020

lhoestq left a comment

Choose a reason for hiding this comment

stas00 commented Oct 12, 2020 •

edited

stas00 commented Oct 17, 2020 •

edited