Implement to_dict and to_pandas for Dataset #1889

SBrandeis · 2021-02-16T12:38:19Z

With options to return a generator or the full dataset

lhoestq

Nice thanks !

Also can you use query_table with self._indices (same change as I suggested in the to_csv PR) ? This way this will support shuffled/sharded datasets

src/datasets/arrow_dataset.py

lhoestq

Nice ! Thanks for adding the support for datasets with an indices mapping.
I added some more comments about query_table and the batched=False case, as well as suggestions to use len(self) instead of self._data.num_rows in query_table

src/datasets/arrow_dataset.py

tests/test_arrow_dataset.py

lhoestq

Nice thank you !

lhoestq · 2021-02-18T18:42:37Z

Next step is going to add these two in the documentation ^^

SBrandeis requested a review from lhoestq February 16, 2021 12:38

lhoestq reviewed Feb 17, 2021

View reviewed changes

src/datasets/arrow_dataset.py Outdated Show resolved Hide resolved

src/datasets/arrow_dataset.py Outdated Show resolved Hide resolved

SBrandeis requested a review from lhoestq February 18, 2021 12:22

lhoestq reviewed Feb 18, 2021

View reviewed changes

SBrandeis added 7 commits February 18, 2021 18:11

Move DEFAULT_MAX_BATCH_SIZE to config

4ab730a

Implement to_dict and to_csv

601f71b

Use query_table andchange default behaviour

1bec8e4

Change tests to use multiple columns

3ea1a6e

Change default behaviour of to_pandas

c7b8a73

Use query_table and len(self) in non-batched to_pandas and to_csv

d987d12

Update tests

ca1459b

SBrandeis force-pushed the to_dictlike branch from 3dc4c6b to ca1459b Compare February 18, 2021 17:11

SBrandeis requested a review from lhoestq February 18, 2021 17:55

lhoestq approved these changes Feb 18, 2021

View reviewed changes

lhoestq merged commit 9acb9da into huggingface:master Feb 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement to_dict and to_pandas for Dataset #1889

Implement to_dict and to_pandas for Dataset #1889

SBrandeis commented Feb 16, 2021

lhoestq left a comment

lhoestq left a comment

lhoestq left a comment

lhoestq commented Feb 18, 2021

Implement to_dict and to_pandas for Dataset #1889

Implement to_dict and to_pandas for Dataset #1889

Conversation

SBrandeis commented Feb 16, 2021

lhoestq left a comment

Choose a reason for hiding this comment

lhoestq left a comment

Choose a reason for hiding this comment

lhoestq left a comment

Choose a reason for hiding this comment

lhoestq commented Feb 18, 2021