Always update metadata in arrow schema #2274

lhoestq · 2021-04-27T19:21:57Z

We store a redundant copy of the features in the metadata of the schema of the arrow table. This is used to recover the features when doing Dataset.from_file. These metadata are updated after each transfor, that changes the feature types.

For each function that transforms the feature types of the dataset, I added a step in the tests to make sure the metadata in the arrow schema are up to date.

I also added a line to update the metadata directly in the Dataset.init method.
This way even a dataset instantiated with init will have a table with the right metadata.

Fix #2271.

cc @mariosasko

lhoestq added 8 commits April 23, 2021 16:29

update format, fingerprint and indices after add_item

a877fff

minor

3bd47e5

rename to item_indices_table

cffbd63

test dataset._indices

f834767

Merge branch 'master' into add_item2

d93bc76

fix class_encode_column issue

88676c9

always update metadata in arrow schema

03a57d6

Merge branch 'master' into always-update-metadata-in-arrow-schema

db02eb1

lhoestq mentioned this pull request Apr 28, 2021

Synchronize table metadata with features #2271

Closed

lhoestq merged commit 9a3fc19 into master Apr 29, 2021

lhoestq deleted the always-update-metadata-in-arrow-schema branch April 29, 2021 09:57

lhoestq mentioned this pull request Apr 29, 2021

Implement Dataset add_column #2145

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Always update metadata in arrow schema #2274

Always update metadata in arrow schema #2274

lhoestq commented Apr 27, 2021 •

edited by albertvillanova

Always update metadata in arrow schema #2274

Always update metadata in arrow schema #2274

Conversation

lhoestq commented Apr 27, 2021 • edited by albertvillanova

lhoestq commented Apr 27, 2021 •

edited by albertvillanova