Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standardizing datasets dtypes #1921

Merged
merged 1 commit into from
Feb 22, 2021
Merged

Standardizing datasets dtypes #1921

merged 1 commit into from
Feb 22, 2021

Conversation

justin-yan
Copy link
Contributor

This PR follows up on discussion in #1900 to have an explicit set of basic dtypes for datasets.

This moves away from str(pyarrow.DataType) as the method of choice for creating dtypes, favoring an explicit mapping to a list of supported Value dtypes.

I believe in practice this should be backward compatible, since anyone previously using Value() would only have been able to use dtypes that had an identically named pyarrow factory function, which are all explicitly supported here, with float32 and float64 acting as the official datasets dtypes, which resolves the tension between double being the pyarrow dtype and float64 being the pyarrow type factory function.

@justin-yan
Copy link
Contributor Author

@lhoestq - apologies for the multiple PRs, my previous one (#1905) got mangled due to some merge conflicts that I had trouble resolving so I just cherry-picked my changes onto a fresh branch here.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice thank you !

@lhoestq lhoestq merged commit 4c3fecc into huggingface:master Feb 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants