Standardizing datasets dtypes #1921

justin-yan · 2021-02-20T22:04:01Z

This PR follows up on discussion in #1900 to have an explicit set of basic dtypes for datasets.

This moves away from str(pyarrow.DataType) as the method of choice for creating dtypes, favoring an explicit mapping to a list of supported Value dtypes.

I believe in practice this should be backward compatible, since anyone previously using Value() would only have been able to use dtypes that had an identically named pyarrow factory function, which are all explicitly supported here, with float32 and float64 acting as the official datasets dtypes, which resolves the tension between double being the pyarrow dtype and float64 being the pyarrow type factory function.

justin-yan · 2021-02-20T22:05:35Z

@lhoestq - apologies for the multiple PRs, my previous one (#1905) got mangled due to some merge conflicts that I had trouble resolving so I just cherry-picked my changes onto a fresh branch here.

lhoestq

Nice thank you !

Standardizing datasets dtypes to a defined set

4ef4aff

lhoestq approved these changes Feb 22, 2021

View reviewed changes

lhoestq merged commit 4c3fecc into huggingface:master Feb 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Standardizing datasets dtypes #1921

Standardizing datasets dtypes #1921

justin-yan commented Feb 20, 2021

justin-yan commented Feb 20, 2021

lhoestq left a comment

Standardizing datasets dtypes #1921

Standardizing datasets dtypes #1921

Conversation

justin-yan commented Feb 20, 2021

justin-yan commented Feb 20, 2021

lhoestq left a comment

Choose a reason for hiding this comment