Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue #1895: Bugfix for string_to_arrow timestamp[ns] support #1900

Merged
merged 6 commits into from
Feb 19, 2021
Merged

Issue #1895: Bugfix for string_to_arrow timestamp[ns] support #1900

merged 6 commits into from
Feb 19, 2021

Conversation

justin-yan
Copy link
Contributor

@justin-yan justin-yan commented Feb 17, 2021

Should resolve #1895

The main part of this PR adds additional parsing in string_to_arrow to convert the timestamp dtypes that result from str(pa_type) back into the pa.DataType TimestampType.

While adding unit-testing, I noticed that support for the double/float types also don't invert correctly, so I added them, which I believe would hypothetically make this section of Value redundant:

    def __post_init__(self):
        if self.dtype == "double":  # fix inferred type
            self.dtype = "float64"
        if self.dtype == "float":  # fix inferred type
            self.dtype = "float32"

However, since I think Value.dtype is part of the public interface, removing that would result in a backward-incompatible change, so I didn't muck with that.

The rest of the PR consists of docstrings that I added while developing locally so I could keep track of which functions were supposed to be inverses of each other, and thought I'd include them initially in case you want to keep them around, but I'm happy to delete or remove any of them at your request!

src/datasets/features.py Outdated Show resolved Hide resolved
@@ -129,7 +172,7 @@ def cast_to_python_objects(obj: Any) -> Any:
class Value:
"""Encapsulate an Arrow datatype for easy serialization."""

dtype: str
dtype: str # The string representation of a pyarrow datatype: str(pyarrow.DataType)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the documentation here: https://huggingface.co/docs/datasets/features.html

I think it would be helpful to clarify whether this dtype should be the name of a pyarrow function, or if it should be the string representation of a pyarrow DataType:

i.e. float64 or double?

My intuition is that if you want to accept dtypes such as timestamp[ns], then this dtype value should correspond to str(pa.DataType) instead of pyarrow.__dict__[<factory function name>] - otherwise, you have one style of dtype for certain primitives, and another altogether for timestamps, decimal, etc.:

>>> str(pyarrow.decimal128(1))
'decimal(1, 0)'

I realize that double->float64 and float->float32 are already on the public interface so I wouldn't propose changing those since that would be a backward-incompatible change, but it would be useful for documenting going forward, I think.

Happy to accept a decision either way and to remove this comment, but wanted to highlight it so that the maintainers can provide explicit guidance. Either way you choose, a comment and some additional documentation on the https://huggingface.co/docs/datasets/features.html page would probably be helpful, and I'd be happy to help with that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have a list of supported types, and their corresponding pyarrow dtypes.
What about adding a table in the documentation ?

The definition of dtype being the name of a pyarrow function is invalid (given the exceptions you mentioned) and restrictive, so an explicit table would make sense IMO. The trick that just calls the pyarrow function with the same name is just an internal trick in the code, not a rule that defines the dtypes.

This also allows us to have enough flexibility to add the support for more types in the future.

Let me know what you think

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to maintain your own dtypes separate from pyarrow, then I think that makes a lot of sense.

In that world, perhaps there should actually be a separate function arrow_to_string that would act as an inverse for string_to_arrow that you could specifically control vs. relying on whatever str(pa_type) happens to do?

I'm happy to clean up my PR with this in mind, and I can open a follow-up PR that will incorporate this suggestion and add some documentation if you'd like!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense to have arrow_to_string indeed, thanks for the suggestion and for the help :)
And sure feel free to add some documentation, this is greatly appreciated

We try to have our own dtypes so that we can provide alternative backends instead of pyarrow, which can be a bit overkill for data that easily fit in memory IMO. This is something we're exploring

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really cool thank you !

Timestamp handling looks good to me

I agree with you that we should explain in the documentation the link between the datasets feature types and the internal pyarrow types. I added a few comments.

Also thanks for the tests and the additional docstrings, this is helpful.

src/datasets/features.py Outdated Show resolved Hide resolved
tests/test_features.py Outdated Show resolved Hide resolved
@@ -129,7 +172,7 @@ def cast_to_python_objects(obj: Any) -> Any:
class Value:
"""Encapsulate an Arrow datatype for easy serialization."""

dtype: str
dtype: str # The string representation of a pyarrow datatype: str(pyarrow.DataType)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have a list of supported types, and their corresponding pyarrow dtypes.
What about adding a table in the documentation ?

The definition of dtype being the name of a pyarrow function is invalid (given the exceptions you mentioned) and restrictive, so an explicit table would make sense IMO. The trick that just calls the pyarrow function with the same name is just an internal trick in the code, not a rule that defines the dtypes.

This also allows us to have enough flexibility to add the support for more types in the future.

Let me know what you think

src/datasets/features.py Show resolved Hide resolved
src/datasets/features.py Show resolved Hide resolved
src/datasets/features.py Outdated Show resolved Hide resolved
justin-yan and others added 2 commits February 18, 2021 09:11
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
@justin-yan
Copy link
Contributor Author

OK! Thank you for the review - I will follow up with a separate PR for the comments here (#1900 (comment))!

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot !
Let me apply the minor change to the docstring and remove the mention of arrow_to_datasets_dtype for now since it doesn't exist yet :)

src/datasets/features.py Outdated Show resolved Hide resolved
@lhoestq lhoestq merged commit e5ad4d2 into huggingface:master Feb 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug Report: timestamp[ns] not recognized
2 participants