Issue #1895: Bugfix for string_to_arrow timestamp[ns] support #1900

justin-yan · 2021-02-17T20:26:04Z

Should resolve #1895

The main part of this PR adds additional parsing in string_to_arrow to convert the timestamp dtypes that result from str(pa_type) back into the pa.DataType TimestampType.

While adding unit-testing, I noticed that support for the double/float types also don't invert correctly, so I added them, which I believe would hypothetically make this section of Value redundant:

    def __post_init__(self):
        if self.dtype == "double":  # fix inferred type
            self.dtype = "float64"
        if self.dtype == "float":  # fix inferred type
            self.dtype = "float32"

However, since I think Value.dtype is part of the public interface, removing that would result in a backward-incompatible change, so I didn't muck with that.

The rest of the PR consists of docstrings that I added while developing locally so I could keep track of which functions were supposed to be inverses of each other, and thought I'd include them initially in case you want to keep them around, but I'm happy to delete or remove any of them at your request!

src/datasets/features.py

justin-yan · 2021-02-17T20:46:51Z

src/datasets/features.py

@@ -129,7 +172,7 @@ def cast_to_python_objects(obj: Any) -> Any:
 class Value:
    """Encapsulate an Arrow datatype for easy serialization."""

-    dtype: str
+    dtype: str  # The string representation of a pyarrow datatype: str(pyarrow.DataType)


From the documentation here: https://huggingface.co/docs/datasets/features.html

I think it would be helpful to clarify whether this dtype should be the name of a pyarrow function, or if it should be the string representation of a pyarrow DataType:

i.e. float64 or double?

My intuition is that if you want to accept dtypes such as timestamp[ns], then this dtype value should correspond to str(pa.DataType) instead of pyarrow.__dict__[<factory function name>] - otherwise, you have one style of dtype for certain primitives, and another altogether for timestamps, decimal, etc.:

>>> str(pyarrow.decimal128(1)) 'decimal(1, 0)'

I realize that double->float64 and float->float32 are already on the public interface so I wouldn't propose changing those since that would be a backward-incompatible change, but it would be useful for documenting going forward, I think.

Happy to accept a decision either way and to remove this comment, but wanted to highlight it so that the maintainers can provide explicit guidance. Either way you choose, a comment and some additional documentation on the https://huggingface.co/docs/datasets/features.html page would probably be helpful, and I'd be happy to help with that.

I think we should have a list of supported types, and their corresponding pyarrow dtypes.
What about adding a table in the documentation ?

The definition of dtype being the name of a pyarrow function is invalid (given the exceptions you mentioned) and restrictive, so an explicit table would make sense IMO. The trick that just calls the pyarrow function with the same name is just an internal trick in the code, not a rule that defines the dtypes.

This also allows us to have enough flexibility to add the support for more types in the future.

Let me know what you think

If you want to maintain your own dtypes separate from pyarrow, then I think that makes a lot of sense.

In that world, perhaps there should actually be a separate function arrow_to_string that would act as an inverse for string_to_arrow that you could specifically control vs. relying on whatever str(pa_type) happens to do?

I'm happy to clean up my PR with this in mind, and I can open a follow-up PR that will incorporate this suggestion and add some documentation if you'd like!

It makes sense to have arrow_to_string indeed, thanks for the suggestion and for the help :)
And sure feel free to add some documentation, this is greatly appreciated

We try to have our own dtypes so that we can provide alternative backends instead of pyarrow, which can be a bit overkill for data that easily fit in memory IMO. This is something we're exploring

src/datasets/features.py

lhoestq

Really cool thank you !

Timestamp handling looks good to me

I agree with you that we should explain in the documentation the link between the datasets feature types and the internal pyarrow types. I added a few comments.

Also thanks for the tests and the additional docstrings, this is helpful.

src/datasets/features.py

tests/test_features.py

lhoestq · 2021-02-18T10:54:35Z

src/datasets/features.py

@@ -129,7 +172,7 @@ def cast_to_python_objects(obj: Any) -> Any:
 class Value:
    """Encapsulate an Arrow datatype for easy serialization."""

-    dtype: str
+    dtype: str  # The string representation of a pyarrow datatype: str(pyarrow.DataType)


I think we should have a list of supported types, and their corresponding pyarrow dtypes.
What about adding a table in the documentation ?

The definition of dtype being the name of a pyarrow function is invalid (given the exceptions you mentioned) and restrictive, so an explicit table would make sense IMO. The trick that just calls the pyarrow function with the same name is just an internal trick in the code, not a rule that defines the dtypes.

This also allows us to have enough flexibility to add the support for more types in the future.

Let me know what you think

src/datasets/features.py

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

justin-yan · 2021-02-18T17:22:36Z

OK! Thank you for the review - I will follow up with a separate PR for the comments here (#1900 (comment))!

lhoestq

Thanks a lot !
Let me apply the minor change to the docstring and remove the mention of arrow_to_datasets_dtype for now since it doesn't exist yet :)

src/datasets/features.py

justin-yan added 2 commits February 17, 2021 12:18

Supporting timestamp and float/double types for string_to_arrow

20149d1

raw string to avoid invalid escape sequences

0014667