Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Series.__getitem__ returns a datetime.timedelta for pl.Duration('ns') dtype #14695

Closed
2 tasks done
douglas-raillard-arm opened this issue Feb 26, 2024 · 5 comments
Closed
2 tasks done
Labels
A-timeseries Area: date/time functionality bug Something isn't working invalid A bug report that is not actually a bug python Related to Python Polars

Comments

@douglas-raillard-arm
Copy link
Contributor

douglas-raillard-arm commented Feb 26, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import datetime
df = pl.DataFrame(dict(a=[1]))
x = df.select(pl.col('a').cast(pl.Duration('ns')))['a'][0]
assert isinstance(x, datetime.timedelta)

Log output

N/A

Issue description

pl.Duration('ns') dtype is able to represent nanosecond-recision durations.
However, datetime.timedelta() is not able to represent nanosecond-precision durations:
https://bugs.python.org/issue15443
python/cpython#59648

This means it's currently not possible to access the real value without casting to an integer dtype first.

Expected behavior

Series.__getitem__ should return instances of a type able to represent the values stored in the series.

pandas has implemented its own equivalent type that does handle nanoseconds (possibly for unrelated reasons):
https://pandas.pydata.org/pandas-docs/version/2.1/reference/api/pandas.Timedelta.html

Installed versions

--------Version info---------
Polars:               0.20.10
Index type:           UInt32
Platform:             Linux-5.15.0-92-generic-x86_64-with-glibc2.29
Python:               3.8.10 (default, Nov 22 2023, 10:22:35) 
[GCC 9.4.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           3.7.4
numpy:                1.24.4
openpyxl:             <not installed>
pandas:               2.0.3
pyarrow:              15.0.0
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>

EDIT: added the CPython github issue tracker link

@douglas-raillard-arm douglas-raillard-arm added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Feb 26, 2024
@mcrumiller
Copy link
Contributor

The same is true for both pl.Time and pl.Duration. The polars library has neither the expectation nor the guarantee that every data type has a corresponding type in the standard python library. If we want to change that, it would be a bit of an overhaul to the temporal system in place.

But I do agree that we should perhaps create a pl.Scalar class that can represent elements of the various data types. For example, if the user wanted to filter on a very specific time down to the nanosecond, they would require knowledge of how times are implemented:

>>> from datetime import time
>>> import polar as pl
>>> s = pl.time_range(time(hour=1), time(hour=1, second=1), interval="1ns")
>>> s.filter(s <= time(hour=1, nanosecond=1))  # can't do this!
>>> s.filter(s.cast(pl.Int64) <= (1_000_000_000*3600 + 1)).cast(pl.Time)  # must use integer
shape: (2,)
Series: 'literal' [time]
[
        01:00:00
        01:00:00.000000001
]

It would be nice to be able to do:

>>> s.filter(s < pl.Time.Scalar(hour=1, nanosecond=1))

...or something of the sort.

@mcrumiller
Copy link
Contributor

I think I will open an enhancement request for a scalar and see how that is received.

@stinodego stinodego added A-timeseries Area: date/time functionality invalid A bug report that is not actually a bug and removed needs triage Awaiting prioritization by a maintainer labels Feb 26, 2024
@stinodego
Copy link
Member

This is not a bug, but I can understand the need for a scalar type. We have already talked about this internally. I'll close this in favor of #14700 for now.

@stinodego stinodego closed this as not planned Won't fix, can't repro, duplicate, stale Feb 26, 2024
@douglas-raillard-arm
Copy link
Contributor Author

I suppose external uses could convert back to an int using .dt.total_nanoseconds(), but I fear that many 3rd party lib will end up with the seemingly "good enough" approach of simply using the timedelta and that it will hurt the polars ecosystem. For example, a plotting library could think "how nice, we get this std lib object" and just go for it without realizing it will be a deal breaker for some users, especially since the doc does not have any particular mention about that AFAICT:
https://docs.pola.rs/py-polars/html/reference/api/polars.Duration.html#polars-duration

If we want to change that, it would be a bit of an overhaul to the temporal system in place.

Would it necessarily be ? Maybe it would be possible to subclass datetime.timedelta to add nanosecond support, and make __getitem__ return that. I haven't dug deep in polars but extracting values from a Series/DataFrame shouldn't be done if perf matters anyway, so even if that subclass ends up being a bit more expensive to build/manipulate it's probably ok

@douglas-raillard-arm
Copy link
Contributor Author

EDIT: It did not take long to find an issue of this kind, starting from polars' own .plot namespace: holoviz/hvplot#488
I'm not sure what timedelta is being referred to there (Python one ? Maybe a Javascript counterpart ?), and the issue is not limited to polars, but the idea is there, it trips people.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-timeseries Area: date/time functionality bug Something isn't working invalid A bug report that is not actually a bug python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

3 participants