Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API/DES: Non-Nanosecond Tracker #46587

Open
14 of 25 tasks
jbrockmendel opened this issue Mar 31, 2022 · 10 comments
Open
14 of 25 tasks

API/DES: Non-Nanosecond Tracker #46587

jbrockmendel opened this issue Mar 31, 2022 · 10 comments
Labels
API Design Non-Nano datetime64/timedelta64 with non-nanosecond resolution

Comments

@jbrockmendel
Copy link
Member

jbrockmendel commented Mar 31, 2022

Support for non-nanosecond timedelta64, datetime64, and datetime64tz is coming along. The next big planned steps are to get the Timedelta and Timestamp scalars to support non-nano resolutions. There are a few design options that I'd like to get input on cc @pandas-dev/pandas-core

I'm doing Timedelta first mostly because that is more conducive to doing as a scoped PR with dedicated testing. Current plan is to make a dedicated constructor like Timedelta._from_value_and_reso (so i can write tests) that can can be removed once we decide on the public behavior. Which brings us to the questions:

  • Do we add a reso-like keyword to the constructors? Or use/respect "unit"?
  • With non-nano np.timedelta64 objects that don't overflow when cast to nano, do we still cast? e.g. does `np.datetime64(4, "s") become ns or stay s?
  • Similar with pytimedelta. pd.Timedelta(timedelta(days=106752)) currently raises, in the future will presumably come back with a 'us' reso. So what about currently non-raising cases like pd.Timedelta(timedelta(days=106751))? Does it stay ns or become us?

Other

  1. What happens to (class attributes) (Timestamp|Timedelta)(min|max|resolution)?

Update 2023-01-16 Remaining TODO list

@jbrockmendel jbrockmendel added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 31, 2022
@mroeschke
Copy link
Member

Quick thoughts:

Do we add a reso-like keyword to the constructors? Or use/respect "unit"?

My gut is saying this should use/respect "unit" as it's not clear to me how "unit" and a reso-like keyword would cooperate?

With non-nano np.timedelta64 objects that don't overflow when cast to nano, do we still cast?

Based on the GH issues I remember seeing in the past, generally people were hoping datetime64[s] was not cast to datetime64[ns] for example, but this behavior has been around for a while and some users may have gotten use to it. I would think this would be a welcome change as the "old" behavior could be preserved with astype("...64[ns]")?

Similar with pytimedelta...

Under the assumption unit : str, default 'ns' is still the default after this feature, I would expect "ns" for the current non-raising cases

@mroeschke mroeschke added API Design Non-Nano datetime64/timedelta64 with non-nanosecond resolution and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 1, 2022
@rhshadrach
Copy link
Member

Do keep in mind I don't have much experience using the date functionality even as-is.

Do we add a reso-like keyword to the constructors? Or use/respect "unit"?

For the "use/respect unit" option, I think this would mean that a user specifying pd.Timedelta(1, unit='days') would always have a resolution of days. Is there a downside to this? One thought is combining different units (and hence, resolutions) would constantly require conversions. But maybe different resolutions would be cached under the hood.

Another option, if being able to specify the resolution is advantageous, would be to allow units such as days[s] or days[ns] etc. Unspecified could then default to ns.

pd.Timedelta(timedelta(days=106752)) currently raises, in the future will presumably come back with a 'us' reso.

Is this value-dependent behavior? If everything "just works", then I don't have much of a concern. However, if getting back one resolution vs another can cause odd results in edge cases, this could be surprising and lead to hard to find bugs. If that is the case, I would lean toward raising and making the user specify the resolution.

@jbrockmendel
Copy link
Member Author

For the "use/respect unit" option, I think this would mean that a user specifying pd.Timedelta(1, unit='days') would always have a resolution of days. Is there a downside to this?

Not really, just an API change.

One thought is combining different units (and hence, resolutions) would constantly require conversions.

Yah, I've handled comparisons between scalars of different resolutions, but addition/subtraction (and division for Timedelta) it's not obvious what the right behavior is (i think numpy silently converts in ways that sometimes overflow, have examples/tests written down somewhere). I think we provide an .astype-like API to allow conversion and otherwise raise on potentially-ambiguous operations.

@jorisvandenbossche
Copy link
Member

  • Do we add a reso-like keyword to the constructors? Or use/respect "unit"?

The existing keyword is about "how to interpret the user input", and what we now want is "how to store the parsed value". Those two are not necessarily the same, and I personally don't think we can simply assume they are (e.g. I think it would be quite common use case that you want to convert values that represent for example hours, but you want to store then in microseconds for compatibility with other data).

In addition, we allow more units here for this keyword than what I would implement as possible resolution of the dtype.
But given the responses above, we should maybe have this discussion explicitly about which resolutions we want to support. For example, about:

For the "use/respect unit" option, I think this would mean that a user specifying pd.Timedelta(1, unit='days') would always have a resolution of days. Is there a downside to this?

My answer would be: we don't support a resolution of days, so such a resulting value is simply not possible (and thus it would be impossible to always respect the unit from the constructor.

(it's certainly unfortunate, though, that the unit keyword is not directly available for specifying the resulting storage unit .., as that would be the logical keyword name for this)

  • Similar with pytimedelta. pd.Timedelta(timedelta(days=106752)) currently raises, in the future will presumably come back with a 'us' reso. So what about currently non-raising cases like pd.Timedelta(timedelta(days=106751))? Does it stay ns or become us?

I think long term, the logical behaviour would be that any datetime.timedelta conversion by default results in "us" (since that is the "native" resolution of datetime.timedelta), regardless of the actual value.
That will be a breaking change though, and I suppose we can only make that at some point in a major release (eg when also changing the default resolution), and on the short term keep using "ns" for this (and thus value-dependent behaviour).

@jbrockmendel
Copy link
Member Author

I'm reaching a point where making further progress depends on having non-nano DatetimeTZDtype. That is easy to implement, but would be an API change bc ATM passing non-nano to the DatetimeTZDtype constructor raises (which we test).

Selfishly, it would be very convenient to break the no-changes-in-minor-versions rule and do this immediately. Since this is a pretty tiny corner case, I doubt anyone would complain. But the policy is there for a reason.

Thoughts on a) whether this is reasonable or b) viable alternatives? cc @mroeschke @jreback @jorisvandenbossche

@mroeschke
Copy link
Member

ATM passing non-nano to the DatetimeTZDtype constructor raises (which we test).

Does it raise with a not-implemented-like error message? If so, I think the "API change" could be advertised as an "enhancement"

@jbrockmendel
Copy link
Member Author

ValueError

@datapythonista
Copy link
Member

If I'm not missing anything, the only user code that could "break" with what you propose is if the user has code creating second precision datetime data, which currently breaks, and after your changes would work.

Personally I think:

  1. Is very unlikely that users have code with second precision, since it's not implemented. Or precision obtained at runtime
  2. If someone implemented this code, I assume would be to use it if this was ever implemented, and would be happy to see it finally working

If I'm not missing something, I'd personally add this directly without overcomplicating things.

@jorisvandenbossche
Copy link
Member

Indeed, enabling more keyword values is generally not seen as a breaking change, I would say. Of course, the situation is a bit different here as the keyword values already did exist elsewhere (in numpy), and so some users might have tried passing it. But relying on that passing it raises an error seems unlikely.

So for the dtype constructor, I would also directly add the ability to create dtypes with other resolutions.

@jbrockmendel
Copy link
Member Author

Status Update 2022-06-20. The constructors point is the one the most pressing.

TODO - Technical:
    - Timestamp/Timedelta constructors will choke on values outside the pydatetime/pytimedelta implementation bounds; work around this.
    - PERF
        The implementation of non-nano support means many new checks on reso, which necessarily hurts performance. Try to get some of that back. Ideas:
        - Cache Localizer
        - Localizer.utc_val_to_local_val could be nogil and avoid the `except? -1` if restricted to the non-tzlocal/zoneinfo cases.  Re-separating these might improve performance at the cost of re-duplicating some code.
    - Not Yet Implemented
        - pd.io compat
        - pd.plotting compat

TODO - API Design:
    - Constructors
        Timestamp, Timedelta, to_datetime, to_timedelta, date_range, timedelta_range, others?

        - How to specify desired resolution? Options:
            a) add a keyword e.g. "reso", e.g.

                In [4]: pd.Timestamp(4567, unit="us", reso="us")

            b) repurpose the "unit" keyword to specify the output resolution, e.g.

                In [4]: pd.Timestamp(4567, unit="us")

            The main place where this would cause an issue is if the user wants a different input resolution vs output resolution, in which case we would direct the user to convert post-construction along the lines of:

                In [5]: pd.Timestamp(4567, unit="s").as_unit("us")

            (ATM there is a private _as_unit which will likely be made public)

            c) don't specify it, infer it from the input, let users convert post-construction if they want.

            If we were starting fresh, b) would be my preferred API.  Backwards-compat concerns make a) more appealing. If we want to provide something user-facing in 1.5, a) seems like the only option.

        - Resolution Inference - scalars
            - np.datetime64, np.timedelta64 have natural resolutions attached to them. We should preserve these wherever possible and cast to the nearest support resolution otherwise.  e.g. Timestamp(np.datetime64(4, "ms")) would retain millisecond resolution.
            - pydatetime and pytimedelta objects naturally have microsecond resolution. The Timestamp/Timedelta constructors should preserve these.
            - strings are the harder case.  Should e.g. `Timestamp("2022-06-20 08:21:04")` come back with second-resolution?  I lean towards "yes".

        - Resolution Inference - arrays
            If we do inference on scalars, presumably we would want the analogous inference on arrays of scalars. This raises questions about what to do with mixed-resolution inputs:

                In [3]: pd.to_datetime(["2016-01-02 03:04:05", "2016-06-07 08:09:10.111", "2017-11-20 12:13:14:156789"])


    - Arithmetic between Timestamps/Timedelta of mismatched resolution?
        - Options:
            a) Raise in all cases
                We currently do this for division (truediv, rtruediv, floordiv, rfloordiv) on Timedelta
            b) Cast to the higher resolution like numpy:

                In [2]: td = np.timedelta64(1, "h")
                In [3]: dt = np.datetime64("1994-01-02 03:04:05")
                In [4]: dt + td
                Out[4]: numpy.datetime64('1994-01-02T04:04:05')

                In [5]: td + dt
                Out[5]: numpy.datetime64('1994-01-02T04:04:05')

            c) Cast to the lower resolution
            d) (the current choice) Cast to the lower resolution *if* doing so is lossless

                In [3]: ts = pd.Timestamp("2016-01-01 02:03:04")._as_unit("s")
                In [4]: other = pd.Timestamp(500_000_000)._as_unit("ms")

                In [5]: ts - other
                [...]
                ValueError: Timestamp subtraction with mismatched resolutions is not allowed when casting to the lower resolution would require lossy rounding.

                i) ATM (2022-06-20 16:01 UTC) this check is in place for Timestamp.__sub__(datetimelike) but not for Timedelta. The intention is to implement it for all the relevant add/sub ts/td combinations.
                ii) ATM (2022-06-19 16:01 UTC) this check is not implemented for DatetimeArray or TimedeltaArray.
            
                iii) The main reason for choosing d) over a) is that a) would break user code if/when, as expected, we change constructor behavior for Timestamp(pydatetime_object) to have microsecond resolution, e.g:
                    In [3]: pd.Timestamp.now() - pd.Timestamp(datetime.now())

    - find_common_type with mixed resolution dt64/td64
        - numpy returns object
            In [19]: np.find_common_type([np.dtype("M8[ns]"), np.dtype("M8[s]")], [])
            Out[19]: dtype('O')

        - To be consistent with the arithmetic logic described above, we would want the common type to be the lower resolution if that cost is lossless, and object otherwise.
            This is value-dependent behavior that we otherwise try to avoid.

    - (Timestamp|Timedelta).(min|max|resolution)
        - ATM these are class attributes that assume Timestamp/Timedelta are always in nanoseconds, so will be incorrect for non-nano instances.
        - One solution is a descriptor that behaves differently depending on whether it is accessed on the class or an instance. e.g. `Timestamp.min` remains unchanged but `timestamp_instance_with_second_reso.min` returns the minimum second-resolution Timestamp.
        - Another solution would be for Timestamp.min to return a dict-like so you would do e.g. `Timestamp.min["ns"]`

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Non-Nano datetime64/timedelta64 with non-nanosecond resolution
Projects
None yet
Development

No branches or pull requests

5 participants