Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: DataFrame from dict with non-nano Timedelta #48901

Merged
merged 6 commits into from
Nov 18, 2022

Conversation

jbrockmendel
Copy link
Member

xref #44504

This is motivated by a branch (upcoming PR) that makes the Timedelta constructor preserve resolution when passed a np.timedelta64 object. Without this change, test_constructor_dict_timedelta64_index fails because of a dict lookup in lib.fast_multiget.

For non-nano Timedeltas this does significantly hurt performance:

In [1]: import pandas as pd
In [2]: td = pd.Timedelta(days=1, minutes=3)
In [3]: td2 = td._as_unit("ms")

In [4]: %timeit hash(td)
145 ns ± 15.5 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)  # <- main
131 ns ± 3.43 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)  # <- PR

In [5]: %timeit hash(td2)
138 ns ± 16.8 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)  # <- main
1.8 µs ± 42.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)  # <- PR

I expect there's room for optimization. Something like:

if self._is_in_pytimedelta_bounds() and not self._has_ns():
    return timedelta.__hash__(self)
else:
    return self._hash_version_in this_pr()

@mroeschke mroeschke added Timedelta Timedelta data type DataFrame DataFrame data structure Non-Nano datetime64/timedelta64 with non-nanosecond resolution labels Oct 3, 2022
# resolution.
try:
obj = self._as_reso(<NPY_DATETIMEUNIT>(self._reso + 1))
except OverflowError:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do the unit test hit both the except and else branch?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just reach one branch, will add test(s)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did these get added?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, not yet

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated per request

@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Nov 18, 2022
@mroeschke mroeschke added this to the 2.0 milestone Nov 18, 2022
@mroeschke mroeschke removed the Stale label Nov 18, 2022
@mroeschke mroeschke merged commit d4e3963 into pandas-dev:main Nov 18, 2022
@mroeschke
Copy link
Member

Thanks @jbrockmendel

@jbrockmendel jbrockmendel deleted the bug-fast_multiget branch November 18, 2022 23:39
mliu08 pushed a commit to mliu08/pandas that referenced this pull request Nov 27, 2022
* BUG: DataFrame from dict with non-nano Timedelta

* fix __hash__

* test for both hash cases
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DataFrame DataFrame data structure Non-Nano datetime64/timedelta64 with non-nanosecond resolution Timedelta Timedelta data type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants