Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REF/API: DatetimeTZDtype #23990

Merged
merged 20 commits into from Dec 3, 2018

Conversation

5 participants
@TomAugspurger
Copy link
Contributor

commented Nov 29, 2018

A cleanup of DatetimeTZDtype

  • Remove magic constructor from string. It seemed dangerous to overload unit with either unit or a full datetime64[ns, tz] string. This required changing DatetimeTZDtype.construct_from_string and changing a number of places to use construct_from_string rather than the main __new__
  • Change __new__ to __init__ and remove caching. It seemed to be causing more problems that it was worth. You could too easily create nonsense DatetimeTZDtypes like
In [3]: t = pd.core.dtypes.dtypes.DatetimeTZDtype(unit=None, tz=None)
  • Change .tz and .unit to properties instead of attributes. I've not provided setters. We could in theory, since we're getting rid of caching, but I'd rather wait till there's a demand..

The remaining changes in the DatetimeArray PR will be to

  1. Inherit from ExtensionDtype
  2. Implement construct_array_type
  3. Register the dtype
REF/API: DatetimeTZDtype
* Remove magic constructor from string
* Remove Caching

The remaining changes in the DatetimeArray PR will be to

1. Inherit from ExtensionDtype
2. Implement construct_array_type
3. Register
@pep8speaks

This comment has been minimized.

Copy link

commented Nov 29, 2018

Hello @TomAugspurger! Thanks for submitting the PR.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Nov 29, 2018

pls perf test things

caching is essential as these are created quite a lot and they are the same virtually every time

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Nov 29, 2018

On microbenchmarks, things are fine

master:

In [2]: %time a = pd.core.dtypes.dtypes.DatetimeTZDtype(unit='ns', tz="UTC")
CPU times: user 39 µs, sys: 31 µs, total: 70 µs
Wall time: 75.6 µs

In [3]: %time a = pd.core.dtypes.dtypes.DatetimeTZDtype(unit='ns', tz="UTC")
CPU times: user 15 µs, sys: 0 ns, total: 15 µs
Wall time: 18.8 µs

PR:

In [2]: %time a = pd.core.dtypes.dtypes.DatetimeTZDtype(unit='ns', tz="UTC")
CPU times: user 29 µs, sys: 23 µs, total: 52 µs
Wall time: 56 µs

In [3]: %time a = pd.core.dtypes.dtypes.DatetimeTZDtype(unit='ns', tz="UTC")
CPU times: user 16 µs, sys: 0 ns, total: 16 µs
Wall time: 19.8 µs

ASV for the timeseries, timestamps, and offsets

       before           after         ratio
     [6b3490f4]       [982c169a]
-         143±9ms          127±4ms     0.89  
timeseries.DatetimeIndex.time_add_timedelta('tz_aware')
-     2.73±0.05μs      2.45±0.02μs     0.90  timestamp.TimestampOps.time_tz_convert(tzutc())

Line-profiling reveals that basically all the time is spent on the timezone check.


In [5]: %lprun -s -f DatetimeTZDtype.__init__ DatetimeTZDtype(tz='utc')
Timer unit: 1e-06 s

Total time: 4e-05 s
File: /Users/taugspurger/sandbox/pandas/pandas/core/dtypes/dtypes.py
Function: __init__ at line 494

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   494                                               def __init__(self, unit="ns", tz=None):
   495                                                   """
   496                                                   An ExtensionDtype for timezone-aware datetime data.
   497
   498                                                   Parameters
   499                                                   ----------
   500                                                   unit : str, default "ns"
   501                                                       The precision of the datetime data. Currently limited
   502                                                       to ``"ns"``.
   503                                                   tz : str, int, or datetime.tzinfo
   504                                                       The timezone.
   505
   506                                                   Raises
   507                                                   ------
   508                                                   pytz.UnknownTimeZoneError
   509                                                       When the requested timezone cannot be found.
   510
   511                                                   Examples
   512                                                   --------
   513                                                   >>> pd.core.dtypes.dtypes.DatetimeTZDtype(tz='UTC')
   514                                                   datetime64[ns, UTC]
   515
   516                                                   >>> pd.core.dtypes.dtypes.DatetimeTZDtype(tz='dateutil/US/Central')
   517                                                   datetime64[ns, tzfile('/usr/share/zoneinfo/US/Central')]
   518                                                   """
   519         1          7.0      7.0     17.5          if isinstance(unit, DatetimeTZDtype):
   520                                                       unit, tz = unit.unit, unit.tz
   521
   522         1          1.0      1.0      2.5          if unit != 'ns':
   523                                                       raise ValueError("DatetimeTZDtype only supports ns units")
   524
   525         1          0.0      0.0      0.0          if tz:
   526         1         27.0     27.0     67.5              tz = timezones.maybe_get_tz(tz)
   527                                                   elif tz is not None:
   528                                                       raise pytz.UnknownTimeZoneError(tz)
   529                                                   elif tz is None:
   530                                                       raise TypeError("A 'tz' is required.")
   531
   532         1          4.0      4.0     10.0          self._unit = unit
   533         1          1.0      1.0      2.5          self._tz = tz

I switched the properties to cache_readonly (which we can do, since we aren't using _cache for caching instances now).

_metadata = ('unit', 'tz')
_match = re.compile(r"(datetime64|M8)\[(?P<unit>.+), (?P<tz>.+)\]")
_cache = {}
# TODO: restore caching? who cares though? It seems needlessly complex.
# np.dtype('datetime64[ns]') isn't a singleton

This comment has been minimized.

Copy link
@jreback

jreback Nov 29, 2018

Contributor

this is a huge performance penalty w/o caching

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Nov 29, 2018

Author Contributor

Did you see the perf numbers I posted in
#23990 (comment)? It seems to be slightly faster without caching (though within noise).

This comment has been minimized.

Copy link
@jreback

jreback Nov 29, 2018

Contributor

try running a good part of the test suite. Its the repeated consruction that's a problem, not the single contruction which is fine. W/o caching you end up creating a huge number of these

This comment has been minimized.

Copy link
@jreback

jreback Nov 29, 2018

Contributor

guess could remove the comment now

Show resolved Hide resolved pandas/core/dtypes/dtypes.py Outdated
@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Nov 29, 2018

@jreback

This comment has been minimized.

Copy link
Contributor

commented Nov 29, 2018

maybe things changed since I created this originally, but w/o caching perf was not great.

TomAugspurger added some commits Nov 29, 2018

Updates
* Use pandas_dtype
* removed cache_readonly
@codecov

This comment has been minimized.

Copy link

commented Nov 29, 2018

Codecov Report

Merging #23990 into master will increase coverage by 49.87%.
The diff coverage is 97.95%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master   #23990       +/-   ##
===========================================
+ Coverage   42.38%   92.25%   +49.87%     
===========================================
  Files         161      161               
  Lines       51701    51689       -12     
===========================================
+ Hits        21914    47687    +25773     
+ Misses      29787     4002    -25785
Flag Coverage Δ
#multiple 90.66% <97.95%> (?)
#single 42.51% <63.26%> (+0.12%) ⬆️
Impacted Files Coverage Δ
pandas/core/dtypes/common.py 95.62% <ø> (+25.23%) ⬆️
pandas/core/arrays/datetimelike.py 96.34% <100%> (+45.91%) ⬆️
pandas/core/arrays/datetimes.py 98.43% <100%> (+34.63%) ⬆️
pandas/core/internals/blocks.py 93.69% <100%> (+41.32%) ⬆️
pandas/core/dtypes/missing.py 93.1% <100%> (+35.63%) ⬆️
pandas/core/dtypes/dtypes.py 95.33% <97.14%> (+19.17%) ⬆️
pandas/core/computation/pytables.py 92.37% <0%> (+0.3%) ⬆️
pandas/io/pytables.py 92.3% <0%> (+0.92%) ⬆️
pandas/util/_test_decorators.py 93.24% <0%> (+4.05%) ⬆️
pandas/compat/__init__.py 58.36% <0%> (+8.17%) ⬆️
... and 125 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 08395af...5cde369. Read the comment docs.

Show resolved Hide resolved pandas/core/dtypes/common.py Outdated
Show resolved Hide resolved pandas/core/dtypes/dtypes.py Outdated

TomAugspurger added some commits Nov 29, 2018

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Nov 29, 2018

I've removed _coerce_to_dtype.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Nov 29, 2018

dont you need to remove the _coerce_to_dtype tests?

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Nov 29, 2018

7ab2a74 removed all the _coerce_to_dtype tests.

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Dec 3, 2018

Do we typically add release notes for something like this (deprecations that aren't part of the public API, but are possibly in use downstream)?

@jsexauer jsexauer referenced this pull request Dec 3, 2018

Open

DEPR: deprecations from prior versions #6581

0 of 94 tasks complete
@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Dec 3, 2018

Added to the deprecation log.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Dec 3, 2018

Do we typically add release notes for something like this (deprecations that aren't part of the public API, but are possibly in use downstream)?

this IS certainly part of the public API as DatetimeTZDtype is public.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Dec 3, 2018

I know pyarrow uses this at the very least. (I don't know if the 2 form of the construction is used though). cc @wesm

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Dec 3, 2018

Ah, I didn't realize it was exported in api.types, it's not in the API docs. I'll add it to api.rst at the top-level (as of #23581)

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Dec 3, 2018

Added a release note. Going to followup on api.rst later, since we don't have an array section for datetime array there yet.

@jreback

jreback approved these changes Dec 3, 2018

@jreback

This comment has been minimized.

Copy link
Contributor

commented Dec 3, 2018

thanks @TomAugspurger lgtm. let's merge on green. can you add a ref into #6581

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Dec 3, 2018

@datapythonista

This comment has been minimized.

Copy link
Member

commented Dec 3, 2018

Haven't seen those errors before, I guess something is failing in their side. @davidstaheli do you mind taking a look at https://dev.azure.com/pandas-dev/pandas/_build/results?buildId=4597

@jreback

This comment has been minimized.

Copy link
Contributor

commented Dec 3, 2018

@TomAugspurger needs a rebase as well

@datapythonista

This comment has been minimized.

Copy link
Member

commented Dec 3, 2018

@TomAugspurger I was trying to create an incident for azure, but the url does some weird redirect with the language and it crashes. Not sure if it's Linux specific, can you check if the "Create incident" button works for you: https://azure.microsoft.com/en-us/support/create-ticket/ (the Basic technical support one)

Or @vtbassmatt, may be you can help with the above? See the error in https://dev.azure.com/pandas-dev/pandas/_build/results?buildId=4597, we are getting that often

@datapythonista

This comment has been minimized.

Copy link
Member

commented Dec 3, 2018

I was checking, and numba is getting the same error from azure. I guess something is down for the whole azure-pipelines.

https://dev.azure.com/numba/numba/_build/results?buildId=471

@vtbassmatt

This comment has been minimized.

Copy link
Contributor

commented Dec 3, 2018

I suspect it's a little bit of abuse fighting that was turned up a little too high. I think we've addressed it now.

@jreback jreback merged commit 3fe697f into pandas-dev:master Dec 3, 2018

3 checks passed

ci/circleci Your tests passed on CircleCI!
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
pandas-dev.pandas Build #20181203.72 succeeded
Details
@jreback

This comment has been minimized.

Copy link
Contributor

commented Dec 3, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.