New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

BUG/REF: TimedeltaIndex.new #23539

Merged

jreback merged 19 commits into pandas-dev:master from jbrockmendel:pre-fixes

Nov 11, 2018

Member

jbrockmendel commented Nov 7, 2018

TimedeltaIndex.__new__ type-checking is tough to decipher, and hides multiple bugs

>>> dti = pd.date_range('2016-01-01', periods=3)
>>> pd.TimedeltaIndex(np.array(dti))
TimedeltaIndex(['16801 days', '16802 days', '16803 days'], dtype='timedelta64[ns]', freq=None)

In addition to this incorrect behavior being supported, it is actually tested in tests.indexes.timedeltas.test_ops in what I have to assume is a copy/paste mixup of some kind.

>>> pd.TimedeltaIndex([0.5])
TimedeltaIndex(['0 days'], dtype='timedelta64[ns]', freq=None)

This PR fixes the constructor to

be much more linear and clear than the existing constructor
reject datetime64 and anything that isn't explicitly supported
accept floating-dtype if and only if they can be converted losslessly: np.array([19.0]) is OK, np.array([19.5]) is not. np.nan is correctly converted to NaT

Other assorted small changes:
Move an incorrectly placed # ---- section divider in scalar timedelta tests.
Change a repeated-4-times 4-liner (setting result.freq = inferred_freq) to a 2-linear
Move raising and returning-early cases to the top of DatetimeIndex.__new__ and TimedeltaIndex.__new__

Has tests, does not have whatsnew note, pending suggestions as to which parts merit notes.

Does not significantly change the TimedeltaArray constructor; in an upcoming step we can move much of the logic there and have the TDI constructor wrap a/the TDA constructor.

jbrockmendel added 4 commits

November 6, 2018 21:10


          simplify+fix+test TimedeltaIndex constructor

afc2d30


          tests for datetime64 data being invalid, floats being valid iff non-l…

e4b06ca

…ossy


          Merge branch 'master' of https://github.com/pandas-dev/pandas into pr…

231a5c1

…e-fixes


          comments and whitespace

1ff432b

pep8speaks commented Nov 7, 2018

Hello @jbrockmendel! Thanks for submitting the PR.

There are no PEP8 issues in the file pandas/core/arrays/datetimes.py !
There are no PEP8 issues in the file pandas/core/arrays/timedeltas.py !
There are no PEP8 issues in the file pandas/core/indexes/datetimes.py !
There are no PEP8 issues in the file pandas/core/indexes/timedeltas.py !
There are no PEP8 issues in the file pandas/tests/arithmetic/test_timedelta64.py !
There are no PEP8 issues in the file pandas/tests/indexes/timedeltas/test_construction.py !
There are no PEP8 issues in the file pandas/tests/indexes/timedeltas/test_ops.py !
There are no PEP8 issues in the file pandas/tests/scalar/timedelta/test_arithmetic.py !


          GH references

645e99c

gfyoung added Bug Refactor Datetime Timedelta labels

gfyoung requested changes

View reviewed changes

pandas/core/arrays/datetimes.py Show resolved Hide resolved

pandas/core/indexes/timedeltas.py Outdated Show resolved Hide resolved

pandas/tests/indexes/timedeltas/test_construction.py Show resolved Hide resolved

pandas/core/indexes/timedeltas.py Outdated Show resolved Hide resolved

pandas/core/indexes/timedeltas.py Outdated Show resolved Hide resolved

pandas/core/indexes/timedeltas.py Outdated Show resolved Hide resolved

Member

jorisvandenbossche commented Nov 7, 2018 •

edited

Loading

@jbrockmendel It seems you are adding net a lot of code. Is that because some of the things were previously handled by the to_timedelta call?

But that said, I think some of those things (eg the float handling) should actually be fixed in to_timedelta, and not here, and the TimedeltaIndex constructor can keep making use of that.

jreback requested changes

View reviewed changes

pandas/core/indexes/datetimes.py Outdated Show resolved Hide resolved

pandas/core/indexes/timedeltas.py Show resolved Hide resolved

pandas/core/indexes/timedeltas.py Outdated

+                      # Convert whatever we have into timedelta64[ns] dtype
+                      if is_object_dtype(data) or is_string_dtype(data):
+                          # no need to make a copy, need to convert if string-dtyped

Contributor

jreback Nov 7, 2018

why would u check is_string_dtype?

Member Author

jbrockmendel Nov 7, 2018

because we have several tests that specifically pass e.g. np.array(['1 days', '4 days'], dtype='<S6'). (BTW we don't have similar tests for DatetimeIndex, which I intend to fix in the DTI analogue of this PR)

pandas/core/indexes/timedeltas.py Outdated Show resolved Hide resolved

pandas/core/indexes/timedeltas.py Outdated Show resolved Hide resolved

Member

jorisvandenbossche commented Nov 7, 2018

Does not significantly change the TimedeltaArray constructor; in an upcoming step we can move much of the logic there and have the TDI constructor wrap a/the TDA constructor.

@jbrockmendel as also mentioned in the other PR, Tom and I assumed that the consensus was to not do this (keep the TDA constructor simple), so if you still want to do this, please respond to the issue I linked then: #23212

Contributor

TomAugspurger commented Nov 7, 2018

What's the issue with passing a datetime64[ns] array to TDI? NumPy is happy with it

In [50]: np.asarray(dr).astype("timedelta64[ns]")
Out[50]:
array([1452816000000000000, 1452902400000000000, 1452988800000000000,
       1453075200000000000, 1453161600000000000, 1453248000000000000],
      dtype='timedelta64[ns]')

and we match that

In [51]: np.asarray(dr).astype("timedelta64[ns]") == np.asarray(pd.TimedeltaIndex(np.asarray(dr)))
Out[51]: array([ True,  True,  True,  True,  True,  True])

Contributor

jreback commented Nov 7, 2018

@TomAugspurger you can do that but IMHO this is a TypeError
if you actually want a timedelta from a datetime the. you need to subtract a datetime
this is really preventing foot shooting

Member

jorisvandenbossche commented Nov 7, 2018

I would personally be fine with raising a TypeError, since I think this can be quite surprising behaviour (it's just using the integer values). But also fine to not deviate from numpy behaviour.

But given it is numpy behaviour, so if we want to change it, I think we should deprecate it first.

Contributor

TomAugspurger commented Nov 7, 2018

IMO this is a TypeError

"is a" -> "should be"?

I don't really disagree, and we do raise for TimedeltaIndex(DatetimeIndex), so some consistency there would be nice. But I suspect that we have people relying on this, so a deprecation cycle would be needed.

Contributor

jreback commented Nov 7, 2018

ok by me to deprecate first

Contributor

TomAugspurger commented Nov 7, 2018 •

edited

Loading

this is really preventing foot shooting

I do realize my argument of "but NumPy allows it" isn't great :)

In [3]: np.array([1, np.nan]).astype(int)
Out[3]: array([                   1, -9223372036854775808])

setting aside the TDI(ndarray[timedelta64]) stuff. I think the important question is whether we want

Does not significantly change the TimedeltaArray constructor; in an upcoming step we can move much of the logic there and have the TDI constructor wrap a/the TDA constructor.

@jbrockmendel By "constructor" do you mean TimedeltaArray.__init__, or just a generic term for creating a TimedeltaArray from some input?

Member Author

jbrockmendel commented Nov 7, 2018

@jorisvandenbossche

But that said, I think some of those things (eg the float handling) should actually be fixed in to_timedelta, and not here,

It can be difficult to tell exactly what is handled within the to_timedelta calls because there are two of them reached under separate conditions, and most of the relevant logic is then done again below that anyway. I'd be +1 on improving to_timedelta later.

as also mentioned in the other PR, Tom and I assumed that the consensus was to not do this (keep the TDA constructor simple), so if you still want to do this, please respond to the issue I linked then: #23212

I'll restate my opinion there when I'm ready, thank you. I avoided changing the Array constructors specifically to avoid this being an issue; really shouldn't have mentioned it all in the OP.

Member Author

jbrockmendel commented Nov 7, 2018

I think the important question is whether we want [...] By "constructor" do you mean TimedeltaArray.init, or just a generic term for creating a TimedeltaArray from some input?

@TomAugspurger same as above; that is an important question, but I went out of my way to keep it totally orthogonal to this PR.

Member Author

jbrockmendel commented Nov 7, 2018

It sounds like the discussion w/r/t TimedeltaIndex(np.array(dti)) has resolved itself overnight with an agreement that there should be a deprecation before a TypeError. Am I reading the room correctly?

Contributor

TomAugspurger commented Nov 7, 2018

It sounds like the discussion w/r/t TimedeltaIndex(np.array(dti)) has resolved itself overnight with an agreement that there should be a deprecation before a TypeError. Am I reading the room correctly?

I think so.

I'll restate my opinion there when I'm ready, thank you. I avoided changing the Array constructors specifically to avoid this being an issue; really

Ok. I think I saw "refactor TimedeltaIndex.__new__" and thought it meant something closer to what we did for PeriodIndex.__new__, but we may not be there yet.

Member

jorisvandenbossche commented Nov 7, 2018

It can be difficult to tell exactly what is handled within the to_timedelta calls because there are two of them reached under separate conditions, and most of the relevant logic is then done again below that anyway. I'd be +1 on improving to_timedelta later.

The bug you are mentioning (pd.TimedeltaIndex([0.5]) ignoring the decimal) is coming from to_timedelta doing that (I just verified it is indeed taking this path). So you are repeating and fixing here what to_timedelta is already doing, so if you want to fix this behaviour, it really seems logical to me to fix this in to_timedelta instead of expanding the code here.

It sounds like the discussion w/r/t TimedeltaIndex(np.array(dti)) has resolved itself overnight with an agreement that there should be a deprecation before a TypeError. Am I reading the room correctly?

Yes, I think so.

Member Author

jbrockmendel commented Nov 7, 2018

So you are repeating and fixing here what to_timedelta is already doing, so if you want to fix this behaviour, it really seems logical to me to fix this in to_timedelta instead of expanding the code here.

AFAICT the to_timedelta call (L168 in master) is almost superfluous given what comes below (which was recently moved there IIRC because it was bizarrely located in _simple_new before that). If we disable the branch that calls to_timedelta, the only other changes we need to make everything work (or specifically, pass existing tests) is to a) convert generators to lists before passing into data = np.array(data, copy=False) and b) catch string-dtypes in the same block that catches np.object_ just below that. Which is to say: calling to_timedelta at all is highly redundant.

That said, an argument that I would find compelling is that we definitely don't want the behavior of TimedeltaIndex([x, y, z]) to diverge from the behavior of to_timedelta([x, y, z]), which this PR currently does. So I'll take a look at combining these after some more caffeine.

Member

jorisvandenbossche commented Nov 7, 2018

If we disable the branch that calls to_timedelta, the only other changes we need to make everything work (or specifically, pass existing tests) is to a) convert generators to lists before passing into data = np.array(data, copy=False) and b) catch string-dtypes in the same block that catches np.object_ just below that. Which is to say: calling to_timedelta at all is highly redundant.

If you are going to look at it, I would recommend trying to make the checks you mention here redundant, by letting to_timedelta handle it. Eg strings is something that to_timedelta is made for to handle.
I think the object dtype handling could be removed, see below (but certainly didn't check all cases, so this might be too optimistic)

Dived into the (current master) code for a moment, and you have this:

pandas/pandas/core/indexes/timedeltas.py

Lines 165 to 173 in 28a42da

    
           # convert if not already 
        
           if getattr(data, 'dtype', None) != _TD_DTYPE: 
        
               data = to_timedelta(data, unit=unit, box=False) 
        
           elif copy: 
        
               data = np.array(data, copy=True) 
        
           data = np.array(data, copy=False) 
        
           if data.dtype == np.object_: 
        
               data = array_to_timedelta64(data)

Here, the conversion of everything that is not yet a timedelta64 array is first handled with to_timedelta. But at the end of the snippet, we still call array_to_timedelta64(data) (what is used inside to_timedelta) if the data is object dtype. But didn't we already just convert everything not m8 with to_timedelta?
So this happens if data is a TimedeltaArrayMixin (or at least one of the cases): TimedeltaArrayMixin still has a m8[ns] numpy dtype, so the getattr(data, 'dtype', None) != _TD_DTYPE is passed; so we then convert it with np.array(data) to an array of Timedelta objects, to then convert it back to m8 with array_to_timedelta64.
So this is of course just a redundant back and forth conversion, that can be avoided with a check if data is a TimedeltaArrayMixin, and in that case get the underlying numpy array of it.
(similar to what DatetimeIndex also does, it has a check if isinstance(data, DatetimeArrayMixin))


          deprecate instead of raising for datetime64 dtypes

9c89746

Member Author

jbrockmendel commented Nov 7, 2018

@jorisvandenbossche I think we're in agreement on the relevant part which is that the status quo is redundant and unclear. Give me some time to adjust the PR to see if we can reach something closer to optimal.


          implement sequence_to_td64ns, deprecate datetime64 data, add and test…

ef3f277

… fastpath for inference/validation

jbrockmendel added 3 commits

November 10, 2018 08:41


          double quotes

6fda27e


          test that no copy is made with int64 data

c63796a


          update tests to TypeError

e9b5da6

jreback requested changes

View reviewed changes

Contributor

jreback left a comment

just some minor comments.

pandas/core/arrays/timedeltas.py Show resolved Hide resolved

pandas/core/arrays/timedeltas.py Show resolved Hide resolved

pandas/core/arrays/timedeltas.py Show resolved Hide resolved

pandas/core/arrays/timedeltas.py Show resolved Hide resolved

pandas/core/tools/timedeltas.py Outdated Show resolved Hide resolved

pandas/tests/indexes/timedeltas/test_arithmetic.py

-                      sn = pd.to_timedelta(Series([pd.NaT]))
+                      with tm.assert_produces_warning(FutureWarning, check_stacklevel=False):
+                          # Passing datetime64-dtype data to TimedeltaIndex is deprecated
+                          sn = pd.to_timedelta(Series([pd.NaT]))

Contributor

jreback Nov 11, 2018

what i mean is that this is all nan and inference is ambiguous at this point. so I wouldn't expect the warning (unless the dtype IS explicity passed). Can address in a followup, but pls open an issue, this is IMHO a bug.

Contributor

jreback commented Nov 11, 2018

@gfyoung if you have any further comments.

jreback added this to the 0.24.0 milestone

jbrockmendel added 2 commits

November 11, 2018 09:17


          Merge branch 'master' of https://github.com/pandas-dev/pandas into pr…

b43e936

…e-fixes


          dtype=object instead of 'O'

da6b286

jreback reviewed

View reviewed changes

doc/source/whatsnew/v0.24.0.txt Show resolved Hide resolved

jsexauer mentioned this pull request

DEPR: Clean up list of deprecations from prior versions #6581

Closed

1 task

jreback approved these changes

View reviewed changes

Contributor

jreback commented Nov 11, 2018

@jbrockmendel ok let's merge this on green. a couple of minor followups (they are unresolved conversations).


          use pytest.raises instead of tm.assert_raises_regex

898444f

datapythonista reviewed

View reviewed changes

Member

datapythonista left a comment

Added some suggestions and info about the docstrings format.

pandas/core/arrays/timedeltas.py Show resolved Hide resolved

pandas/core/arrays/timedeltas.py Show resolved Hide resolved

pandas/core/arrays/timedeltas.py Show resolved Hide resolved

pandas/core/arrays/timedeltas.py Show resolved Hide resolved

pandas/core/arrays/timedeltas.py Show resolved Hide resolved

pandas/core/arrays/timedeltas.py Show resolved Hide resolved

pandas/core/arrays/timedeltas.py Show resolved Hide resolved

pandas/core/arrays/timedeltas.py Show resolved Hide resolved

jreback merged commit 2c982df into pandas-dev:master

Contributor

jreback commented Nov 11, 2018

thanks @jbrockmendel if you'd apply @datapythonista comments next pass would be great. trying to not block your followups.

thoo added a commit to thoo/pandas that referenced this pull request


          Merge remote-tracking branch 'upstream/master' into to_html-to_string

895f8cd

* upstream/master:
  BUILD: Simplifying contributor dependencies (pandas-dev#23522)
  BUG/REF: TimedeltaIndex.__new__ (pandas-dev#23539)

jbrockmendel deleted the pre-fixes branch

November 11, 2018 23:42

Member Author

jbrockmendel commented Nov 11, 2018

@datapythonista I'm adding suggested edits into #23587 (needs rebasing anyway). Definitely good suggestions.

For the "unit" description the best I've come up with is "The timedelta unit to treat integers as multiples of"; I can't think of a natural-sounding way to rephrase this that doesn't end in a preposition. Thoughts?

Member

datapythonista commented Nov 12, 2018

That description sounds good to me. Can't think of anything better right now.

thoo added a commit to thoo/pandas that referenced this pull request


          Merge remote-tracking branch 'upstream/master' into io_csv_docstring_…

4e6f3a0

…fixed

* upstream/master:
  DOC: Enhancing pivot / reshape docs (pandas-dev#21038)
  TST: Fix xfailing DataFrame arithmetic tests by transposing (pandas-dev#23620)
  BUILD: Simplifying contributor dependencies (pandas-dev#23522)
  BUG/REF: TimedeltaIndex.__new__ (pandas-dev#23539)
  BUG: Casting tz-aware DatetimeIndex to object-dtype ndarray/Index (pandas-dev#23524)
  BUG: Delegate more of Excel parsing to CSV (pandas-dev#23544)
  API: DataFrame.__getitem__ returns Series for sparse column (pandas-dev#23561)
  CLN: use float64_t consistently instead of double, double_t (pandas-dev#23583)
  DOC: Fix Order of parameters in docstrings (pandas-dev#23611)
  TST: Unskip some Categorical Tests (pandas-dev#23613)
  TST: Fix integer ops comparison test (pandas-dev#23619)

jorisvandenbossche reviewed

View reviewed changes

pandas/tests/arithmetic/test_timedelta64.py

                       box2 = pd.Series if box is pd.Index else box
                       expected = tm.box_expected(expected, box2)
-                      result = idx * Series(rng5f + 0.1)

Member

jorisvandenbossche Nov 12, 2018

@jbrockmendel why did you change this? Left-over from initially changing the behaviour on floats?

Member Author

jbrockmendel Nov 13, 2018

Yes.

jbrockmendel mentioned this pull request

REF/DEPR: DatetimeIndex constructor #23675

Merged

JustinZhengBC pushed a commit to JustinZhengBC/pandas that referenced this pull request


          BUG/REF: TimedeltaIndex.__new__ (pandas-dev#23539)

a4c1490

tm9k1 pushed a commit to tm9k1/pandas that referenced this pull request


          BUG/REF: TimedeltaIndex.__new__ (pandas-dev#23539)

c33527a

jorisvandenbossche mentioned this pull request

Regression in 0.24: to_timedelta handling of float values #25077

Closed

Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request


          BUG/REF: TimedeltaIndex.__new__ (pandas-dev#23539)

8028b07

Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request


          BUG/REF: TimedeltaIndex.__new__ (pandas-dev#23539)

13a3054

jorisvandenbossche mentioned this pull request

REGR: to_timedelta precision issues with floating data #25651

Merged

jreback mentioned this pull request

DEPR: deprecations log for removed issues #13777

Closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

jorisvandenbossche jorisvandenbossche left review comments

datapythonista datapythonista left review comments

gfyoung gfyoung requested changes

jreback jreback approved these changes

TomAugspurger TomAugspurger approved these changes