Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG/REF: TimedeltaIndex.__new__ #23539

Merged
merged 19 commits into from Nov 11, 2018

Conversation

@jbrockmendel
Copy link
Member

commented Nov 7, 2018

TimedeltaIndex.__new__ type-checking is tough to decipher, and hides multiple bugs

>>> dti = pd.date_range('2016-01-01', periods=3)
>>> pd.TimedeltaIndex(np.array(dti))
TimedeltaIndex(['16801 days', '16802 days', '16803 days'], dtype='timedelta64[ns]', freq=None)

In addition to this incorrect behavior being supported, it is actually tested in tests.indexes.timedeltas.test_ops in what I have to assume is a copy/paste mixup of some kind.

>>> pd.TimedeltaIndex([0.5])
TimedeltaIndex(['0 days'], dtype='timedelta64[ns]', freq=None)

This PR fixes the constructor to

  1. be much more linear and clear than the existing constructor
  2. reject datetime64 and anything that isn't explicitly supported
  3. accept floating-dtype if and only if they can be converted losslessly: np.array([19.0]) is OK, np.array([19.5]) is not. np.nan is correctly converted to NaT

Other assorted small changes:
Move an incorrectly placed # ---- section divider in scalar timedelta tests.
Change a repeated-4-times 4-liner (setting result.freq = inferred_freq) to a 2-linear
Move raising and returning-early cases to the top of DatetimeIndex.__new__ and TimedeltaIndex.__new__

Has tests, does not have whatsnew note, pending suggestions as to which parts merit notes.

Does not significantly change the TimedeltaArray constructor; in an upcoming step we can move much of the logic there and have the TDI constructor wrap a/the TDA constructor.

@pep8speaks

This comment has been minimized.

Copy link

commented Nov 7, 2018

Hello @jbrockmendel! Thanks for submitting the PR.

Show resolved Hide resolved pandas/core/arrays/datetimes.py
Show resolved Hide resolved pandas/core/indexes/timedeltas.py Outdated
Show resolved Hide resolved pandas/tests/indexes/timedeltas/test_construction.py
Show resolved Hide resolved pandas/core/indexes/timedeltas.py Outdated
Show resolved Hide resolved pandas/core/indexes/timedeltas.py Outdated
Show resolved Hide resolved pandas/core/indexes/timedeltas.py Outdated
@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Nov 7, 2018

@jbrockmendel It seems you are adding net a lot of code. Is that because some of the things were previously handled by the to_timedelta call?

But that said, I think some of those things (eg the float handling) should actually be fixed in to_timedelta, and not here, and the TimedeltaIndex constructor can keep making use of that.

Show resolved Hide resolved pandas/core/indexes/datetimes.py Outdated
Show resolved Hide resolved pandas/core/indexes/timedeltas.py

# Convert whatever we have into timedelta64[ns] dtype
if is_object_dtype(data) or is_string_dtype(data):
# no need to make a copy, need to convert if string-dtyped

This comment has been minimized.

Copy link
@jreback

jreback Nov 7, 2018

Contributor

why would u check is_string_dtype?

This comment has been minimized.

Copy link
@jbrockmendel

jbrockmendel Nov 7, 2018

Author Member

because we have several tests that specifically pass e.g. np.array(['1 days', '4 days'], dtype='<S6'). (BTW we don't have similar tests for DatetimeIndex, which I intend to fix in the DTI analogue of this PR)

Show resolved Hide resolved pandas/core/indexes/timedeltas.py Outdated
Show resolved Hide resolved pandas/core/indexes/timedeltas.py Outdated
@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Nov 7, 2018

Does not significantly change the TimedeltaArray constructor; in an upcoming step we can move much of the logic there and have the TDI constructor wrap a/the TDA constructor.

@jbrockmendel as also mentioned in the other PR, Tom and I assumed that the consensus was to not do this (keep the TDA constructor simple), so if you still want to do this, please respond to the issue I linked then: #23212

@TomAugspurger

This comment has been minimized.

Copy link
Contributor

commented Nov 7, 2018

What's the issue with passing a datetime64[ns] array to TDI? NumPy is happy with it

In [50]: np.asarray(dr).astype("timedelta64[ns]")
Out[50]:
array([1452816000000000000, 1452902400000000000, 1452988800000000000,
       1453075200000000000, 1453161600000000000, 1453248000000000000],
      dtype='timedelta64[ns]')

and we match that

In [51]: np.asarray(dr).astype("timedelta64[ns]") == np.asarray(pd.TimedeltaIndex(np.asarray(dr)))
Out[51]: array([ True,  True,  True,  True,  True,  True])
@jreback

This comment has been minimized.

Copy link
Contributor

commented Nov 7, 2018

@TomAugspurger you can do that but IMHO this is a TypeError
if you actually want a timedelta from a datetime the. you need to subtract a datetime
this is really preventing foot shooting

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Nov 7, 2018

I would personally be fine with raising a TypeError, since I think this can be quite surprising behaviour (it's just using the integer values). But also fine to not deviate from numpy behaviour.

But given it is numpy behaviour, so if we want to change it, I think we should deprecate it first.

@TomAugspurger

This comment has been minimized.

Copy link
Contributor

commented Nov 7, 2018

IMO this is a TypeError

"is a" -> "should be"?

I don't really disagree, and we do raise for TimedeltaIndex(DatetimeIndex), so some consistency there would be nice. But I suspect that we have people relying on this, so a deprecation cycle would be needed.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Nov 7, 2018

ok by me to deprecate first

@TomAugspurger

This comment has been minimized.

Copy link
Contributor

commented Nov 7, 2018

this is really preventing foot shooting

I do realize my argument of "but NumPy allows it" isn't great :)

In [3]: np.array([1, np.nan]).astype(int)
Out[3]: array([                   1, -9223372036854775808])

setting aside the TDI(ndarray[timedelta64]) stuff. I think the important question is whether we want

Does not significantly change the TimedeltaArray constructor; in an upcoming step we can move much of the logic there and have the TDI constructor wrap a/the TDA constructor.

@jbrockmendel By "constructor" do you mean TimedeltaArray.__init__, or just a generic term for creating a TimedeltaArray from some input?

@jbrockmendel

This comment has been minimized.

Copy link
Member Author

commented Nov 7, 2018

@jorisvandenbossche

But that said, I think some of those things (eg the float handling) should actually be fixed in to_timedelta, and not here,

It can be difficult to tell exactly what is handled within the to_timedelta calls because there are two of them reached under separate conditions, and most of the relevant logic is then done again below that anyway. I'd be +1 on improving to_timedelta later.

as also mentioned in the other PR, Tom and I assumed that the consensus was to not do this (keep the TDA constructor simple), so if you still want to do this, please respond to the issue I linked then: #23212

I'll restate my opinion there when I'm ready, thank you. I avoided changing the Array constructors specifically to avoid this being an issue; really shouldn't have mentioned it all in the OP.

@jbrockmendel

This comment has been minimized.

Copy link
Member Author

commented Nov 7, 2018

I think the important question is whether we want [...] By "constructor" do you mean TimedeltaArray.init, or just a generic term for creating a TimedeltaArray from some input?

@TomAugspurger same as above; that is an important question, but I went out of my way to keep it totally orthogonal to this PR.

@jbrockmendel

This comment has been minimized.

Copy link
Member Author

commented Nov 7, 2018

It sounds like the discussion w/r/t TimedeltaIndex(np.array(dti)) has resolved itself overnight with an agreement that there should be a deprecation before a TypeError. Am I reading the room correctly?

@TomAugspurger

This comment has been minimized.

Copy link
Contributor

commented Nov 7, 2018

It sounds like the discussion w/r/t TimedeltaIndex(np.array(dti)) has resolved itself overnight with an agreement that there should be a deprecation before a TypeError. Am I reading the room correctly?

I think so.

I'll restate my opinion there when I'm ready, thank you. I avoided changing the Array constructors specifically to avoid this being an issue; really

Ok. I think I saw "refactor TimedeltaIndex.__new__" and thought it meant something closer to what we did for PeriodIndex.__new__, but we may not be there yet.

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Nov 7, 2018

It can be difficult to tell exactly what is handled within the to_timedelta calls because there are two of them reached under separate conditions, and most of the relevant logic is then done again below that anyway. I'd be +1 on improving to_timedelta later.

The bug you are mentioning (pd.TimedeltaIndex([0.5]) ignoring the decimal) is coming from to_timedelta doing that (I just verified it is indeed taking this path). So you are repeating and fixing here what to_timedelta is already doing, so if you want to fix this behaviour, it really seems logical to me to fix this in to_timedelta instead of expanding the code here.

It sounds like the discussion w/r/t TimedeltaIndex(np.array(dti)) has resolved itself overnight with an agreement that there should be a deprecation before a TypeError. Am I reading the room correctly?

Yes, I think so.

@jbrockmendel

This comment has been minimized.

Copy link
Member Author

commented Nov 7, 2018

So you are repeating and fixing here what to_timedelta is already doing, so if you want to fix this behaviour, it really seems logical to me to fix this in to_timedelta instead of expanding the code here.

AFAICT the to_timedelta call (L168 in master) is almost superfluous given what comes below (which was recently moved there IIRC because it was bizarrely located in _simple_new before that). If we disable the branch that calls to_timedelta, the only other changes we need to make everything work (or specifically, pass existing tests) is to a) convert generators to lists before passing into data = np.array(data, copy=False) and b) catch string-dtypes in the same block that catches np.object_ just below that. Which is to say: calling to_timedelta at all is highly redundant.

That said, an argument that I would find compelling is that we definitely don't want the behavior of TimedeltaIndex([x, y, z]) to diverge from the behavior of to_timedelta([x, y, z]), which this PR currently does. So I'll take a look at combining these after some more caffeine.

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Nov 7, 2018

If we disable the branch that calls to_timedelta, the only other changes we need to make everything work (or specifically, pass existing tests) is to a) convert generators to lists before passing into data = np.array(data, copy=False) and b) catch string-dtypes in the same block that catches np.object_ just below that. Which is to say: calling to_timedelta at all is highly redundant.

If you are going to look at it, I would recommend trying to make the checks you mention here redundant, by letting to_timedelta handle it. Eg strings is something that to_timedelta is made for to handle.
I think the object dtype handling could be removed, see below (but certainly didn't check all cases, so this might be too optimistic)

Dived into the (current master) code for a moment, and you have this:

# convert if not already
if getattr(data, 'dtype', None) != _TD_DTYPE:
data = to_timedelta(data, unit=unit, box=False)
elif copy:
data = np.array(data, copy=True)
data = np.array(data, copy=False)
if data.dtype == np.object_:
data = array_to_timedelta64(data)

Here, the conversion of everything that is not yet a timedelta64 array is first handled with to_timedelta. But at the end of the snippet, we still call array_to_timedelta64(data) (what is used inside to_timedelta) if the data is object dtype. But didn't we already just convert everything not m8 with to_timedelta?
So this happens if data is a TimedeltaArrayMixin (or at least one of the cases): TimedeltaArrayMixin still has a m8[ns] numpy dtype, so the getattr(data, 'dtype', None) != _TD_DTYPE is passed; so we then convert it with np.array(data) to an array of Timedelta objects, to then convert it back to m8 with array_to_timedelta64.
So this is of course just a redundant back and forth conversion, that can be avoided with a check if data is a TimedeltaArrayMixin, and in that case get the underlying numpy array of it.
(similar to what DatetimeIndex also does, it has a check if isinstance(data, DatetimeArrayMixin))

@jbrockmendel

This comment has been minimized.

Copy link
Member Author

commented Nov 7, 2018

@jorisvandenbossche I think we're in agreement on the relevant part which is that the status quo is redundant and unclear. Give me some time to adjust the PR to see if we can reach something closer to optimal.

@jreback
Copy link
Contributor

left a comment

just some minor comments.

Show resolved Hide resolved pandas/core/arrays/timedeltas.py
Show resolved Hide resolved pandas/core/arrays/timedeltas.py
Show resolved Hide resolved pandas/core/arrays/timedeltas.py
Show resolved Hide resolved pandas/core/arrays/timedeltas.py
Show resolved Hide resolved pandas/core/tools/timedeltas.py Outdated
sn = pd.to_timedelta(Series([pd.NaT]))
with tm.assert_produces_warning(FutureWarning, check_stacklevel=False):
# Passing datetime64-dtype data to TimedeltaIndex is deprecated
sn = pd.to_timedelta(Series([pd.NaT]))

This comment has been minimized.

Copy link
@jreback

jreback Nov 11, 2018

Contributor

what i mean is that this is all nan and inference is ambiguous at this point. so I wouldn't expect the warning (unless the dtype IS explicity passed). Can address in a followup, but pls open an issue, this is IMHO a bug.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Nov 11, 2018

@gfyoung if you have any further comments.

@jreback jreback added this to the 0.24.0 milestone Nov 11, 2018

@jsexauer jsexauer referenced this pull request Nov 11, 2018

Open

DEPR: deprecations from prior versions #6581

0 of 94 tasks complete
@jreback

This comment has been minimized.

Copy link
Contributor

commented Nov 11, 2018

@jbrockmendel ok let's merge this on green. a couple of minor followups (they are unresolved conversations).

@datapythonista
Copy link
Member

left a comment

Added some suggestions and info about the docstrings format.

Show resolved Hide resolved pandas/core/arrays/timedeltas.py
Show resolved Hide resolved pandas/core/arrays/timedeltas.py
Show resolved Hide resolved pandas/core/arrays/timedeltas.py
Show resolved Hide resolved pandas/core/arrays/timedeltas.py
Show resolved Hide resolved pandas/core/arrays/timedeltas.py
Show resolved Hide resolved pandas/core/arrays/timedeltas.py
Show resolved Hide resolved pandas/core/arrays/timedeltas.py
Show resolved Hide resolved pandas/core/arrays/timedeltas.py

@jreback jreback merged commit 2c982df into pandas-dev:master Nov 11, 2018

2 of 3 checks passed

pandas-dev.pandas
Details
ci/circleci Your tests passed on CircleCI!
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@jreback

This comment has been minimized.

Copy link
Contributor

commented Nov 11, 2018

thanks @jbrockmendel if you'd apply @datapythonista comments next pass would be great. trying to not block your followups.

thoo added a commit to thoo/pandas that referenced this pull request Nov 11, 2018

Merge remote-tracking branch 'upstream/master' into to_html-to_string
* upstream/master:
  BUILD: Simplifying contributor dependencies (pandas-dev#23522)
  BUG/REF: TimedeltaIndex.__new__ (pandas-dev#23539)

@jbrockmendel jbrockmendel deleted the jbrockmendel:pre-fixes branch Nov 11, 2018

@jbrockmendel

This comment has been minimized.

Copy link
Member Author

commented Nov 11, 2018

@datapythonista I'm adding suggested edits into #23587 (needs rebasing anyway). Definitely good suggestions.

For the "unit" description the best I've come up with is "The timedelta unit to treat integers as multiples of"; I can't think of a natural-sounding way to rephrase this that doesn't end in a preposition. Thoughts?

@datapythonista

This comment has been minimized.

Copy link
Member

commented Nov 12, 2018

That description sounds good to me. Can't think of anything better right now.

thoo added a commit to thoo/pandas that referenced this pull request Nov 12, 2018

Merge remote-tracking branch 'upstream/master' into io_csv_docstring_…
…fixed

* upstream/master:
  DOC: Enhancing pivot / reshape docs (pandas-dev#21038)
  TST: Fix xfailing DataFrame arithmetic tests by transposing (pandas-dev#23620)
  BUILD: Simplifying contributor dependencies (pandas-dev#23522)
  BUG/REF: TimedeltaIndex.__new__ (pandas-dev#23539)
  BUG: Casting tz-aware DatetimeIndex to object-dtype ndarray/Index (pandas-dev#23524)
  BUG: Delegate more of Excel parsing to CSV (pandas-dev#23544)
  API: DataFrame.__getitem__ returns Series for sparse column (pandas-dev#23561)
  CLN: use float64_t consistently instead of double, double_t (pandas-dev#23583)
  DOC: Fix Order of parameters in docstrings (pandas-dev#23611)
  TST: Unskip some Categorical Tests (pandas-dev#23613)
  TST: Fix integer ops comparison test (pandas-dev#23619)
box2 = pd.Series if box is pd.Index else box
expected = tm.box_expected(expected, box2)

result = idx * Series(rng5f + 0.1)

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Nov 12, 2018

Member

@jbrockmendel why did you change this? Left-over from initially changing the behaviour on floats?

This comment has been minimized.

Copy link
@jbrockmendel

jbrockmendel Nov 13, 2018

Author Member

Yes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.