Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DatetimeIndex lookups tz inconsistency #18435

Closed
jbrockmendel opened this issue Nov 22, 2017 · 25 comments
Closed

DatetimeIndex lookups tz inconsistency #18435

jbrockmendel opened this issue Nov 22, 2017 · 25 comments
Labels
API Design Compat pandas objects compatability with Numpy or Python functions Indexing Related to indexing on series/frames, not to indexes themselves Timezones Timezone data dtype

Comments

@jbrockmendel
Copy link
Member

TLDR: enforcing tzaware vs tznaive compat in DatetimeIndex comparisons (#18162) appears to be inconsistent with current slicing behavior.

The following example is based off of tests.series.test_indexing.test_getitem_setitem_datetimeindex:

dti = pd.date_range('1/1/1990', periods=50, freq='H', tz='US/Eastern')
ts = pd.Series(np.random.randn(50), index=dti)

lb = '1990-01-01 04:00:00'
rb = '1990-01-01 07:00:00'

The behavior that we want to enforce is #18162 requires that dti < pd.Timestamp(lb) should raise, as should dti < lb. At the moment they effectively get treated as UTC. But if we change this so that it does raise, the following from test_getitem_setitem_datetimeindex breaks pretty irrevocably:

ts[(ts.index >= lb) & (ts.index <= rb)]

There is also ts[lb:rb] which if we're being strict should raise, but at least we could make that still work. (BTW this implicitly casts lb and rb to US/Eastern, albeit in different places. So far that appears to be a related but distinct can of worms.)

@jbrockmendel
Copy link
Member Author

Looks related : #17920

@1kastner
Copy link

@jbrockmendel Your abbreviations make it difficult for me to read what exactly you mean. Could you use more explicit variable naming for explaining? Thanks a lot!

@jbrockmendel
Copy link
Member Author

jbrockmendel commented Nov 23, 2017

Woops, didn't notice at first it was the variable names where abbreviations were unclear, assumed it was vernacular.

pd is standard for pandas, as is np for numpy. dti is pretty common for DatetimeIndex.

In this case lb and rb are taken from the referenced test test_getitem_setitem_datetimeindex, presumably are short for left_bound and right_bound or something like that.

@jbrockmendel
Copy link
Member Author

Migrated from #18376

I would rather go for the split between naive and timezoned datetime-like objects and my UserWarning instead of an error was rather for backwards-compability

So if backward-compat weren't a concern, you would be OK with raising and requiring users to pass tzaware indexers to tzaware DatetimeIndexes? If so, then I bet @jreback will agree that correctness is more important than backwards-compat in this case and we should Do It Right. (If so, the rest of this comment is redundant)

It is motivated by a use case with data from several different data sources. Some are UTC, some are German Winter Time (without daylight saving time, definitely not a standard time), some are CET etc. I believe that mixing different data sources often has timezone issues as a consequence. But a prove I can not serve.

Adding a TZ is not an option in ISO 8601 as far as I have read it. And I would try to comply to that standard for the labels. Your attempt sounds quite experimential to me.

You're right that it is not part of the standard. The idea is that the "TZ" directive would be identified before the date-parsing step-- but on second thought it'd be easier just to append the offset e.g. "-0500". I wrote quickly last night; let me try to explain the suggestion more clearly.

There are three issues involved here. There is the correct behavior, the convenient behavior, and the consistent behavior. The Technically Correct thing to do is to disallow tznaive/tzaware comparisons. The convenient thing to do is to Just Work when trying to index/slice a tzaware DatetimeIndex with strings. The consistent behavior is that slicing and comparison are linked, e.g. series[lower:upper] == series[(series.index >= lower) & (series.index <= upper)] match. AFAICT we cannot have all three of these at the same time.

The easy cases:

  • If a DatetimeIndex is tznaive and an indexer/slicer is tznaive, then the current behavior is both correct and convenient.
  • If a DatetimeIndex is tzaware and an indexer/slicer is tzaware, then the correct behavior and the convenient behavior and the correct behavior are the same (I'll have to double-check before claiming this is also the same as the current behavior)

The hard cases:

But for DatetimeIndex behavior to be internally consistent, that would require that comparison methods also assume that tznaive objects implicitly have the same tz as the index. And we should definitely not be making that assumption. It would better to raise in this case and require that users explicitly pass tzaware indexers/slicers.

@jreback
Copy link
Contributor

jreback commented Nov 24, 2017

@jbrockmendel you need to strictly differentiate between strings and actual datetimes here.

comparision with datetimes are necessarily and always will be strict w.r.t. tz-awareness. The conversion of strings can be a question, but is orthogonal.

@1kastner
Copy link

@jbrockmendel

If a DatetimeIndex is tznaive and an indexer/slicer is tzaware, then #17920 assumes the DatetimeIndex is UTC. I would much rather raise and require the user explicitly make the index tzaware. Is there is a reason why this wouldn't work in your use case?

It would work in my use case, this assumption was done not to break other people's code unnecessarily. I agree that your solution is cleaner but for me pandas means having a lot of convenience and the convenience of today creates lots of edge cases. When you look at the documentation, you can pass a lot of string labels which do not look like a datetime-like object but instead are just the year, the year and the month or something similar. We should not assume the user to provide a ISO8601 conform label only. And how do you suggest to make a timezone aware label look like that way? A small sample:

t = pd.date_range(start=pd.datetime(2000, 1, 1), periods=400, freq='d')
df = pd.DataFrame(index=t, data={"val": range(len(t))})
df.loc["2000-01":"2000-03"]
df.loc["2000-01"]

I see no way in how we can add timezones to these kinds of labels which can be intuitively read. And only if all kinds of possible labels allow adding timezones, the strictness you suggest is suitable for this library. But that would be a major update for me, rather something which can be discussed for a similar but completely new library.

@1kastner
Copy link

@jbrockmendel Have you made a list of possible string index labels (except the string which equals an ISO8601 string) and how you want to make them timezone aware? Just create a concept before you push more commits on this issue.

@jbrockmendel
Copy link
Member Author

Have you made a list of possible string index labels (except the string which equals an ISO8601 string) and how you want to make them timezone aware?

That was the idea behind adding "TZ" at the end of a non-aware string.

pandas means having a lot of convenience

Agreed. My contention is that special-casing strings in this context and not others is going to create more corner cases; inconsistency will make things less predictable and therefore less convenient.

Just create a concept before you push more commits on this issue.

Not sure I understand this bit. This is an Issue, has no commits.

@1kastner
Copy link

1kastner commented Nov 26, 2017 via email

@jorisvandenbossche
Copy link
Member

@jbrockmendel you need to strictly differentiate between strings and actual datetimes here.

comparision with datetimes are necessarily and always will be strict w.r.t. tz-awareness. The conversion of strings can be a question, but is orthogonal.

I agree we should consider both string and actual datetime objects.
However, shouldn't this distinction then also not be true for slicing? Eg df[pd.Timestamp(..):pd.Timestamp(..)] with naive Timestamps on a tz-aware dataframe? Because this is eg a case there are tests being added for in #17920 that this works. Following your rules of "comparison with actual datetimes should be strict" this should then raise?

My contention is that special-casing strings in this context and not others is going to create more corner cases; inconsistency will make things less predictable and therefore less convenient.

Yes indeed. So since for slicing we want to special case strings (I don't think we should start thinking of extending datetime string formats with kind of a 'TZ' indicator), let's do it everywhere? So also for comparisons? (which we already do in master I think)

@jbrockmendel
Copy link
Member Author

@jorisvandenbossche thanks for weighing in here and elsewhere.

Supposing we assume that strings are implicitly localized to the same tz as a DatetimeIndex for comparisons, do we do the same for Timestamps? i.e. are we ok with DatetimeIndex([foo]) == bar and Timestamp(foo) == bar behaving differently?

What about other DatetimeIndex binary operations that take a datetime-like arg? Subtraction is the only one that comes to mind.

Conditional on #17920 going in (and I'm +0.75 on it because it is convenient), there is going to be an equivalence-breakage somewhere. Right now I'm inclined to keep comparisons technically correct (well, make them technically correct with #18376) and break the ser.loc[lb:ub] == ser[(ser.index >= lb) & (ser.index <= ub)] equivalence. That's the only case that doesn't immediately snowball into "well what about this other thing..." (I think/hope)

@jbrockmendel
Copy link
Member Author

Copied from #18376:

And I think we certainly should not regard it is as just a bug fix (eg in the whatsnew notes).

How to treat it in the whatsnew notes is above my pay grade. But over the course of this PR's history I've become increasingly convinced that the current comparison behavior is Just Plain Wrong and should be treated like a bug.

The three options on hand are 1) this PR which makes DatetimeIndex comparisons behave like all other datetime-like comparisons, 2) edit the spec to explicitly make string comparisons a special case, 3) do nothing. I'm going to assume away 3 and argue against 2.

AFAICT the main objection to 1) is that in conjunction with #17920 it breaks the equivalence between ser.loc[lower:upper] and ser[(ser.index >= lower) & (ser.index <= upper)]. But consider what other equivalences are broken by 2:

  • We can no longer count on DatetimeIndex comparisons to be transitive. a >= b and b >= c no longer implies a >= c.
  • Comparison-broadcasting becomes inconsistent. (index >= bound)[n] no longer equals index[n] >= bound (unless we then go mess with Timestamp...)
  • Comparison-boxing becomes inconsistent. index >= bound no longer necessarily equals index >= Timestamp(bound).
  • Box-conversion becomes inconsistent. index >= bound no longer necessarily equals index.astype(object) >= bound or Series(index) >= bound or DataFrame(index) >= bound

... and most of all, I am not remotely confident that this list is complete. How many places across the code-base do comparisons with DatetimeIndex objects in one place or another? I have no idea.

A tz-aware DatetimeIndex is not easy to get by accident. If a user has one, they have made a decision that timezones matter.


Any of the available options introduces an inconsistency somewhere. AFAICT Option1 breaks a convenience equivalency, will do so loudly, and as a result will not snowball into other inconsistencies.

Special-casing string comparisons generates a whole mess of other potential (often silent) problems that can be avoided by enforcing behavior that is already canonical.

@jreback jreback added API Design Compat pandas objects compatability with Numpy or Python functions Indexing Related to indexing on series/frames, not to indexes themselves Timezones Timezone data dtype labels Dec 15, 2017
@jreback
Copy link
Contributor

jreback commented Dec 15, 2017

In [3]: pd.Timestamp("2016-01-01", tz="Europe/Berlin") > pd.Timestamp("now", tz='UTC')
Out[3]: False

@jbrockmendel on the list?

@jorisvandenbossche
Copy link
Member

In [3]: pd.Timestamp("2016-01-01", tz="Europe/Berlin") > pd.Timestamp("now", tz='UTC')
Out[3]: False

these should all raise (on the list)

I would say that is up for debate. We currently allow it, and I don't think there is anything ambiguous about what the result should be. Why breaking backwards compatibility to start erroring on this?

@jbrockmendel
Copy link
Member Author

@jreback why would 2016-01-01 be > now? Assuming you ran this recently...

@jorisvandenbossche Is your last comment (the one blockquoting the 2016-01-01) about 2016-01-01? AFAICT that is unrelated to this Issue.

@jorisvandenbossche
Copy link
Member

@jreback didn't mean the False was wrong, but that he wants it to raise an error (different timezones), and my answer was on that aspect.
It's related to the general topic (comparisons with timezones) but indeed a bit different (two different timezones instead of naive vs aware).

@jbrockmendel
Copy link
Member Author

didn't mean the False was wrong, but that he wants it to raise an error

@jreback can you confirm this is what you intended? If so I'm surprised because the scalar behavior is pretty well-established.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Dec 15, 2017 via email

@jorisvandenbossche
Copy link
Member

How to treat it in the whatsnew notes is above my pay grade. But over the course of this PR's history I've become increasingly convinced that the current comparison behavior is Just Plain Wrong and should be treated like a bug.

I think you are certainly experienced enough to have a valuable opinion about that!
I can certainly understand you are opposed to keeping the behaviour (there are good reasons for that), but I don't see why it would be "plain wrong". IMO it is not wrong to interpret a string (I would even say it is not really ambiguous what it should do), but it is just a choice we make: are we strict about it or do we parse the string to the type of the timestamp it is being compared to. Both are valid choices, with its pros and cons.

this PR which makes DatetimeIndex comparisons behave like all other datetime-like comparisons,

What do you mean with "all other datetime-like comparisons" ?
And to be clear: your PR is only about disallowing to compare tz-aware index to strings, correct? Because your following arguments sound more against all comparisons to strings (so also for tz-naive).

We can no longer count on DatetimeIndex comparisons to be transitive. a >= b and b >= c no longer implies a >= c.

I don't understand this one. Can you give a concrete example?

Comparison-broadcasting becomes inconsistent. (index >= bound)[n] no longer equals index[n] >= bound (unless we then go mess with Timestamp...)

I would say: of course, if we choose for allowing strings to compare to tz aware, we do the same for Timestamp (those currenlty don't even compare to strings at all, so this equivalency already does not hold at the moment

Comparison-boxing becomes inconsistent. index >= bound no longer necessarily equals index >= Timestamp(bound).

That's true. But that is the same with slicing with strings, there this also does not hold up. So I think I would be OK with that.

Box-conversion becomes inconsistent. index >= bound no longer necessarily equals index.astype(object) >= bound or Series(index) >= bound or DataFrame(index) >= bound

The boxing is true (but for all strings, not only in the case of tz-aware), but Series with datetime values happily compares to a string at the moment.

A tz-aware DatetimeIndex is not easy to get by accident. If a user has one, they have made a decision that timezones matter.

Yes. And if a user uses strings instead of aware objects, he made the decision to see the string as 'local' time (I mean local to the data).


To be clear, I am not necessarily against the disallowing of comparing strings to tz-aware data, but I am strongly opposed to comparing to strings in general, and it is not fully clear if you are arguing for that or only the former.

@jbrockmendel
Copy link
Member Author

See the discussion in the linked issue: #15249

Thanks, the context there helps.

we do the same for Timestamp (those currenlty don't even compare to strings at all, so this equivalency already does not hold at the moment

I was under the mistaken impression that those comparisons were done, my mistake. This tempers the strength of my opinion somewhat, will need to think on it before deciding how much.

What do you mean with "all other datetime-like comparisons" ?

Choose any two: DatetimeIndex, Timestamp, datetime, datetime64-Series, datetime64-Dataframe-column.

I don't understand this one [loss of transitivity]. Can you give a concrete example?

The following is in 0.21.1 and is part of what #18376 is intended to fix:

In [2]: tstr = '2016-01-01 12:00:00'
In [3]: west = pd.DatetimeIndex([tstr], tz='US/Pacific')
In [4]: east = pd.DatetimeIndex([tstr], tz='US/Eastern')

In [7]: east == tstr
Out[7]: array([ True], dtype=bool)

In [8]: west == tstr
Out[8]: array([ True], dtype=bool)

In [9]: east == west
Out[9]: array([False], dtype=bool)

i.e. equality is not transitive. Am I the only one who finds that troubling from a ability-to-reason-about-code perspective?

@jorisvandenbossche
Copy link
Member

What do you mean with "like all other datetime-like comparisons" ?

Choose any two: DatetimeIndex, Timestamp, datetime, datetime64-Series, datetime64-Dataframe-column.

Well, DatetimeIndex and datetime64 Series/DataFrame all allow comparison to strings when they are tz-aware. It is only Timestamp that does not do this, but as discussed above it does not support string comparison at all. I don't think datetime is that relevant here, as we extend the standard functionality already in many ways.

I don't understand this one [loss of transitivity]. Can you give a concrete example?

The following is in 0.21.1 and is part of what #18376 is intended to fix:

The example you give is for tz-aware timestamps strings, where indeed a >= b and b >= c no longer implies a >= c.
That is correct, but, I can perfectly make a similar example for tz-naive timestamps as well, and is just a consequence of allowing this convenience of interpreting strings as datetimes. And so IMO not an argument for this specific case of tz-aware timestamps (but could be an argument against any string interpretation at all).

In [26]: a = "2012-01-01"

In [27]: b = pd.DatetimeIndex(['2012-01-01'], tz='US/Pacific')

In [28]: c = "2012-01-01 00:00:00"

In [29]: a == b
Out[29]: array([ True], dtype=bool)

In [30]: b == c
Out[30]: array([ True], dtype=bool)

In [31]: a == c
Out[31]: False

Of course I "cheat" a bit as I have taken strings for a and c (so the other way around as in your example), but again, equality is not transitive.
Yes, that can make it more complex to reason about expected outcome if you are comparing variables. But interpreting strings is there for convenience, in interactive use, and you can always choose to use actual timestamps for complex library code where those strings would be hidden in variables.

@jbrockmendel
Copy link
Member Author

Of course I "cheat" a bit as I have taken strings for a and c (so the other way around as in your example), but again, equality is not transitive.

You are technically correct.

I'm still not wild about allowing inconsistent string comparisons through, but this isn't a hill I want to die on. If I can get your blessing on #18376 in its current form (that punts on strings, requires tzawareness-compat for everything else) I'll happily spend my time elsewhere.

@jbrockmendel
Copy link
Member Author

I think this has been resolved. Closing.

@jorisvandenbossche
Copy link
Member

I think this has been resolved.

I what way has it been resolved?

@jbrockmendel
Copy link
Member Author

I what way has it been resolved?

I thought that #17920 had been merged, apparently was wrong about that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Compat pandas objects compatability with Numpy or Python functions Indexing Related to indexing on series/frames, not to indexes themselves Timezones Timezone data dtype
Projects
None yet
Development

No branches or pull requests

4 participants