Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DatetimeIndex.get_slice_bound(...) raises TypeErrors for unexpected YYYY-MM-DD/datetime.date/Timestamp combinations #35690

Closed
RhysU opened this issue Aug 12, 2020 · 7 comments · Fixed by #35848
Assignees
Labels
Bug Datetime Datetime data dtype Indexing Related to indexing on series/frames, not to indexes themselves Timezones Timezone data dtype
Milestone

Comments

@RhysU
Copy link

RhysU commented Aug 12, 2020

Below find a handful of reproducible observations in Pandas 1.0.5 re: DatetimeIndex.get_slice_bound(...) where I'm genuinely not sure what the correct behaviors should all be. Some of these appear related to #34077 in that slice_locs(...) uses get_slice_bound(...) under the covers.

Observations inlined and expected behaviors discussed afterwards:

########################################
OBSERVATIONS 1 using a UTC DatetimeIndex
########################################

import datetime
import pandas as pd
import pandas.util.testing as put

# Generate a UTC-localized DatetimeIndex
df = put.makeTimeDataFrame()
df = df.tz_localize("utc")
index = df.index

# Show the generated DatetimeIndex, which should look like:
# DatetimeIndex(['2000-01-03 00:00:00+00:00', '2000-01-04 00:00:00+00:00',
#                ...
#                '2000-02-10 00:00:00+00:00', '2000-02-11 00:00:00+00:00'],
#               dtype='datetime64[ns, UTC]', freq='B')
print(index)

# (A) When the date is inside the DatetimeIndex, this call completes.
index.get_slice_bound(datetime.date(2000, 1, 7), kind="ix", side="left")

# (B) Notice date before start of index
# TypeError: searchsorted requires compatible dtype or scalar, not date
index.get_slice_bound(datetime.date(2000, 1, 1), kind="ix", side="left")

# (C) Notice date after end of index
# TypeError: searchsorted requires compatible dtype or scalar, not date
index.get_slice_bound(datetime.date(2020, 1, 1), kind="ix", side="left")


# (D) When the Timestamp is inside the DatetimeIndex, this call completes
index.get_slice_bound(pd.Timestamp("2000-01-07"), kind="ix", side="left")

# (E) Notice Timestamp before start of index
# TypeError: Cannot compare tz-naive and tz-aware datetime-like objects
index.get_slice_bound(pd.Timestamp("2000-01-01"), kind="ix", side="left")

# (F) Notice Timestamp after end of index
# TypeError: Cannot compare tz-naive and tz-aware datetime-like objects
index.get_slice_bound(pd.Timestamp("2020-01-01"), kind="ix", side="left")

Discussion:

  1. Above, I do not know if (A)-(C) should behave as (E)-(F) or not.
  2. Above, I suspect (D) should raise as (E)-(F) do.
  3. I can see arguments where datetime.date's lacking tzinfo could be inferred to be the DatetimeIndex.tzinfo. Then (A)-(C) would not raise.
  4. I can see arguments where Timestamps lacking tzinfo could be inferred to be the DatetimeIndex.tzinfo. Then (D)-(F) would not raise.
  5. I don't expect any data-dependence, meaning that the specific YYYY-MM-DD should not impact if something raises.
  6. I have not tested datetime.datetime under these circumstances.
  7. I believe some of these behaviors may have changed since the 0.2-series.

Again, observations inlined and expected behaviors discussed afterwards:

########################################################
OBSERVATIONS 2 using a DatetimeIndex with tzinfo is None
########################################################

import datetime
import pandas as pd
import pandas.util.testing as put

# Generate a non-localized DatetimeIndex
df = put.makeTimeDataFrame()
index = df.index
assert index.tzinfo is None

# Show the generated DatetimeIndex, which should look like:
# DatetimeIndex(['2000-01-03', '2000-01-04', '2000-01-05', '2000-01-06',
#                ...
#                '2000-02-10', '2000-02-11'],
#               dtype='datetime64[ns]', freq='B')
print(index)

# (G) When the date is inside the DatetimeIndex, this call completes.
index.get_slice_bound(datetime.date(2000, 1, 7), kind="ix", side="left")

# (H) Notice date before start of index
# TypeError: searchsorted requires compatible dtype or scalar, not date
index.get_slice_bound(datetime.date(2000, 1, 1), kind="ix", side="left")

# (I) Notice date after end of index
# TypeError: searchsorted requires compatible dtype or scalar, not date
index.get_slice_bound(datetime.date(2020, 1, 1), kind="ix", side="left")


# (J) When the Timestamp is inside the DatetimeIndex, this call completes
index.get_slice_bound(pd.Timestamp("2000-01-07"), kind="ix", side="left")

# (K) Call completes for Timestamp before start of index
index.get_slice_bound(pd.Timestamp("2000-01-01"), kind="ix", side="left")

# (L) Call completes for Timestamp after end of index
index.get_slice_bound(pd.Timestamp("2020-01-01"), kind="ix", side="left")

Discussion:

  1. Above, I expect (H)-(I) to behave as (G).
  2. Above, (J)-(K) appear correct to me.
  3. I don't expect any data-dependence, meaning that the specific YYYY-MM-DD should not impact if something raises.
  4. I have not tested datetime.datetime under these circumstances.
  5. I believe some of these behaviors may have changed since the 0.2-series.
INSTALLED VERSIONS
------------------
commit           : None
python           : 3.6.10.final.0
python-bits      : 64
OS               : Linux
OS-release       : 4.14.67-ts1
machine          : x86_64
processor        : 
byteorder        : little
LC_ALL           : en_US.UTF-8
LANG             : en_US.utf8
LOCALE           : en_US.UTF-8

pandas           : 1.0.5
numpy            : 1.16.6
pytz             : 2019.2
dateutil         : 2.8.0
pip              : 19.0.3
setuptools       : 40.8.0
Cython           : 0.29.20
pytest           : 5.3.1
hypothesis       : 3.57.0
sphinx           : 1.8.5
blosc            : 1.5.1
feather          : None
xlsxwriter       : 1.0.2
lxml.etree       : 4.3.4
html5lib         : 1.0.1
pymysql          : None
psycopg2         : None
jinja2           : 2.11.1
IPython          : 7.5.0
pandas_datareader: None
bs4              : 4.9.1
bottleneck       : 1.2.1
fastparquet      : 0.3.3
gcsfs            : None
lxml.etree       : 4.3.4
matplotlib       : 3.0.3
numexpr          : 2.6.4
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : 1.0.0
pytables         : None
pytest           : 5.3.1
pyxlsb           : None
s3fs             : 0.4.2
scipy            : 1.5.1
sqlalchemy       : 1.2.1
tables           : 3.5.2
tabulate         : 0.8.3
xarray           : 0.10.0
xlrd             : 1.1.0
xlwt             : 1.3.0
xlsxwriter       : 1.0.2
numba            : 0.50.1
@jreback jreback added Indexing Related to indexing on series/frames, not to indexes themselves Datetime Datetime data dtype labels Aug 13, 2020
@mroeschke
Copy link
Member

Sorry that this issue hasn't had followup sooner.

Note that the "ix" option has been removed since 1.1.0, but the results are identical with other kind arguments on master.

I agree with your observations but just to comment:

  1. datetime.date doesn't have 1st class support in pandas (works in some places and not others), but IMO they should be valid arguments like in get_slice_bound.

  2. The TypeErrors in the tz case should be supported since we support partial indexing a naive timestamp with a tz-aware index

In [20]: df['2000-01-03': '2000-01-05']
Out[20]:
                                  A         B         C         D
2000-01-03 00:00:00+00:00 -0.029496 -0.654536 -0.269876  1.294191
2000-01-04 00:00:00+00:00  1.495514  0.407623 -0.092775  0.894099
2000-01-05 00:00:00+00:00 -1.488444 -0.653957  0.513663  0.642006

In [21]: df[pd.Timestamp('2000-01-03'): pd.Timestamp('2000-01-05')]
Out[21]:
                                  A         B         C         D
2000-01-03 00:00:00+00:00 -0.029496 -0.654536 -0.269876  1.294191
2000-01-04 00:00:00+00:00  1.495514  0.407623 -0.092775  0.894099
2000-01-05 00:00:00+00:00 -1.488444 -0.653957  0.513663  0.642006

@mroeschke mroeschke added Bug Timezones Timezone data dtype labels Aug 19, 2020
@mroeschke mroeschke self-assigned this Aug 19, 2020
@RhysU
Copy link
Author

RhysU commented Aug 19, 2020

Thank you for investigating. If it would be helpful, I can directly send you some non-trivial unit tests that I have written to exercise these observations. It is very much a superset of what is reported above.

@mroeschke
Copy link
Member

Sure if you have any test cases that test get_slice_bound beyond what you've provided above that'd be great!

@jbrockmendel
Copy link
Member

  1. datetime.date doesn't have 1st class support in pandas (works in some places and not others), but IMO they should be valid arguments like in get_slice_bound.

I think the important thing is internal consistency. date objects are not allowed for DatetimeIndex inequality comparisons, so shouldn't be allowed here. (or the comparison behavior should be changed xref #35466)

  1. The TypeErrors in the tz case should be supported since we support partial indexing a naive timestamp with a tz-aware index

I think that is a bug, and that we should not be allowing naive timestamps here. there was a long discussion a couple years back that resulted in allowing naive strings, but i dont think allowing naive timestamps was ever even proposed.

@RhysU
Copy link
Author

RhysU commented Aug 24, 2020

If naive strings contain the same information as naive timestamps, I don't see why they should be treated differently (though I'll defer to precedent that I don't know). An object is naive or it is not.

@mroeschke
Copy link
Member

I am not 100% knowledgeable of what datetime.date support should be valid in pandas, but it appears that #31501 is a nod in the direction that date objects should be okay for indexing therefore get_slice_bound is also indexing related and should be supported?

@jbrockmendel
Copy link
Member

but it appears that #31501 is a nod in the direction that date objects should be okay for indexing

You're right. Note that #31501 was closed by #31521, where I commented that the change was OK but should be deprecated for internal consistency.

We're inconsistent because passing a date object to dti.get_loc or ser.loc or ser.__getitem__ raises while using a slice containing a date object is accepted (by some slicing methods but not others per this issue)

@jreback jreback added this to the 1.2 milestone Sep 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Datetime Datetime data dtype Indexing Related to indexing on series/frames, not to indexes themselves Timezones Timezone data dtype
Projects
None yet
4 participants