BUG: TimeDeltaIndex slicing with strings incorrectly includes too much data #33603

pbotros · 2020-04-17T01:40:12Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import numpy as np
import pandas as pd
fs = 50000
i = pd.to_timedelta(np.arange(900 * fs, dtype=np.int) / fs * 1e9, unit='ns')
df = pd.DataFrame({'dummy': np.arange(len(i))}, index=i)
assert np.isclose(len(df['710s':'711s']) / fs, 2.0)
assert np.isclose(len(df['710s':'719s']) / fs, 10.0)
assert np.isclose(len(df['610s':'620s']) / fs, 11.0)
assert np.isclose(len(df['710s':'720.00000001s']) / fs, 10.00002)
assert np.isclose(len(df['710s':'720s']) / fs, 11.0)  # fails! Slices 70 seconds of data??

Problem description

Slicing a dataframe with a TimeDeltaIndex with the particular right bound '720s' seems to be incorrectly parsed, not returning the time slice as expected. As can be seen in the above example, other bounds work as expected, but using '720s' as the right bound returns 60 more seconds of data than it should have.

Expected Output

Slicing between '710s' and '720s' should return 11 seconds of data, as slicing '610s' and '620s' does.

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit : None python : 3.7.3.final.0 python-bits : 64 OS : Linux OS-release : 5.0.0-29-generic machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 1.0.3
numpy : 1.18.2
pytz : 2019.3
dateutil : 2.8.1
pip : 9.0.1
setuptools : 46.1.3
Cython : None
pytest : 4.3.1
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : 0.999999999
pymysql : 0.9.3
psycopg2 : None
jinja2 : 2.10.3
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.2.1
numexpr : None
odfpy : None
openpyxl : 2.4.11
pandas_gbq : None
pyarrow : 0.13.0
pytables : None
pytest : 4.3.1
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : 1.3.12
tables : None
tabulate : 0.8.6
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : None
numba : 0.48.0

The text was updated successfully, but these errors were encountered:

mroeschke · 2020-04-20T17:07:17Z

Actually I think this is the intended behavior.

I think your example shows partial string indexing for a TimedeltaIndex. So since the right edge of '720s' is 12 minutes, the result returns all results with a minute component of 12 minutes. Likewise since '620s' is 10 minutes 20 seconds, the result returns all results with a minute component of 10 and seconds component of 20.

I definitely think this can be better documented here: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#indexing

mattbit · 2020-04-28T16:09:17Z

@mroeschke I think this is related to #21186.
While I do see the point in having partial string indexing for datetimes, in the case of TimedeltaIndex it can create some strange and counterintuitive behaviour.

Some example to explain better:

# Create a timeseries with 10 Hz timedelta index (one sample each 0.1 s)
# i.e. index contains values ['00:00:00', '00:00:00.1', '00:00:00.2', …] and
# the series values represent the sample number
idx = pd.timedelta_range(0, '10s', freq='100ms')
ts = pd.Series(np.arange(len(idx)), index=idx)

# I want to get a specific sample, at '00:00:03'
ts.loc['3s']  # returns the value at '00:00:03' (i.e. sample 30)
assert ts.loc['3s'] == 30  # indeed

# Now I want to get all samples until at '00:00:03' 
ts.loc[:'3s']  # this returns all values until '00:00:03.90' (i.e. sample 39)
assert ts.loc[:'3s'][-1] == 30  # this fails, because the last element is not 30 but 39

df.loc[:'3.000s']  # this again returns all values until '00:00:03.90'
assert ts.loc[:'3.000s'][-1] == 30  # fails, again

df.loc[:'3.001s']  # this instead returns all values until '00:00:03'
assert ts.loc[:'3.001s'][-1] == 30  # success!

# The paradox: selecting until '3.000s' returns more than selecting until '3.001s' (!)
len(ts.loc[:'3.000s']) > len(ts.loc[:'3.001s'])  # True

# Using `pandas.Timedelta` objects solves the ambiguity
ts.loc[:pd.Timedelta('3s')]  # returns all values until '00:00:03'
ts.loc[:pd.Timedelta('3s')][-1] == 30  # True

This has to do with the resolution parsed from the timedelta string. Maybe for timedelta indices it would make more sense to always use the resolution of the index?
Or provide an alternative implementation (e.g. FixedResolutionTimedeltaIndex) allowing for this use case?

pbotros · 2020-07-05T21:30:22Z

Sorry for the delay here - revisiting this. I agree with @mattbit that while it's the intended behavior, it's not very intuitive. I'd say that always using the resolution of the index makes sense - presumingly that's how non-range indexing (e.g. ts.loc['3s'] in the above example) works. Unfortunately though, this does change the behavior for existing implementations.

@mroeschke thoughts on the gravity of changing this behavior? Alternatively @mattbit's idea of a new type of Index can make the transition opt-in, but then that's one more class that people will have to think about going forward.

mroeschke · 2020-07-06T04:31:43Z

I'd be -1 to adding an entirely separate FixedResolutionTimedeltaIndex.

I'd be +0 to changing the behavior of partial string indexing. Would probably need more buy in from other core maintainers before deprecating and changing this behavior.

pbotros · 2020-07-06T17:17:15Z

Sounds good - thanks @mroeschke . Any particular core maintainers you have in mind and could add to this issue (or are they notified already)?

mroeschke · 2020-07-06T21:06:50Z

@pandas-dev/pandas-core for thoughts about changing partial string indexing for TimedeltaIndex

jorisvandenbossche · 2020-07-08T13:24:54Z

I think it would be helpful for the discussion if someone could write up some small examples with the current behaviour and proposed future behaviour with rationale for the change, and how it compares to how it works without string parsing (actual Timedelta objects) / with datetimeindex partial string indexing.
(eg the examples from #33603 (comment) might be a good start, but with including output and proposed behaviour, it might be easier to understand what this is about without requiring everyone to run the code themselves).

abrammer · 2021-06-16T20:01:39Z

I think I came just across this issue, that as above provides very counter intuitive results. Reproducible example adapted straight from the docs on slicing with partial timedelta strings. I guess the solution is to not use strings, and rather use Timedelta objects (even if they're created from the same string)

edit, I see this is pretty much the same as the above comment. so 12 months later, still confusing people.

import numpy as np
import pandas as pd
s = pd.Series(
        np.arange(100),
        index=pd.timedelta_range("1 days", periods=100, freq="h"),
)

def compare_indexing(end_time):
    print(end_time)
    print(len(s[:end_time]))
    print(sum(s.index <= pd.Timedelta(end_time)))
    assert len(s[:end_time]) == sum(s.index <= pd.Timedelta(end_time))

end_time = "1d 23 hours"
compare_indexing(end_time)

end_time = "2d 0.1 hours"  # some how works...
compare_indexing(end_time) 

end_time = "2d 0 hours"
compare_indexing(end_time)

for my own curiosity I dug into this a little bit.
From

pandas/pandas/core/indexes/timedeltas.py

Line 220 in 2cb9652

lbound = parsed.round(parsed.resolution_string)

the parsed resolution string seems to be the cause for confusion.

parsed = Timedelta(label)
print('resolution:', parsed.resolution_string)

47H
resolution: H
49H
resolution: H
48H 
resolution: D

The resolution is evaluated against the value being 0 or not, so not sure how that'd be "fixed",

pandas/pandas/_libs/tslibs/timedeltas.pyx

Line 953 in 2cb9652

if self._ns:

jbrockmendel · 2022-04-13T22:59:18Z

I think whats happening here is that we are not actually getting the resolution of the string, just the Timedelta constructed from it. By contrast with DatetimeIndex, the parsing code also returns information about the string's specificity.

randolf-scholz · 2022-11-13T22:12:36Z

The current behaviour seems self-inconsistent:

from datetime import timedelta
import numpy as np
import pandas as pd

# create 24h range with 1 min spacing
td = pd.timedelta_range(start="0h", end="24h", freq="1min").to_series()

# last timedelta included in slice
print(td[: timedelta(hours=1)].iloc[-1])          # 01:00
print(td[: np.timedelta64(1, "h")].iloc[-1])      # 01:00
print(td[: pd.Timedelta(1, "h")].iloc[-1])        # 01:00
print(td[: pd.Timedelta("1h")].iloc[-1])          # 01:00
print(td[:"1h"].iloc[-1])                         # 01:59 ✘

# with loc
print(td.loc[: timedelta(hours=1)].iloc[-1])      # 01:00
print(td.loc[: np.timedelta64(1, "h")].iloc[-1])  # 01:00
print(td.loc[: pd.Timedelta(1, "h")].iloc[-1])    # 01:00
print(td.loc[: pd.Timedelta("1h")].iloc[-1])      # 01:00
print(td.loc[:"1h"].iloc[-1])                     # 01:59 ✘

One would expect that if string values are passed they should simply be cast via pd.Timedelta.

@jorisvandenbossche The documentation tries to give some rationale for the behaviour. (or here)

In [109]: s["1 day":"2 day"]
Out[109]: 
1 days 00:00:00     0
1 days 01:00:00     1
1 days 02:00:00     2
1 days 03:00:00     3
1 days 04:00:00     4
                   ..
2 days 19:00:00    43
2 days 20:00:00    44
2 days 21:00:00    45
2 days 22:00:00    46
2 days 23:00:00    47
Freq: H, Length: 48, dtype: int64

And it is kind of neat that one can write something like dt["1 day":"1 day"] to get all values that belong to "day 1". But that is also kind of the issue, because here the lower bound and the upper bound are effectively translated to different values.

Another detail is that the slicing with a datetime.timedelta, pd.Timedelta and np.timedelta64 gives a an inclusive upper bound, which makes it difficult to select day ranges without getting the "0:00" value of the following day.

There is probably a good reason for this behaviour but the design decisions for this are likely deeply buried somewhere. In principle I think it would make more sense if time ranges were half-open intervals, just like it is with ints and floats.

pbotros added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 17, 2020

mroeschke added Indexing Related to indexing on series/frames, not to indexes themselves Timedelta Timedelta data type and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 20, 2020

mroeschke added Docs and removed Bug labels Apr 20, 2020

mroeschke mentioned this issue May 3, 2020

DataFrame slicing TimedeltaIndex with str #21186

Closed

mroeschke added API Design Needs Discussion Requires discussion from core team before further action labels Jul 6, 2020

mroeschke added Deprecate Functionality to remove in pandas and removed API Design labels Jul 31, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: TimeDeltaIndex slicing with strings incorrectly includes too much data #33603

BUG: TimeDeltaIndex slicing with strings incorrectly includes too much data #33603

pbotros commented Apr 17, 2020 •

edited

mroeschke commented Apr 20, 2020

mattbit commented Apr 28, 2020

pbotros commented Jul 5, 2020

mroeschke commented Jul 6, 2020

pbotros commented Jul 6, 2020

mroeschke commented Jul 6, 2020

jorisvandenbossche commented Jul 8, 2020

abrammer commented Jun 16, 2021 •

edited

jbrockmendel commented Apr 13, 2022

randolf-scholz commented Nov 13, 2022 •

edited

BUG: TimeDeltaIndex slicing with strings incorrectly includes too much data #33603

BUG: TimeDeltaIndex slicing with strings incorrectly includes too much data #33603

Comments

pbotros commented Apr 17, 2020 • edited

Code Sample, a copy-pastable example

Problem description

Expected Output

Output of pd.show_versions()

mroeschke commented Apr 20, 2020

mattbit commented Apr 28, 2020

pbotros commented Jul 5, 2020

mroeschke commented Jul 6, 2020

pbotros commented Jul 6, 2020

mroeschke commented Jul 6, 2020

jorisvandenbossche commented Jul 8, 2020

abrammer commented Jun 16, 2021 • edited

jbrockmendel commented Apr 13, 2022

randolf-scholz commented Nov 13, 2022 • edited

pbotros commented Apr 17, 2020 •

edited

Output of `pd.show_versions()`

abrammer commented Jun 16, 2021 •

edited

randolf-scholz commented Nov 13, 2022 •

edited