Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: TimeDeltaIndex slicing with strings incorrectly includes too much data #33603

Open
2 of 3 tasks
pbotros opened this issue Apr 17, 2020 · 10 comments
Open
2 of 3 tasks
Labels
Deprecate Functionality to remove in pandas Docs Indexing Related to indexing on series/frames, not to indexes themselves Needs Discussion Requires discussion from core team before further action Timedelta Timedelta data type

Comments

@pbotros
Copy link

pbotros commented Apr 17, 2020

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import numpy as np
import pandas as pd
fs = 50000
i = pd.to_timedelta(np.arange(900 * fs, dtype=np.int) / fs * 1e9, unit='ns')
df = pd.DataFrame({'dummy': np.arange(len(i))}, index=i)
assert np.isclose(len(df['710s':'711s']) / fs, 2.0)
assert np.isclose(len(df['710s':'719s']) / fs, 10.0)
assert np.isclose(len(df['610s':'620s']) / fs, 11.0)
assert np.isclose(len(df['710s':'720.00000001s']) / fs, 10.00002)
assert np.isclose(len(df['710s':'720s']) / fs, 11.0)  # fails! Slices 70 seconds of data??

Problem description

Slicing a dataframe with a TimeDeltaIndex with the particular right bound '720s' seems to be incorrectly parsed, not returning the time slice as expected. As can be seen in the above example, other bounds work as expected, but using '720s' as the right bound returns 60 more seconds of data than it should have.

Expected Output

Slicing between '710s' and '720s' should return 11 seconds of data, as slicing '610s' and '620s' does.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit : None python : 3.7.3.final.0 python-bits : 64 OS : Linux OS-release : 5.0.0-29-generic machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 1.0.3
numpy : 1.18.2
pytz : 2019.3
dateutil : 2.8.1
pip : 9.0.1
setuptools : 46.1.3
Cython : None
pytest : 4.3.1
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : 0.999999999
pymysql : 0.9.3
psycopg2 : None
jinja2 : 2.10.3
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.2.1
numexpr : None
odfpy : None
openpyxl : 2.4.11
pandas_gbq : None
pyarrow : 0.13.0
pytables : None
pytest : 4.3.1
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : 1.3.12
tables : None
tabulate : 0.8.6
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : None
numba : 0.48.0

@pbotros pbotros added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 17, 2020
@mroeschke mroeschke added Indexing Related to indexing on series/frames, not to indexes themselves Timedelta Timedelta data type and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 20, 2020
@mroeschke
Copy link
Member

Actually I think this is the intended behavior.

I think your example shows partial string indexing for a TimedeltaIndex. So since the right edge of '720s' is 12 minutes, the result returns all results with a minute component of 12 minutes. Likewise since '620s' is 10 minutes 20 seconds, the result returns all results with a minute component of 10 and seconds component of 20.

I definitely think this can be better documented here: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#indexing

@mroeschke mroeschke added Docs and removed Bug labels Apr 20, 2020
@mattbit
Copy link

mattbit commented Apr 28, 2020

@mroeschke I think this is related to #21186.
While I do see the point in having partial string indexing for datetimes, in the case of TimedeltaIndex it can create some strange and counterintuitive behaviour.

Some example to explain better:

# Create a timeseries with 10 Hz timedelta index (one sample each 0.1 s)
# i.e. index contains values ['00:00:00', '00:00:00.1', '00:00:00.2', …] and
# the series values represent the sample number
idx = pd.timedelta_range(0, '10s', freq='100ms')
ts = pd.Series(np.arange(len(idx)), index=idx)

# I want to get a specific sample, at '00:00:03'
ts.loc['3s']  # returns the value at '00:00:03' (i.e. sample 30)
assert ts.loc['3s'] == 30  # indeed

# Now I want to get all samples until at '00:00:03' 
ts.loc[:'3s']  # this returns all values until '00:00:03.90' (i.e. sample 39)
assert ts.loc[:'3s'][-1] == 30  # this fails, because the last element is not 30 but 39

df.loc[:'3.000s']  # this again returns all values until '00:00:03.90'
assert ts.loc[:'3.000s'][-1] == 30  # fails, again

df.loc[:'3.001s']  # this instead returns all values until '00:00:03'
assert ts.loc[:'3.001s'][-1] == 30  # success!

# The paradox: selecting until '3.000s' returns more than selecting until '3.001s' (!)
len(ts.loc[:'3.000s']) > len(ts.loc[:'3.001s'])  # True

# Using `pandas.Timedelta` objects solves the ambiguity
ts.loc[:pd.Timedelta('3s')]  # returns all values until '00:00:03'
ts.loc[:pd.Timedelta('3s')][-1] == 30  # True

This has to do with the resolution parsed from the timedelta string. Maybe for timedelta indices it would make more sense to always use the resolution of the index?
Or provide an alternative implementation (e.g. FixedResolutionTimedeltaIndex) allowing for this use case?

@pbotros
Copy link
Author

pbotros commented Jul 5, 2020

Sorry for the delay here - revisiting this. I agree with @mattbit that while it's the intended behavior, it's not very intuitive. I'd say that always using the resolution of the index makes sense - presumingly that's how non-range indexing (e.g. ts.loc['3s'] in the above example) works. Unfortunately though, this does change the behavior for existing implementations.

@mroeschke thoughts on the gravity of changing this behavior? Alternatively @mattbit's idea of a new type of Index can make the transition opt-in, but then that's one more class that people will have to think about going forward.

@mroeschke
Copy link
Member

I'd be -1 to adding an entirely separate FixedResolutionTimedeltaIndex.

I'd be +0 to changing the behavior of partial string indexing. Would probably need more buy in from other core maintainers before deprecating and changing this behavior.

@mroeschke mroeschke added API Design Needs Discussion Requires discussion from core team before further action labels Jul 6, 2020
@pbotros
Copy link
Author

pbotros commented Jul 6, 2020

Sounds good - thanks @mroeschke . Any particular core maintainers you have in mind and could add to this issue (or are they notified already)?

@mroeschke
Copy link
Member

@pandas-dev/pandas-core for thoughts about changing partial string indexing for TimedeltaIndex

@jorisvandenbossche
Copy link
Member

I think it would be helpful for the discussion if someone could write up some small examples with the current behaviour and proposed future behaviour with rationale for the change, and how it compares to how it works without string parsing (actual Timedelta objects) / with datetimeindex partial string indexing.
(eg the examples from #33603 (comment) might be a good start, but with including output and proposed behaviour, it might be easier to understand what this is about without requiring everyone to run the code themselves).

@abrammer
Copy link

abrammer commented Jun 16, 2021

I think I came just across this issue, that as above provides very counter intuitive results. Reproducible example adapted straight from the docs on slicing with partial timedelta strings. I guess the solution is to not use strings, and rather use Timedelta objects (even if they're created from the same string)

edit, I see this is pretty much the same as the above comment. so 12 months later, still confusing people.

import numpy as np
import pandas as pd
s = pd.Series(
        np.arange(100),
        index=pd.timedelta_range("1 days", periods=100, freq="h"),
)

def compare_indexing(end_time):
    print(end_time)
    print(len(s[:end_time]))
    print(sum(s.index <= pd.Timedelta(end_time)))
    assert len(s[:end_time]) == sum(s.index <= pd.Timedelta(end_time))

end_time = "1d 23 hours"
compare_indexing(end_time)

end_time = "2d 0.1 hours"  # some how works...
compare_indexing(end_time) 

end_time = "2d 0 hours"
compare_indexing(end_time)

for my own curiosity I dug into this a little bit.
From

lbound = parsed.round(parsed.resolution_string)
the parsed resolution string seems to be the cause for confusion.

parsed = Timedelta(label)
print('resolution:', parsed.resolution_string)

47H
resolution: H
49H
resolution: H
48H 
resolution: D

The resolution is evaluated against the value being 0 or not, so not sure how that'd be "fixed",

@mroeschke mroeschke added Deprecate Functionality to remove in pandas and removed API Design labels Jul 31, 2021
@jbrockmendel
Copy link
Member

I think whats happening here is that we are not actually getting the resolution of the string, just the Timedelta constructed from it. By contrast with DatetimeIndex, the parsing code also returns information about the string's specificity.

@randolf-scholz
Copy link
Contributor

randolf-scholz commented Nov 13, 2022

The current behaviour seems self-inconsistent:

from datetime import timedelta
import numpy as np
import pandas as pd

# create 24h range with 1 min spacing
td = pd.timedelta_range(start="0h", end="24h", freq="1min").to_series()

# last timedelta included in slice
print(td[: timedelta(hours=1)].iloc[-1])          # 01:00
print(td[: np.timedelta64(1, "h")].iloc[-1])      # 01:00
print(td[: pd.Timedelta(1, "h")].iloc[-1])        # 01:00
print(td[: pd.Timedelta("1h")].iloc[-1])          # 01:00
print(td[:"1h"].iloc[-1])                         # 01:59 ✘

# with loc
print(td.loc[: timedelta(hours=1)].iloc[-1])      # 01:00
print(td.loc[: np.timedelta64(1, "h")].iloc[-1])  # 01:00
print(td.loc[: pd.Timedelta(1, "h")].iloc[-1])    # 01:00
print(td.loc[: pd.Timedelta("1h")].iloc[-1])      # 01:00
print(td.loc[:"1h"].iloc[-1])                     # 01:59 ✘

One would expect that if string values are passed they should simply be cast via pd.Timedelta.


@jorisvandenbossche The documentation tries to give some rationale for the behaviour. (or here)

In [109]: s["1 day":"2 day"]
Out[109]: 
1 days 00:00:00     0
1 days 01:00:00     1
1 days 02:00:00     2
1 days 03:00:00     3
1 days 04:00:00     4
                   ..
2 days 19:00:00    43
2 days 20:00:00    44
2 days 21:00:00    45
2 days 22:00:00    46
2 days 23:00:00    47
Freq: H, Length: 48, dtype: int64

And it is kind of neat that one can write something like dt["1 day":"1 day"] to get all values that belong to "day 1". But that is also kind of the issue, because here the lower bound and the upper bound are effectively translated to different values.

Another detail is that the slicing with a datetime.timedelta, pd.Timedelta and np.timedelta64 gives a an inclusive upper bound, which makes it difficult to select day ranges without getting the "0:00" value of the following day.

There is probably a good reason for this behaviour but the design decisions for this are likely deeply buried somewhere. In principle I think it would make more sense if time ranges were half-open intervals, just like it is with ints and floats.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Deprecate Functionality to remove in pandas Docs Indexing Related to indexing on series/frames, not to indexes themselves Needs Discussion Requires discussion from core team before further action Timedelta Timedelta data type
Projects
None yet
Development

No branches or pull requests

7 participants