New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: TimeDeltaIndex slicing with strings incorrectly includes too much data #33603
Comments
Actually I think this is the intended behavior. I think your example shows partial string indexing for a I definitely think this can be better documented here: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#indexing |
@mroeschke I think this is related to #21186. Some example to explain better: # Create a timeseries with 10 Hz timedelta index (one sample each 0.1 s)
# i.e. index contains values ['00:00:00', '00:00:00.1', '00:00:00.2', …] and
# the series values represent the sample number
idx = pd.timedelta_range(0, '10s', freq='100ms')
ts = pd.Series(np.arange(len(idx)), index=idx)
# I want to get a specific sample, at '00:00:03'
ts.loc['3s'] # returns the value at '00:00:03' (i.e. sample 30)
assert ts.loc['3s'] == 30 # indeed
# Now I want to get all samples until at '00:00:03'
ts.loc[:'3s'] # this returns all values until '00:00:03.90' (i.e. sample 39)
assert ts.loc[:'3s'][-1] == 30 # this fails, because the last element is not 30 but 39
df.loc[:'3.000s'] # this again returns all values until '00:00:03.90'
assert ts.loc[:'3.000s'][-1] == 30 # fails, again
df.loc[:'3.001s'] # this instead returns all values until '00:00:03'
assert ts.loc[:'3.001s'][-1] == 30 # success!
# The paradox: selecting until '3.000s' returns more than selecting until '3.001s' (!)
len(ts.loc[:'3.000s']) > len(ts.loc[:'3.001s']) # True
# Using `pandas.Timedelta` objects solves the ambiguity
ts.loc[:pd.Timedelta('3s')] # returns all values until '00:00:03'
ts.loc[:pd.Timedelta('3s')][-1] == 30 # True This has to do with the resolution parsed from the timedelta string. Maybe for timedelta indices it would make more sense to always use the resolution of the index? |
Sorry for the delay here - revisiting this. I agree with @mattbit that while it's the intended behavior, it's not very intuitive. I'd say that always using the resolution of the index makes sense - presumingly that's how non-range indexing (e.g. @mroeschke thoughts on the gravity of changing this behavior? Alternatively @mattbit's idea of a new type of Index can make the transition opt-in, but then that's one more class that people will have to think about going forward. |
I'd be -1 to adding an entirely separate I'd be +0 to changing the behavior of partial string indexing. Would probably need more buy in from other core maintainers before deprecating and changing this behavior. |
Sounds good - thanks @mroeschke . Any particular core maintainers you have in mind and could add to this issue (or are they notified already)? |
@pandas-dev/pandas-core for thoughts about changing partial string indexing for TimedeltaIndex |
I think it would be helpful for the discussion if someone could write up some small examples with the current behaviour and proposed future behaviour with rationale for the change, and how it compares to how it works without string parsing (actual Timedelta objects) / with datetimeindex partial string indexing. |
I think I came just across this issue, that as above provides very counter intuitive results. Reproducible example adapted straight from the docs on slicing with partial timedelta strings. I guess the solution is to not use strings, and rather use Timedelta objects (even if they're created from the same string) edit, I see this is pretty much the same as the above comment. so 12 months later, still confusing people. import numpy as np
import pandas as pd
s = pd.Series(
np.arange(100),
index=pd.timedelta_range("1 days", periods=100, freq="h"),
)
def compare_indexing(end_time):
print(end_time)
print(len(s[:end_time]))
print(sum(s.index <= pd.Timedelta(end_time)))
assert len(s[:end_time]) == sum(s.index <= pd.Timedelta(end_time))
end_time = "1d 23 hours"
compare_indexing(end_time)
end_time = "2d 0.1 hours" # some how works...
compare_indexing(end_time)
end_time = "2d 0 hours"
compare_indexing(end_time) for my own curiosity I dug into this a little bit. pandas/pandas/core/indexes/timedeltas.py Line 220 in 2cb9652
The resolution is evaluated against the value being 0 or not, so not sure how that'd be "fixed", pandas/pandas/_libs/tslibs/timedeltas.pyx Line 953 in 2cb9652
|
I think whats happening here is that we are not actually getting the resolution of the string, just the Timedelta constructed from it. By contrast with DatetimeIndex, the parsing code also returns information about the string's specificity. |
The current behaviour seems self-inconsistent: from datetime import timedelta
import numpy as np
import pandas as pd
# create 24h range with 1 min spacing
td = pd.timedelta_range(start="0h", end="24h", freq="1min").to_series()
# last timedelta included in slice
print(td[: timedelta(hours=1)].iloc[-1]) # 01:00
print(td[: np.timedelta64(1, "h")].iloc[-1]) # 01:00
print(td[: pd.Timedelta(1, "h")].iloc[-1]) # 01:00
print(td[: pd.Timedelta("1h")].iloc[-1]) # 01:00
print(td[:"1h"].iloc[-1]) # 01:59 ✘
# with loc
print(td.loc[: timedelta(hours=1)].iloc[-1]) # 01:00
print(td.loc[: np.timedelta64(1, "h")].iloc[-1]) # 01:00
print(td.loc[: pd.Timedelta(1, "h")].iloc[-1]) # 01:00
print(td.loc[: pd.Timedelta("1h")].iloc[-1]) # 01:00
print(td.loc[:"1h"].iloc[-1]) # 01:59 ✘ One would expect that if string values are passed they should simply be cast via @jorisvandenbossche The documentation tries to give some rationale for the behaviour. (or here)
And it is kind of neat that one can write something like Another detail is that the slicing with a There is probably a good reason for this behaviour but the design decisions for this are likely deeply buried somewhere. In principle I think it would make more sense if time ranges were half-open intervals, just like it is with |
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.
Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.
Code Sample, a copy-pastable example
Problem description
Slicing a dataframe with a TimeDeltaIndex with the particular right bound '720s' seems to be incorrectly parsed, not returning the time slice as expected. As can be seen in the above example, other bounds work as expected, but using '720s' as the right bound returns 60 more seconds of data than it should have.
Expected Output
Slicing between '710s' and '720s' should return 11 seconds of data, as slicing '610s' and '620s' does.
Output of
pd.show_versions()
pandas : 1.0.3
numpy : 1.18.2
pytz : 2019.3
dateutil : 2.8.1
pip : 9.0.1
setuptools : 46.1.3
Cython : None
pytest : 4.3.1
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : 0.999999999
pymysql : 0.9.3
psycopg2 : None
jinja2 : 2.10.3
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.2.1
numexpr : None
odfpy : None
openpyxl : 2.4.11
pandas_gbq : None
pyarrow : 0.13.0
pytables : None
pytest : 4.3.1
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : 1.3.12
tables : None
tabulate : 0.8.6
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : None
numba : 0.48.0
The text was updated successfully, but these errors were encountered: