Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KeyError on slicing with datetime or pandas.Timestamp #5821

Closed
vfilimonov opened this issue Jan 2, 2014 · 13 comments
Closed

KeyError on slicing with datetime or pandas.Timestamp #5821

vfilimonov opened this issue Jan 2, 2014 · 13 comments
Labels
Error Reporting Incorrect or improved errors from pandas Indexing Related to indexing on series/frames, not to indexes themselves Timeseries Usage Question
Milestone

Comments

@vfilimonov
Copy link
Contributor

related #6066 (on Float64Index too)

The 'KeyError' on slicing was discussed multiple times but I'm still not sure if the issue below is a bug or just my misunderstanding.

I'm experiencing 'KeyError' from time to time when I try to slice my dataframes with datetime or Timestamp objects, however slicing with strings works perfectly. I was unable to construct synthetic example with pandas.date_range, so I needed to upload the piece of real data where the issue appears:
https://www.dropbox.com/s/ibzbwqs35tiydyc/tmp.h5

df = pd.read_hdf('tmp.h5', 'data')
print df.head()

When I try to slice it with the pandas.Timestamp objects it results in 'KeyError':

tt1 = pd.Timestamp('2006-11-16 18:30:00')
tt2 = pd.Timestamp('2006-11-16 18:40:00')
print df.ix[tt1:tt2].head()

Same for datetime objects:

tt1 = pd.Timestamp('2006-11-16 18:30:00').to_datetime()
tt2 = pd.Timestamp('2006-11-16 18:40:00').to_datetime()
print df.ix[tt1:tt2].head()

However this slicing works perfectly:

tt1 = pd.Timestamp('2006-11-16 18:30:00').strftime('%Y-%m-%d %H:%M:%S')
tt2 = pd.Timestamp('2006-11-16 18:40:00').strftime('%Y-%m-%d %H:%M:%S')
print df[tt1:tt2].head()

Numpy version 1.8.0
Pandas version 0.13.0

@jreback
Copy link
Contributor

jreback commented Jan 2, 2014

Your index is not monotonic (e.g. sorted). This is more of an incorrect error report

If you sort it it works (with exact indexes or not)

In [14]: df.sort_index().ix[Timestamp('2006-11-16 18:30:04.101497'):Timestamp('2006-11-16 18:30:06.075980')]
Out[14]: 
                                  val
2006-11-16 18:30:04.101497+00:00    0
2006-11-16 18:30:04.101497+00:00    0
2006-11-16 18:30:04.101497+00:00    0
2006-11-16 18:30:06.075980+00:00    0
2006-11-16 18:30:06.075980+00:00    0

[5 rows x 1 columns]

In [15]: df.sort_index().ix[Timestamp('2006-11-16 18:30:04.101497'):Timestamp('2006-11-16 18:30:06')]
Out[15]: 
                                  val
2006-11-16 18:30:04.101497+00:00    0
2006-11-16 18:30:04.101497+00:00    0
2006-11-16 18:30:04.101497+00:00    0

[3 rows x 1 columns]

In [16]: df.sort_index().ix[Timestamp('2006-11-16 18:30:04'):Timestamp('2006-11-16 18:30:06')]
Out[16]: 
                                  val
2006-11-16 18:30:04.101497+00:00    0
2006-11-16 18:30:04.101497+00:00    0
2006-11-16 18:30:04.101497+00:00    0

[3 rows x 1 columns]

This should be a ValueError I think; so its a 'bug' on the error report

In [17]: df.ix[Timestamp('2006-11-16 18:30:04.101497'):Timestamp('2006-11-16 18:30:06.075980')]
KeyError: 'cannot peform a slice operation on a non-unique non-monotonic index'

@jtratner agree?

@vfilimonov
Copy link
Contributor Author

@jreback thanks for pointing out the problem with my dataset! I was not aware of it.

@jreback
Copy link
Contributor

jreback commented Apr 9, 2014

closing as not a bug

@KristianHolsheimer
Copy link
Contributor

Hi there, I know this issue is closed, since it's not a bug. I would argue, though, that the error message could point the user in the right direction (I googled the error message and it lead me here).

@leonsas
Copy link

leonsas commented Aug 31, 2016

I second Kristian. And I'll provide a bit more context which can be helpful:

trimmed_series = series[event.startTime:event.endTime]

Which throws KeyError: 1472641846000000000. Initially I thought that maybe event.endTime happens to be after the last index of series, and that maybe pandas isn't smart in dealing with out of bound errors. Obviously pandas does deal with this and it was the index not being sorted (which was a bug elsewhere in my code), which I found here. But it definitely isn't intuitive.

@jreback
Copy link
Contributor

jreback commented Aug 31, 2016

@leonsas @KristianHolsheimer

if you have a nice reproducible example, pls open a new issue (and xref this one). This is well-defined behavior. I closed this because its not a bug, though, it could/should be a ValueError with an even more informative error message.

@kdebrab
Copy link
Contributor

kdebrab commented Jan 12, 2017

As requested by @jreback, I hereby create a nice reproducible example:

import pandas as pd
index = pd.date_range('2016-10-29 23:00', '2016-10-30 3:00', freq='15T', tz='UTC')
index = index.tz_convert('Europe/Brussels').tz_localize(None)
ts = pd.Series(1, index=index)
ts.truncate(before='2016-10-30 2:10')

which raises:

KeyError: 1477793400000000000L

In above example, the index is non-sorted due to the naive local time, comprising a switch from Summer to Winter time.

Slicing the data as:

ts[pd.Timestamp('2016-10-30 2:10'):]

returns the same KeyError.

Truncating with a time that is contained inside the index works just fine (unless you choose a non-unique label):

ts.truncate(before='2016-10-30 1:30')

Interestingly:

ts['2016-10-30 2:10':]

raises no error, returning the correct (non-sorted !) result. Even when choosing a duplicate label!

All different operations work on the sorted series, e.g.:

ts.sort_index().truncate(before='2016-10-30 2:10')

Though now the result is sorted as well of course.

I'm using pandas version 0.19.1

@jreback
Copy link
Contributor

jreback commented Jan 12, 2017

@kdebrab the only thing that would be nicer would be the actual KeyError message for a not-found label. (It should show it as a Timestamp). So would take a fix for that.

you can open a new issue, or push a PR if that works for you.

@jorisvandenbossche
Copy link
Member

The truncate does just a slice under the hood, so the example of @kdebrab becomes:

In [12]: ts[pd.Timestamp('2016-1-1'):]
...
KeyError: 1451606400000000000

So which raises an error, because you are slicing with a non-present value on a non-sorted index.
To give a more simple example for this:

In [33]: s = pd.Series(range(5), index=[1,2,3,2,4])

# works because '1' is present (and is unique), so unsortedness does not matter
In [34]: s.loc[1:]
Out[34]: 
1    0
2    1
3    2
2    3
4    4
dtype: int64

# fails because '0' is not a label, because of unsortedness pandas cannot know how to handle this
In [35]: s.loc[0:]
...
KeyError: 0

I think it really be nice to have a better error message for this, as the difference between sorted/not-sorted can be very subtle, and with a sorted index, slicing with a non-present label is perfectly fine.

@jreback The only thing I don't directly find clear is why does the same example work with strings?

In [13]: ts['2016-1-1':]
Out[13]: 
2016-10-30 01:00:00    1
2016-10-30 01:15:00    1
...
2016-10-30 03:45:00    1
2016-10-30 04:00:00    1
Freq: 15T, dtype: int64

Shouldn't this be equivalent to ts[pd.Timestamp('2016-1-1'):]? Or at least in the case of ts['2016-1-1 00:00:00':] (which also works) where it cannot be seen as partial string?

@jreback
Copy link
Contributor

jreback commented Jan 12, 2017

this works with strings because a string turns into partial timestamp indexing and thus is a slice, and hence works.

so maybe we ought to always make truncate a slice, then this will just work. (internally its a scalar being passed and NOT a slice), so this is pretty easy to 'fix'.

and I agree that this should work.

So let's create 2 new issues for this

  1. better error message with a datetime aware (e.g. it should show the converted key, not the internal repr)
  2. fix .truncate to always be slice like

@jorisvandenbossche
Copy link
Member

@jreback I thought partial strings only turned into a slice when the string has lower resolution than the series. In the case of ts['2016-1-1 00:00:00':] this is certainly not the case, and should be treated as a slice with an exact Timestamp?

@jreback
Copy link
Contributor

jreback commented Jan 12, 2017

In [7]: ts.index.resolution
Out[7]: 'minute'

so I think its always treated as a slice (in this example)

aptiko added a commit to aptiko/loggertodb that referenced this issue May 27, 2019
@DanielHabenicht
Copy link

So this was never settled?
I think I will create a new bug for it as it just cost me almost an hour.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Error Reporting Incorrect or improved errors from pandas Indexing Related to indexing on series/frames, not to indexes themselves Timeseries Usage Question
Projects
None yet
Development

No branches or pull requests

7 participants