New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: Very slow printing of Series with DatetimeIndex #19764

Closed
jorisvandenbossche opened this Issue Feb 19, 2018 · 13 comments

Comments

Projects
None yet
4 participants
@jorisvandenbossche
Member

jorisvandenbossche commented Feb 19, 2018

Consider this series with a datetime index:

s = pd.Series(np.random.randn(int(1e6)), 
              index=pd.date_range("2012-01-01", freq='s', periods=int(1e6)))

Showing this in the console or notebook (doing just s) is very slow (noticable delay of 1 to 2 seconds in console, around 8 seconds in the notebook or JupyterLab ). But, on the other hand, in a plain Python console the display is instantly.

Further, on the other hand, if the series has no DatetimeIndex, the display appears instantly. Also, if it is a frame (doing s.to_frame()) the display is instantly. And when doing print(s) explicitly, the display is actually identical and also instantly. When it is another datetime-like index, eg s.to_period(), the display is instant.

So it seems we are doing some work under the hood that is not needed. And it is somehow triggered by doing this in an IPython context, and specifically for DatetimeIndex.

This already seems to be present on some older versions of pandas as well.

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Feb 19, 2018

cc @takluyver what is the mechanism how in general an object is displayed? As the __repr__ or __str__ itself is fast, but displaying in an IPython console is not.

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Feb 19, 2018

Since seems to be present in IPython >= 6.1 (I don't see it in an environment with IPython 6.0). And in an ipython console, the overhead comes from MimeBundleFormatter.__call__ calling get_real_method, which in the end does a getattr on the series to check if a _repr_mimebundle_ exists.

So in the end it is due to getattr taking a huge amount of time on a Series with a DatetimeIndex. With the example from above:

In [29]: %timeit getattr(s, '_bla', None)
1.39 s ± 31.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [31]: s2 = s.reset_index(drop=True)

In [32]: %timeit getattr(s2, '_bla', None)
3.54 µs ± 15.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [33]: df = s.to_frame()

In [34]: %timeit getattr(df, '_bla', None)
3.64 µs ± 67 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

And from the same profiling, it seems it is doing expensive get_loc calls on the index a few times, which stem from a 'contains' call in generic.py __getattr__ (name in s.index), and timings for this:

In [41]: %timeit '_bla' in s.index
517 ms ± 69.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [42]: %timeit '_bla' in s2.index
1.4 µs ± 79.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [43]: %timeit '_bla' in s3.index
2.91 µs ± 149 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

which in turn steps from a very slow get_loc of the DatetimeIndex engine. I suppose it is doing (for this case) way too much work because it tries to interpret the string as a datetime string.

@jorisvandenbossche jorisvandenbossche added this to the 0.23.0 milestone Feb 19, 2018

@chris-b1

This comment has been minimized.

Contributor

chris-b1 commented Feb 19, 2018

I didn't know we did dot-attribute indexing for a Series - seems odd - something we want to support / actually gets used?

In [10]: s = pd.Series([1, 2, 3], index=['a', 'b', 'c'])

In [11]: s.c
Out[11]: 3
@chris-b1

This comment has been minimized.

Contributor

chris-b1 commented Feb 19, 2018

I guess it is fairly prominent in the indexing docs, just missed it.

https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Feb 19, 2018

Yeah, the dataframe equivalent is more known and used, I personally don't think many people do that for series as well.

For the original issues, I think we should use in __getattr__ somehow a 'smarter' contains check, because I don't think we want to support any string-like to be used as attribute for DatetimeIndex (I am thinking if there is actually one that would be a valid attribute name, as it cannot start with a number).

@chris-b1

This comment has been minimized.

Contributor

chris-b1 commented Feb 19, 2018

There also seems to be fundamental performance problem with DatetimeIndex.get_loc - maybe same as #17754

In [26]: dti = s.index

In [27]: dti.get_loc('2012-01-01')
Out[27]: slice(0, 86400, None)

In [28]: %timeit dti.get_loc('2012-01-01')
1.24 s ± 25.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [29]: %timeit dti.get_loc(pd.Timestamp('2012-01-01'))
11 µs ± 126 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Feb 20, 2018

Explored a bit what DatetimeIndex.get_loc and DatetimeEngine.get_loc are doing, and a lot is going on:

  • Something like dti.get_loc('blabla') ultimately calls DatetimeEngine.get_loc 2 times, and dti.get_loc('2012-01-01 00:00:00') as well. While we can know in advance that DatetimeEngine.get_loc will fail as this method does not work for strings (it needs a datetime-like)
  • A single call to DatetimeEngine.get_loc('string'), although it will never work, takes a lot of time. The ultimate reason for this long time is that it is trying a values.searchsorted(key) where values are the int64 data and key is the string. The numpy searchsorted will then cast under the hood the int values to string or object (whather the common dtype between values and key), which is slow. The value that then comes out of it, in case of a string, is outside the range (just added to the right of the values) and we check this and raise a KeyError. While in case of a string we just never should have tried.
  • When I would have used int(1e6) - 1 instead of int(1e6) values, we are below the size threshold, and it takes a different code path. In that case the situation is better (we still do unneeded tries of get_loc, but they are not that slow in that case)

@jreback jreback modified the milestones: 0.23.0, Next Major Release Apr 14, 2018

@jorisvandenbossche jorisvandenbossche modified the milestones: Next Major Release, 0.23.0 Apr 24, 2018

@TomAugspurger

This comment has been minimized.

Contributor

TomAugspurger commented Apr 26, 2018

Correctness question, is this the right output?

In [5]: pd.date_range('2017', periods=12)
Out[5]:
DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03', '2017-01-04',
               '2017-01-05', '2017-01-06', '2017-01-07', '2017-01-08',
               '2017-01-09', '2017-01-10', '2017-01-11', '2017-01-12'],
              dtype='datetime64[ns]', freq='D')

In [6]: pd.date_range('2017', periods=12).get_loc("2017")
Out[6]: slice(0, 12, None)

The docstring says in int is returned. The closes analog I can think of is IntervalIndex w/ overlapping intervals.

In [12]: pd.IntervalIndex.from_tuples([(0, 1), (0, 2), (0, 3)]).get_loc(0.5)
Out[12]: array([0, 1, 2])
@TomAugspurger

This comment has been minimized.

Contributor

TomAugspurger commented Apr 26, 2018

Second question, should DatetimeIndex.get_loc(int) ever work? We test that it does.

In [11]: s.index.get_loc(s.index[0].value)
Out[11]: 0

That's taking the underlying integer representation. cc @jreback

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Apr 26, 2018

Not really sure, should dive into the code again.

But, to fix the actual regression, I think the easier path will be to avoid any get_loc call that is coming from __getattr__ altogether, as I mentioned in #19764 (comment) (not that get_loc shouldn't be fixed)

@TomAugspurger

This comment has been minimized.

Contributor

TomAugspurger commented Apr 26, 2018

Indeed, I got curious though :) Perhaps I should keep performance questions in #17754.

I'll investigate the getattr stuff now (for 0.23).

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Apr 26, 2018

So currently, Series/DataFrame.__getattr__ does a if name in self._info_axis, which then calls Index.__contains__, which does the engine lookup.

So a quick dirty "hack" would be to add a _contains_getattr method to the base Index class which is equal to __contains__, and then DatetimeLikeIndex can override this to just return False (as a valid attribute name can never be contained in the index).

Is that too dirty of a hack?

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Apr 26, 2018

@TomAugspurger

This comment has been minimized.

Contributor

TomAugspurger commented Apr 26, 2018

See #20834.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment