Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: don't call RangeIndex._data unnecessarily #26565

Merged
merged 4 commits into from Jun 1, 2019

Conversation

@topper-123
Copy link
Contributor

commented May 29, 2019

  • closes #xxxx
  • tests added / passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

I've looked into RangeIndex and found that the index type creates and caches a int64 array if/when RangeIndex._data property is being called. This basically means that in many cases, a RangeIndex has the same memory consumption and the same speed as an Int64Index.

This PR improves on that situation by giving RangeIndex custom .get_loc and ._format_with_header methods. This avoids the calls to ._data in some cases, which helps on the speed and memory consumption (see performance improvements below). There are probably other case where RangeIndex._data can be avoided, which I'll investigate over the coming days.

>>> %timeit pd.RangeIndex(1_000_000).get_loc(900_000)
8.95 ms ± 485 µs per loop  # master
4.31 µs ± 303 ns per loop  # this PR
>>> rng =  pd.RangeIndex(1_000_000)
>>> %timeit rng.get_loc(900_000)
17.3 µs ± 392 ns per loop  # master
547 ns ± 8.26 ns per loop  # this PR. get_loc is now lightningly fast
>>> df = pd.DataFrame({'a': range(1_000_000)})
>>> %timeit df.loc[800_000: 900_000]
132 µs ± 5.79 µs per loop  # master
89 µs ± 2.95 µs per loop  # this PR

@topper-123 topper-123 force-pushed the topper-123:range_index_calls_data branch from 3e29889 to 8e4c734 May 29, 2019

@@ -64,6 +65,8 @@ class RangeIndex(Int64Index):
_typ = 'rangeindex'
_engine_type = libindex.Int64Engine

# check whether self._data has benn called
_has_called_data = False # type: bool

This comment has been minimized.

Copy link
@topper-123

topper-123 May 29, 2019

Author Contributor

This is added to check if ._data has been called, without actually calling it..

@@ -215,6 +221,9 @@ def _format_data(self, name=None):
# we are formatting thru the attributes
return None

def _format_with_header(self, header, na_rep='NaN', **kwargs):
return header + [pprint_thing(x) for x in self._range]

This comment has been minimized.

Copy link
@topper-123

topper-123 May 29, 2019

Author Contributor

Without this I found that reprs of small DataFrames call RangeIndex.values and therefore RangeIndex._data. This avoids that.

This comment has been minimized.

Copy link
@jreback

jreback May 30, 2019

Contributor

could do
header + list(map(pprint_thing, self._range))

@topper-123 topper-123 added this to the 0.25.0 milestone May 29, 2019

@topper-123 topper-123 changed the title PERF: don't call RangeIndex._data unneccesary PERF: don't call RangeIndex._data unneccesarily May 29, 2019

@topper-123 topper-123 force-pushed the topper-123:range_index_calls_data branch from 8e4c734 to 7bc8655 May 29, 2019

@codecov

This comment has been minimized.

Copy link

commented May 29, 2019

Codecov Report

Merging #26565 into master will decrease coverage by 50.09%.
The diff coverage is 92.85%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26565      +/-   ##
==========================================
- Coverage   91.77%   41.68%   -50.1%     
==========================================
  Files         174      174              
  Lines       50649    50663      +14     
==========================================
- Hits        46483    21118   -25365     
- Misses       4166    29545   +25379
Flag Coverage Δ
#multiple ?
#single 41.68% <92.85%> (-0.08%) ⬇️
Impacted Files Coverage Δ
pandas/core/indexes/range.py 53.76% <92.85%> (-44.22%) ⬇️
pandas/io/formats/latex.py 0% <0%> (-100%) ⬇️
pandas/io/sas/sas_constants.py 0% <0%> (-100%) ⬇️
pandas/core/groupby/categorical.py 0% <0%> (-100%) ⬇️
pandas/tseries/plotting.py 0% <0%> (-100%) ⬇️
pandas/tseries/converter.py 0% <0%> (-100%) ⬇️
pandas/io/formats/html.py 0% <0%> (-99.37%) ⬇️
pandas/io/sas/sas7bdat.py 0% <0%> (-91.16%) ⬇️
pandas/io/sas/sas_xport.py 0% <0%> (-90.1%) ⬇️
pandas/core/tools/numeric.py 10.14% <0%> (-89.86%) ⬇️
... and 129 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a91da0c...7bc8655. Read the comment docs.

@codecov

This comment has been minimized.

Copy link

commented May 29, 2019

Codecov Report

Merging #26565 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26565      +/-   ##
==========================================
- Coverage   91.84%   91.84%   -0.01%     
==========================================
  Files         174      174              
  Lines       50644    50660      +16     
==========================================
+ Hits        46516    46527      +11     
- Misses       4128     4133       +5
Flag Coverage Δ
#multiple 90.38% <100%> (ø) ⬆️
#single 41.71% <100%> (-0.09%) ⬇️
Impacted Files Coverage Δ
pandas/core/indexes/range.py 98.06% <100%> (+0.08%) ⬆️
pandas/io/gbq.py 78.94% <0%> (-10.53%) ⬇️
pandas/core/frame.py 97% <0%> (-0.12%) ⬇️
pandas/util/testing.py 90.81% <0%> (-0.11%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7f31865...ff805da. Read the comment docs.

@topper-123 topper-123 force-pushed the topper-123:range_index_calls_data branch 2 times, most recently from 6e71708 to a5cad77 May 29, 2019

@pep8speaks

This comment has been minimized.

Copy link

commented May 29, 2019

Hello @topper-123! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-06-01 16:07:40 UTC

@topper-123 topper-123 force-pushed the topper-123:range_index_calls_data branch from a5cad77 to a293738 May 29, 2019

@@ -164,6 +168,8 @@ def _simple_new(cls, start, stop=None, step=None, name=None,
for k, v in kwargs.items():
setattr(result, k, v)

result._range = range(result._start, result._stop, result._step)

This comment has been minimized.

Copy link
@jreback

jreback May 30, 2019

Contributor

we could actually remove the _start, _stop, _step properties as well?

This comment has been minimized.

Copy link
@topper-123

topper-123 May 30, 2019

Author Contributor

Yes, I'm planning to do that in an upcoming PR.

Python3's range accepts slicing, which Python2's xrange didn't, so this refactoring will also allow dropping doing custom slicing operations in RangeIndex.

@@ -215,6 +221,9 @@ def _format_data(self, name=None):
# we are formatting thru the attributes
return None

def _format_with_header(self, header, na_rep='NaN', **kwargs):
return header + [pprint_thing(x) for x in self._range]

This comment has been minimized.

Copy link
@jreback

jreback May 30, 2019

Contributor

could do
header + list(map(pprint_thing, self._range))

pandas/core/indexes/range.py Show resolved Hide resolved
# Calling RangeIndex._data caches a array of the same length.
# This tests whether RangeIndex._data has been called by doing methods
idx = RangeIndex(0, 100, 10)
assert idx._has_called_data is False

This comment has been minimized.

Copy link
@jreback

jreback May 30, 2019

Contributor

I would suggest that you monkeypatch the class here, a bit cleaner as the code then doesn't have this attribute

This comment has been minimized.

Copy link
@topper-123

topper-123 May 31, 2019

Author Contributor

The _data attribute is a property (previously cache_readonly) and in neither cases is it technically possible to dynamically monkey-patch _data. I could subclass RangeIndex and add a new property, but not sure if that's better than this?

@topper-123 topper-123 changed the title PERF: don't call RangeIndex._data unneccesarily PERF: don't call RangeIndex._data unnecessarily May 30, 2019

@topper-123 topper-123 force-pushed the topper-123:range_index_calls_data branch 3 times, most recently from 8f65498 to ff805da May 31, 2019

@codecov-io

This comment has been minimized.

Copy link

commented May 31, 2019

Codecov Report

Merging #26565 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26565      +/-   ##
==========================================
- Coverage   91.84%   91.84%   -0.01%     
==========================================
  Files         174      174              
  Lines       50644    50659      +15     
==========================================
+ Hits        46516    46527      +11     
- Misses       4128     4132       +4
Flag Coverage Δ
#multiple 90.38% <100%> (ø) ⬆️
#single 41.73% <100%> (-0.07%) ⬇️
Impacted Files Coverage Δ
pandas/core/indexes/range.py 98.05% <100%> (+0.08%) ⬆️
pandas/io/gbq.py 78.94% <0%> (-10.53%) ⬇️
pandas/core/frame.py 97% <0%> (-0.12%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7f31865...ff805da. Read the comment docs.

@codecov-io

This comment has been minimized.

Copy link

commented May 31, 2019

Codecov Report

Merging #26565 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26565      +/-   ##
==========================================
- Coverage   91.85%   91.85%   -0.01%     
==========================================
  Files         174      174              
  Lines       50707    50722      +15     
==========================================
+ Hits        46578    46589      +11     
- Misses       4129     4133       +4
Flag Coverage Δ
#multiple 90.39% <100%> (ø) ⬆️
#single 41.78% <100%> (-0.07%) ⬇️
Impacted Files Coverage Δ
pandas/core/indexes/range.py 98.05% <100%> (+0.08%) ⬆️
pandas/io/gbq.py 78.94% <0%> (-10.53%) ⬇️
pandas/core/frame.py 97% <0%> (-0.12%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0dbb99e...c72758b. Read the comment docs.

@topper-123 topper-123 force-pushed the topper-123:range_index_calls_data branch from ff805da to 618f63f May 31, 2019

@jreback

jreback approved these changes Jun 1, 2019

Copy link
Contributor

left a comment

small comment, merge on green.

def _data(self):
return np.arange(self._start, self._stop, self._step, dtype=np.int64)
if self._cached_data is None:

This comment has been minimized.

Copy link
@jreback

jreback Jun 1, 2019

Contributor

can you give this a doc-string (e.g. that cached_data is actually an int array and be constructed only if necessary for performance reasons

@topper-123 topper-123 force-pushed the topper-123:range_index_calls_data branch 2 times, most recently from 6e037ac to 61e93e5 Jun 1, 2019

@topper-123 topper-123 force-pushed the topper-123:range_index_calls_data branch from 61e93e5 to c72758b Jun 1, 2019

@topper-123 topper-123 merged commit 437efa6 into pandas-dev:master Jun 1, 2019

12 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
pandas-dev.pandas Build #20190601.28 succeeded
Details
pandas-dev.pandas (Checks) Checks succeeded
Details
pandas-dev.pandas (Docs) Docs succeeded
Details
pandas-dev.pandas (Linux py35_compat) Linux py35_compat succeeded
Details
pandas-dev.pandas (Linux py36_locale_slow) Linux py36_locale_slow succeeded
Details
pandas-dev.pandas (Linux py36_locale_slow_old_np) Linux py36_locale_slow_old_np succeeded
Details
pandas-dev.pandas (Linux py37_locale) Linux py37_locale succeeded
Details
pandas-dev.pandas (Linux py37_np_dev) Linux py37_np_dev succeeded
Details
pandas-dev.pandas (Windows py36_np15) Windows py36_np15 succeeded
Details
pandas-dev.pandas (Windows py37_np141) Windows py37_np141 succeeded
Details
pandas-dev.pandas (macOS py35_macos) macOS py35_macos succeeded
Details

@topper-123 topper-123 deleted the topper-123:range_index_calls_data branch Jun 1, 2019

@topper-123 topper-123 referenced this pull request Jun 2, 2019
3 of 4 tasks complete

vaibhavhrt added a commit to vaibhavhrt/pandas that referenced this pull request Jun 6, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.