Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: Unnecessary hash table with RangeIndex #16685

Closed
chris-b1 opened this issue Jun 12, 2017 · 5 comments

Comments

@chris-b1
Copy link
Contributor

commented Jun 12, 2017

Example

def log_memory():
    import os
    import gc
    import psutil
    for i in range(3):
        gc.collect(i)
    process = psutil.Process(os.getpid())
    mem_usage = process.memory_info().rss / float(2 ** 20)
    print("[Memory usage] {:12.1f} MB".format(
        mem_usage
    ))

In [20]: df = pd.DataFrame({'a': np.arange(1000000)})

In [23]: log_memory()
[Memory usage]        132.4 MB

In [24]: df.loc[5, :]
Out[24]: 
a    5
Name: 5, dtype: int32

In [25]: log_memory()
[Memory usage]        172.2 MB

Rather than materializing the hash table, should directly convert labels into positions. Low priority in my opinion, atypical to be using loc with a RangeIndex.

pandas 0.20.2

@chris-b1 chris-b1 added this to the Next Major Release milestone Jun 12, 2017

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jun 14, 2017

hmm ,this should not be creating a hash table at all (though it would if it morphed to an int64index)

@chris-b1

This comment has been minimized.

Copy link
Contributor Author

commented Jun 14, 2017

The _engine for RangeIndex still points at the Int64Index one, so a hash table will be created if something calls self._engine. So either need to make a simplified RangeIndexEngine or override the indexing methods.

_engine_type = libindex.Int64Engine

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jun 14, 2017

I think that _engine is purely for compat on some methods. might be easiest/best to create a pretty dummy RI Engine in cython.

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Dec 20, 2017

I think that _engine is purely for compat on some methods

_engine is still used in eg get_loc, so often created.

Related to this, I just stumbled on the fact that currenlty RangeIndex.memory_usage() is inaccurate, as it does not take into account the engine.

And given that the actual engine takes a lot more memory as the values (for int64index it is 5x more memory on an example case 500k range-like int index), a RangeIndex is not much more memory-efficient as a Int64Index (its big selling point I thought?) once the engine is created.

@chris-b1

This comment has been minimized.

Copy link
Contributor Author

commented Dec 20, 2017

RangeIndex is not much more memory-efficient as a Int64Index (its big selling point I thought?) once the engine is created.

This is true, but I think it would be relatively rare for the engine to populated when a RangeIndex is being used - typically it is when someone doesn't care about the index at all, so would likely be using iloc if any indexing at all.

toobaz added a commit to toobaz/pandas that referenced this issue Jun 29, 2019

toobaz added a commit to toobaz/pandas that referenced this issue Jun 29, 2019

toobaz added a commit to toobaz/pandas that referenced this issue Jun 29, 2019

toobaz added a commit to toobaz/pandas that referenced this issue Jun 29, 2019

toobaz added a commit to toobaz/pandas that referenced this issue Jun 29, 2019

toobaz added a commit to toobaz/pandas that referenced this issue Jun 29, 2019

toobaz added a commit to toobaz/pandas that referenced this issue Jun 29, 2019

toobaz added a commit to toobaz/pandas that referenced this issue Jun 29, 2019

toobaz added a commit to toobaz/pandas that referenced this issue Jun 29, 2019

toobaz added a commit to toobaz/pandas that referenced this issue Jun 29, 2019

@jreback jreback modified the milestones: Contributions Welcome, 0.25.0 Jun 30, 2019

jreback added a commit that referenced this issue Jun 30, 2019

PERF: do not instantiate IndexEngine for standard lookup over RangeIn…
…dex (#27119)

* TST: actually test #16877 on numeric index (not just RangeIndex)

* PERF: do not instantiate IndexEngine for standard lookup over RangeIndex

closes #16685

alimcmaster1 added a commit to alimcmaster1/pandas that referenced this issue Jun 30, 2019

PERF: do not instantiate IndexEngine for standard lookup over RangeIn…
…dex (pandas-dev#27119)

* TST: actually test pandas-dev#16877 on numeric index (not just RangeIndex)

* PERF: do not instantiate IndexEngine for standard lookup over RangeIndex

closes pandas-dev#16685
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.