Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a more memory-efficient RangeIndex-sort of thing to avoid large arange(N) indexes in some cases #939

Closed
wesm opened this issue Mar 18, 2012 · 13 comments · Fixed by #11892
Closed

Comments

@wesm
Copy link
Member

@wesm wesm commented Mar 18, 2012

No description provided.

@hayd

This comment has been minimized.

Copy link
Contributor

@hayd hayd commented Sep 2, 2014

Here's @jtratner's tree for this https://github.com/jtratner/pandas/tree/add-range-index (based off Wes').

(I keep struggling to find it. Perhaps I'll rebase and PR, would be interesting to experiment with this.)

@immerrr

This comment has been minimized.

Copy link
Contributor

@immerrr immerrr commented Sep 2, 2014

And there's likely some code that can be salvaged from BlockPlacement class I've added when refactoring block managers.

@jreback jreback modified the milestones: 0.16, 0.15.0 Sep 23, 2014
@jreback jreback modified the milestones: 0.16, 0.15.1 Oct 7, 2014
@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015
@ARF1

This comment has been minimized.

Copy link

@ARF1 ARF1 commented Apr 20, 2015

I am interested in RangeIndexes as well. Does anybody know what the state of @jtratner's tree is? Is there anything I can do to help push this towards a PR?

@jreback

This comment has been minimized.

Copy link
Contributor

@jreback jreback commented Apr 20, 2015

it's in a reasonable state
would welcome an update PR for it

@hayd

This comment has been minimized.

Copy link
Contributor

@hayd hayd commented Apr 20, 2015

I linked to it above, it's here https://github.com/jtratner/pandas/tree/add-range-index

IIRC it ought to rebase pretty cleanly (it's mainly new files). +1 to a resurrection!

@ARF1

This comment has been minimized.

Copy link

@ARF1 ARF1 commented Apr 21, 2015

Ok, I rebased jtratner's tree: https://github.com/ARF1/pandas/tree/range_index

Of course it was too much to hope it would pass all the test. - One can always dream...

Having not worked on pandas internals before I am probably not the best person to move this forward efficiently. If somebody else feels like they are better suited I would be happy to contribute individual fixes to their tree.

If there are no takers, I will give it a shot myself. Though I will probably need a fair amount of hand-holding to get started.

Running the already written tests, it appears the Index api has changed. Is that possible? Can anybody point me to the PR / issue to help me understand how to adapt the code appropriately?

pandas/core/range.py#L61-L76 has the following segment in its __init__() which seems no longer possible:

self = np.array([], dtype='int64').view(RangeIndex)
...
self.left = left

As I understand this, RangeIndex being a subclass of Index used to be a ndarray subclass but is no longer, right?

@jreback

This comment has been minimized.

Copy link
Contributor

@jreback jreback commented Apr 21, 2015

yes Index is no longer a subclass of ndarray since 0.15.0.You can look at the recently merged CategoricalIndex for the api

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Apr 21, 2015

@ARF1 The easiest thing to do is probably to just open a pull request based on your branch. Then it is easier to comment on the code and give feedback.

@ARF1

This comment has been minimized.

Copy link

@ARF1 ARF1 commented Apr 21, 2015

@jorisvandenbossche Ok, I did not want to "pollute" the PR list with unfinished code. The rebased tree is available as PR #9961.

@jreback Thanks. I will take a look and see if I can find my way around...

@jreback jreback modified the milestones: 0.17.0, Next Major Release Apr 24, 2015
@jreback jreback modified the milestones: Next Major Release, 0.17.0 Aug 15, 2015
@jreback jreback added Prio-high and removed Prio-medium labels Aug 22, 2015
@jreback jreback mentioned this issue Dec 23, 2015
2 of 2 tasks complete
jreback added a commit to jreback/pandas that referenced this issue Dec 24, 2015
`RangeIndex(1, 10, 2)` is a memory saving alternative to
`Index(np.arange(1, 10,2))`: c.f. pandas-dev#939.

This re-implementation is compatible with the current `Index()` api and is a
drop-in replacement for `Int64Index()`. It automatically converts to
Int64Index() when required by operations.

At present only for a minimum number of operations the type is
conserved (e.g. slicing, inner-, left- and right-joins). Most other operations
trigger creation of an equivalent Int64Index (or at least an equivalent numpy
array) and fall back to its implementation.

This PR also extends the functionality of the `Index()` constructor to allow
creation of `RangeIndexes()` with
```
Index(20)
Index(2, 20)
Index(0, 20, 2)
```
in analogy to
```
range(20)
range(2, 20)
range(0, 20, 2)
```

restore Index() fastpath precedence

Various fixes suggested by @jreback and @shoyer

Cache a private Int64Index object the first time it or its values are required.
Restore Index(5) as error. Restore its test. Allow Index(0, 5) and Index(0, 5, 1).
Make RangeIndex immutable. See start, stop, step properties.
In test_constructor(): check class, attributes (possibly including dtype).
In test_copy(): check that copy is not identical (but equal) to the existing.
In test_duplicates(): Assert is_unique and has_duplicates return correct values.

fix slicing

fix view

Set RangeIndex as default index
* enh: set RangeIndex as default index
* fix: pandas.io.packers: encode() and decode() for RangeIndex
* enh: array argument pass-through
* fix: reindex
* fix: use _default_index() in pandas.core.frame.extract_index()
* fix: pandas.core.index.Index._is()
* fix: add RangeIndex to ABCIndexClass
* fix: use _default_index() in _get_names_from_index()
* fix: pytables tests
* fix: MultiIndex.get_level_values()
* fix: RangeIndex._shallow_copy()
* fix: null-size RangeIndex equals() comparison
* enh: make RangeIndex.is_unique immutable

enh: various performance optimizations

 * optimize argsort()
 * optimize tolist()
 * comment clean-up
jreback added a commit to jreback/pandas that referenced this issue Jan 13, 2016
`RangeIndex(1, 10, 2)` is a memory saving alternative to
`Index(np.arange(1, 10,2))`: c.f. pandas-dev#939.

This re-implementation is compatible with the current `Index()` api and is a
drop-in replacement for `Int64Index()`. It automatically converts to
Int64Index() when required by operations.

At present only for a minimum number of operations the type is
conserved (e.g. slicing, inner-, left- and right-joins). Most other operations
trigger creation of an equivalent Int64Index (or at least an equivalent numpy
array) and fall back to its implementation.

This PR also extends the functionality of the `Index()` constructor to allow
creation of `RangeIndexes()` with
```
Index(20)
Index(2, 20)
Index(0, 20, 2)
```
in analogy to
```
range(20)
range(2, 20)
range(0, 20, 2)
```

restore Index() fastpath precedence

Various fixes suggested by @jreback and @shoyer

Cache a private Int64Index object the first time it or its values are required.
Restore Index(5) as error. Restore its test. Allow Index(0, 5) and Index(0, 5, 1).
Make RangeIndex immutable. See start, stop, step properties.
In test_constructor(): check class, attributes (possibly including dtype).
In test_copy(): check that copy is not identical (but equal) to the existing.
In test_duplicates(): Assert is_unique and has_duplicates return correct values.

fix slicing

fix view

Set RangeIndex as default index
* enh: set RangeIndex as default index
* fix: pandas.io.packers: encode() and decode() for RangeIndex
* enh: array argument pass-through
* fix: reindex
* fix: use _default_index() in pandas.core.frame.extract_index()
* fix: pandas.core.index.Index._is()
* fix: add RangeIndex to ABCIndexClass
* fix: use _default_index() in _get_names_from_index()
* fix: pytables tests
* fix: MultiIndex.get_level_values()
* fix: RangeIndex._shallow_copy()
* fix: null-size RangeIndex equals() comparison
* enh: make RangeIndex.is_unique immutable

enh: various performance optimizations

 * optimize argsort()
 * optimize tolist()
 * comment clean-up
jreback added a commit to jreback/pandas that referenced this issue Jan 16, 2016
`RangeIndex(1, 10, 2)` is a memory saving alternative to
`Index(np.arange(1, 10,2))`: c.f. pandas-dev#939.

This re-implementation is compatible with the current `Index()` api and is a
drop-in replacement for `Int64Index()`. It automatically converts to
Int64Index() when required by operations.

At present only for a minimum number of operations the type is
conserved (e.g. slicing, inner-, left- and right-joins). Most other operations
trigger creation of an equivalent Int64Index (or at least an equivalent numpy
array) and fall back to its implementation.

This PR also extends the functionality of the `Index()` constructor to allow
creation of `RangeIndexes()` with
```
Index(20)
Index(2, 20)
Index(0, 20, 2)
```
in analogy to
```
range(20)
range(2, 20)
range(0, 20, 2)
```

restore Index() fastpath precedence

Various fixes suggested by @jreback and @shoyer

Cache a private Int64Index object the first time it or its values are required.
Restore Index(5) as error. Restore its test. Allow Index(0, 5) and Index(0, 5, 1).
Make RangeIndex immutable. See start, stop, step properties.
In test_constructor(): check class, attributes (possibly including dtype).
In test_copy(): check that copy is not identical (but equal) to the existing.
In test_duplicates(): Assert is_unique and has_duplicates return correct values.

fix slicing

fix view

Set RangeIndex as default index
* enh: set RangeIndex as default index
* fix: pandas.io.packers: encode() and decode() for RangeIndex
* enh: array argument pass-through
* fix: reindex
* fix: use _default_index() in pandas.core.frame.extract_index()
* fix: pandas.core.index.Index._is()
* fix: add RangeIndex to ABCIndexClass
* fix: use _default_index() in _get_names_from_index()
* fix: pytables tests
* fix: MultiIndex.get_level_values()
* fix: RangeIndex._shallow_copy()
* fix: null-size RangeIndex equals() comparison
* enh: make RangeIndex.is_unique immutable

enh: various performance optimizations

 * optimize argsort()
 * optimize tolist()
 * comment clean-up
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants
You can’t perform that action at this time.