Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduction of RangeIndex #9977

Closed
wants to merge 7 commits into from
Closed

Introduction of RangeIndex #9977

wants to merge 7 commits into from

Conversation

ARF1
Copy link

@ARF1 ARF1 commented Apr 24, 2015

closes #939

RangeIndex(1, 10, 2) is a memory saving alternative to Index(np.arange(1, 10,2)): c.f. #939.

This re-implementation is compatible with the current Index() api and is a drop-in replacement for Int64Index(). It automatically converts to Int64Index() when required by operations.

At present only for a minimum number of operations the type is conserved (e.g. slicing, inner-, left- and right-joins). Most other operations trigger creation of an equivalent Int64Index (or at least an equivalent numpy array) and fall back to its implementation.

This PR also extends the functionality of the Index() constructor to allow creation of RangeIndex() with

Index(20)
Index(2, 20)
Index(0, 20, 2)

in analogy to

range(20)
range(2, 20)
range(0, 20, 2)

TODO:

  • Cache a private Int64Index object the first time it or its values are required.
  • Restore Index(5) as error. Restore its test. Allow Index(0, 5) and Index(0, 5, 1).
  • Make RangeIndex immutable. See start, stop, step properties.
  • In test_constructor(): check class, attributes (possibly including dtype).
  • In test_copy(): check that copy is not identical (but equal) to the existing.
  • In test_join_non_unique(): test .append for ri.append(int64index) and int64index.append(ri).
  • In test_duplicates(): Assert is_unique and has_duplicates return correct values.
  • After consensus on desired behaviour, adjust test_view().
  • After consensus on constructor, adjust test_constructor() and possible add further tests.
  • After consensus on RangeIndex() validity, adjust test_pickle_compat_construction().
  • After consensus on slice behaviour, test for correct return type in test_slice_specialised().
  • Accept xrange objects as input
  • Setting RangeIndex as the default index.
  • Clean up Index constructor: use ABCs and named arguments
  • In __getitem__(): replace np.isscalar to exclude singe-element numpy arrays
  • Tests for pandas.io.packers

In need of consensus:

  • If we disallow Index(10) should we also disallow RangeIndex(10)? That would seem like an unfortunate api break with respect to built-in range(), xrange() and numpy arange(). Decision: allow
  • Should RangeIndex() be allowed as a constructor alternative to the kludgy Index(np.empty(0, dtype='...'))? Decision: allow
  • Should RangeIndex accept array inputs? If so, what should it do with them? Should it try to infer a corresponding RangeIndex? What should it do with inputs it cannot infer? Decision: delegate all array inputs to Int64Index.
  • Should the RangeIndex constructor accept dtype as an argument? Decision: accept, but only for array inputs.
  • If so, should it accept non-int64 dtypes as valid input? Decision: Irrelevant due to delegation.
  • If so, does it need to check that the other inputs are within the range of the dtype? Decision: Irrelevant due to delegation.
  • Related to constructor question, should slicing with a list return? Always Int64Index or RangeIndex when possible and Int64Index when not? Decision: slicing with list will always return Int64Index.
  • What should RangeIndex(10).view('i8') return? I believe RangeIndex(10). Decision: return RangeIndex(10).
  • What should RangeIndex(10).view('i4') return? I believe Index(RangeIndex(10), dtype='i4'). Decision: return Index(RangeIndex(10), dtype='i4').

ARF added 2 commits April 24, 2015 13:11
`RangeIndex(1, 10, 2)` is a memory saving alternative to
`Index(np.arange(1, 10,2))`: c.f. #939.

This re-implementation is compatible with the current `Index()` api and is a
drop-in replacement for `Int64Index()`. It automatically converts to
Int64Index() when required by operations.

At present only for a minimum number of operations the type is
conserved (e.g. slicing, inner-, left- and right-joins). Most other operations
trigger creation of an equivalent Int64Index (or at least an equivalent numpy
array) and fall back to its implementation.

This PR also extends the functionality of the `Index()` constructor to allow
creation of `RangeIndexes()` with
```
Index(20)
Index(2, 20)
Index(0, 20, 2)
```
in analogy to
```
range(20)
range(2, 20)
range(0, 20, 2)
```
@jreback
Copy link
Contributor

jreback commented Apr 24, 2015

@ARF1 looks like a good start, but need some more testing

  • https://github.com/pydata/pandas/blob/master/pandas/core/common.py#L2145 needs to default to a RangeIndex, otherwise the entire point of this change is lost. There might be other things like this where Int64Index are hard-coded. I think need to change them (but look thru them as well)
  • need a section in tests/test_frame.py, where you check .set_index/reset_index and assigning an index. e.g. df.index = range(5) should tests that its doing the right thing (and work !)
  • need a section (make a separate class like I did for TestCategoricalIndex in tests/test_indexing) to cover various indexing cases. This might be somewhat duplicative, but you can simply cover the most common (e.g.. []/iloc/loc with scalar/slice/list/index/array indexers)
  • eventually will need a small doc/warning for 0.17.0 (you can't actually write this atm as it doesn't exist)
  • my brief look saw a couple of changed/deleted tests. you cannot do that for this kind of non-API change. That is the biggest issue here, that you inadvertantly change some API guarantees in (maybe subtle) ways. That is why we need a lot of tests/validation for this.
  • some kind of vbench / metric to see that this indeed a good idea (I think we agree it is, but a couple of simple metrics for showing memory savings / faster ops, whatever). These tend to become part of the whatsnew (like hey pandas changed this and here's why it was done).

put up a list of todo's at the top of the PR, as detailed as possible, and check off as you are done.

We can do an in-depth review once some of these are checked off.

@jreback jreback added this to the 0.17.0 milestone Apr 24, 2015
@jreback jreback added Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance API Design labels Apr 24, 2015
copy = None
elif isinstance(copy, int):
range_constructor = True
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems very kludgy, what is the issue here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No real issue. It is not required for RangeIndex. I just thought it might be convenient for people to be able to use e.g. Index(5) to create a RangeIndex(5).

It is kludgy since it needs to ensure, that the intent of the user really was to create a RangeIndex when calling Index(...) with a given set of parameters. I don't want to grab inputs that I shouldn't.

If the whole Index(...) calling options extension is an issue or just too complicated I am happy to take it out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, see my other comment. if data is a scalar, then its an error (and btw use lib.isscalar)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather we only implement the range like construction when using RangeIndex explicitly. To make a range index from the index constructor, you could write something like Index(range(...)) (or xrange on Python 2). This could call an alternate constructor, which might just be _simple_new. I'll follow up with a more explicit example when I'm not on my phone.

@shoyer
Copy link
Member

shoyer commented Apr 24, 2015

Here's thought on the constructor:

class Index(IndexOpsMixin, PandasObject):
    # ...

    def __new__(cls, data=None, dtype=None, copy=False, name=None, fastpath=False,
                 tupleize_cols=True, **kwargs):
        # ...

        if all(hasattr(data, a) for a in ['start', 'stop', 'step']):
            return RangeIndex._simple_new(data, name)

# ...

class RangeIndex(Int64Index):
   def __new__(cls, *args, **kwargs):
       # range should be xrange on Python 2
       range_obj = range(*args)
       name = kwargs.pop('name', None)
       if kwargs:
           raise TypeError('too many keyword arguments')
       return _simple_new(range_obj, name)

    @classmethod
    def _simple_new(cls, range_obj, name=None):
        result = object.__new__(cls)
        result._start = range_obj.start
        result._stop = range_obj.stop
        result._step = range_obj.step
        result.name = name
        return result

There are two nice things about this approach:

1, the Index constructor remains entirely backwards compatible.
2. you can explicitly make a range index with either RangeIndex(...) or Index(range(...)).

@jreback
Copy link
Contributor

jreback commented Apr 24, 2015

I think there are a ton of edge cases that can be dealt with by having a proper constructor. E.g.

Index(np.arange(5)) -> should return RangeIndex
Index(range(5)) -> RangeIndex
Index([0,1,2,3,4,5]) -> what does this return?

I don't think there will be much explict usage of RangeIndex, but it should be a drop-in replacement for Int64Index, thus should be able to accept the same exact input

@ARF1
Copy link
Author

ARF1 commented Apr 24, 2015

@jreback Thank you very much for taking the time for extensively commenting the code. I really appreciate it.

On some points I will need clarification (see comments in the line notes). Two more fundamental questions I think we should discuss follow below:

Making RangeIndex the default index

I held off on making RangeIndex the default index class since I thought a "soft phase-in" might be a better approach than an "all or nothing" approach: Once it is in the code base, people could start to use it and remaining bugs / performance issues could be ironed out before making it the default.

Also, until really all conceivable operations are conserving the RangeIndex type, there will probably be performance implications in some cases since materialization to Int64Index will be triggered by many operations (probably) which will be slower than the use of an already existing index as is the case currently.

Materialization of the index will happen repeatedly if the user calls non-RangeIndex-conserving functions repeatedly on an object indexed with a RangeIndex. (See also my line comments regarding the slicing operation.)

In its present form I do not see RangeIndex as a performance PR. In the long term I have a hunch that it could be but I think we are still far from it. Making it so, is a far larger effort I believe. Also I am not sure I am really up to that task. - I am kind of surprised I managed to get it working at all as a pandas novice. ;-)

Are you 100% certain you want to make it the default index straight away? I really was trying to keep the PR manageable to start with and optimize its performance with smaller, more manageable subsequent PRs.

changed/deleted tests

There is only the one test_constructor_corner(self) test that I deleted.

This became necessary since I suggested allowing more inputs to the Index(...) constructor. If I change that constructor I have to change its tests accordingly. (See also line comments.)

I could restore the Index(...) constructor to its current, more limited functionality in which case I can then also restore the test. Should I to that?

@shoyer
Copy link
Member

shoyer commented Apr 24, 2015

RE the cost of repeatedly allocating the values -- I would cache the array of values on the RangeIndex once accessed.

@jreback
Copy link
Contributor

jreback commented Apr 24, 2015

@ARF1 np, always nice to have someone work on the internals!

I think the original intention WAS to have RangeIndex be the default index created. However as you have noted then the semantics of the operations need to be quite clearly defined, e.g. when and when not to materialize to a baser type (e.g. we experience similar discussion on when to morph CategoricalIndex to a Index).

@shoyer does have a good point that simple, predictible rules are useful.

If its release in a more limited scope (e.g. say non-default and slices only preserving, with a similar to existing constructor). Then how/when would this actually be created / where would it be useful enough that one would explicity create it?

See that's the rub. Pandas does lots of inference for users (arguable points whether too much at times of course!).

So if somethings natural (and I think we agree this is), then it should be put in the natural places for utility.

@shoyer
Copy link
Member

shoyer commented Apr 24, 2015

@jreback My notion of a drop in replacement is different. I don't think we should allow for RangeIndex(np.arange(...)) and we certainly should not try to infer a range index from Index(np.arange(...)). That is simply too magical, and will have unexpected performance implications -- converting an integer ndarray to an Index should take constant, not linear time. If a user wants this sort of thing, it's not hard to construct the arguments to pass to RangeIndex on their own (e.g., usually this will just be RangeIndex(array[0], array[-1])).

I would like to see this enabled as the default index (when none is explicitly provided) for two reasons:

  1. Currently, the default index is very wasteful. It often gets created and then immediately thrown away. This happens all the time in user code.
  2. Testing RangeIndex as the default will be necessary to be confident that it works properly, and at that point, we've already done most of the hard work here.

@ARF1
Copy link
Author

ARF1 commented Apr 25, 2015

@shoyer Regarding your constructor suggestion above: the xrange object in python 2 is much more limited than the range object in python 3. In particular it does not have start, stop or step properties. We could infer these by testing the return from the object - except for stop for which we could at best determine a new value that will lead to the same results.

To be honest, I do not really see the point however. What sane user would write Index(xrange(0,10)) when Index(0, 10) does the same thing?

In short: of course we can do it, but is it worth the effort?

@shoyer Regarding your suggestion of caching the values of RangeIndex when they are accessed: I probably misunderstood you here but:

  • If we cache all the values in a RangeIndex why not just convert to a Int64Index in the first place? With your suggestion we are carrying around two objects describing the same thing that we need to keep synchronized. - Sounds like a nightmare to me.
  • If we cache only parts of the values in the RangeIndex: how would that work in practice?

@ARF1
Copy link
Author

ARF1 commented Apr 25, 2015

@jreback @shoyer I added a list of topics on which we still need a consensus (or alternatively an authoritative decision) to the OP. - I am off for today. Have a nice weekend.

@ARF1
Copy link
Author

ARF1 commented Apr 25, 2015

@shoyer Re caching values: Never mind, I was a bit slow. I think I now understand what you meant: For RangeIndex to have a sort of lazy instantiation of a private Int64Index object the first time the values or a Int64Index is required. I am going to implement it and we'll see whether we mean the same thing.

@shoyer
Copy link
Member

shoyer commented Apr 25, 2015

For caching values, you can use the cache_readonly decorator. So this looks something like:

@cache_readonly
def _data(self):
    # ...

@shoyer
Copy link
Member

shoyer commented Apr 25, 2015

As for handling range/xrange as an argument:

I think this is nice to support, simply because it's slightly fewer characters to type than pd.RangeIndex. It's also nice because it gives us an unambiguous way to use the generic Index constructor to make a RangeIndex. (I think both @jreback and I have noted that supporting Index(start, stop, step) seems like a bad idea.)

It's definitely unfortunate Python 2 doesn't define start, step and stop for xrange. But that information is all still accessible via repr. So you could write something like:

if PY2 and isinstance(data, xrange):
    g, = re.match('xrange\((.+)\)', repr(data)).groups()
    return RangeIndex(*map(int, m.split(', ')))

Not so elegant, but it gets the job done.

if np.isscalar(key):
n = int(key)
if n != key:
return super_getitem(key)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should always raise KeyError? Or do we actually allow for non-integer __getitem__ on a Int64Index? If I recall correctly, numpy has deprecated that sort of indexing...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, accepting floats that can be cast to integers is correct:

In [6]: pd.Int64Index([1,2,3,4])[1.0]
Out[6]: 2

Another issue is, that np.isscalar is too permissive. (It allows single-element numpy arrays.) Probably better to use isinstance(key, numbers.Integer) or something similar to replicate numpy __getitem__ behaviour.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is deprecated for all other index types except Float64Index ; we do not allow float indexers for a lot of reasons

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, that changes things. Can we then for consistency reject all non-int-type inputs for RangeIndex? Especially for the constructor.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, please do.

Cache a private Int64Index object the first time it or its values are required.
Restore Index(5) as error. Restore its test. Allow Index(0, 5) and Index(0, 5, 1).
Make RangeIndex immutable. See start, stop, step properties.
In test_constructor(): check class, attributes (possibly including dtype).
In test_copy(): check that copy is not identical (but equal) to the existing.
In test_duplicates(): Assert is_unique and has_duplicates return correct values.
@ARF1
Copy link
Author

ARF1 commented Apr 27, 2015

@jreback @shoyer I spun-off work on the "default index issue" to a new PR (#9999) to avoid messing up this PR with its more limited scope and to keep the discussions separate.

ARF added 4 commits May 2, 2015 00:03
* enh: set RangeIndex as default index
* fix: pandas.io.packers: encode() and decode() for RangeIndex
* enh: array argument pass-through
* fix: reindex
* fix: use _default_index() in pandas.core.frame.extract_index()
* fix: pandas.core.index.Index._is()
* fix: add RangeIndex to ABCIndexClass
* fix: use _default_index() in _get_names_from_index()
* fix: pytables tests
* fix: MultiIndex.get_level_values()
* fix: RangeIndex._shallow_copy()
* fix: null-size RangeIndex equals() comparison
* enh: make RangeIndex.is_unique immutable
 * optimize argsort()
 * optimize tolist()
 * comment clean-up
@ARF1
Copy link
Author

ARF1 commented May 7, 2015

I am considering making RangeIndex a subclass of slice to be able to use a RangeIndex for slicing objects without instantiating a numpy array. (Since the __getitem__() functions seem to consistently check for instances of slice before checking for instances of ndarray.)

What do you think?

@jreback
Copy link
Contributor

jreback commented May 7, 2015

no, don't subclass except for Index. This just makes things harder / more confusing.

@shoyer
Copy link
Member

shoyer commented May 7, 2015

Yeah, that seems likely to be quite problematic. For example, a negative start has a totally different meaning between range (start at that value) and slice (count from the end).

On Thu, May 7, 2015 at 9:39 AM, jreback notifications@github.com wrote:

no, don't subclass except for Index. This just makes things harder / more confusing.

Reply to this email directly or view it on GitHub:
#9977 (comment)

@jreback jreback modified the milestones: Next Major Release, 0.17.0 Aug 15, 2015
@auvipy
Copy link

auvipy commented Dec 3, 2015

@ARF1 could u plz fix the merge conflict?

@jreback jreback mentioned this pull request Dec 23, 2015
2 tasks
@jreback
Copy link
Contributor

jreback commented Dec 23, 2015

replace by #11892
cc @ARF1

@jreback jreback closed this Dec 30, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a more memory-efficient RangeIndex-sort of thing to avoid large arange(N) indexes in some cases
4 participants