New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Create and propagate UInt64Index #14937

Closed
wants to merge 1 commit into
base: master
from

Conversation

Projects
None yet
5 participants
@gfyoung
Member

gfyoung commented Dec 21, 2016

  1. Introduces and propagates UInt64Index, an index specifically for uint. xref #14935

  2. Patches bug from #14916 that makes maybe_convert_objects robust against the known numpy bug that uint64 cannot be compared to int64. This bug was caught during testing of UInt64Index.

UPDATE: Patched in #14951

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Dec 21, 2016

Contributor

can you separate out

Patches bug from #14916 that makes maybe_convert_objects robust against the known numpy bug that uint64 cannot be compared to int64. This bug was caught during testing of UInt64Index.

Contributor

jreback commented Dec 21, 2016

can you separate out

Patches bug from #14916 that makes maybe_convert_objects robust against the known numpy bug that uint64 cannot be compared to int64. This bug was caught during testing of UInt64Index.

Show outdated Hide outdated doc/source/whatsnew/v0.20.0.txt
@@ -99,6 +99,7 @@ Other enhancements
unsorted MultiIndex (:issue:`11897`). This allows differentiation between errors due to lack
of sorting or an incorrect key. See :ref:`here <advanced.unsorted>`
- New ``UInt64Index`` (subclass of ``NumericIndex``) for specifically indexing unsigned integers (:issue:`14935`)

This comment has been minimized.

@jreback

jreback Dec 21, 2016

Contributor

create a new sub-section for all uint64 issues (can be one later)

@jreback

jreback Dec 21, 2016

Contributor

create a new sub-section for all uint64 issues (can be one later)

This comment has been minimized.

@gfyoung

gfyoung Dec 22, 2016

Member

Created.

@gfyoung

gfyoung Dec 22, 2016

Member

Created.

Show outdated Hide outdated pandas/index.pyx
@@ -426,6 +426,67 @@ cdef class Int64Engine(IndexEngine):
return result
cdef class UInt64Engine(IndexEngine):

This comment has been minimized.

@jreback

jreback Dec 21, 2016

Contributor

we are going to need to simplify this boilerplate, either expanding some methods in the superclass, maybe using some tempita. I hate copy-pasting code.

@jreback

jreback Dec 21, 2016

Contributor

we are going to need to simplify this boilerplate, either expanding some methods in the superclass, maybe using some tempita. I hate copy-pasting code.

@codecov-io

This comment has been minimized.

Show comment
Hide comment
@codecov-io

codecov-io Dec 21, 2016

Current coverage is 85.54% (diff: 95.00%)

Merging #14937 into master will increase coverage by <.01%

@@             master     #14937   diff @@
==========================================
  Files           145        145          
  Lines         51288      51350    +62   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43872      43928    +56   
- Misses         7416       7422     +6   
  Partials          0          0          

Powered by Codecov. Last update 0e219d7...8ab6fbd

codecov-io commented Dec 21, 2016

Current coverage is 85.54% (diff: 95.00%)

Merging #14937 into master will increase coverage by <.01%

@@             master     #14937   diff @@
==========================================
  Files           145        145          
  Lines         51288      51350    +62   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43872      43928    +56   
- Misses         7416       7422     +6   
  Partials          0          0          

Powered by Codecov. Last update 0e219d7...8ab6fbd

Show outdated Hide outdated pandas/index.pyx
raise KeyError(val)
cdef _maybe_get_bool_indexer(self, object val):
cdef:

This comment has been minimized.

@jreback

jreback Dec 21, 2016

Contributor

you can prob just use fused types for routines like this (for the scalar & array type)

@jreback

jreback Dec 21, 2016

Contributor

you can prob just use fused types for routines like this (for the scalar & array type)

This comment has been minimized.

@gfyoung

gfyoung Dec 22, 2016

Member

I am planning to template, so this shouldn't be problematic anymore.

@gfyoung

gfyoung Dec 22, 2016

Member

I am planning to template, so this shouldn't be problematic anymore.

Show outdated Hide outdated pandas/indexes/base.py
except (OverflowError, TypeError, ValueError):
pass
try:

This comment has been minimized.

@jreback

jreback Dec 21, 2016

Contributor

add a comment

@jreback

jreback Dec 21, 2016

Contributor

add a comment

This comment has been minimized.

@gfyoung

gfyoung Dec 22, 2016

Member

Done.

@gfyoung

gfyoung Dec 22, 2016

Member

Done.

Show outdated Hide outdated pandas/indexes/numeric.py
@@ -177,6 +177,91 @@ def _assert_safe_casting(cls, data, subarr):
Int64Index._add_logical_methods()
class UInt64Index(NumericIndex):
"""
Immutable ndarray implementing an ordered, sliceable set. The basic object

This comment has been minimized.

@jreback

jreback Dec 21, 2016

Contributor

we should use a shared doc for these indexes __init__

@jreback

jreback Dec 21, 2016

Contributor

we should use a shared doc for these indexes __init__

This comment has been minimized.

@gfyoung

gfyoung Dec 22, 2016

Member

Makes sense. Done.

@gfyoung

gfyoung Dec 22, 2016

Member

Makes sense. Done.

_default_dtype = np.uint64
@property
def inferred_type(self):

This comment has been minimized.

@jreback

jreback Dec 21, 2016

Contributor

hmm? you don't want to differentiate

@jreback

jreback Dec 21, 2016

Contributor

hmm? you don't want to differentiate

This comment has been minimized.

@gfyoung

gfyoung Dec 22, 2016

Member

I did not see a reason to do so at this point.

@gfyoung

gfyoung Dec 22, 2016

Member

I did not see a reason to do so at this point.

Show outdated Hide outdated pandas/indexes/numeric.py
return self.values.view('u8')
@property
def is_all_dates(self):

This comment has been minimized.

@jreback

jreback Dec 21, 2016

Contributor

I think this is defined in the super class (or it should be)

@jreback

jreback Dec 21, 2016

Contributor

I think this is defined in the super class (or it should be)

This comment has been minimized.

@gfyoung

gfyoung Dec 22, 2016

Member

It is defined in Index. However, that implementation is very generic and less performant. We can just return the boolean immediately if possible, which is what I do and is also done in many subclasses (e.g Int64Index, DatetimeIndex).

@gfyoung

gfyoung Dec 22, 2016

Member

It is defined in Index. However, that implementation is very generic and less performant. We can just return the boolean immediately if possible, which is what I do and is also done in many subclasses (e.g Int64Index, DatetimeIndex).

This comment has been minimized.

@jreback

jreback Dec 25, 2016

Contributor

this should be a method on NumericIndex, which just returns False (then is overriden in the datetime ones).

@jreback

jreback Dec 25, 2016

Contributor

this should be a method on NumericIndex, which just returns False (then is overriden in the datetime ones).

@gfyoung

This comment has been minimized.

Show comment
Hide comment
@gfyoung

gfyoung Dec 22, 2016

Member

@jreback : See #14951 for patching maybe_convert_objects

Member

gfyoung commented Dec 22, 2016

@jreback : See #14951 for patching maybe_convert_objects

Show outdated Hide outdated doc/source/whatsnew/v0.20.0.txt
or purely non-negative, integers. Previously, handling these integers would
result in improper rounding or data-type casting, leading to incorrect results.
One notable place where this improved was in ``DataFrame`` creation (:issue:`14917`):

This comment has been minimized.

@jreback

jreback Dec 22, 2016

Contributor

I don't think you need this example. mainly wanted to put the issues together (top section is good though).

@jreback

jreback Dec 22, 2016

Contributor

I don't think you need this example. mainly wanted to put the issues together (top section is good though).

This comment has been minimized.

@gfyoung

gfyoung Dec 22, 2016

Member

Fair enough.

@gfyoung

gfyoung Dec 22, 2016

Member

Fair enough.

Show outdated Hide outdated doc/source/whatsnew/v0.20.0.txt
- Bug in converting object elements of array-like objects to unsigned 64-bit integers (:issue:`4471`)
- Bug in ``Series.unique()`` in which unsigned 64-bit integers were causing overflow (:issue:`14721`)
- New ``UInt64Index`` (subclass of ``NumericIndex``) for specifically indexing unsigned integers (:issue:`14935`)

This comment has been minimized.

@jreback

jreback Dec 22, 2016

Contributor

so an example using the index might be good (e.g. using indexing)

@jreback

jreback Dec 22, 2016

Contributor

so an example using the index might be good (e.g. using indexing)

This comment has been minimized.

@gfyoung

gfyoung Dec 22, 2016

Member

Sounds good. Done.

@gfyoung

gfyoung Dec 22, 2016

Member

Sounds good. Done.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Dec 22, 2016

Contributor

so need testing / fixes for

  • assignment to DataFrame
  • set/reset index
  • indexing (getting and setting), add a new test class
Contributor

jreback commented Dec 22, 2016

so need testing / fixes for

  • assignment to DataFrame
  • set/reset index
  • indexing (getting and setting), add a new test class
@gfyoung

This comment has been minimized.

Show comment
Hide comment
@gfyoung

gfyoung Dec 22, 2016

Member

Assignment to DataFrame: frame/test_alter_axes.py ?
Add a new test class : indexes/test_numeric.py ?

Set / reset index ? (not sure what you mean)
Getting / setting ? (not sure what you mean)

Member

gfyoung commented Dec 22, 2016

Assignment to DataFrame: frame/test_alter_axes.py ?
Add a new test class : indexes/test_numeric.py ?

Set / reset index ? (not sure what you mean)
Getting / setting ? (not sure what you mean)

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Dec 23, 2016

Contributor

Set / reset index ? (not sure what you mean)
df.set_index('uint64_col').reset_index()

Getting / setting ? (not sure what you mean)
df2 = df.set_index('uint64_col')
df2.loc[df.index[0]] (and =)

basically need a pandas/tests/indexing/test_uint64.py

Contributor

jreback commented Dec 23, 2016

Set / reset index ? (not sure what you mean)
df.set_index('uint64_col').reset_index()

Getting / setting ? (not sure what you mean)
df2 = df.set_index('uint64_col')
df2.loc[df.index[0]] (and =)

basically need a pandas/tests/indexing/test_uint64.py

@gfyoung

This comment has been minimized.

Show comment
Hide comment
@gfyoung

gfyoung Dec 23, 2016

Member

Would appending uint to the battery of tests in tests/indexing/test_indexing.py suffice then?

Member

gfyoung commented Dec 23, 2016

Would appending uint to the battery of tests in tests/indexing/test_indexing.py suffice then?

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Dec 23, 2016

Contributor

@gfyoung yes adding uint to lots of tests would definitly help.

Contributor

jreback commented Dec 23, 2016

@gfyoung yes adding uint to lots of tests would definitly help.

jreback added a commit that referenced this pull request Dec 24, 2016

BUG: Patch rank() uint64 behavior
Adds `uint64` ranking functions to `algos.pyx` to allow for proper
ranking with `uint64`.    Also introduces partial patch for
`factorize()` by adding `uint64` hashtables and vectors for  usage.
However, this patch is only partial because the larger bug of non-
support for `uint64`  in `Index` has not been fixed (**UPDATE**:
tackled in #14937):  ~~~python  >>> from pandas import Index, np  >>>
Index(np.array([2**63], dtype=np.uint64))
Int64Index([-9223372036854775808], dtype='int64')  ~~~    Also patches
a bug in `UInt64HashTable` from #14915 that had an erroneous null
condition that was caught during testing and was hence removed.

Author: gfyoung <gfyoung17@gmail.com>

Closes #14935 from gfyoung/core-algorithms-uint64-two and squashes the following commits:

2598cea [gfyoung] BUG: Patch rank() uint64 behavior
@gfyoung

This comment has been minimized.

Show comment
Hide comment
@gfyoung

gfyoung Dec 25, 2016

Member

@jreback : Added UInt64Index to huge number of tests, and everything is green.

Member

gfyoung commented Dec 25, 2016

@jreback : Added UInt64Index to huge number of tests, and everything is green.

Show outdated Hide outdated pandas/core/indexing.py
keyarr = key
if isinstance(labels, ABCUInt64Index) and key.is_integer():
keyarr = key.astype(np.uint64)
else:

This comment has been minimized.

@jreback

jreback Dec 25, 2016

Contributor

why is this needed? too specialy casey

@jreback

jreback Dec 25, 2016

Contributor

why is this needed? too specialy casey

This comment has been minimized.

@gfyoung

gfyoung Dec 25, 2016

Member

Specially-cased, you say? This code is WAY too int64-oriented that attempts to generalize the patch caused test failures all over the place. The reason why this is needed is because when you attempt to index a UInt64Index, the current code implicitly does a lot of casting to int64, which works for about everything else except for uint64.

@gfyoung

gfyoung Dec 25, 2016

Member

Specially-cased, you say? This code is WAY too int64-oriented that attempts to generalize the patch caused test failures all over the place. The reason why this is needed is because when you attempt to index a UInt64Index, the current code implicitly does a lot of casting to int64, which works for about everything else except for uint64.

if is_integer_dtype(keyarr) and not labels.is_integer():
keyarr = _ensure_platform_int(keyarr)
return labels.take(keyarr)
if is_integer_dtype(keyarr):

This comment has been minimized.

@jreback

jreback Dec 25, 2016

Contributor

same here, if there are too many special cases this makes this code even more impenetrable.

@jreback

jreback Dec 25, 2016

Contributor

same here, if there are too many special cases this makes this code even more impenetrable.

This comment has been minimized.

@gfyoung

gfyoung Dec 25, 2016

Member

Same answer as above. This code is very difficult to insert uint64 compatibility into as a general feature because we have no way of differentiating between uint64 and int64 in this code without special-casing somewhere.

@gfyoung

gfyoung Dec 25, 2016

Member

Same answer as above. This code is very difficult to insert uint64 compatibility into as a general feature because we have no way of differentiating between uint64 and int64 in this code without special-casing somewhere.

This comment has been minimized.

@jreback

jreback Jan 4, 2017

Contributor

I'd like to capture this again in a method on the Index, not sure what this should look like though.

@jreback

jreback Jan 4, 2017

Contributor

I'd like to capture this again in a method on the Index, not sure what this should look like though.

Show outdated Hide outdated pandas/core/indexing.py
# want Index objects to pass through untouched
keyarr = key
if isinstance(labels, ABCUInt64Index) and key.is_integer():
keyarr = key.astype(np.uint64)

This comment has been minimized.

@jreback

jreback Dec 25, 2016

Contributor

smae

@jreback

jreback Dec 25, 2016

Contributor

smae

This comment has been minimized.

@gfyoung

gfyoung Dec 25, 2016

Member

Same answer as above.

@gfyoung

gfyoung Dec 25, 2016

Member

Same answer as above.

This comment has been minimized.

@jreback

jreback Jan 4, 2017

Contributor

this should be handled using _convert_index_indexer (e.g. like _convert_list_indexer), a method on the Index object itself (and for all indexes will basically just return, except for UInt64 you can do the astyping).

@jreback

jreback Jan 4, 2017

Contributor

this should be handled using _convert_index_indexer (e.g. like _convert_list_indexer), a method on the Index object itself (and for all indexes will basically just return, except for UInt64 you can do the astyping).

Show outdated Hide outdated pandas/src/index_class_helper.pxi.in
WARNING: DO NOT edit .pxi FILE directly, .pxi is generated from .pxi.in
"""

This comment has been minimized.

@jreback

jreback Dec 25, 2016

Contributor

let's split this change out to a separate PR (obviously index.pyx as well).

@jreback

jreback Dec 25, 2016

Contributor

let's split this change out to a separate PR (obviously index.pyx as well).

Show outdated Hide outdated pandas/indexes/numeric.py
"""
_typ = 'uint64index'
_arrmap = _algos.arrmap_uint64
_left_indexer_unique = _join.left_join_indexer_unique_uint64

This comment has been minimized.

@jreback

jreback Dec 25, 2016

Contributor

in another PR, let's make the change that generates these indexers base on the dtype of the class
kind of like how we add method via the add_numeric_methods class function. Then you can run it on a class, it will introspect and add these methods. This makes less boilerplate generally.

@jreback

jreback Dec 25, 2016

Contributor

in another PR, let's make the change that generates these indexers base on the dtype of the class
kind of like how we add method via the add_numeric_methods class function. Then you can run it on a class, it will introspect and add these methods. This makes less boilerplate generally.

This comment has been minimized.

@gfyoung

gfyoung Dec 25, 2016

Member

Good idea, though most of the boilerplate you are seeing is from UInt64Index I think. Perhaps we can do that as a follow-up once UInt64Index is added?

@gfyoung

gfyoung Dec 25, 2016

Member

Good idea, though most of the boilerplate you are seeing is from UInt64Index I think. Perhaps we can do that as a follow-up once UInt64Index is added?

Show outdated Hide outdated pandas/indexes/numeric.py
return self.values.view('u8')
@property
def is_all_dates(self):

This comment has been minimized.

@jreback

jreback Dec 25, 2016

Contributor

this should be a method on NumericIndex, which just returns False (then is overriden in the datetime ones).

@jreback

jreback Dec 25, 2016

Contributor

this should be a method on NumericIndex, which just returns False (then is overriden in the datetime ones).

Show outdated Hide outdated pandas/indexes/numeric.py
"""
_num_index_shared_docs['_convert_scalar_indexer'] = """
Convert a scalar indexer.

This comment has been minimized.

@jreback

jreback Dec 25, 2016

Contributor

also pull out this doc-string to a separate PR. Idea is we want to do changes to make things generic in a PR before this one. Then this PR becomes much less complex.

@jreback

jreback Dec 25, 2016

Contributor

also pull out this doc-string to a separate PR. Idea is we want to do changes to make things generic in a PR before this one. Then this PR becomes much less complex.

@@ -12,7 +12,8 @@ WARNING: DO NOT edit .pxi FILE directly, .pxi is generated from .pxi.in
{{py:
# table_type, by_dtype
by_dtypes = [('PyObjectHashTable', 'object'), ('Int64HashTable', 'int64_t')]
by_dtypes = [('PyObjectHashTable', 'object'), ('Int64HashTable', 'int64_t'),

This comment has been minimized.

@jreback

jreback Dec 25, 2016

Contributor

this change can also be done in same PR with the index.pyx changes.

@jreback

jreback Dec 25, 2016

Contributor

this change can also be done in same PR with the index.pyx changes.

This comment has been minimized.

@gfyoung

gfyoung Dec 31, 2016

Member

Yep, handled in #14989.

@gfyoung

gfyoung Dec 31, 2016

Member

Yep, handled in #14989.

Show outdated Hide outdated pandas/tests/frame/test_indexing.py
# with nan
df2 = df.copy()
df2.iloc[1, 1] = pd.NaT

This comment has been minimized.

@jreback

jreback Dec 25, 2016

Contributor

its is not clear at all what you are testing here

@jreback

jreback Dec 25, 2016

Contributor

its is not clear at all what you are testing here

This comment has been minimized.

@gfyoung

gfyoung Dec 25, 2016

Member

I'm not sure what's not clear. The test name relates to setitem, and the item we are setting here is pd.NaT.

@gfyoung

gfyoung Dec 25, 2016

Member

I'm not sure what's not clear. The test name relates to setitem, and the item we are setting here is pd.NaT.

This comment has been minimized.

@jreback

jreback Jan 4, 2017

Contributor

why are you setting with pd.NaT? are you trying to force uint64?

@jreback

jreback Jan 4, 2017

Contributor

why are you setting with pd.NaT? are you trying to force uint64?

df = self.df
idx = self.idx
# setitem

This comment has been minimized.

@jreback

jreback Dec 25, 2016

Contributor

the main thing to test is .loc

@jreback

jreback Dec 25, 2016

Contributor

the main thing to test is .loc

This comment has been minimized.

@gfyoung

gfyoung Dec 25, 2016

Member

If you look at the rest of the changes, I added uint to a battery-ram of tests in indexes/test_indexing.py

@gfyoung

gfyoung Dec 25, 2016

Member

If you look at the rest of the changes, I added uint to a battery-ram of tests in indexes/test_indexing.py

# set/reset
df = DataFrame({'A': [0, 1, 2]}, index=idx)
result = df.reset_index()
self.assertEqual(result['foo'].dtype, np.dtype('uint64'))

This comment has been minimized.

@jreback

jreback Dec 25, 2016

Contributor

this is not clear what you are testing here

@jreback

jreback Dec 25, 2016

Contributor

this is not clear what you are testing here

This comment has been minimized.

@gfyoung

gfyoung Dec 25, 2016

Member

What do you mean? I literally copied the test class right above. This is testing reset_index.

@gfyoung

gfyoung Dec 25, 2016

Member

What do you mean? I literally copied the test class right above. This is testing reset_index.

ShaharBental added a commit to ShaharBental/pandas that referenced this pull request Dec 26, 2016

BUG: Patch rank() uint64 behavior
Adds `uint64` ranking functions to `algos.pyx` to allow for proper
ranking with `uint64`.    Also introduces partial patch for
`factorize()` by adding `uint64` hashtables and vectors for  usage.
However, this patch is only partial because the larger bug of non-
support for `uint64`  in `Index` has not been fixed (**UPDATE**:
tackled in #14937):  ~~~python  >>> from pandas import Index, np  >>>
Index(np.array([2**63], dtype=np.uint64))
Int64Index([-9223372036854775808], dtype='int64')  ~~~    Also patches
a bug in `UInt64HashTable` from #14915 that had an erroneous null
condition that was caught during testing and was hence removed.

Author: gfyoung <gfyoung17@gmail.com>

Closes #14935 from gfyoung/core-algorithms-uint64-two and squashes the following commits:

2598cea [gfyoung] BUG: Patch rank() uint64 behavior

jreback added a commit that referenced this pull request Dec 30, 2016

DOC: Refactor numeric index docs
Refactor `NumericIndex` docs to avoid duplicate documentation.  xref
#14937.

Author: gfyoung <gfyoung17@gmail.com>

Closes #14985 from gfyoung/numeric-doc-refactor and squashes the following commits:

ba80766 [gfyoung] DOC: Refactor numeric index docs

@jreback jreback referenced this pull request Jan 1, 2017

Closed

groupby casting to int64 #15027

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Jan 1, 2017

Contributor

@gfyoung so some additional cases to check, e.g #15027 (this may be quite involved actually. a uint group-and should be turned into a uint64 grouper).

Contributor

jreback commented Jan 1, 2017

@gfyoung so some additional cases to check, e.g #15027 (this may be quite involved actually. a uint group-and should be turned into a uint64 grouper).

@gfyoung

This comment has been minimized.

Show comment
Hide comment
@gfyoung

gfyoung Jan 1, 2017

Member

@jreback : Yeah...looking at the code, it seems that this will be somewhat involved. Can we patch that in a separate PR (and maybe just reopen it for starters)?

Member

gfyoung commented Jan 1, 2017

@jreback : Yeah...looking at the code, it seems that this will be somewhat involved. Can we patch that in a separate PR (and maybe just reopen it for starters)?

@gfyoung

This comment has been minimized.

Show comment
Hide comment
@gfyoung

gfyoung Jan 1, 2017

Member

@jreback : Perhaps we can just focus on making sure in this PR that the changes I have proposed are good to merge. How does the PR look at this point?

Member

gfyoung commented Jan 1, 2017

@jreback : Perhaps we can just focus on making sure in this PR that the changes I have proposed are good to merge. How does the PR look at this point?

@gfyoung

This comment has been minimized.

Show comment
Hide comment
@gfyoung

gfyoung Jan 3, 2017

Member

@jreback : Do you think you could do a re-look over this PR and tell me what things should be changed before this can be merged?

Member

gfyoung commented Jan 3, 2017

@jreback : Do you think you could do a re-look over this PR and tell me what things should be changed before this can be merged?

@jreback

you need to have a TestUInt64Index sub-class in tests/index/test_numeric.py. maybe you have one and I didn't see it.

Show outdated Hide outdated doc/source/whatsnew/v0.20.0.txt
Pandas has significantly improved support for operations involving unsigned,
or purely non-negative, integers. Previously, handling these integers would
result in improper rounding or data-type casting, leading to incorrect results.
Notably, a new numerical index, `UInt64Index`, has been created (:issue:`14937`)

This comment has been minimized.

@jreback

jreback Jan 4, 2017

Contributor

double back-ticks on UInt64Index

@jreback

jreback Jan 4, 2017

Contributor

double back-ticks on UInt64Index

This comment has been minimized.

@gfyoung

gfyoung Jan 5, 2017

Member

Done.

@gfyoung

gfyoung Jan 5, 2017

Member

Done.

Show outdated Hide outdated pandas/core/indexing.py
# want Index objects to pass through untouched
keyarr = key
if isinstance(labels, ABCUInt64Index) and key.is_integer():
keyarr = key.astype(np.uint64)

This comment has been minimized.

@jreback

jreback Jan 4, 2017

Contributor

this should be handled using _convert_index_indexer (e.g. like _convert_list_indexer), a method on the Index object itself (and for all indexes will basically just return, except for UInt64 you can do the astyping).

@jreback

jreback Jan 4, 2017

Contributor

this should be handled using _convert_index_indexer (e.g. like _convert_list_indexer), a method on the Index object itself (and for all indexes will basically just return, except for UInt64 you can do the astyping).

Show outdated Hide outdated pandas/core/indexing.py
@@ -1043,12 +1051,17 @@ def _getitem_iterable(self, key, axis=0):
return self.obj.take(inds, axis=axis, convert=False)
else:
if isinstance(key, Index):
# want Index objects to pass through untouched
keyarr = key
if isinstance(labels, ABCUInt64Index) and key.is_integer():

This comment has been minimized.

@jreback

jreback Jan 4, 2017

Contributor

_convert_index_indexer will work here

@jreback

jreback Jan 4, 2017

Contributor

_convert_index_indexer will work here

if is_integer_dtype(keyarr) and not labels.is_integer():
keyarr = _ensure_platform_int(keyarr)
return labels.take(keyarr)
if is_integer_dtype(keyarr):

This comment has been minimized.

@jreback

jreback Jan 4, 2017

Contributor

I'd like to capture this again in a method on the Index, not sure what this should look like though.

@jreback

jreback Jan 4, 2017

Contributor

I'd like to capture this again in a method on the Index, not sure what this should look like though.

Show outdated Hide outdated pandas/indexes/base.py
# Conversion to int64 failed (possibly due to
# overflow), so let's try now with uint64.
try:
res = data.astype('u8')

This comment has been minimized.

@jreback

jreback Jan 4, 2017

Contributor

add copy=False (and similarly to the .astype for i8 above)

@jreback

jreback Jan 4, 2017

Contributor

add copy=False (and similarly to the .astype for i8 above)

Show outdated Hide outdated pandas/indexes/numeric.py
"""
Ensure incoming data can be represented as uints.
"""
if not issubclass(data.dtype.type, np.integer):

This comment has been minimized.

@jreback

jreback Jan 4, 2017

Contributor

should this be np.signedinteger?

@jreback

jreback Jan 4, 2017

Contributor

should this be np.signedinteger?

This comment has been minimized.

@gfyoung

gfyoung Jan 5, 2017

Member

Good catch - np.unsignedinteger in fact. I also updated Int64Index to use np.signedinteger.

@gfyoung

gfyoung Jan 5, 2017

Member

Good catch - np.unsignedinteger in fact. I also updated Int64Index to use np.signedinteger.

Show outdated Hide outdated pandas/indexes/numeric.py
_float64_descr_args = dict(
klass='Float64Index',
dtype='float64',

This comment has been minimized.

@jreback

jreback Jan 4, 2017

Contributor

I thought you changed these doc-strings in another PR?

@jreback

jreback Jan 4, 2017

Contributor

I thought you changed these doc-strings in another PR?

This comment has been minimized.

@gfyoung

gfyoung Jan 5, 2017

Member

Duplicate _float64_descr_args. Failed rebase. Fixed.

@gfyoung

gfyoung Jan 5, 2017

Member

Duplicate _float64_descr_args. Failed rebase. Fixed.

Show outdated Hide outdated pandas/tests/frame/test_indexing.py
# with nan
df2 = df.copy()
df2.iloc[1, 1] = pd.NaT

This comment has been minimized.

@jreback

jreback Jan 4, 2017

Contributor

why are you setting with pd.NaT? are you trying to force uint64?

@jreback

jreback Jan 4, 2017

Contributor

why are you setting with pd.NaT? are you trying to force uint64?

Show outdated Hide outdated pandas/tests/indexes/test_base.py
@@ -363,6 +364,19 @@ def test_constructor_dtypes_timedelta(self):
pd.TimedeltaIndex(list(values), dtype=dtype)]:
tm.assert_index_equal(res, idx)
def test_constructor_uint64(self):

This comment has been minimized.

@jreback

jreback Jan 4, 2017

Contributor

this should be in the testing sub-class

@jreback

jreback Jan 4, 2017

Contributor

this should be in the testing sub-class

This comment has been minimized.

@gfyoung

gfyoung Jan 5, 2017

Member

Reply to your comment about pd.NaT above: the purpose is to see what the column dtype becomes when I set an element in the UInt64Index to a NaN value. I added a comment to explain this.

@gfyoung

gfyoung Jan 5, 2017

Member

Reply to your comment about pd.NaT above: the purpose is to see what the column dtype becomes when I set an element in the UInt64Index to a NaN value. I added a comment to explain this.

This comment has been minimized.

@gfyoung

gfyoung Jan 5, 2017

Member

Reply to your comment about testing sub-class (and your other comment above): I did not have a sub-class at all, so I will add that. Let's see what breaks! 😄

@gfyoung

gfyoung Jan 5, 2017

Member

Reply to your comment about testing sub-class (and your other comment above): I did not have a sub-class at all, so I will add that. Let's see what breaks! 😄

This comment has been minimized.

@gfyoung

gfyoung Jan 5, 2017

Member

Before addressing your comments about the _convert_index_indexer refactoring, I will first try to expand the testing sub-class as much as possible. Still WIP (got it to pass all Numeric tests, but trying to see if the Int64Index tests can be generalized to include UInt64Index too)

@gfyoung

gfyoung Jan 5, 2017

Member

Before addressing your comments about the _convert_index_indexer refactoring, I will first try to expand the testing sub-class as much as possible. Still WIP (got it to pass all Numeric tests, but trying to see if the Int64Index tests can be generalized to include UInt64Index too)

@gfyoung

This comment has been minimized.

Show comment
Hide comment
@gfyoung

gfyoung Jan 7, 2017

Member

@jreback : Tons of new tests were added with the sub-class, all of the refactorings requested, and everything is still green!

My only concern is this doc-building instance here, which has consistently failed because for some reason it cannot pick up my changes to the pxi.in files. I built the docs locally, and everything is fine. Any ideas on how to remedy this?

Member

gfyoung commented Jan 7, 2017

@jreback : Tons of new tests were added with the sub-class, all of the refactorings requested, and everything is still green!

My only concern is this doc-building instance here, which has consistently failed because for some reason it cannot pick up my changes to the pxi.in files. I built the docs locally, and everything is fine. Any ideas on how to remedy this?

Show outdated Hide outdated pandas/core/indexing.py
return labels.take(keyarr)
if is_integer_dtype(keyarr):
keyarr = labels._convert_arr_indexer(keyarr)

This comment has been minimized.

@jreback

jreback Jan 9, 2017

Contributor

can you add a 1-liner here about what this is doing (always will take comments w.r.t. indexing code :)

@jreback

jreback Jan 9, 2017

Contributor

can you add a 1-liner here about what this is doing (always will take comments w.r.t. indexing code :)

This comment has been minimized.

@gfyoung

gfyoung Jan 10, 2017

Member

Done.

@gfyoung

gfyoung Jan 10, 2017

Member

Done.

if (res == data).all():
return Int64Index(res, copy=copy,
name=name)
except (OverflowError, TypeError, ValueError):
pass

This comment has been minimized.

@jreback

jreback Jan 9, 2017

Contributor

shouldn't we try to convert to a UInt64 index only if we have oveflow on the int64 conversion? (and not some other error)?

I suppose the way the code is now it will just hit both paths then fall thru on a genuine Type/Value error.

@jreback

jreback Jan 9, 2017

Contributor

shouldn't we try to convert to a UInt64 index only if we have oveflow on the int64 conversion? (and not some other error)?

I suppose the way the code is now it will just hit both paths then fall thru on a genuine Type/Value error.

This comment has been minimized.

@gfyoung

gfyoung Jan 10, 2017

Member

I don't believe so. TypeError and ValueError were being guarded against for Int64Index initialization before. I see no reason to change that. The other option is to do something like this:

try:
   try:
       Int64Index(...)
   except OverflowError:
       UInt64Index(...)
except (TypeError, ValueError):
   pass

However, I am more in favor of my method at this point since it is slightly more modular.

@gfyoung

gfyoung Jan 10, 2017

Member

I don't believe so. TypeError and ValueError were being guarded against for Int64Index initialization before. I see no reason to change that. The other option is to do something like this:

try:
   try:
       Int64Index(...)
   except OverflowError:
       UInt64Index(...)
except (TypeError, ValueError):
   pass

However, I am more in favor of my method at this point since it is slightly more modular.

Show outdated Hide outdated pandas/indexes/base.py
@@ -235,9 +246,12 @@ def __new__(cls, data=None, dtype=None, copy=False, name=None,
IncompatibleFrequency)
if isinstance(data, PeriodIndex):
return PeriodIndex(data, copy=copy, name=name, **kwargs)
if issubclass(data.dtype.type, np.integer):
if issubclass(data.dtype.type, np.signedinteger):

This comment has been minimized.

@jreback

jreback Jan 9, 2017

Contributor

we should probably change all of these to use our is_*_dtype methods (but can do that in another PR)

@jreback

jreback Jan 9, 2017

Contributor

we should probably change all of these to use our is_*_dtype methods (but can do that in another PR)

This comment has been minimized.

@gfyoung

gfyoung Jan 10, 2017

Member

@jreback : Lemme try to patch this now. We have the methods defined. If I can't, I'll let you know in a subsequent comment.

@gfyoung

gfyoung Jan 10, 2017

Member

@jreback : Lemme try to patch this now. We have the methods defined. If I can't, I'll let you know in a subsequent comment.

This comment has been minimized.

@gfyoung

gfyoung Jan 10, 2017

Member

Yep, no issues with doing this now. Done.

@gfyoung

gfyoung Jan 10, 2017

Member

Yep, no issues with doing this now. Done.

Show outdated Hide outdated pandas/indexes/numeric.py
@Appender(_index_shared_docs['_convert_arr_indexer'])
def _convert_arr_indexer(self, keyarr):
if is_integer_dtype(keyarr):

This comment has been minimized.

@jreback

jreback Jan 9, 2017

Contributor

can you put a 1-line comment on what you are doing here

@jreback

jreback Jan 9, 2017

Contributor

can you put a 1-line comment on what you are doing here

This comment has been minimized.

@gfyoung

gfyoung Jan 10, 2017

Member

Done.

@gfyoung

gfyoung Jan 10, 2017

Member

Done.

return keyarr
@Appender(_index_shared_docs['_convert_index_indexer'])
def _convert_index_indexer(self, keyarr):

This comment has been minimized.

@jreback

jreback Jan 9, 2017

Contributor

same

@jreback

jreback Jan 9, 2017

Contributor

same

This comment has been minimized.

@gfyoung

gfyoung Jan 10, 2017

Member

Done.

@gfyoung

gfyoung Jan 10, 2017

Member

Done.

return keyarr
def _wrap_joined_index(self, joined, other):
name = self.name if self.name == other.name else None

This comment has been minimized.

@jreback

jreback Jan 9, 2017

Contributor

this should prob be handled more generically in the super method (e.g. why are we not just calling _simple_new(joined...)?

@jreback

jreback Jan 9, 2017

Contributor

this should prob be handled more generically in the super method (e.g. why are we not just calling _simple_new(joined...)?

This comment has been minimized.

@gfyoung

gfyoung Jan 9, 2017

Member

Better for a refactoring PR follow-up?

@gfyoung

gfyoung Jan 9, 2017

Member

Better for a refactoring PR follow-up?

@@ -2020,3 +2022,64 @@ def test_repeat(self):
with tm.assert_produces_warning(FutureWarning):
result = idx.repeat(n=repeats)
tm.assert_index_equal(result, expected)
def test_is_monotonic_na(self):
examples = [pd.Index([np.nan]),

This comment has been minimized.

@jreback

jreback Jan 9, 2017

Contributor

where did all of these tests come from?

@jreback

jreback Jan 9, 2017

Contributor

where did all of these tests come from?

This comment has been minimized.

@gfyoung

gfyoung Jan 9, 2017

Member

test_numeric.py under the Int64Index tests (not sure why they were there in the first place). This is just a refactoring i.e., I moved tests around (did not change them) when I was writing out the UInt64Index tests.

@gfyoung

gfyoung Jan 9, 2017

Member

test_numeric.py under the Int64Index tests (not sure why they were there in the first place). This is just a refactoring i.e., I moved tests around (did not change them) when I was writing out the UInt64Index tests.

@@ -448,7 +447,183 @@ def test_take_fill_value(self):
idx.take(np.array([1, -5]))
class TestInt64Index(Numeric, tm.TestCase):
class NumericInt(Numeric):

This comment has been minimized.

@jreback

jreback Jan 9, 2017

Contributor

nice!

@jreback

jreback Jan 9, 2017

Contributor

nice!

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Jan 9, 2017

Contributor

@gfyoung looks pretty good. just a few more minor comments. I still have to play with this a bit more though.

@jorisvandenbossche

Contributor

jreback commented Jan 9, 2017

@gfyoung looks pretty good. just a few more minor comments. I still have to play with this a bit more though.

@jorisvandenbossche

@jreback

jreback approved these changes Jan 9, 2017

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Jan 12, 2017

Contributor

I got the same import issues when I pulled this down. maybe some dep is not recompiling? (I didn't try, but I bet it works with a full clean first)

/Users/jreback/pandas/pandas/indexes/api.py in <module>()
      4 from pandas.indexes.category import CategoricalIndex  # noqa
      5 from pandas.indexes.multi import MultiIndex  # noqa
----> 6 from pandas.indexes.numeric import (NumericIndex, Float64Index,  # noqa
      7                                     Int64Index, UInt64Index)
      8 from pandas.indexes.range import RangeIndex  # noqa

/Users/jreback/pandas/pandas/indexes/numeric.py in <module>()
    171 
    172 
--> 173 class UInt64Index(NumericIndex):
    174     __doc__ = _num_index_shared_docs['class_descr'] % _uint64_descr_args
    175 

/Users/jreback/pandas/pandas/indexes/numeric.py in UInt64Index()
    184     _na_value = 0
    185 
--> 186     _engine_type = _index.UInt64Engine
    187 
    188     _default_dtype = np.uint64

AttributeError: module 'pandas.index' has no attribute 'UInt64Engine'

Contributor

jreback commented Jan 12, 2017

I got the same import issues when I pulled this down. maybe some dep is not recompiling? (I didn't try, but I bet it works with a full clean first)

/Users/jreback/pandas/pandas/indexes/api.py in <module>()
      4 from pandas.indexes.category import CategoricalIndex  # noqa
      5 from pandas.indexes.multi import MultiIndex  # noqa
----> 6 from pandas.indexes.numeric import (NumericIndex, Float64Index,  # noqa
      7                                     Int64Index, UInt64Index)
      8 from pandas.indexes.range import RangeIndex  # noqa

/Users/jreback/pandas/pandas/indexes/numeric.py in <module>()
    171 
    172 
--> 173 class UInt64Index(NumericIndex):
    174     __doc__ = _num_index_shared_docs['class_descr'] % _uint64_descr_args
    175 

/Users/jreback/pandas/pandas/indexes/numeric.py in UInt64Index()
    184     _na_value = 0
    185 
--> 186     _engine_type = _index.UInt64Engine
    187 
    188     _default_dtype = np.uint64

AttributeError: module 'pandas.index' has no attribute 'UInt64Engine'

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Jan 12, 2017

Member

I also had to do a clean + full build to get the branch working (not convenient in this specific case, but is that actually indicating a problem?)

Member

jorisvandenbossche commented Jan 12, 2017

I also had to do a clean + full build to get the branch working (not convenient in this specific case, but is that actually indicating a problem?)

@gfyoung

This comment has been minimized.

Show comment
Hide comment
@gfyoung

gfyoung Jan 12, 2017

Member

@jorisvandenbossche , @jreback : See my comment above. This might be indicative of a setup.py issue since we have to do a git clean before installing. It isn't breaking Travis, but this is a little disconcerting admittedly.

Member

gfyoung commented Jan 12, 2017

@jorisvandenbossche , @jreback : See my comment above. This might be indicative of a setup.py issue since we have to do a git clean before installing. It isn't breaking Travis, but this is a little disconcerting admittedly.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Jan 13, 2017

Contributor

need

'depends': _pxi_dep['index']

here `https://github.com/pandas-dev/pandas/blob/master/setup.py#L493

Contributor

jreback commented Jan 13, 2017

need

'depends': _pxi_dep['index']

here `https://github.com/pandas-dev/pandas/blob/master/setup.py#L493

Show outdated Hide outdated setup.py
@@ -490,7 +490,8 @@ def pxd(name):
index={'pyxfile': 'index',
'sources': ['pandas/src/datetime/np_datetime.c',
'pandas/src/datetime/np_datetime_strings.c'],
'pxdfiles': ['src/util']},
'pxdfiles': ['src/util'],
'depends': _pxi_dep['index'] + _pxi_dep['algos']},

This comment has been minimized.

@jreback

jreback Jan 14, 2017

Contributor

well this is dependent on pandas.algos, but a change in algos.pyx doesn't actually require re-compilation of index.pyx I don't think.

@jreback

jreback Jan 14, 2017

Contributor

well this is dependent on pandas.algos, but a change in algos.pyx doesn't actually require re-compilation of index.pyx I don't think.

This comment has been minimized.

@gfyoung

gfyoung Jan 14, 2017

Member

Yeah, good point. Fixed.

@gfyoung

gfyoung Jan 14, 2017

Member

Yeah, good point. Fixed.

@gfyoung

This comment has been minimized.

Show comment
Hide comment
@gfyoung

gfyoung Jan 14, 2017

Member

@jreback , @jorisvandenbossche : Added the dependency in setup.py as requested. The installation should work. If it does, then this should be ready to merge then.

Member

gfyoung commented Jan 14, 2017

@jreback , @jorisvandenbossche : Added the dependency in setup.py as requested. The installation should work. If it does, then this should be ready to merge then.

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Jan 14, 2017

Member

Was playing a bit with it, and got into this:

In [15]: idx = pd.UInt64Index([1,2,3,4])

In [16]: s = pd.Series(range(4), index=idx)

In [17]: s.append(pd.Series([1,2], index=[-1,8])).index
Out[17]: Float64Index([1.0, 2.0, 3.0, 4.0, -1.0, 8.0], dtype='float64')
Member

jorisvandenbossche commented Jan 14, 2017

Was playing a bit with it, and got into this:

In [15]: idx = pd.UInt64Index([1,2,3,4])

In [16]: s = pd.Series(range(4), index=idx)

In [17]: s.append(pd.Series([1,2], index=[-1,8])).index
Out[17]: Float64Index([1.0, 2.0, 3.0, 4.0, -1.0, 8.0], dtype='float64')
@shoyer

This comment has been minimized.

Show comment
Hide comment
@shoyer

shoyer Jan 15, 2017

Member

Was playing a bit with it, and got into this:

In [17]: s.append(pd.Series([1,2], index=[-1,8])).index
Out[17]: Float64Index([1.0, 2.0, 3.0, 4.0, -1.0, 8.0], dtype='float64')

This looks like NumPy's default type promotion rules for combining int64 and uint64:

In [33]: np.concatenate([np.array([1, 2], dtype=np.int64), np.array([3], dtype=np.uint64)])
Out[33]: array([ 1.,  2.,  3.])
Member

shoyer commented Jan 15, 2017

Was playing a bit with it, and got into this:

In [17]: s.append(pd.Series([1,2], index=[-1,8])).index
Out[17]: Float64Index([1.0, 2.0, 3.0, 4.0, -1.0, 8.0], dtype='float64')

This looks like NumPy's default type promotion rules for combining int64 and uint64:

In [33]: np.concatenate([np.array([1, 2], dtype=np.int64), np.array([3], dtype=np.uint64)])
Out[33]: array([ 1.,  2.,  3.])
@gfyoung

This comment has been minimized.

Show comment
Hide comment
@gfyoung

gfyoung Jan 15, 2017

Member

@shoyer , @jorisvandenbossche : At first glance, I would agree with that diagnosis. Combining int64 and uint64 is "dangerous" from that perspective unfortunately (and out of our control). For now, I would ensure that uint64-only interactions result in uint64.

Member

gfyoung commented Jan 15, 2017

@shoyer , @jorisvandenbossche : At first glance, I would agree with that diagnosis. Combining int64 and uint64 is "dangerous" from that perspective unfortunately (and out of our control). For now, I would ensure that uint64-only interactions result in uint64.

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Jan 15, 2017

Member

Ah yes, of course, overlooked that. Naively thought for a moment this fitted in int64 and so should be that. But fully agree that we should not check for this and just follow the rules.