ENH: Create and propagate UInt64Index #14937

gfyoung · 2016-12-21T08:03:45Z

Introduces and propagates UInt64Index, an index specifically for uint. xref BUG: Patch rank() uint64 behavior #14935
Patches bug from BUG: Convert uint64 in maybe_convert_objects #14916 that makes maybe_convert_objects robust against the known numpy bug that uint64 cannot be compared to int64. This bug was caught during testing of UInt64Index.

UPDATE: Patched in #14951

jreback · 2016-12-21T11:32:14Z

can you separate out

Patches bug from #14916 that makes maybe_convert_objects robust against the known numpy bug that uint64 cannot be compared to int64. This bug was caught during testing of UInt64Index.

jreback · 2016-12-21T11:41:11Z

doc/source/whatsnew/v0.20.0.txt

@@ -99,6 +99,7 @@ Other enhancements
  unsorted MultiIndex (:issue:`11897`). This allows differentiation between errors due to lack
  of sorting or an incorrect key. See :ref:`here <advanced.unsorted>`

+- New ``UInt64Index`` (subclass of ``NumericIndex``) for specifically indexing unsigned integers (:issue:`14935`)


create a new sub-section for all uint64 issues (can be one later)

jreback · 2016-12-21T11:42:30Z

pandas/index.pyx

@@ -426,6 +426,67 @@ cdef class Int64Engine(IndexEngine):

        return result

+cdef class UInt64Engine(IndexEngine):
+


we are going to need to simplify this boilerplate, either expanding some methods in the superclass, maybe using some tempita. I hate copy-pasting code.

codecov-io · 2016-12-21T11:42:50Z

Current coverage is 85.54% (diff: 95.00%)

Merging #14937 into master will increase coverage by <.01%

@@             master     #14937   diff @@
==========================================
  Files           145        145          
  Lines         51288      51350    +62   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43872      43928    +56   
- Misses         7416       7422     +6   
  Partials          0          0

Powered by Codecov. Last update 0e219d7...8ab6fbd

jreback · 2016-12-21T11:43:32Z

pandas/index.pyx

+            raise KeyError(val)
+
+    cdef _maybe_get_bool_indexer(self, object val):
+        cdef:


you can prob just use fused types for routines like this (for the scalar & array type)

I am planning to template, so this shouldn't be problematic anymore.

jreback · 2016-12-21T11:45:26Z

pandas/indexes/base.py

+                            except (OverflowError, TypeError, ValueError):
+                                pass
+
+                            try:


add a comment

jreback · 2016-12-21T11:46:21Z

pandas/indexes/numeric.py

@@ -177,6 +177,91 @@ def _assert_safe_casting(cls, data, subarr):
 Int64Index._add_logical_methods()


+class UInt64Index(NumericIndex):
+    """
+    Immutable ndarray implementing an ordered, sliceable set. The basic object


we should use a shared doc for these indexes __init__

Makes sense. Done.

jreback · 2016-12-21T11:47:15Z

pandas/indexes/numeric.py

+    _default_dtype = np.uint64
+
+    @property
+    def inferred_type(self):


hmm? you don't want to differentiate

I did not see a reason to do so at this point.

jreback · 2016-12-21T11:47:57Z

pandas/indexes/numeric.py

+        return self.values.view('u8')
+
+    @property
+    def is_all_dates(self):


I think this is defined in the super class (or it should be)

It is defined in Index. However, that implementation is very generic and less performant. We can just return the boolean immediately if possible, which is what I do and is also done in many subclasses (e.g Int64Index, DatetimeIndex).

this should be a method on NumericIndex, which just returns False (then is overriden in the datetime ones).

gfyoung · 2016-12-22T08:04:26Z

@jreback : See #14951 for patching maybe_convert_objects

jreback · 2016-12-22T11:46:11Z

doc/source/whatsnew/v0.20.0.txt

+or purely non-negative, integers. Previously, handling these integers would
+result in improper rounding or data-type casting, leading to incorrect results.
+One notable place where this improved was in ``DataFrame`` creation (:issue:`14917`):
+


I don't think you need this example. mainly wanted to put the issues together (top section is good though).

Fair enough.

jreback · 2016-12-22T11:46:38Z

doc/source/whatsnew/v0.20.0.txt

+
+- Bug in converting object elements of array-like objects to unsigned 64-bit integers (:issue:`4471`)
+- Bug in ``Series.unique()`` in which unsigned 64-bit integers were causing overflow (:issue:`14721`)
+- New ``UInt64Index`` (subclass of ``NumericIndex``) for specifically indexing unsigned integers (:issue:`14935`)


so an example using the index might be good (e.g. using indexing)

Sounds good. Done.

jreback · 2016-12-22T11:48:54Z

so need testing / fixes for

assignment to DataFrame
set/reset index
indexing (getting and setting), add a new test class

gfyoung · 2016-12-22T15:52:38Z

Assignment to DataFrame: frame/test_alter_axes.py ?
Add a new test class : indexes/test_numeric.py ?

Set / reset index ? (not sure what you mean)
Getting / setting ? (not sure what you mean)

jreback · 2016-12-23T00:25:00Z

Set / reset index ? (not sure what you mean)
df.set_index('uint64_col').reset_index()

Getting / setting ? (not sure what you mean)
df2 = df.set_index('uint64_col')
df2.loc[df.index[0]] (and =)

basically need a pandas/tests/indexing/test_uint64.py

gfyoung · 2016-12-23T08:52:51Z

Would appending uint to the battery of tests in tests/indexing/test_indexing.py suffice then?

jreback · 2016-12-23T12:35:42Z

@gfyoung yes adding uint to lots of tests would definitly help.

Adds `uint64` ranking functions to `algos.pyx` to allow for proper ranking with `uint64`. Also introduces partial patch for `factorize()` by adding `uint64` hashtables and vectors for usage. However, this patch is only partial because the larger bug of non- support for `uint64` in `Index` has not been fixed (**UPDATE**: tackled in #14937): ~~~python >>> from pandas import Index, np >>> Index(np.array([2**63], dtype=np.uint64)) Int64Index([-9223372036854775808], dtype='int64') ~~~ Also patches a bug in `UInt64HashTable` from #14915 that had an erroneous null condition that was caught during testing and was hence removed. Author: gfyoung <gfyoung17@gmail.com> Closes #14935 from gfyoung/core-algorithms-uint64-two and squashes the following commits: 2598cea [gfyoung] BUG: Patch rank() uint64 behavior

gfyoung · 2017-01-12T23:34:37Z

@jorisvandenbossche , @jreback : See my comment above. This might be indicative of a setup.py issue since we have to do a git clean before installing. It isn't breaking Travis, but this is a little disconcerting admittedly.

jreback · 2017-01-13T14:00:20Z

need

'depends': _pxi_dep['index']

here `https://github.com/pandas-dev/pandas/blob/master/setup.py#L493

jreback · 2017-01-14T18:09:41Z

setup.py

@@ -490,7 +490,8 @@ def pxd(name):
    index={'pyxfile': 'index',
           'sources': ['pandas/src/datetime/np_datetime.c',
                       'pandas/src/datetime/np_datetime_strings.c'],
-           'pxdfiles': ['src/util']},
+           'pxdfiles': ['src/util'],
+           'depends': _pxi_dep['index'] + _pxi_dep['algos']},


well this is dependent on pandas.algos, but a change in algos.pyx doesn't actually require re-compilation of index.pyx I don't think.

Yeah, good point. Fixed.

gfyoung · 2017-01-14T22:37:26Z

@jreback , @jorisvandenbossche : Added the dependency in setup.py as requested. The installation should work. If it does, then this should be ready to merge then.

jorisvandenbossche · 2017-01-14T22:48:36Z

Was playing a bit with it, and got into this:

In [15]: idx = pd.UInt64Index([1,2,3,4])

In [16]: s = pd.Series(range(4), index=idx)

In [17]: s.append(pd.Series([1,2], index=[-1,8])).index
Out[17]: Float64Index([1.0, 2.0, 3.0, 4.0, -1.0, 8.0], dtype='float64')

shoyer · 2017-01-15T02:47:46Z

Was playing a bit with it, and got into this:

In [17]: s.append(pd.Series([1,2], index=[-1,8])).index
Out[17]: Float64Index([1.0, 2.0, 3.0, 4.0, -1.0, 8.0], dtype='float64')

This looks like NumPy's default type promotion rules for combining int64 and uint64:

In [33]: np.concatenate([np.array([1, 2], dtype=np.int64), np.array([3], dtype=np.uint64)])
Out[33]: array([ 1.,  2.,  3.])

gfyoung · 2017-01-15T03:06:04Z

@shoyer , @jorisvandenbossche : At first glance, I would agree with that diagnosis. Combining int64 and uint64 is "dangerous" from that perspective unfortunately (and out of our control). For now, I would ensure that uint64-only interactions result in uint64.

jorisvandenbossche · 2017-01-15T11:22:59Z

Ah yes, of course, overlooked that. Naively thought for a moment this fitted in int64 and so should be that. But fully agree that we should not check for this and just follow the rules.

jreback · 2017-01-15T16:25:25Z

you whatsnew example is a little odd FYI, maybe expand a touch, and here's a bug

In [42]:    idx = pd.UInt64Index([1, 2, 3], name='foo')
    ...:    df = pd.DataFrame({'A' : ['a', 'b', 'c'], 'B' : range(3), 'C' : np.array([3,4, 5],dtype='uint64')}, index=idx)
    ...:    
    ...: 

In [43]: df2 = df.reset_index()

In [44]: df2
Out[44]: 
   foo  A  B  C
0    1  a  0  3
1    2  b  1  4
2    3  c  2  5

In [45]: df2.dtypes
Out[45]: 
foo    uint64
A      object
B       int64
C      uint64
dtype: object

In [46]: df2.iloc[1, 0] = np.nan

In [47]: df2
Out[47]: 
   foo  A  B  C
0  1.0  a  0  3
1  NaN  b  1  4
2  3.0  c  2  5

In [48]: df2.dtypes
Out[48]: 
foo    float64
A       object
B        int64
C       uint64
dtype: object

In [49]: df2.foo.fillna(-1)
Out[49]: 
0    1.0
1   -1.0
2    3.0
Name: foo, dtype: float64

In [50]: df2.foo.fillna(-1).astype('uint64')
Out[50]: 
0                       1
1    18446744073709551615
2                       3
Name: foo, dtype: uint64

[50] is wrong

gfyoung · 2017-01-15T23:24:33Z

@jreback :

What is odd about my example, and how should I expand it if it is lacking?
What is the expected behavior for [50] in your opinion, as I have a similar question in this example:

>>> i = Index([1.1, 2, 3])
>>> i.astype('int64')  # Just plain INT64
0    1
1    2
2    3
dtype: int64

jreback · 2017-01-15T23:29:59Z

What is odd about my example, and how should I expand it if it is lacking?

I find not naming columns as odd (in examples).

In [37]: pd.DataFrame({'A' : ['a', 'b', 'c']}, index=range(3))
Out[37]: 
   A
0  a
1  b
2  c

jreback · 2017-01-15T23:31:49Z

What is the expected behavior for [50] in your opinion, as I have a similar question in this example:

I would raise, that is an invalid conversion if negative values or nan's are present

In [39]: Series([np.nan, 2, 3]).astype('int64')
ValueError: Cannot convert non-finite values (NA or inf) to integer

In fact I make sure to test lots of astyping (similar to what we do with ints)

gfyoung · 2017-01-15T23:49:26Z

@jreback : The point of my example in my response was to illustrate that astype is bugged far beyond anything that I could have added with UInt64Index. I think it is better to save patching that for another time, as the bug extends beyond uint64. I will update the whatsnew though.

gfyoung · 2017-01-16T02:55:31Z

@jreback : Updated whatsnew, and everything is still green.

jreback · 2017-01-17T13:55:06Z

thanks @gfyoung !

this is great.

if you can canvas any issues that are referenced and make sure that they are closed? (ping if not).

jreback · 2017-01-17T14:44:11Z

https://travis-ci.org/pandas-dev/pandas/jobs/192687204

wonder if this is a cython cache issue.

jreback · 2017-01-17T15:12:43Z

https://travis-ci.org/pandas-dev/pandas/jobs/192687206#L523 in particular is suspicous

jreback · 2017-01-17T19:21:51Z

ok I cleared the cache on master it is built now: https://travis-ci.org/pandas-dev/pandas/builds/192687200

weird, maybe there is persistence of the .pxi files somehow so if the .pyx are not changed (e.g. algos.pyx or index.pyx) as well, then the cython cache thinks they are not rebuild.

gfyoung · 2017-01-18T05:09:47Z

@jreback : Awesome that doc-building issue got fixed on Travis. I haven't found any related issues yet, but I'll pick them / resolve them eventually as I've done with the read_csv issues 😄

1) Introduces and propagates `UInt64Index`, an index specifically for `uint`. xref pandas-dev#14935 2) <strike> Patches bug from pandas-dev#14916 that makes `maybe_convert_objects` robust against the known `numpy` bug that `uint64` cannot be compared to `int64`. This bug was caught during testing of `UInt64Index`. </strike> **UPDATE**: Patched in pandas-dev#14951 Author: gfyoung <gfyoung17@gmail.com> Closes pandas-dev#14937 from gfyoung/create-uint64-index and squashes the following commits: 8ab6fbd [gfyoung] ENH: Create and propagate UInt64Index

gfyoung mentioned this pull request Dec 21, 2016

BUG: Patch rank() uint64 behavior #14935

Closed

jreback added Dtype Conversions Unexpected or buggy dtype conversions Enhancement Indexing Related to indexing on series/frames, not to indexes themselves labels Dec 21, 2016

jreback reviewed Dec 21, 2016

View reviewed changes

gfyoung mentioned this pull request Dec 22, 2016

BUG: Patch maybe_convert_objects uint64 handling #14951

Merged

gfyoung force-pushed the create-uint64-index branch from bb82670 to 12c6807 Compare December 22, 2016 08:03

gfyoung force-pushed the create-uint64-index branch 2 times, most recently from dde6312 to 22402d6 Compare December 22, 2016 08:45

jreback reviewed Dec 22, 2016

View reviewed changes

gfyoung force-pushed the create-uint64-index branch from 22402d6 to 4f59a12 Compare December 22, 2016 15:48

gfyoung force-pushed the create-uint64-index branch from 4f59a12 to 3a6508d Compare December 22, 2016 17:04

gfyoung force-pushed the create-uint64-index branch 2 times, most recently from cd9c6eb to 236fcf8 Compare December 24, 2016 22:03

gfyoung force-pushed the create-uint64-index branch 2 times, most recently from 0aaca35 to 452f56d Compare January 14, 2017 10:04

jreback reviewed Jan 14, 2017

View reviewed changes

gfyoung force-pushed the create-uint64-index branch from 452f56d to 488dfd7 Compare January 14, 2017 19:49

ENH: Create and propagate UInt64Index

8ab6fbd

gfyoung force-pushed the create-uint64-index branch from 488dfd7 to 8ab6fbd Compare January 16, 2017 01:27

jreback mentioned this pull request Jan 17, 2017

COMPAT: create UInt64Block #15145

Closed

jreback closed this in 362e78d Jan 17, 2017

jreback added this to the 0.20.0 milestone Jan 17, 2017

gfyoung deleted the create-uint64-index branch January 18, 2017 05:10

jreback mentioned this pull request Jan 18, 2017

CI: invalidate cache when .pxi.in only changes #15154

Closed

		@@ -426,6 +426,67 @@ cdef class Int64Engine(IndexEngine):

		return result

		cdef class UInt64Engine(IndexEngine):

Uh oh!

ENH: Create and propagate UInt64Index #14937

ENH: Create and propagate UInt64Index #14937

Uh oh!

Conversation

gfyoung commented Dec 21, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jreback commented Dec 21, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov-io commented Dec 21, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Current coverage is 85.54% (diff: 95.00%)

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jreback Dec 21, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gfyoung commented Dec 22, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jreback commented Dec 22, 2016

Uh oh!

gfyoung commented Dec 22, 2016

Uh oh!

jreback commented Dec 23, 2016

Uh oh!

gfyoung commented Dec 23, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jreback commented Dec 23, 2016

Uh oh!

gfyoung commented Jan 12, 2017

Uh oh!

jreback commented Jan 13, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gfyoung commented Jan 14, 2017

Uh oh!

jorisvandenbossche commented Jan 14, 2017

Uh oh!

shoyer commented Jan 15, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gfyoung commented Dec 21, 2016 •

edited

Loading

codecov-io commented Dec 21, 2016 •

edited

Loading

jreback Dec 21, 2016 •

edited

Loading

gfyoung commented Dec 23, 2016 •

edited

Loading

shoyer commented Jan 15, 2017 •

edited

Loading

gfyoung commented Jan 15, 2017 •

edited

Loading

gfyoung commented Jan 15, 2017 •

edited

Loading