New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Series.unique converts uint64 to int64 (with overflow) #14721

Closed
dnspies opened this Issue Nov 23, 2016 · 16 comments

Comments

Projects
None yet
3 participants
@dnspies

dnspies commented Nov 23, 2016

>>> import pandas as pd
>>> import numpy as np
>>> s = pd.Series([1,2,2**63], dtype=np.dtype('uint64'))
>>> s
0                      1
1                      2
2    9223372036854775808
dtype: uint64
>>> s.unique()
array([                   1,                    2, -9223372036854775808])

Series.unique should preserve whatever type the values are. But instead it changes from uint64 to int64.

INSTALLED VERSIONS

...
python: 3.4.3.final.0
python-bits: 64
...

...
pandas: 0.17.1
numpy: 1.10.4
...

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Nov 23, 2016

Contributor

this is a sympton of #4471 in general uint64 support is not great. PR's are welcome to fix that!

Contributor

jreback commented Nov 23, 2016

this is a sympton of #4471 in general uint64 support is not great. PR's are welcome to fix that!

@jreback jreback added this to the Next Major Release milestone Nov 23, 2016

@gfyoung

This comment has been minimized.

Show comment
Hide comment
@gfyoung

gfyoung Nov 23, 2016

Member

@jreback : IIUC, your statement is incorrect. The bug traces back as follows:

When you call Series.unique(), it calls the super-class implementation here. That method then calls unique1d() here. The erroneous check is here, which catches np.uint64 but then incorrectly converts them to int64.

The correct patch I think is to change the check to np.signedinteger and then have a separate one for np.unsignedinteger. However, I cannot find a proper implementation of a uint64 hashtable. In addition, things seem to get insanely complicated as we move to hashtable.pyx (and related helper .pxi.in files) khash.pxd, and khash.h, which setup all of the hashtable functionality.

The easy way out but less desirable way I think is to modify this branch here and convert the resulting array back to the original dtype.

Thoughts?

Member

gfyoung commented Nov 23, 2016

@jreback : IIUC, your statement is incorrect. The bug traces back as follows:

When you call Series.unique(), it calls the super-class implementation here. That method then calls unique1d() here. The erroneous check is here, which catches np.uint64 but then incorrectly converts them to int64.

The correct patch I think is to change the check to np.signedinteger and then have a separate one for np.unsignedinteger. However, I cannot find a proper implementation of a uint64 hashtable. In addition, things seem to get insanely complicated as we move to hashtable.pyx (and related helper .pxi.in files) khash.pxd, and khash.h, which setup all of the hashtable functionality.

The easy way out but less desirable way I think is to modify this branch here and convert the resulting array back to the original dtype.

Thoughts?

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Nov 23, 2016

Contributor

@gfyoung I was pointing to a 'generic' issue with uint64. it doesn't work in lots of places!

Contributor

jreback commented Nov 23, 2016

@gfyoung I was pointing to a 'generic' issue with uint64. it doesn't work in lots of places!

@gfyoung

This comment has been minimized.

Show comment
Hide comment
@gfyoung

gfyoung Nov 23, 2016

Member

@jreback : Ah, fair enough. But thoughts about this?

Member

gfyoung commented Nov 23, 2016

@jreback : Ah, fair enough. But thoughts about this?

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Nov 23, 2016

Contributor

However, I cannot find a proper implementation of a uint64 hashtable. In addition, things seem to get insanely complicated as we move to hashtable.pyx (and related helper .pxi.in files) khash.pxd, and khash.h, which setup all of the hashtable functionality.

hmm

Contributor

jreback commented Nov 23, 2016

However, I cannot find a proper implementation of a uint64 hashtable. In addition, things seem to get insanely complicated as we move to hashtable.pyx (and related helper .pxi.in files) khash.pxd, and khash.h, which setup all of the hashtable functionality.

hmm

@gfyoung

This comment has been minimized.

Show comment
Hide comment
@gfyoung

gfyoung Nov 23, 2016

Member

@jreback : If someone knows how to navigate through that swamp of hashtable code, then by all means, patch it! 😄 But yes, it's just a matter of adding another hashtable class.

Member

gfyoung commented Nov 23, 2016

@jreback : If someone knows how to navigate through that swamp of hashtable code, then by all means, patch it! 😄 But yes, it's just a matter of adding another hashtable class.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Nov 23, 2016

Contributor

you can use python hashing, no?

In [9]: hash(np.uint64(2**63+1))
Out[9]: 5

In [10]: hash(np.int64(2**63-1))
Out[10]: 3
Contributor

jreback commented Nov 23, 2016

you can use python hashing, no?

In [9]: hash(np.uint64(2**63+1))
Out[9]: 5

In [10]: hash(np.int64(2**63-1))
Out[10]: 3
@gfyoung

This comment has been minimized.

Show comment
Hide comment
@gfyoung

gfyoung Nov 23, 2016

Member

@jreback : Perhaps, but then what's going on with this entire hashtable class? Why was it implemented that way if Python hashing is feasible?

Member

gfyoung commented Nov 23, 2016

@jreback : Perhaps, but then what's going on with this entire hashtable class? Why was it implemented that way if Python hashing is feasible?

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Nov 23, 2016

Contributor

prob not that hard to write some klib type stuff in cython

Contributor

jreback commented Nov 23, 2016

prob not that hard to write some klib type stuff in cython

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Nov 23, 2016

Contributor

klib is >> faster than python for the types of things we do (and doesn't have as much overhead for pyobjects)

Contributor

jreback commented Nov 23, 2016

klib is >> faster than python for the types of things we do (and doesn't have as much overhead for pyobjects)

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Nov 23, 2016

Contributor

or just add to the klib stuff (might be easier)

Contributor

jreback commented Nov 23, 2016

or just add to the klib stuff (might be easier)

@gfyoung

This comment has been minimized.

Show comment
Hide comment
@gfyoung

gfyoung Nov 23, 2016

Member

@jreback : I think adding to the klib is the best solution, but I don't really understand how all of it works.

Member

gfyoung commented Nov 23, 2016

@jreback : I think adding to the klib is the best solution, but I don't really understand how all of it works.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Nov 23, 2016

Contributor

me neither :<

Contributor

jreback commented Nov 23, 2016

me neither :<

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Nov 23, 2016

Contributor

its just template stuff in a header only library

Contributor

jreback commented Nov 23, 2016

its just template stuff in a header only library

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Nov 23, 2016

Contributor

you can prob copy int64 and just do a replace. (maybe)

Contributor

jreback commented Nov 23, 2016

you can prob copy int64 and just do a replace. (maybe)

@gfyoung

This comment has been minimized.

Show comment
Hide comment
@gfyoung

gfyoung Nov 23, 2016

Member

@jreback : True that file is, but I have no idea how it gets populated.

Member

gfyoung commented Nov 23, 2016

@jreback : True that file is, but I have no idea how it gets populated.

gfyoung added a commit to gfyoung/pandas that referenced this issue Dec 19, 2016

BUG: Prevent uint64 overflow in Series.unique
Introduces a UInt64HashTable class to hash
uint64 elements and prevent overflow.

Closes gh-14721.

gfyoung added a commit to gfyoung/pandas that referenced this issue Dec 19, 2016

BUG: Prevent uint64 overflow in Series.unique
Introduces a UInt64HashTable class to hash
uint64 elements and prevent overflow in
functions like Series.unique.

Closes gh-14721.

gfyoung added a commit to gfyoung/pandas that referenced this issue Dec 19, 2016

BUG: Prevent uint64 overflow in Series.unique
Introduces a UInt64HashTable class to hash
uint64 elements and prevent overflow in
functions like Series.unique.

Closes gh-14721.

@jreback jreback modified the milestones: 0.20.0, Next Major Release Dec 19, 2016

gfyoung added a commit to gfyoung/pandas that referenced this issue Dec 20, 2016

BUG: Prevent uint64 overflow in Series.unique
Introduces a UInt64HashTable class to hash
uint64 elements and prevent overflow in
functions like Series.unique.

Closes gh-14721.

@jreback jreback closed this in b35c689 Dec 20, 2016

ShaharBental added a commit to ShaharBental/pandas that referenced this issue Dec 26, 2016

BUG: Prevent uint64 overflow in Series.unique
Introduces a `UInt64HashTable` class to hash `uint64` elements and
prevent overflow in functions like `Series.unique`.    Closes #14721.

Author: gfyoung <gfyoung17@gmail.com>

Closes #14915 from gfyoung/uint64-hashtable-patch and squashes the following commits:

380c580 [gfyoung] BUG: Prevent uint64 overflow in Series.unique
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment