New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Prevent uint64 overflow in Series.unique #14915

Closed
wants to merge 1 commit into
base: master
from

Conversation

Projects
None yet
4 participants
@gfyoung
Member

gfyoung commented Dec 19, 2016

Introduces a UInt64HashTable class to hash uint64 elements and prevent overflow in functions like Series.unique.

Closes #14721.

@sinhrks sinhrks added Algos Bug labels Dec 19, 2016

@codecov-io

This comment has been minimized.

Show comment
Hide comment
@codecov-io

codecov-io Dec 19, 2016

Current coverage is 84.65% (diff: 100%)

Merging #14915 into master will increase coverage by <.01%

@@             master     #14915   diff @@
==========================================
  Files           144        144          
  Lines         51016      51020     +4   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43184      43189     +5   
+ Misses         7832       7831     -1   
  Partials          0          0          

Powered by Codecov. Last update 3ccb501...380c580

codecov-io commented Dec 19, 2016

Current coverage is 84.65% (diff: 100%)

Merging #14915 into master will increase coverage by <.01%

@@             master     #14915   diff @@
==========================================
  Files           144        144          
  Lines         51016      51020     +4   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43184      43189     +5   
+ Misses         7832       7831     -1   
  Partials          0          0          

Powered by Codecov. Last update 3ccb501...380c580

Show outdated Hide outdated pandas/tests/test_base.py
np.arange(len(xs), dtype=np.int64))
def test_get_unique(self):
s = pd.Series([1, 2, 2**63, 2**63], dtype=np.uint64)

This comment has been minimized.

@jreback

jreback Dec 19, 2016

Contributor

i think needs a bit more testing
most test are in algos for things like this

@jreback

jreback Dec 19, 2016

Contributor

i think needs a bit more testing
most test are in algos for things like this

This comment has been minimized.

@gfyoung

gfyoung Dec 19, 2016

Member

"Needs a bit more testing" is vague. What sort of cases are you thinking of? Secondly, the affected function has no relation to algos. The code path is completely different.

However, you did catch a bug in pd.unique, which does travel through algorithms.py and can now benefit from my hashtable. 😄

@gfyoung

gfyoung Dec 19, 2016

Member

"Needs a bit more testing" is vague. What sort of cases are you thinking of? Secondly, the affected function has no relation to algos. The code path is completely different.

However, you did catch a bug in pd.unique, which does travel through algorithms.py and can now benefit from my hashtable. 😄

This comment has been minimized.

@jreback

jreback Dec 19, 2016

Contributor

not on the hashtable itself. ideally you should add uint64 to lots of places that we test int64 (construction type things). IOW, to actually exercise this code. I know this is vague. happy to merge this and to do a followup with more tests (and potentially breaking things). later.

@jreback

jreback Dec 19, 2016

Contributor

not on the hashtable itself. ideally you should add uint64 to lots of places that we test int64 (construction type things). IOW, to actually exercise this code. I know this is vague. happy to merge this and to do a followup with more tests (and potentially breaking things). later.

This comment has been minimized.

@gfyoung

gfyoung Dec 19, 2016

Member

Okay, well that's a larger issue beyond this PR. However, let's not merge until I patch pd.unique, which suffers from a similar issue as Series.unique.

@gfyoung

gfyoung Dec 19, 2016

Member

Okay, well that's a larger issue beyond this PR. However, let's not merge until I patch pd.unique, which suffers from a similar issue as Series.unique.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Dec 19, 2016

Contributor

you will need to rebase after #14919. I moved a couple of things, and added a HashTable testing section.

Though what I would really like is to systematically add unit64 to as many places as we test other dtypes and see what breaks / falls out (can do in another PR). Then create issues for incompat places.

Contributor

jreback commented Dec 19, 2016

you will need to rebase after #14919. I moved a couple of things, and added a HashTable testing section.

Though what I would really like is to systematically add unit64 to as many places as we test other dtypes and see what breaks / falls out (can do in another PR). Then create issues for incompat places.

@gfyoung gfyoung closed this Dec 19, 2016

@gfyoung gfyoung deleted the gfyoung:uint64-hashtable-patch branch Dec 19, 2016

@gfyoung gfyoung restored the gfyoung:uint64-hashtable-patch branch Dec 19, 2016

@gfyoung gfyoung reopened this Dec 19, 2016

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Dec 19, 2016

Contributor

what was the bug (from before I moved unique)?

Contributor

jreback commented Dec 19, 2016

what was the bug (from before I moved unique)?

@gfyoung

This comment has been minimized.

Show comment
Hide comment
@gfyoung

gfyoung Dec 19, 2016

Member

@jreback : The diff explains it all. We were smashing all integer types together into one hashtable, which does not work if you have uint.

Member

gfyoung commented Dec 19, 2016

@jreback : The diff explains it all. We were smashing all integer types together into one hashtable, which does not work if you have uint.

@jreback jreback added this to the 0.20.0 milestone Dec 19, 2016

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Dec 19, 2016

Contributor

ok, lgtm. ping on green.

Contributor

jreback commented Dec 19, 2016

ok, lgtm. ping on green.

@gfyoung

This comment has been minimized.

Show comment
Hide comment
@gfyoung

gfyoung Dec 19, 2016

Member

@jreback : Sounds good. I changed my mind about changing pd.unique in this PR. core.algorithms is riddled with uint64 bugs that need to be checked. I'll tackle in a follow-up.

Member

gfyoung commented Dec 19, 2016

@jreback : Sounds good. I changed my mind about changing pd.unique in this PR. core.algorithms is riddled with uint64 bugs that need to be checked. I'll tackle in a follow-up.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Dec 19, 2016

Contributor

@jreback : Sounds good. I changed my mind about changing pd.unique in this PR. core.algorithms is riddled with uint64 bugs that need to be checked. I'll tackle in a follow-up.

totally fine. we can make a master issue of these if that is helpful.

Contributor

jreback commented Dec 19, 2016

@jreback : Sounds good. I changed my mind about changing pd.unique in this PR. core.algorithms is riddled with uint64 bugs that need to be checked. I'll tackle in a follow-up.

totally fine. we can make a master issue of these if that is helpful.

@gfyoung gfyoung closed this Dec 20, 2016

BUG: Prevent uint64 overflow in Series.unique
Introduces a UInt64HashTable class to hash
uint64 elements and prevent overflow in
functions like Series.unique.

Closes gh-14721.

@gfyoung gfyoung reopened this Dec 20, 2016

@gfyoung

This comment has been minimized.

Show comment
Hide comment
@gfyoung

gfyoung Dec 20, 2016

Member

@jreback : Everything is green, so ready to merge if there are no other concerns.

Member

gfyoung commented Dec 20, 2016

@jreback : Everything is green, so ready to merge if there are no other concerns.

@jreback jreback closed this in b35c689 Dec 20, 2016

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Dec 20, 2016

Contributor

thanks!

as I said, ideally have an issue which greatly expands test coverage for uint64 (and just mark the tests that are failing (e.g. assert that they are failing or comment out)).

Then can come back around and fix.

Contributor

jreback commented Dec 20, 2016

thanks!

as I said, ideally have an issue which greatly expands test coverage for uint64 (and just mark the tests that are failing (e.g. assert that they are failing or comment out)).

Then can come back around and fix.

@gfyoung gfyoung deleted the gfyoung:uint64-hashtable-patch branch Dec 20, 2016

gfyoung added a commit to gfyoung/pandas that referenced this pull request Dec 20, 2016

BUG: Use UInt64HashTable in pd.unique
Uses UInt64HashTable to patch a uint64
overflow bug in pd.unique analogous to
that seen in Series.unique (patched in
gh-14915).

gfyoung added a commit to gfyoung/pandas that referenced this pull request Dec 20, 2016

BUG: Use UInt64HashTable in pd.unique
Uses UInt64HashTable to patch a uint64
overflow bug in pd.unique analogous to
that seen in Series.unique (patched in
gh-14915).

gfyoung added a commit to gfyoung/pandas that referenced this pull request Dec 20, 2016

BUG: Use UInt64HashTable in pd.unique
Uses UInt64HashTable to patch a uint64
overflow bug in pd.unique analogous to
that seen in Series.unique (patched in
gh-14915).

gfyoung added a commit to gfyoung/pandas that referenced this pull request Dec 20, 2016

BUG: Use UInt64HashTable in pd.unique
Uses UInt64HashTable to patch a uint64
overflow bug in pd.unique analogous to
that seen in Series.unique (patched in
gh-14915).

gfyoung added a commit to gfyoung/pandas that referenced this pull request Dec 21, 2016

DOC, TST, BUG: Improve uint64 core/algos behavior
1) duplicated()

Updates documentation to describe the "values"
parameter in the signature, adds tests for uint64,
and refactors to use duplicated_uint64.

2) mode()

Updates documentation to describe the "values"
parameter in the signature, adds tests for uint64,
and reactors to use mode_uint64.

3) unique()

Uses UInt64HashTable to patch a uint64 overflow bug
analogous to that seen in Series.unique (patched in
gh-14915).

4) Types API

Introduces "is_signed_integer_dtype" and "is_unsigned
_integer_dtype" to the public API. Used in refactoring/
patching of 1-3.

gfyoung added a commit to gfyoung/pandas that referenced this pull request Dec 21, 2016

DOC, TST, BUG: Improve uint64 core/algos behavior
1) duplicated()

Updates documentation to describe the "values"
parameter in the signature, adds tests for uint64,
and refactors to use duplicated_uint64.

2) mode()

Updates documentation to describe the "values"
parameter in the signature, adds tests for uint64,
and reactors to use mode_uint64.

3) unique()

Uses UInt64HashTable to patch a uint64 overflow bug
analogous to that seen in Series.unique (patched in
gh-14915).

4) Types API

Introduces "is_signed_integer_dtype" and "is_unsigned
_integer_dtype" to the public API. Used in refactoring/
patching of 1-3.

gfyoung added a commit to gfyoung/pandas that referenced this pull request Dec 21, 2016

DOC, TST, BUG: Improve uint64 core/algos behavior
1) duplicated()

Updates documentation to describe the "values"
parameter in the signature, adds tests for uint64,
and refactors to use duplicated_uint64.

2) mode()

Updates documentation to describe the "values"
parameter in the signature, adds tests for uint64,
and reactors to use mode_uint64.

3) unique()

Uses UInt64HashTable to patch a uint64 overflow bug
analogous to that seen in Series.unique (patched in
gh-14915).

4) Types API

Introduces "is_signed_integer_dtype" and "is_unsigned
_integer_dtype" to the public API. Used in refactoring/
patching of 1-3.

gfyoung added a commit to gfyoung/pandas that referenced this pull request Dec 22, 2016

DOC, TST, BUG: Improve uint64 core/algos behavior
1) duplicated()

Updates documentation to describe the "values"
parameter in the signature, adds tests for uint64,
and refactors to use duplicated_uint64.

2) mode()

Updates documentation to describe the "values"
parameter in the signature, adds tests for uint64,
and reactors to use mode_uint64.

3) unique()

Uses UInt64HashTable to patch a uint64 overflow bug
analogous to that seen in Series.unique (patched in
gh-14915).

4) Types API

Introduces "is_signed_integer_dtype" and "is_unsigned
_integer_dtype" to the public API. Used in refactoring/
patching of 1-3.

gfyoung added a commit to gfyoung/pandas that referenced this pull request Dec 22, 2016

DOC, TST, BUG: Improve uint64 core/algos behavior
1) duplicated()

Updates documentation to describe the "values"
parameter in the signature, adds tests for uint64,
and refactors to use duplicated_uint64.

2) mode()

Updates documentation to describe the "values"
parameter in the signature, adds tests for uint64,
and reactors to use mode_uint64.

3) unique()

Uses UInt64HashTable to patch a uint64 overflow bug
analogous to that seen in Series.unique (patched in
gh-14915).

4) Types API

Introduces "is_signed_integer_dtype" and "is_unsigned
_integer_dtype" to the public API. Used in refactoring/
patching of 1-3.

gfyoung added a commit to gfyoung/pandas that referenced this pull request Dec 22, 2016

DOC, TST, BUG: Improve uint64 core/algos behavior
1) duplicated()

Updates documentation to describe the "values"
parameter in the signature, adds tests for uint64,
and refactors to use duplicated_uint64.

2) mode()

Updates documentation to describe the "values"
parameter in the signature, adds tests for uint64,
and reactors to use mode_uint64.

3) unique()

Uses UInt64HashTable to patch a uint64 overflow bug
analogous to that seen in Series.unique (patched in
gh-14915).

4) Types API

Introduces "is_signed_integer_dtype" and "is_unsigned
_integer_dtype" to the public API. Used in refactoring/
patching of 1-3.

gfyoung added a commit to gfyoung/pandas that referenced this pull request Dec 22, 2016

DOC, TST, BUG: Improve uint64 core/algos behavior
1) duplicated()

Updates documentation to describe the "values"
parameter in the signature, adds tests for uint64,
and refactors to use duplicated_uint64.

2) mode()

Updates documentation to describe the "values"
parameter in the signature, adds tests for uint64,
and reactors to use mode_uint64.

3) unique()

Uses UInt64HashTable to patch a uint64 overflow bug
analogous to that seen in Series.unique (patched in
gh-14915).

4) Types API

Introduces "is_signed_integer_dtype" and "is_unsigned
_integer_dtype" to the public API. Used in refactoring/
patching of 1-3.

gfyoung added a commit to gfyoung/pandas that referenced this pull request Dec 23, 2016

DOC, TST, BUG: Improve uint64 core/algos behavior
1) duplicated()

Updates documentation to describe the "values"
parameter in the signature, adds tests for uint64,
and refactors to use duplicated_uint64.

2) mode()

Updates documentation to describe the "values"
parameter in the signature, adds tests for uint64,
and reactors to use mode_uint64.

3) unique()

Uses UInt64HashTable to patch a uint64 overflow bug
analogous to that seen in Series.unique (patched in
gh-14915).

4) Types API

Introduces "is_signed_integer_dtype" and "is_unsigned
_integer_dtype" to the public API. Used in refactoring/
patching of 1-3.

jreback added a commit that referenced this pull request Dec 24, 2016

BUG: Patch rank() uint64 behavior
Adds `uint64` ranking functions to `algos.pyx` to allow for proper
ranking with `uint64`.    Also introduces partial patch for
`factorize()` by adding `uint64` hashtables and vectors for  usage.
However, this patch is only partial because the larger bug of non-
support for `uint64`  in `Index` has not been fixed (**UPDATE**:
tackled in #14937):  ~~~python  >>> from pandas import Index, np  >>>
Index(np.array([2**63], dtype=np.uint64))
Int64Index([-9223372036854775808], dtype='int64')  ~~~    Also patches
a bug in `UInt64HashTable` from #14915 that had an erroneous null
condition that was caught during testing and was hence removed.

Author: gfyoung <gfyoung17@gmail.com>

Closes #14935 from gfyoung/core-algorithms-uint64-two and squashes the following commits:

2598cea [gfyoung] BUG: Patch rank() uint64 behavior

ShaharBental added a commit to ShaharBental/pandas that referenced this pull request Dec 26, 2016

BUG: Prevent uint64 overflow in Series.unique
Introduces a `UInt64HashTable` class to hash `uint64` elements and
prevent overflow in functions like `Series.unique`.    Closes #14721.

Author: gfyoung <gfyoung17@gmail.com>

Closes #14915 from gfyoung/uint64-hashtable-patch and squashes the following commits:

380c580 [gfyoung] BUG: Prevent uint64 overflow in Series.unique

ShaharBental added a commit to ShaharBental/pandas that referenced this pull request Dec 26, 2016

BUG: Patch rank() uint64 behavior
Adds `uint64` ranking functions to `algos.pyx` to allow for proper
ranking with `uint64`.    Also introduces partial patch for
`factorize()` by adding `uint64` hashtables and vectors for  usage.
However, this patch is only partial because the larger bug of non-
support for `uint64`  in `Index` has not been fixed (**UPDATE**:
tackled in #14937):  ~~~python  >>> from pandas import Index, np  >>>
Index(np.array([2**63], dtype=np.uint64))
Int64Index([-9223372036854775808], dtype='int64')  ~~~    Also patches
a bug in `UInt64HashTable` from #14915 that had an erroneous null
condition that was caught during testing and was hence removed.

Author: gfyoung <gfyoung17@gmail.com>

Closes #14935 from gfyoung/core-algorithms-uint64-two and squashes the following commits:

2598cea [gfyoung] BUG: Patch rank() uint64 behavior
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment