New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG, TST: Check uint64 behaviour in algorithms.py #14934

Closed
wants to merge 1 commit into
base: master
from

Conversation

Projects
None yet
3 participants
@gfyoung
Member

gfyoung commented Dec 20, 2016

First of a series of PR's to patch and test uint64 behaviour in core/algorithms.py. In this PR, the following functions are checked:

  1. duplicated() : robust but now has test to confirm
  2. mode() : robust but now has test to confirm
  3. unique() : non-robust but patched and tested
@@ -155,6 +155,18 @@ def is_integer_dtype(arr_or_dtype):
not issubclass(tipo, (np.datetime64, np.timedelta64)))
def is_signed_integer_dtype(arr_or_dtype):
tipo = _get_dtype_type(arr_or_dtype)
return (issubclass(tipo, np.signedinteger) and

This comment has been minimized.

@jreback

jreback Dec 20, 2016

Contributor

ok to add to public api (pandas.api)

@jreback

jreback Dec 20, 2016

Contributor

ok to add to public api (pandas.api)

This comment has been minimized.

@gfyoung

gfyoung Dec 21, 2016

Member

Yep, done.

@gfyoung

gfyoung Dec 21, 2016

Member

Yep, done.

Show outdated Hide outdated pandas/core/algorithms.py
@@ -508,6 +512,7 @@ def duplicated(values, keep='first'):
duplicated : ndarray
"""
values = com._asarray_tuplesafe(values)

This comment has been minimized.

@jreback

jreback Dec 20, 2016

Contributor

hmm, why are you doing this?

@jreback

jreback Dec 20, 2016

Contributor

hmm, why are you doing this?

This comment has been minimized.

@gfyoung

gfyoung Dec 21, 2016

Member

Your comment below answers this question. Fair enough. Will remove.

@gfyoung

gfyoung Dec 21, 2016

Member

Your comment below answers this question. Fair enough. Will remove.

This comment has been minimized.

@jreback

jreback Dec 21, 2016

Contributor

yep I figure out why you are doing it (but still not sure its necessary). I'd rather have a more strict requirement here for internal API's (unless good reason otherwise)

@jreback

jreback Dec 21, 2016

Contributor

yep I figure out why you are doing it (but still not sure its necessary). I'd rather have a more strict requirement here for internal API's (unless good reason otherwise)

This comment has been minimized.

@gfyoung

gfyoung Dec 21, 2016

Member

Yeah, that's perfectly reasonable. I'll update the doc to make it clear to avoid confusion.

@gfyoung

gfyoung Dec 21, 2016

Member

Yeah, that's perfectly reasonable. I'll update the doc to make it clear to avoid confusion.

Show outdated Hide outdated pandas/core/algorithms.py
@@ -548,6 +567,9 @@ def mode(values):
dtype = values.dtype
if is_integer_dtype(values):
# This also works if dtype is uint64 because there is a 1-1
# correspondence between int64 and uint64, and we will cast
# all elements back to their correct uint64 values.
values = _ensure_int64(values)

This comment has been minimized.

@jreback

jreback Dec 20, 2016

Contributor

I would rather make this less magical and just select based on dtype. A read will be confused by this; better to do it on dtype.

@jreback

jreback Dec 20, 2016

Contributor

I would rather make this less magical and just select based on dtype. A read will be confused by this; better to do it on dtype.

Show outdated Hide outdated pandas/tests/test_algos.py
@@ -772,6 +779,13 @@ def test_unique_index(self):
tm.assert_numpy_array_equal(case.duplicated(),
np.array([False, False, False]))
def test_duplicated_non_dtype(self):
# Make sure we can pass in array-likes with no dtype attribute.
cases = [[1, 2, 3], tuple([1, 2, 3])]

This comment has been minimized.

@jreback

jreback Dec 20, 2016

Contributor

I don't think we need to support this. These are internal functions and you can expect a ndarray-like (and not a plain list). factorize is public and an exception.

@jreback

jreback Dec 20, 2016

Contributor

I don't think we need to support this. These are internal functions and you can expect a ndarray-like (and not a plain list). factorize is public and an exception.

@codecov-io

This comment has been minimized.

Show comment
Hide comment
@codecov-io

codecov-io Dec 21, 2016

Current coverage is 84.65% (diff: 100%)

Merging #14934 into master will increase coverage by <.01%

@@             master     #14934   diff @@
==========================================
  Files           144        144          
  Lines         51030      51043    +13   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43199      43213    +14   
+ Misses         7831       7830     -1   
  Partials          0          0          

Powered by Codecov. Last update 74de478...6d31057

codecov-io commented Dec 21, 2016

Current coverage is 84.65% (diff: 100%)

Merging #14934 into master will increase coverage by <.01%

@@             master     #14934   diff @@
==========================================
  Files           144        144          
  Lines         51030      51043    +13   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43199      43213    +14   
+ Misses         7831       7830     -1   
  Partials          0          0          

Powered by Codecov. Last update 74de478...6d31057

@gfyoung

This comment has been minimized.

Show comment
Hide comment
@gfyoung

gfyoung Dec 21, 2016

Member

@jreback : Tests are green. Ready to merge if there is nothing else.

Member

gfyoung commented Dec 21, 2016

@jreback : Tests are green. Ready to merge if there is nothing else.

@jreback jreback added this to the 0.20.0 milestone Dec 21, 2016

Show outdated Hide outdated pandas/hashtable.pyx
@@ -236,6 +236,41 @@ def mode_int64(int64_t[:] values):
@cython.wraparound(False)
@cython.boundscheck(False)
def mode_uint64(uint64_t[:] values):

This comment has been minimized.

@jreback

jreback Dec 21, 2016

Contributor

can we move these to tempita? (for mode)

@jreback

jreback Dec 21, 2016

Contributor

can we move these to tempita? (for mode)

This comment has been minimized.

@gfyoung

gfyoung Dec 21, 2016

Member

Certainly. Done.

@gfyoung

gfyoung Dec 21, 2016

Member

Certainly. Done.

Show outdated Hide outdated pandas/tests/test_algos.py
@@ -1202,6 +1209,20 @@ def test_int64_add_overflow():
b_mask=np.array([False, True]))
def test_mode_uint64():

This comment has been minimized.

@jreback

jreback Dec 21, 2016

Contributor

can you add a test class (and consolidate the tests)

@jreback

jreback Dec 21, 2016

Contributor

can you add a test class (and consolidate the tests)

This comment has been minimized.

@gfyoung

gfyoung Dec 21, 2016

Member

Done.

@gfyoung

gfyoung Dec 21, 2016

Member

Done.

Show outdated Hide outdated pandas/tests/test_algos.py
@@ -1202,6 +1209,57 @@ def test_int64_add_overflow():
b_mask=np.array([False, True]))
class TestMode(tm.TestCase):
def test_basic(self):

This comment has been minimized.

@jreback

jreback Dec 21, 2016

Contributor

looks like there is a test for mode in series/test_analytics.py can you make sure that we are looping over all integer, uint, float32, float64, and datelike types? (most are there, just making sure).

Also ok with moving all the test_mode test from series and putting them here (but then, leave the original test with a skip and a comment that says these are tested in algorithms).

alternatively, put a note here that these are tested in the series routine.

either one is ok.

@jreback

jreback Dec 21, 2016

Contributor

looks like there is a test for mode in series/test_analytics.py can you make sure that we are looping over all integer, uint, float32, float64, and datelike types? (most are there, just making sure).

Also ok with moving all the test_mode test from series and putting them here (but then, leave the original test with a skip and a comment that says these are tested in algorithms).

alternatively, put a note here that these are tested in the series routine.

either one is ok.

This comment has been minimized.

@gfyoung

gfyoung Dec 22, 2016

Member

I duplicated tests between Series and algos but also wrote additional tests for algos to account for other array-like objects. I see no reason to couple testing together.

@gfyoung

gfyoung Dec 22, 2016

Member

I duplicated tests between Series and algos but also wrote additional tests for algos to account for other array-like objects. I see no reason to couple testing together.

@gfyoung

This comment has been minimized.

Show comment
Hide comment
@gfyoung

gfyoung Dec 22, 2016

Member

@jreback : Addressed all comments, and everything is green and passing. Ready to merge if there are no other concerns.

Member

gfyoung commented Dec 22, 2016

@jreback : Addressed all comments, and everything is green and passing. Ready to merge if there are no other concerns.

Show outdated Hide outdated pandas/core/algorithms.py
@@ -547,10 +566,12 @@ def mode(values):
constructor = Series
dtype = values.dtype
if is_integer_dtype(values):
if is_signed_integer_dtype(values):
values = _ensure_int64(values)
result = constructor(sorted(htable.mode_int64(values)), dtype=dtype)

This comment has been minimized.

@jreback

jreback Dec 22, 2016

Contributor

side note, not sure whey sorted is used here, this does a conversion to list (then back when it is constructed).

can you do separate PR to fix this? (use np.sort)

@jreback

jreback Dec 22, 2016

Contributor

side note, not sure whey sorted is used here, this does a conversion to list (then back when it is constructed).

can you do separate PR to fix this? (use np.sort)

This comment has been minimized.

@jreback

jreback Dec 22, 2016

Contributor

and it can be at the end of the function (and actually you can move the boiler plate to the end of the function)
(e.g. the construct / sort)

@jreback

jreback Dec 22, 2016

Contributor

and it can be at the end of the function (and actually you can move the boiler plate to the end of the function)
(e.g. the construct / sort)

This comment has been minimized.

@jreback

jreback Dec 22, 2016

Contributor

or can you do in this PR

@jreback

jreback Dec 22, 2016

Contributor

or can you do in this PR

This comment has been minimized.

@gfyoung

gfyoung Dec 22, 2016

Member

And how is that any different from np.sort, which converts to ndarray before we then convert to Series? We might as well just use Series.sort_values then?

@gfyoung

gfyoung Dec 22, 2016

Member

And how is that any different from np.sort, which converts to ndarray before we then convert to Series? We might as well just use Series.sort_values then?

This comment has been minimized.

@jreback

jreback Dec 22, 2016

Contributor

its not, yes you could do a .sort_values() (I just don't want sorted)

@jreback

jreback Dec 22, 2016

Contributor

its not, yes you could do a .sort_values() (I just don't want sorted)

This comment has been minimized.

@gfyoung

gfyoung Dec 22, 2016

Member

Can't quite move to the end (categorical doesn't sort). However, can just replace sorted with np.sort in the code.

@gfyoung

gfyoung Dec 22, 2016

Member

Can't quite move to the end (categorical doesn't sort). However, can just replace sorted with np.sort in the code.

Show outdated Hide outdated pandas/src/hashtable_func_helper.pxi.in
table = kh_init_{{table_type}}()
{{if dtype == 'object'}}
build_count_table_{{dtype}}(values, mask, table)

This comment has been minimized.

@jreback

jreback Dec 22, 2016

Contributor

hmm, any reason NOT to make the signatures the same for build_count_table_*?

@jreback

jreback Dec 22, 2016

Contributor

hmm, any reason NOT to make the signatures the same for build_count_table_*?

This comment has been minimized.

@gfyoung

gfyoung Dec 22, 2016

Member

build_count_table_object takes a mask argument, whereas the others do null detection internally and take a dropna parameter instead. It might be possible to standardize, but as all of these functions are tied to Python-facing functions (e.g. value_counts), I don't think it would be a good idea to investigate in this PR.

@gfyoung

gfyoung Dec 22, 2016

Member

build_count_table_object takes a mask argument, whereas the others do null detection internally and take a dropna parameter instead. It might be possible to standardize, but as all of these functions are tied to Python-facing functions (e.g. value_counts), I don't think it would be a good idea to investigate in this PR.

This comment has been minimized.

@jreback

jreback Dec 22, 2016

Contributor

hmm, ok, can you make an issue about this then?

this PR still need some work, so might be simpler to do that first FYI.

@jreback

jreback Dec 22, 2016

Contributor

hmm, ok, can you make an issue about this then?

this PR still need some work, so might be simpler to do that first FYI.

This comment has been minimized.

@gfyoung

gfyoung Dec 22, 2016

Member

Actually, it doesn't seem as difficult as I thought previous. I'm going to try to make the change here in this PR.

@gfyoung

gfyoung Dec 22, 2016

Member

Actually, it doesn't seem as difficult as I thought previous. I'm going to try to make the change here in this PR.

This comment has been minimized.

@jreback

jreback Dec 23, 2016

Contributor

ok, lmk then.

@jreback

jreback Dec 23, 2016

Contributor

ok, lmk then.

This comment has been minimized.

@gfyoung

gfyoung Dec 23, 2016

Member

So looking into this further, I think I understand why object was special-cased from the rest. It appears that Python's hashing for object (PyObject_Hash) does not work very well with null values. Thus, when you insert a value like nan into such a hashtable multiple times, they will be interpreted as different keys, which is obviously wrong.

I have successfully refactored the code to align signatures nonetheless.

@gfyoung

gfyoung Dec 23, 2016

Member

So looking into this further, I think I understand why object was special-cased from the rest. It appears that Python's hashing for object (PyObject_Hash) does not work very well with null values. Thus, when you insert a value like nan into such a hashtable multiple times, they will be interpreted as different keys, which is obviously wrong.

I have successfully refactored the code to align signatures nonetheless.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Dec 22, 2016

Contributor

few comments

Contributor

jreback commented Dec 22, 2016

few comments

DOC, TST, BUG: Improve uint64 core/algos behavior
1) duplicated()

Updates documentation to describe the "values"
parameter in the signature, adds tests for uint64,
and refactors to use duplicated_uint64.

2) mode()

Updates documentation to describe the "values"
parameter in the signature, adds tests for uint64,
and reactors to use mode_uint64.

3) unique()

Uses UInt64HashTable to patch a uint64 overflow bug
analogous to that seen in Series.unique (patched in
gh-14915).

4) Types API

Introduces "is_signed_integer_dtype" and "is_unsigned
_integer_dtype" to the public API. Used in refactoring/
patching of 1-3.

@jreback jreback closed this in 5710947 Dec 23, 2016

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Dec 23, 2016

Contributor

thanks!

Contributor

jreback commented Dec 23, 2016

thanks!

@gfyoung gfyoung deleted the gfyoung:core-algorithms-uint64 branch Dec 23, 2016

ShaharBental added a commit to ShaharBental/pandas that referenced this pull request Dec 26, 2016

BUG, TST: Check uint64 behaviour in algorithms.py
First of a series of PR's to patch and test `uint64` behaviour in
`core/algorithms.py`.  In this PR, the following functions are
checked:    1. `duplicated()` : robust but now has test to confirm  2.
`mode()` : robust but now has test to confirm  3. `unique()` : non-
robust but patched and tested

Author: gfyoung <gfyoung17@gmail.com>

Closes #14934 from gfyoung/core-algorithms-uint64 and squashes the following commits:

6d31057 [gfyoung] DOC, TST, BUG: Improve uint64 core/algos behavior

jreback added a commit that referenced this pull request Jan 22, 2017

BUG, TST: Patch handling of uint64 in algorithms
1) Patches handling of `uint64` in `value_counts`
2) Adds tests for handling of `uint64` in `factorize`

xref #14934

Author: gfyoung <gfyoung17@gmail.com>

Closes #15162 from gfyoung/core-algorithms-uint64-three and squashes the following commits:

1fb256b [gfyoung] BUG, TST: Patch uint64 behavior in value_counts and factorize

AnkurDedania added a commit to AnkurDedania/pandas that referenced this pull request Mar 21, 2017

BUG, TST: Patch handling of uint64 in algorithms
1) Patches handling of `uint64` in `value_counts`
2) Adds tests for handling of `uint64` in `factorize`

xref #14934

Author: gfyoung <gfyoung17@gmail.com>

Closes #15162 from gfyoung/core-algorithms-uint64-three and squashes the following commits:

1fb256b [gfyoung] BUG, TST: Patch uint64 behavior in value_counts and factorize
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment