New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: .nlargest with unsigned integers #21426

Closed
eoincondron opened this Issue Jun 11, 2018 · 13 comments

Comments

Projects
None yet
4 participants
@eoincondron

eoincondron commented Jun 11, 2018

Code Sample, a copy-pastable example if possible

pd.Series(np.array([0, 0, 0, 100, 1000, 10000, 100], dtype='uint32')).nlargest(5) 
0        0
1        0
2        0
5    10000
4     1000

Problem description

nlargest favours 0 above positive values. Common to both uint32 and uint64 types and possibly others.

Expected Output

5    10000
4     1000
3      100
6      100
0        0
dtype: uint32

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.2.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-327.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
LOCALE: en_GB.UTF-8

pandas: 0.20.3
pytest: 3.2.1
pip: 9.0.1
setuptools: 36.4.0
Cython: None
numpy: 1.13.1
scipy: 0.19.1
xarray: None
IPython: 6.1.0
sphinx: None
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.8
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.9999999
sqlalchemy: 1.1.13
pymysql: 0.7.9.None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

@gfyoung

This comment has been minimized.

Member

gfyoung commented Jun 11, 2018

How strange! Let's figure out what's going on there...

@jschendel

This comment has been minimized.

Member

jschendel commented Jun 11, 2018

I suspect the issue is with this block of code:

arr, _, _ = _ensure_data(dropped.values)
if method == 'nlargest':
arr = -arr

Specifically for uint data, I don't think -arr behaves as intended:

In [2]: arr = np.array([0, 0, 0, 100, 1000, 10000, 100], dtype='uint64')

In [3]: -arr
Out[3]:
array([                   0,                    0,                    0,
       18446744073709551516, 18446744073709550616, 18446744073709541616,
       18446744073709551516], dtype=uint64)

@gfyoung gfyoung added the Algos label Jun 11, 2018

@gfyoung

This comment has been minimized.

Member

gfyoung commented Jun 11, 2018

@jschendel : That would make sense. Hacky solution is to cast to int64, but then we lose half of the uint64 range, which isn't ideal. Should be possible to do without the hack.

@jschendel

This comment has been minimized.

Member

jschendel commented Jun 11, 2018

I'm not all that well-versed regarding uint operations, but does simply doing -a - 1 suffice for the uint case? It looks -a - 1 properly reverses order, which appears to be the intention of the code:

In [2]: a = np.array([0, 1, 18446744073709551615], dtype='uint64')

In [3]: a
Out[3]: array([                   0,                    1, 18446744073709551615], dtype=uint64)

In [4]: -a
Out[4]: array([                   0, 18446744073709551615,                    1], dtype=uint64)

In [5]: -a - 1
Out[5]: array([18446744073709551615, 18446744073709551614,                    0], dtype=uint64)
@gfyoung

This comment has been minimized.

Member

gfyoung commented Jun 11, 2018

Hmm...that actually might work, both for the uint and int cases in fact. Definitely would need to test the boundary cases for various (u)int dtypes to ensure correctness.

@gfyoung gfyoung added this to the 0.23.2 milestone Jun 11, 2018

@gfyoung

This comment has been minimized.

Member

gfyoung commented Jun 11, 2018

Marking this for 0.23.2, as this is certainly patch-able in that time frame (I would be able to patch it if isn't resolved by then).

@alimcmaster1

This comment has been minimized.

Contributor

alimcmaster1 commented Jun 11, 2018

Hi! I would like to pick this up if this is a good first issue? ( Feel free to assign to me )

@gfyoung

This comment has been minimized.

Member

gfyoung commented Jun 11, 2018

@alimcmaster1 : Go for it!

@jschendel

This comment has been minimized.

Member

jschendel commented Jun 11, 2018

I'm actually right about to put a fix in for this

@gfyoung

This comment has been minimized.

Member

gfyoung commented Jun 11, 2018

@jschendel : Thanks for letting us know!

@alimcmaster1 : Sorry about that. 😞 But you're more than welcome to review the PR that @jschendel puts up soon-ish.

@alimcmaster1

This comment has been minimized.

Contributor

alimcmaster1 commented Jun 11, 2018

No worries! Thanks can do :)

@jschendel

This comment has been minimized.

Member

jschendel commented Jun 11, 2018

Created the PR. As another example, note that this fails for int64 on 0.23.0 as well:

In [2]: pd.__version__
Out[2]: '0.23.0'

In [3]: s = pd.Series([-9223372036854775808, 0, 9223372036854775807])

In [4]: s
Out[4]:
0   -9223372036854775808
1                      0
2    9223372036854775807
dtype: int64

In [5]: s.nlargest(2)
Out[5]:
0   -9223372036854775808
2    9223372036854775807
dtype: int64
@gfyoung

This comment has been minimized.

Member

gfyoung commented Jun 11, 2018

Sigh...that's symptomatic of the same overflow issue presented with uint. Good catch!

@jreback jreback changed the title from `nlargest` bug with unsigned integers to BUG: .nlargest with unsigned integers Jun 12, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment