Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

`pandas.Categorical.from_codes` incorrectly converts NaN codes to 0. #21767

Closed
miker985 opened this issue Jul 6, 2018 · 4 comments

Comments

Projects
None yet
3 participants
@miker985
Copy link
Contributor

commented Jul 6, 2018

Code Sample, a copy-pastable example if possible

import pandas


def test_categorical_from_codes():
    "Demonstrate issue with pandas.Categorical.from_codes."
    codes = pandas.Series([1, 2, float('NaN')])
    categories = ['foo', 'bar', 'baz']
    c = pandas.Categorical.from_codes(codes, categories, ordered=True)

    expected = pandas.Categorical.from_codes([1, 2, -1], categories,
                                             ordered=True)

    pandas.util.testing.assert_categorical_equal(c, expected)

Problem description

pandas.Categorical.from_codes is incorrectly coercing NaN code values into 0 values. I think it should be raising a ValueError. I believe this is because pandas._libs.algos.ensure_int8/16/32/64 is behaving in an unanticipated manner on numpy arrays.

The call chain is:

  1. pandas.core.arrays.categorical.Categorical.from_codes
  2. pandas.core.dtypes.cast.coerce_indexer_dtype
  3. pandas._libs.algos.ensure_int8/16/32/64 (aliased to pandas.core.dtypes.common.ensure_int8/16/32/64)

When given a single value all the ensure functions behave correctly. But when given a numpy.array they do not.

import pandas
import pytest

def test_ensure_on_value():
    """
    Test that ensure_int functions raise a ValueError when converting a NaN.

    These tests ALL PASS, and should pass.
    """
    with pytest.raises(ValueError):
        pandas._libs.algos.ensure_int8(float('NaN'))

    with pytest.raises(ValueError):
        pandas._libs.algos.ensure_int16(float('NaN'))

    with pytest.raises(ValueError):
        pandas._libs.algos.ensure_int32(float('NaN'))

    with pytest.raises(ValueError):
        pandas._libs.algos.ensure_int64(float('NaN'))


def test_ensure_on_array():
    """
    These tests ALL PASS.

    I assume all of these are supposed to raise ValueErrors.

    Interestingly I did not get 0 values with ensure_int32 and ensure_int64.
    """
    a8 = numpy.array([float('NaN')])
    c8 = pandas._libs.algos.ensure_int8(a8)
    numpy.testing.assert_array_equal(c8, numpy.array([0]))

    a16 = numpy.array([float('NaN')])
    c16 = pandas._libs.algos.ensure_int16(a16)
    numpy.testing.assert_array_equal(c16, numpy.array([0]))

    a32 = numpy.array([float('NaN')])
    c32 = pandas._libs.algos.ensure_int32(a32)
    numpy.testing.assert_array_equal(c32, numpy.array([-2147483648]))

    a64 = numpy.array([float('NaN')])
    c64 = pandas._libs.algos.ensure_int64(a64)
    numpy.testing.assert_array_equal(c64, numpy.array([-9223372036854775808]))

Expected Output

test_categorical_from_codes should pass. Instead it contains the codes [1, 2, 0].

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]

pandas.show_versions()
/Users/miker985/miniconda3/envs/general3/lib/python3.6/site-packages/psycopg2/init.py:144: UserWarning: The psycopg2 wheel package will be renamed from release 2.8; in order to keep installing from binary please use "pip install psycopg2-binary" instead. For details see: http://initd.org/psycopg/docs/install.html#binary-install-from-pypi.
""")

INSTALLED VERSIONS

commit: None
python: 3.6.0.final.0
python-bits: 64
OS: Darwin
OS-release: 17.3.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.2
pytest: 3.6.3
pip: 10.0.1
setuptools: 27.2.0
Cython: None
numpy: 1.12.1
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.0.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.0
html5lib: None
sqlalchemy: 1.2.6
pymysql: None
psycopg2: 2.7.4 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@miker985

This comment has been minimized.

Copy link
Contributor Author

commented Jul 6, 2018

I'm not familiar enough with pandas algos to know whether it's appropriate to make changes to it.

If you agree this is a problem, I'm happy to submit a patch to either Categorical.from_codes or coerce_indexer_dtype

@TomAugspurger

This comment has been minimized.

Copy link
Contributor

commented Jul 6, 2018

I think from_codes should raise when it's np.asarray(codes) isn't an integer dtype.

NaN isn't a valid code. It should be -1.

@miker985

This comment has been minimized.

Copy link
Contributor Author

commented Jul 6, 2018

That will break the obvious fix (using .fillna()). It also breaks 8 existing tests.

>>> codes = pandas.Series([1, 2, float('NaN')])
>>> numpy.asarray(codes.fillna(-1))
array([ 1.,  2., -1.])
>>> numpy.asarray(codes.fillna(-1)).dtype
dtype('float64')

Will a simple if isna(codes).any() check suffice for raising an error?

@TomAugspurger

This comment has been minimized.

Copy link
Contributor

commented Jul 6, 2018

Right, we may want to warn on that for a version. But codes should be integers, not floats.

@miker985 miker985 referenced this issue Jul 6, 2018

Merged

Fix categorical from codes nan 21767 #21775

4 of 4 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.