Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pandas.Categorical.from_codes incorrectly converts NaN codes to 0. #21767

Closed
miker985 opened this issue Jul 6, 2018 · 4 comments · Fixed by #21775
Closed

pandas.Categorical.from_codes incorrectly converts NaN codes to 0. #21767

miker985 opened this issue Jul 6, 2018 · 4 comments · Fixed by #21775
Labels
Categorical Categorical Data Type Error Reporting Incorrect or improved errors from pandas Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Milestone

Comments

@miker985
Copy link
Contributor

miker985 commented Jul 6, 2018

Code Sample, a copy-pastable example if possible

import pandas


def test_categorical_from_codes():
    "Demonstrate issue with pandas.Categorical.from_codes."
    codes = pandas.Series([1, 2, float('NaN')])
    categories = ['foo', 'bar', 'baz']
    c = pandas.Categorical.from_codes(codes, categories, ordered=True)

    expected = pandas.Categorical.from_codes([1, 2, -1], categories,
                                             ordered=True)

    pandas.util.testing.assert_categorical_equal(c, expected)

Problem description

pandas.Categorical.from_codes is incorrectly coercing NaN code values into 0 values. I think it should be raising a ValueError. I believe this is because pandas._libs.algos.ensure_int8/16/32/64 is behaving in an unanticipated manner on numpy arrays.

The call chain is:

  1. pandas.core.arrays.categorical.Categorical.from_codes
  2. pandas.core.dtypes.cast.coerce_indexer_dtype
  3. pandas._libs.algos.ensure_int8/16/32/64 (aliased to pandas.core.dtypes.common.ensure_int8/16/32/64)

When given a single value all the ensure functions behave correctly. But when given a numpy.array they do not.

import pandas
import pytest

def test_ensure_on_value():
    """
    Test that ensure_int functions raise a ValueError when converting a NaN.

    These tests ALL PASS, and should pass.
    """
    with pytest.raises(ValueError):
        pandas._libs.algos.ensure_int8(float('NaN'))

    with pytest.raises(ValueError):
        pandas._libs.algos.ensure_int16(float('NaN'))

    with pytest.raises(ValueError):
        pandas._libs.algos.ensure_int32(float('NaN'))

    with pytest.raises(ValueError):
        pandas._libs.algos.ensure_int64(float('NaN'))


def test_ensure_on_array():
    """
    These tests ALL PASS.

    I assume all of these are supposed to raise ValueErrors.

    Interestingly I did not get 0 values with ensure_int32 and ensure_int64.
    """
    a8 = numpy.array([float('NaN')])
    c8 = pandas._libs.algos.ensure_int8(a8)
    numpy.testing.assert_array_equal(c8, numpy.array([0]))

    a16 = numpy.array([float('NaN')])
    c16 = pandas._libs.algos.ensure_int16(a16)
    numpy.testing.assert_array_equal(c16, numpy.array([0]))

    a32 = numpy.array([float('NaN')])
    c32 = pandas._libs.algos.ensure_int32(a32)
    numpy.testing.assert_array_equal(c32, numpy.array([-2147483648]))

    a64 = numpy.array([float('NaN')])
    c64 = pandas._libs.algos.ensure_int64(a64)
    numpy.testing.assert_array_equal(c64, numpy.array([-9223372036854775808]))

Expected Output

test_categorical_from_codes should pass. Instead it contains the codes [1, 2, 0].

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]

pandas.show_versions()
/Users/miker985/miniconda3/envs/general3/lib/python3.6/site-packages/psycopg2/init.py:144: UserWarning: The psycopg2 wheel package will be renamed from release 2.8; in order to keep installing from binary please use "pip install psycopg2-binary" instead. For details see: http://initd.org/psycopg/docs/install.html#binary-install-from-pypi.
""")

INSTALLED VERSIONS

commit: None
python: 3.6.0.final.0
python-bits: 64
OS: Darwin
OS-release: 17.3.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.2
pytest: 3.6.3
pip: 10.0.1
setuptools: 27.2.0
Cython: None
numpy: 1.12.1
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.0.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.0
html5lib: None
sqlalchemy: 1.2.6
pymysql: None
psycopg2: 2.7.4 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@miker985
Copy link
Contributor Author

miker985 commented Jul 6, 2018

I'm not familiar enough with pandas algos to know whether it's appropriate to make changes to it.

If you agree this is a problem, I'm happy to submit a patch to either Categorical.from_codes or coerce_indexer_dtype

@TomAugspurger
Copy link
Contributor

I think from_codes should raise when it's np.asarray(codes) isn't an integer dtype.

NaN isn't a valid code. It should be -1.

@TomAugspurger TomAugspurger added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Categorical Categorical Data Type labels Jul 6, 2018
@TomAugspurger TomAugspurger added this to the 0.24.0 milestone Jul 6, 2018
@jreback jreback added the Error Reporting Incorrect or improved errors from pandas label Jul 6, 2018
@miker985
Copy link
Contributor Author

miker985 commented Jul 6, 2018

That will break the obvious fix (using .fillna()). It also breaks 8 existing tests.

>>> codes = pandas.Series([1, 2, float('NaN')])
>>> numpy.asarray(codes.fillna(-1))
array([ 1.,  2., -1.])
>>> numpy.asarray(codes.fillna(-1)).dtype
dtype('float64')

Will a simple if isna(codes).any() check suffice for raising an error?

@TomAugspurger
Copy link
Contributor

Right, we may want to warn on that for a version. But codes should be integers, not floats.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Error Reporting Incorrect or improved errors from pandas Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants