REGR: Series repr of object Index with bools and NaN is wrong #32146

disimone · 2020-02-21T10:06:03Z

Code Sample, a copy-pastable example if possible

import pandas as pd
# a series with booleans and nan
pd.Series([False,True,True,pd.NA]).value_counts(dropna=False)
True     2
False    1
True     1
dtype: int64

# check the actual index of the value_counts result
pd.Series([False,True,True,pd.NA]).value_counts(dropna=False).index
Index([True, False, nan], dtype='object')

# a similar Serie, with ints instead of nans, seems to work ok
pd.Series([0,1,1,pd.NA]).value_counts(dropna=False)
1.0    2
0.0    1
NaN    1
dtype: int64

Problem description

As shown in the example, the repr of the value_counts result is apparently wrong when booleans are in the serie. It should report nan as a possible value, instead it maps it to True.
Note that this is limited to series with boolean. A similar example with ints works ok.

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit : None python : 3.7.1.final.0 python-bits : 64 OS : Linux OS-release : 4.15.0-76-generic machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 1.0.1
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 45.2.0
Cython : None
pytest : 5.3.5
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.8.4 (dt dec pq3 ext lo64)
jinja2 : None
IPython : 7.12.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.1.3
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : 5.3.5
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : 1.3.13
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : None

The text was updated successfully, but these errors were encountered:

AnnaDaglis · 2020-02-21T10:56:14Z

take

ts2095 · 2020-02-21T11:47:56Z

Not sure if this is related but it seems similar:

numpy.nan shows as True in index.

>>> import pandas as pd
>>> import numpy as np
>>> s = pd.Series(range(3), index=[True, False, np.nan])
>>> s
True     0
False    1
True     2
dtype: int64

This True is different from the regular True in the value count:

>>> s.index.value_counts(dropna=False)
True     1
False    1
True     1
dtype: int64

But the correct value is still used "under the hood":

>>> s.index.unique()
Index([True, False, nan], dtype='object')

jorisvandenbossche · 2020-02-21T12:24:53Z

Yes, so it is not directly related to value_counts, but a bug in the representation of an object index with NaN inside (@timschulz91's first example above):

>>> s = pd.Series(range(3), index=[True, False, np.nan])
>>> s
True     0
False    1
True     2
dtype: int64

jorisvandenbossche · 2020-02-21T12:26:04Z

And this seems to be a regression compared to 0.25, so tagged it as such (and with 1.0.2 milestone)

jorisvandenbossche · 2020-02-21T12:28:22Z

And the bug should be somewhere in here:

In [1]: idx = pd.Index([True, False, np.nan], dtype=object) 

In [2]: idx.format() 
Out[2]: ['True ', 'False', 'False']

jorisvandenbossche · 2020-02-21T12:41:18Z

And there, the maybe_convert_objects function is called, and it is this one that changed behaviour:

In [6]: pd._libs.lib.maybe_convert_objects(idx.values, safe=1)                                                                                                                                                     
Out[6]: array([ True, False,  True])

(this returns array([True, False, nan], dtype=object) in 0.25.3)

I suppose this is caused by #27335

github-actions bot assigned AnnaDaglis Feb 21, 2020

jorisvandenbossche changed the title ~~value_counts produces wrong labels for nans~~ REGR: Series repr of object Index with bools and NaN is wrong Feb 21, 2020

jorisvandenbossche added Output-Formatting __repr__ of pandas objects, to_string Regression Functionality that used to work in a prior pandas version labels Feb 21, 2020

jorisvandenbossche added this to the 1.0.2 milestone Feb 21, 2020

AnnaDaglis mentioned this issue Feb 25, 2020

BUG: Fixed bug, where pandas._libs.lib.maybe_convert_objects function improperly handled arrays with bools and NaNs #32242

Merged

5 tasks

jbrockmendel mentioned this issue Mar 4, 2020

RLS: 1.0.2 #32415

Closed

TomAugspurger closed this as completed in #32242 Mar 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

REGR: Series repr of object Index with bools and NaN is wrong #32146

REGR: Series repr of object Index with bools and NaN is wrong #32146

disimone commented Feb 21, 2020

AnnaDaglis commented Feb 21, 2020

ts2095 commented Feb 21, 2020

jorisvandenbossche commented Feb 21, 2020

jorisvandenbossche commented Feb 21, 2020

jorisvandenbossche commented Feb 21, 2020

jorisvandenbossche commented Feb 21, 2020

REGR: Series repr of object Index with bools and NaN is wrong #32146

REGR: Series repr of object Index with bools and NaN is wrong #32146

Comments

disimone commented Feb 21, 2020

Code Sample, a copy-pastable example if possible

Problem description

Output of pd.show_versions()

AnnaDaglis commented Feb 21, 2020

ts2095 commented Feb 21, 2020

jorisvandenbossche commented Feb 21, 2020

jorisvandenbossche commented Feb 21, 2020

jorisvandenbossche commented Feb 21, 2020

jorisvandenbossche commented Feb 21, 2020

Output of `pd.show_versions()`