Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pandas.isnull() has poor behavior for lists #20675

Closed
bfollinprm opened this Issue Apr 13, 2018 · 4 comments

Comments

Projects
None yet
3 participants
@bfollinprm
Copy link

bfollinprm commented Apr 13, 2018

Code Sample, a copy-pastable example if possible

import pandas as pd
pd.isnull([np.NaN, 'world'])

# returns:
# array([False, False], dtype=bool)

Problem description

The output of pd.isnull/pd.isna on lists depends on the inferred dtype of the numpy conversion.
In cases where the array is inferred to be of a string type, numpy converts np.NaN to the string "nan", which pd.isnull() no longer recognizes as a null value. The following solve the underlying problem, which is numpy auto-inferring a string dtype for mixed lists containing strings and float('nan') float values:

  • explicitly convert to object arrays instead of string arrays, as is done in pd.Series construction
  • convert to a pd.Series object instead of the numpy object (leverages the above)
  • applying pd.isna() in a list comprehension for list objects, e.g.
def isna(a): #a is a list
    np.array([pd.isna(el) for el in a])

Expected Output

array([True, False], dtype=bool)

or

TypeError

if it is undesirable to support lists with mixed float/string types with pd.isnull()

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 2.7.13.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-1048-aws
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.22.0
pytest: 3.2.2
pip: 9.0.1
setuptools: 27.2.0
Cython: None
numpy: 1.13.3
scipy: 0.19.1
pyarrow: None
xarray: None
IPython: 5.3.0
sphinx: 1.5.4
patsy: None
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.0
html5lib: 0.9999999
sqlalchemy: 1.1.9
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

jorisvandenbossche commented Apr 13, 2018

As you mention, this is due to np.asarray(...) coercing everything to a string once there is a string in it. I would say this is a bug (or design issue) in numpy, but since it is a well known one, we should workaround it, as we do in other places.

From a quick look, this might be used instead of asarray:

In [58]: pd.core.dtypes.cast.maybe_convert_platform([np.nan, 'world'])
Out[58]: array([nan, 'world'], dtype=object)

PR welcome!

@jorisvandenbossche jorisvandenbossche added this to the Next Major Release milestone Apr 13, 2018

@bfollinprm

This comment has been minimized.

Copy link
Author

bfollinprm commented Apr 13, 2018

I'll work through this hopefully early next week.

@bfollinprm

This comment has been minimized.

Copy link
Author

bfollinprm commented Apr 18, 2018

I want to update so I'm not thought of as a raise-and-dump kind of person: awaiting approval to contribute from work, but have a fix ready if/when that happens.

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

jorisvandenbossche commented Apr 19, 2018

OK, hopefully that will be no problem, and looking forward to the PR

@jreback jreback modified the milestones: Next Major Release, 0.23.0 May 7, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.