New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataframe constructor fails when given dict with None value #14381

Closed
gitj opened this Issue Oct 9, 2016 · 7 comments

Comments

Projects
None yet
5 participants
@gitj

gitj commented Oct 9, 2016

A small, complete example of the issue

# Your code here

import pandas as pd
pd.Dataframe(dict(a=None), index= [0])
In [3]: pd.DataFrame(dict(a=None),index=[0])
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-3-20b65f605ca3> in <module>()
----> 1 pd.DataFrame(dict(a=None),index=[0])

miniconda2/envs/readout2/lib/python2.7/site-packages/pandas/core/frame.pyc in __init__(self, data, index, columns, dtype, copy)
    264                                  dtype=dtype, copy=copy)
    265         elif isinstance(data, dict):
--> 266             mgr = self._init_dict(data, index, columns, dtype=dtype)
    267         elif isinstance(data, ma.MaskedArray):
    268             import numpy.ma.mrecords as mrecords

miniconda2/envs/readout2/lib/python2.7/site-packages/pandas/core/frame.pyc in _init_dict(self, data, index, columns, dtype)
    400             arrays = [data[k] for k in keys]
    401 
--> 402         return _arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
    403 
    404     def _init_ndarray(self, values, index, columns, dtype=None, copy=False):

miniconda2/envs/readout2/lib/python2.7/site-packages/pandas/core/frame.pyc in _arrays_to_mgr(arrays, arr_names, index, columns, dtype)
   5382 
   5383     # don't force copy because getting jammed in an ndarray anyway
-> 5384     arrays = _homogenize(arrays, index, dtype)
   5385 
   5386     # from BlockManager perspective

miniconda2/envs/readout2/lib/python2.7/site-packages/pandas/core/frame.pyc in _homogenize(data, index, dtype)
   5693                 v = lib.fast_multiget(v, oindex.values, default=NA)
   5694             v = _sanitize_array(v, index, dtype=dtype, copy=False,
-> 5695                                 raise_cast_failure=False)
   5696 
   5697         homogenized.append(v)

miniconda2/envs/readout2/lib/python2.7/site-packages/pandas/core/series.pyc in _sanitize_array(data, index, dtype, copy, raise_cast_failure)
   2917 
   2918     # scalar like
-> 2919     if subarr.ndim == 0:
   2920         if isinstance(data, list):  # pragma: no cover
   2921             subarr = np.array(data, dtype=object)

AttributeError: 'NoneType' object has no attribute 'ndim'

Expected Output

This previously worked with a sensible output in 0.18.1:

In [2]: pd.DataFrame(dict(a=None),index=[0])
Out[2]:
a
0 None

Output of pd.show_versions()

Working version: ## INSTALLED VERSIONS

commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Linux
OS-release: 3.2.0-4-amd64
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.1
nose: 1.3.7
pip: 8.1.2
setuptools: 27.2.0
Cython: 0.24
numpy: 1.11.2
scipy: 0.17.0
statsmodels: 0.6.1
xarray: None
IPython: 4.2.0
sphinx: 1.4.1
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.7
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.4.4
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.5
lxml: 3.6.0
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.12
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.39.0
pandas_datareader: None

Broken version:

INSTALLED VERSIONS

commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Linux
OS-release: 3.2.0-4-amd64
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.19.0
nose: 1.3.7
pip: 8.1.2
setuptools: 27.2.0
Cython: 0.24
numpy: 1.11.2
scipy: 0.17.0
statsmodels: 0.6.1
xarray: None
IPython: 4.2.0
sphinx: 1.4.1
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.7
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.4.4
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.5
lxml: 3.6.0
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.12
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.39.0
pandas_datareader: None

gitj added a commit to ColumbiaCMB/kid_readout that referenced this issue Oct 9, 2016

@jreback jreback added this to the Next Major Release milestone Oct 9, 2016

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Oct 9, 2016

Contributor

So this works correctly in the following cases.

In [12]: pd.DataFrame(columns=['a'], index=[0])
Out[12]: 
     a
0  NaN

In [13]: pd.DataFrame(dict(a=np.nan), index=[0])
Out[13]: 
    a
0 NaN

The behavior in 0.18.1 is actually wrong, this should coerce to the np.nan case, as dtype is not specified.

pull-requests to fix are welcome.

Contributor

jreback commented Oct 9, 2016

So this works correctly in the following cases.

In [12]: pd.DataFrame(columns=['a'], index=[0])
Out[12]: 
     a
0  NaN

In [13]: pd.DataFrame(dict(a=np.nan), index=[0])
Out[13]: 
    a
0 NaN

The behavior in 0.18.1 is actually wrong, this should coerce to the np.nan case, as dtype is not specified.

pull-requests to fix are welcome.

@shawnheide

This comment has been minimized.

Show comment
Hide comment
@shawnheide

shawnheide Oct 11, 2016

Contributor

Hey @brandonmburroughs, I saw that you're working on this too and beat me to the PR. No worries, I wasn't as far along. Just wanted to let you know that the same problem shows up with the Series constructor too, i.e. Series([None]) fails to coerce to NaN.

I looked at fixing it a little further down the stack in series.py, but didn't check with any tests yet. Feel free to see my commit above that referenced this.

Contributor

shawnheide commented Oct 11, 2016

Hey @brandonmburroughs, I saw that you're working on this too and beat me to the PR. No worries, I wasn't as far along. Just wanted to let you know that the same problem shows up with the Series constructor too, i.e. Series([None]) fails to coerce to NaN.

I looked at fixing it a little further down the stack in series.py, but didn't check with any tests yet. Feel free to see my commit above that referenced this.

@gitj

This comment has been minimized.

Show comment
Hide comment
@gitj

gitj Oct 11, 2016

I was going to work on a PR but looks like you guys are on top of it. Thanks!

gitj commented Oct 11, 2016

I was going to work on a PR but looks like you guys are on top of it. Thanks!

@brandonmburroughs

This comment has been minimized.

Show comment
Hide comment
@brandonmburroughs

brandonmburroughs Oct 11, 2016

Contributor

@shawnheide I actually noticed this problem after I created my PR. I created an issue (#14393) about this and there is some discussion going on there as to how to handle this as the cases are different. Depending upon how they want to handle the API design, your fix may be better suited to handle all cases.

Contributor

brandonmburroughs commented Oct 11, 2016

@shawnheide I actually noticed this problem after I created my PR. I created an issue (#14393) about this and there is some discussion going on there as to how to handle this as the cases are different. Depending upon how they want to handle the API design, your fix may be better suited to handle all cases.

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Oct 26, 2016

Member

@jreback Given your comment in #14393 (comment), I would personally say that the above case should not coerce to NaN, but keep the None. Thoughts?
(in any case that is the conservative road for now, as that was the behaviour in 0.18.1)

But in that case, @brandonmburroughs, your PR should be updated.

Member

jorisvandenbossche commented Oct 26, 2016

@jreback Given your comment in #14393 (comment), I would personally say that the above case should not coerce to NaN, but keep the None. Thoughts?
(in any case that is the conservative road for now, as that was the behaviour in 0.18.1)

But in that case, @brandonmburroughs, your PR should be updated.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Oct 26, 2016

Contributor

yeah open to having it be pre-0.19.0 behavior (IOW, remain as object) is fine.

Contributor

jreback commented Oct 26, 2016

yeah open to having it be pre-0.19.0 behavior (IOW, remain as object) is fine.

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Oct 26, 2016

Member

To illustrate, in pandas 0.18:

In [7]: pd.DataFrame(dict(a=[None]), index= [0])
Out[7]: 
      a
0  None

In [8]: pd.DataFrame(dict(a=None), index= [0])
Out[8]: 
      a
0  None

So for 0.19.1, I would choose to go back to 0.18.1 behaviour, so not coercing to NaN (keep as None).
We can discuss if we want to change for later releases.

Member

jorisvandenbossche commented Oct 26, 2016

To illustrate, in pandas 0.18:

In [7]: pd.DataFrame(dict(a=[None]), index= [0])
Out[7]: 
      a
0  None

In [8]: pd.DataFrame(dict(a=None), index= [0])
Out[8]: 
      a
0  None

So for 0.19.1, I would choose to go back to 0.18.1 behaviour, so not coercing to NaN (keep as None).
We can discuss if we want to change for later releases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment