New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError when applying a function that returns a list or tuple to a DataFrame that contains a Timestamp #17892

Closed
wdm81 opened this Issue Oct 16, 2017 · 2 comments

Comments

Projects
None yet
3 participants
@wdm81

wdm81 commented Oct 16, 2017

Code Sample, a copy-pastable example if possible

Executing

import pandas as pd

df = pd.DataFrame({'a':[pd.Timestamp('2010-02-01'),
                        pd.Timestamp('2010-02-04'),
                        pd.Timestamp('2010-02-05'),
                        pd.Timestamp('2010-02-06')],
                   'b':[9,5,4,3], 'c':[5,3,4,2], 'd':[1,2,3,4]})

def fun(x):
    return (1,2)

df.apply(fun, axis=1)

raises an exception

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/Users/wilmat01/anaconda/lib/python3.6/site-packages/pandas/core/internals.py in create_block_manager_from_arrays(arrays, names, axes)
   4309         blocks = form_blocks(arrays, names, axes)
-> 4310         mgr = BlockManager(blocks, axes)
   4311         mgr._consolidate_inplace()

/Users/wilmat01/anaconda/lib/python3.6/site-packages/pandas/core/internals.py in __init__(self, blocks, axes, do_integrity_check, fastpath)
   2794         if do_integrity_check:
-> 2795             self._verify_integrity()
   2796 

/Users/wilmat01/anaconda/lib/python3.6/site-packages/pandas/core/internals.py in _verify_integrity(self)
   3005             if block._verify_integrity and block.shape[1:] != mgr_shape[1:]:
-> 3006                 construction_error(tot_items, block.shape[1:], self.axes)
   3007         if len(self.items) != tot_items:

/Users/wilmat01/anaconda/lib/python3.6/site-packages/pandas/core/internals.py in construction_error(tot_items, block_shape, axes, e)
   4279     raise ValueError("Shape of passed values is {0}, indices imply {1}".format(
-> 4280         passed, implied))
   4281 

ValueError: Shape of passed values is (4, 2), indices imply (4, 4)

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-26-7b305f7b3474> in <module>()
      8     return (1,2)
      9 
---> 10 df.apply(fun, axis=1)

/Users/wilmat01/anaconda/lib/python3.6/site-packages/pandas/core/frame.py in apply(self, func, axis, broadcast, raw, reduce, args, **kwds)
   4260                         f, axis,
   4261                         reduce=reduce,
-> 4262                         ignore_failures=ignore_failures)
   4263             else:
   4264                 return self._apply_broadcast(f, axis)

/Users/wilmat01/anaconda/lib/python3.6/site-packages/pandas/core/frame.py in _apply_standard(self, func, axis, ignore_failures, reduce)
   4373                 index = None
   4374 
-> 4375             result = self._constructor(data=results, index=index)
   4376             result.columns = res_index
   4377 

/Users/wilmat01/anaconda/lib/python3.6/site-packages/pandas/core/frame.py in __init__(self, data, index, columns, dtype, copy)
    273                                  dtype=dtype, copy=copy)
    274         elif isinstance(data, dict):
--> 275             mgr = self._init_dict(data, index, columns, dtype=dtype)
    276         elif isinstance(data, ma.MaskedArray):
    277             import numpy.ma.mrecords as mrecords

/Users/wilmat01/anaconda/lib/python3.6/site-packages/pandas/core/frame.py in _init_dict(self, data, index, columns, dtype)
    409             arrays = [data[k] for k in keys]
    410 
--> 411         return _arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
    412 
    413     def _init_ndarray(self, values, index, columns, dtype=None, copy=False):

/Users/wilmat01/anaconda/lib/python3.6/site-packages/pandas/core/frame.py in _arrays_to_mgr(arrays, arr_names, index, columns, dtype)
   5504     axes = [_ensure_index(columns), _ensure_index(index)]
   5505 
-> 5506     return create_block_manager_from_arrays(arrays, arr_names, axes)
   5507 
   5508 

/Users/wilmat01/anaconda/lib/python3.6/site-packages/pandas/core/internals.py in create_block_manager_from_arrays(arrays, names, axes)
   4312         return mgr
   4313     except ValueError as e:
-> 4314         construction_error(len(arrays), arrays[0].shape, axes, e)
   4315 
   4316 

/Users/wilmat01/anaconda/lib/python3.6/site-packages/pandas/core/internals.py in construction_error(tot_items, block_shape, axes, e)
   4278         raise ValueError("Empty data passed with indices specified.")
   4279     raise ValueError("Shape of passed values is {0}, indices imply {1}".format(
-> 4280         passed, implied))
   4281 
   4282 

ValueError: Shape of passed values is (4, 2), indices imply (4, 4)

Problem description

  • I see the same problem when fun returns a list (e.g. [1,2]) rather than tuple.
  • The error does not occur when apply is called with axis=0.
  • The error does not occur when I replace the Timestamp column with a column of integers.

Expected Output

A pandas Series containing tuples:

0    (1, 2)
1    (1, 2)
2    (1, 2)
3    (1, 2)
dtype: object

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]

INSTALLED VERSIONS

commit: None
python: 3.6.1.final.0
python-bits: 64
OS: Darwin
OS-release: 16.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
LOCALE: en_GB.UTF-8

pandas: 0.20.3
pytest: 3.0.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.13.3
scipy: 0.19.0
xarray: None
IPython: 5.3.0
sphinx: 1.5.6
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.3.0
numexpr: 2.6.2
feather: 0.4.0
matplotlib: 2.0.2
openpyxl: 2.4.7
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.3
bs4: 4.6.0
html5lib: 0.9999999
sqlalchemy: 1.1.9
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

@jreback

This comment has been minimized.

Contributor

jreback commented Oct 16, 2017

duplicate of #16353 and #15628

.apply infers the output dimension based on what you are returning, which looks exactly like a Series. This is not idiomatic pandas, not to mention non-performant.

@jreback jreback closed this Oct 16, 2017

@jreback jreback added this to the No action milestone Oct 16, 2017

@kmader

This comment has been minimized.

kmader commented Oct 23, 2017

I have the same issue (see code below). The top frame (s_df) works perfectly and the bottom one doesn't work at all. The inconsistency of behavior is what I find a bit troubling because adding a column shouldn't change how .apply works. While this contrived example is very simplified, it is based a real issue where I have a number of date columns that I want to create new columns based on relationships between them (warranty_valid = purchase_date-claim_date<180 days). Is there a more idiomatic pandas way to this?

import pandas as pd
s_df = pd.DataFrame(dict(a = [1,2]))
print(s_df.apply(lambda x: [1,2,3],1))
t_df = pd.DataFrame(dict(a = [1,2], b = pd.to_datetime(['2017-10-%02d' % i for i in [1,2]])))
print(t_df.apply(lambda x: [1,2,3],1))

jreback added a commit to jreback/pandas that referenced this issue Nov 30, 2017

@jreback jreback modified the milestones: No action, 0.22.0 Nov 30, 2017

jreback added a commit to jreback/pandas that referenced this issue Dec 2, 2017

jreback added a commit to jreback/pandas that referenced this issue Dec 3, 2017

jreback added a commit to jreback/pandas that referenced this issue Dec 7, 2017

jreback added a commit to jreback/pandas that referenced this issue Dec 10, 2017

jreback added a commit to jreback/pandas that referenced this issue Dec 14, 2017

jreback added a commit to jreback/pandas that referenced this issue Dec 21, 2017

jreback added a commit to jreback/pandas that referenced this issue Dec 23, 2017

jreback added a commit to jreback/pandas that referenced this issue Jan 6, 2018

API/BUG: .apply will correctly infer output shape when axis=1
closes #16353
closes #17348
closes #17437
closes #18573
closes #17970
closes #17892
closes #17602
closes #18775
closes #18901
closes #18919

jreback added a commit to jreback/pandas that referenced this issue Feb 5, 2018

API/BUG: .apply will correctly infer output shape when axis=1
closes #16353
closes #17348
closes #17437
closes #18573
closes #17970
closes #17892
closes #17602
closes #18775
closes #18901
closes #18919

jorisvandenbossche added a commit that referenced this issue Feb 7, 2018

harisbal pushed a commit to harisbal/pandas that referenced this issue Feb 28, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment