BUG: setting raw=True in the Dataframe.apply function causes a ValueError #34822

lfdversluis · 2020-06-16T12:23:50Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Code Sample, a copy-pastable example

import numpy as np
import pandas as pd
df = pd.DataFrame({"a": [100, 300], "b": [200, 400]})

def parse_node_row(row, index):
    def unpack_bitmask(val):
        return [np.bitwise_and(np.right_shift(val, i * 16), 0xFFFF) for i in range(4)]

    arr = np.concatenate(np.apply_along_axis(unpack_bitmask, 0, row)).ravel()

    return pd.Series(arr, index=index)

iterables = [df.columns, ["gpu{}".format(i) for i in range(4)]]
midx = pd.MultiIndex.from_product(iterables, names=['node', 'gpu'])

ndf = df.apply(parse_node_row, axis="columns", result_type="expand", args=(midx,))

Problem description

I am working on a dataset where values are bitmaskes, each 64bit integer is actually 4x16 (one per GPU per node). The format of the dataset is out of my control.
So, I wrote the code above to create a new dataframe that created a multi index with the data split.

It works fine and all, but as I am solely using numpy functions in parse_node_row, I added parse=True in the apply call. However, this causes my script to crash! The minimal reproducible example above shows

ValueError: Shape of passed values is (2, 8), indices imply (2, 2)

But runs just fine when raw=False. What gives?
There is no mention in the docs of any side-effects of using raw=True besides you getting a numpy array in your apply function, which is completely fine for me.

Expected Output

More performance!

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : None
python : 3.7.4.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None

pandas : 1.0.4
numpy : 1.18.5
pytz : 2020.1
dateutil : 2.8.1
pip : 20.1.1
setuptools : 47.1.1
Cython : 0.29.20
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.5.1
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.9.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : 4.5.1
matplotlib : 3.2.0
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.17.1
pytables : None
pytest : None
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : None

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2020-06-16T13:40:53Z

Can you try with pandas master? I get

In [2]: ndf
Out[2]:
node    a                   b
gpu  gpu0 gpu1 gpu2 gpu3 gpu0 gpu1 gpu2 gpu3
0     100  200    0    0    0    0    0    0
1     300  400    0    0    0    0    0    0

lfdversluis · 2020-06-17T09:26:43Z

I think you ran it without raw=True. Here is the output:

INSTALLED VERSIONS ------------------ commit : 5fdd6f5 python : 3.7.4.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.18362 machine : AMD64 processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : None.None

pandas : 1.1.0.dev0+1887.g5fdd6f50a
numpy : 1.18.5
pytz : 2020.1
dateutil : 2.8.1
pip : 20.1.1
setuptools : 47.1.1
Cython : 0.29.20
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.5.1
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.9.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
matplotlib : 3.2.0
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.17.1
pytables : None
pyxlsb : None
s3fs : 0.2.2
scipy : 1.4.1
sqlalchemy : None
tables : None
tabulate : 0.8.7
xarray : 0.15.1
xlrd : None
xlwt : None
numba : 0.50.0

This is what I ran:

import numpy as np
import pandas as pd
pd.show_versions()
df = pd.DataFrame({"a": [100, 300], "b": [200, 400]})

def parse_node_row(row, index):
    def unpack_bitmask(val):
        return [np.bitwise_and(np.right_shift(val, i * 16), 0xFFFF) for i in range(4)]

    arr = np.concatenate(np.apply_along_axis(unpack_bitmask, 0, row)).ravel()

    return pd.Series(arr, index=index)

iterables = [df.columns, ["gpu{}".format(i) for i in range(4)]]
midx = pd.MultiIndex.from_product(iterables, names=['node', 'gpu'])

ndf = df.apply(parse_node_row, axis="columns", raw=True, result_type="expand", args=(midx,))

(I have added the raw=True) in this snippet for your convenience.

Output:

ValueError: Shape of passed values is (2, 8), indices imply (2, 2)

lfdversluis · 2020-06-17T09:28:15Z

Now run it with the line ndf = df.apply(parse_node_row, axis="columns", result_type="expand", args=(midx,)) and you will see it works fine.

snake575 · 2020-09-18T20:12:26Z

I have as similar situation with df.apply using raw=True along with result_type="expand". It occurs with pandas>=1.1.0, pandas==1.0.5 works as expected.

import pandas as pd

df = pd.DataFrame({"a": [1, 2], "b": [3, 4]})

df.apply(lambda row: (1, 2, 3), axis=1, raw=True, result_type="expand")

pandas 1.0.5

0    (1, 2, 3)
1    (1, 2, 3)
dtype: object

pandas >= 1.1.0

ValueError: Shape of passed values is (2, 3), indices imply (2, 2)

chinmayrane · 2021-01-26T23:46:21Z

I have a similar issue while migrating from 0.23.4 to 1.1.x

Essentially raw=True and zipping the returned tuples, I found more efficient, but this functionality breaks in 1.1.x and non-raw apply method is slow.

chinmayrane · 2021-02-05T17:02:03Z

@TomAugspurger Would you know if there is any work around this besides not using raw=True ?

TomAugspurger · 2021-02-05T17:10:13Z

I'm not sure offhand.

anirudh-hegde · 2023-08-12T04:53:11Z

Can you please assign me this issue.
I will try to solve this

chinmayrane · 2023-08-14T02:37:57Z

https://stackoverflow.com/questions/67678210/raw-true-causes-valueerror-in-pandas-dataframe-apply

Might be related to this - https://github.com/pandas-dev/pandas/pull/32425/files

pramodhrachuri · 2024-01-18T01:19:40Z

I looks like pandas is trying to parse the returned data if the type id object. I am trying to return a list of 400 values and I get the error Shape of passed values is (10, 410), indices imply (10, 6). I think the code around this line is the reason.

pandas/pandas/core/internals/construction.py

Line 237 in d4bd7e4

def ndarray_to_mgr(

Is there anyway to make pandas not try to parse the output of apply?

mpierrau · 2024-02-08T14:37:56Z

Is there any progress on this issue?

lfdversluis added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 16, 2020

lfdversluis changed the title ~~BUG: setting raw=True in the Dataframe.apply apply function causes a ValueError~~ BUG: setting raw=True in the Dataframe.apply function causes a ValueError Jun 16, 2020

jbrockmendel added Apply Apply, Aggregate, Transform and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: setting raw=True in the Dataframe.apply function causes a ValueError #34822

BUG: setting raw=True in the Dataframe.apply function causes a ValueError #34822

lfdversluis commented Jun 16, 2020 •

edited

INSTALLED VERSIONS

TomAugspurger commented Jun 16, 2020

lfdversluis commented Jun 17, 2020 •

edited

lfdversluis commented Jun 17, 2020

snake575 commented Sep 18, 2020

chinmayrane commented Jan 26, 2021

chinmayrane commented Feb 5, 2021

TomAugspurger commented Feb 5, 2021

anirudh-hegde commented Aug 12, 2023

chinmayrane commented Aug 14, 2023 •

edited

pramodhrachuri commented Jan 18, 2024

mpierrau commented Feb 8, 2024

BUG: setting raw=True in the Dataframe.apply function causes a ValueError #34822

BUG: setting raw=True in the Dataframe.apply function causes a ValueError #34822

Comments

lfdversluis commented Jun 16, 2020 • edited

Code Sample, a copy-pastable example

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

TomAugspurger commented Jun 16, 2020

lfdversluis commented Jun 17, 2020 • edited

lfdversluis commented Jun 17, 2020

snake575 commented Sep 18, 2020

chinmayrane commented Jan 26, 2021

chinmayrane commented Feb 5, 2021

TomAugspurger commented Feb 5, 2021

anirudh-hegde commented Aug 12, 2023

chinmayrane commented Aug 14, 2023 • edited

pramodhrachuri commented Jan 18, 2024

mpierrau commented Feb 8, 2024

lfdversluis commented Jun 16, 2020 •

edited

Output of `pd.show_versions()`

lfdversluis commented Jun 17, 2020 •

edited

chinmayrane commented Aug 14, 2023 •

edited