Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: setting raw=True in the Dataframe.apply function causes a ValueError #34822

Open
3 tasks done
lfdversluis opened this issue Jun 16, 2020 · 11 comments
Open
3 tasks done
Labels
Apply Apply, Aggregate, Transform Bug

Comments

@lfdversluis
Copy link

lfdversluis commented Jun 16, 2020

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

import numpy as np
import pandas as pd
df = pd.DataFrame({"a": [100, 300], "b": [200, 400]})

def parse_node_row(row, index):
    def unpack_bitmask(val):
        return [np.bitwise_and(np.right_shift(val, i * 16), 0xFFFF) for i in range(4)]

    arr = np.concatenate(np.apply_along_axis(unpack_bitmask, 0, row)).ravel()

    return pd.Series(arr, index=index)

iterables = [df.columns, ["gpu{}".format(i) for i in range(4)]]
midx = pd.MultiIndex.from_product(iterables, names=['node', 'gpu'])

ndf = df.apply(parse_node_row, axis="columns", result_type="expand", args=(midx,))

Problem description

I am working on a dataset where values are bitmaskes, each 64bit integer is actually 4x16 (one per GPU per node). The format of the dataset is out of my control.
So, I wrote the code above to create a new dataframe that created a multi index with the data split.

It works fine and all, but as I am solely using numpy functions in parse_node_row, I added parse=True in the apply call. However, this causes my script to crash! The minimal reproducible example above shows

ValueError: Shape of passed values is (2, 8), indices imply (2, 2)

But runs just fine when raw=False. What gives?
There is no mention in the docs of any side-effects of using raw=True besides you getting a numpy array in your apply function, which is completely fine for me.

Expected Output

More performance!

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.4.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None

pandas : 1.0.4
numpy : 1.18.5
pytz : 2020.1
dateutil : 2.8.1
pip : 20.1.1
setuptools : 47.1.1
Cython : 0.29.20
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.5.1
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.9.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : 4.5.1
matplotlib : 3.2.0
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.17.1
pytables : None
pytest : None
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : None

@lfdversluis lfdversluis added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 16, 2020
@lfdversluis lfdversluis changed the title BUG: setting raw=True in the Dataframe.apply apply function causes a ValueError BUG: setting raw=True in the Dataframe.apply function causes a ValueError Jun 16, 2020
@TomAugspurger
Copy link
Contributor

Can you try with pandas master? I get

In [2]: ndf
Out[2]:
node    a                   b
gpu  gpu0 gpu1 gpu2 gpu3 gpu0 gpu1 gpu2 gpu3
0     100  200    0    0    0    0    0    0
1     300  400    0    0    0    0    0    0

@lfdversluis
Copy link
Author

lfdversluis commented Jun 17, 2020

I think you ran it without raw=True. Here is the output:

INSTALLED VERSIONS ------------------ commit : 5fdd6f5 python : 3.7.4.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.18362 machine : AMD64 processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : None.None

pandas : 1.1.0.dev0+1887.g5fdd6f50a
numpy : 1.18.5
pytz : 2020.1
dateutil : 2.8.1
pip : 20.1.1
setuptools : 47.1.1
Cython : 0.29.20
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.5.1
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.9.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
matplotlib : 3.2.0
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.17.1
pytables : None
pyxlsb : None
s3fs : 0.2.2
scipy : 1.4.1
sqlalchemy : None
tables : None
tabulate : 0.8.7
xarray : 0.15.1
xlrd : None
xlwt : None
numba : 0.50.0

This is what I ran:

import numpy as np
import pandas as pd
pd.show_versions()
df = pd.DataFrame({"a": [100, 300], "b": [200, 400]})

def parse_node_row(row, index):
    def unpack_bitmask(val):
        return [np.bitwise_and(np.right_shift(val, i * 16), 0xFFFF) for i in range(4)]

    arr = np.concatenate(np.apply_along_axis(unpack_bitmask, 0, row)).ravel()

    return pd.Series(arr, index=index)

iterables = [df.columns, ["gpu{}".format(i) for i in range(4)]]
midx = pd.MultiIndex.from_product(iterables, names=['node', 'gpu'])

ndf = df.apply(parse_node_row, axis="columns", raw=True, result_type="expand", args=(midx,))

(I have added the raw=True) in this snippet for your convenience.

Output:

ValueError: Shape of passed values is (2, 8), indices imply (2, 2)

@lfdversluis
Copy link
Author

Now run it with the line ndf = df.apply(parse_node_row, axis="columns", result_type="expand", args=(midx,)) and you will see it works fine.

@jbrockmendel jbrockmendel added Apply Apply, Aggregate, Transform and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 3, 2020
@snake575
Copy link

I have as similar situation with df.apply using raw=True along with result_type="expand". It occurs with pandas>=1.1.0, pandas==1.0.5 works as expected.

import pandas as pd

df = pd.DataFrame({"a": [1, 2], "b": [3, 4]})

df.apply(lambda row: (1, 2, 3), axis=1, raw=True, result_type="expand")

pandas 1.0.5

0    (1, 2, 3)
1    (1, 2, 3)
dtype: object

pandas >= 1.1.0

ValueError: Shape of passed values is (2, 3), indices imply (2, 2)

@chinmayrane
Copy link
Contributor

I have a similar issue while migrating from 0.23.4 to 1.1.x

Essentially raw=True and zipping the returned tuples, I found more efficient, but this functionality breaks in 1.1.x and non-raw apply method is slow.

@chinmayrane
Copy link
Contributor

@TomAugspurger Would you know if there is any work around this besides not using raw=True ?

@TomAugspurger
Copy link
Contributor

I'm not sure offhand.

@anirudh-hegde
Copy link
Contributor

Can you please assign me this issue.
I will try to solve this

@chinmayrane
Copy link
Contributor

chinmayrane commented Aug 14, 2023

@pramodhrachuri
Copy link

I looks like pandas is trying to parse the returned data if the type id object. I am trying to return a list of 400 values and I get the error Shape of passed values is (10, 410), indices imply (10, 6). I think the code around this line is the reason.

def ndarray_to_mgr(

Is there anyway to make pandas not try to parse the output of apply?

@mpierrau
Copy link

mpierrau commented Feb 8, 2024

Is there any progress on this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform Bug
Projects
None yet
Development

No branches or pull requests

8 participants