Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pandarallel_apply crashes with OverflowError: int too big to convert #63

Closed
yeus opened this issue Dec 23, 2019 · 21 comments
Closed

pandarallel_apply crashes with OverflowError: int too big to convert #63

yeus opened this issue Dec 23, 2019 · 21 comments

Comments

@yeus
Copy link

yeus commented Dec 23, 2019

Hi everyone,

I am getting this error here using parallel_apply in pandas:

  File "extract_specifications.py", line 156, in <module>
    extracted_data = df.parallel_apply(extract_raw_infos, axis=1)
  File "/home/tom/.local/lib/python3.6/site-packages/pandarallel/pandarallel.py", line 367, in closure
    kwargs,
  File "/home/tom/.local/lib/python3.6/site-packages/pandarallel/pandarallel.py", line 239, in get_workers_args
    zip(input_files, output_files, chunk_lengths)
  File "/home/tom/.local/lib/python3.6/site-packages/pandarallel/pandarallel.py", line 238, in <listcomp>
    for index, (input_file, output_file, chunk_length) in enumerate(
  File "/home/tom/.local/lib/python3.6/site-packages/pandarallel/pandarallel.py", line 169, in wrapper
    time=time,
  File "/home/tom/.local/lib/python3.6/site-packages/pandarallel/utils/inliner.py", line 34, in wrapper
    return function(*args, **kwargs)
  File "/home/tom/.local/lib/python3.6/site-packages/pandarallel/utils/inliner.py", line 464, in inline
    func_instructions, len(b"".join(pinned_pre_func_instructions_without_return))
  File "/home/tom/.local/lib/python3.6/site-packages/pandarallel/utils/inliner.py", line 34, in wrapper
    return function(*args, **kwargs)
  File "/home/tom/.local/lib/python3.6/site-packages/pandarallel/utils/inliner.py", line 314, in shift_instructions
    for instruction in instructions
  File "/home/tom/.local/lib/python3.6/site-packages/pandarallel/utils/inliner.py", line 314, in <genexpr>
    for instruction in instructions
  File "/home/tom/.local/lib/python3.6/site-packages/pandarallel/utils/inliner.py", line 34, in wrapper
    return function(*args, **kwargs)
  File "/home/tom/.local/lib/python3.6/site-packages/pandarallel/utils/inliner.py", line 293, in shift_instruction
    return bytes((operation,)) + int2python_bytes(python_ints2int(values) + qty)
  File "/home/tom/.local/lib/python3.6/site-packages/pandarallel/utils/inliner.py", line 34, in wrapper
    return function(*args, **kwargs)
  File "/home/tom/.local/lib/python3.6/site-packages/pandarallel/utils/inliner.py", line 71, in int2python_bytes
    return int.to_bytes(item, nb_bytes, "little")
OverflowError: int too big to convert

I am using

pandarallel == 1.4.2
pandas == 0.24.2
python == 3.6.9

Any idea how to proceed from here? I have basically no idea what could cause this bug. I suspect it might be related to the size of the data I have in one column (I save html from web pages in there). But otherwise no idea. I would help removing this bug(?) if I had some guidance here. Thx for helping.

@nalepae
Copy link
Owner

nalepae commented Dec 26, 2019

Could you please try without the progress bar and/or with only a small part of your dataset and tell me the result ?

Could you also print the result of : len(df) ?

@yeus
Copy link
Author

yeus commented Dec 28, 2019

Could you please try without the progress bar and/or with only a small part of your dataset and tell me the result ?

Could you also print the result of : len(df) ?

thx. So without the progress bar it seemed to work. Thx for the hint. As it is a quiet long computation it would be nice to have it though...

here is some more information about my dataset. it includes a lot of python objects holding quiet large strings (largest one being html_content)

len(df) == 4737

# columns: 
df.shape = (4737, 21)

# memory consumption of columns
df.memory_usage(deep=True) == 
Index             201736
url               524872
depth              37896
counter            37896
time              336327
filename          486537
encoding          297990
file_path         548118
tables            246232
main_content    13186430
html_content    22344093
language          279483
lists             243120
list_score         37896
description     16791103
table0d           227376
table1d           259776
tablexd           232944
table_len          37896
table_num          37896
table_score        37896
co_score           37896

@bensdm
Copy link

bensdm commented Jan 20, 2020

same here, any way to make it work with progress bar?

@MarvinT
Copy link

MarvinT commented Jan 21, 2020

Why not just use the parallel version of tqdm instead of trying to build your own progress bar?

@orena1
Copy link

orena1 commented Apr 20, 2020

I'm using the parallel version of tqdm and I still get this error...

@pacman100
Copy link

I am also getting the same error. Trying with the progress bar set to False seems to work. It would be better if this would have worked with the progress bar.

@nbrosse
Copy link

nbrosse commented Sep 16, 2020

Same error also. It woud be nice to have the progress bar.

@Collonville
Copy link

Are there any updates for this issue? I have the same problem and can't solve it.

@aburneraccount
Copy link

I get the same error no matter what size df I use or how many cores I use, on python 3.8. I checked and the value being passed to int.to_bytes in inliner.py is 308

@wangchong666
Copy link

just edit PYTHON_PATH/python3.6/site-packages/pandarallel/utils/inliner.py line 71

retrun int.to_bytes(item%(1<<nb_bytes*8), nb_bytes, "little")

@yeus
Copy link
Author

yeus commented Aug 3, 2021

just edit PYTHON_PATH/python3.6/site-packages/pandarallel/utils/inliner.py line 71

retrun int.to_bytes(item%(1<<nb_bytes*8), nb_bytes, "little")

could we have this as a patch? Or does it interfere with some other stuff?

@yechafengyun
Copy link

Same Problem

@yechafengyun
Copy link

Same Problem

i check my apply function work correctly when progress_bar is set False. It seems to be something relating to the version stuff.
Pls fix it

@RobSpectre
Copy link

Same happening here. @wangchong666's fix does work - would love to see this released in the module.

@s-arjun
Copy link

s-arjun commented Dec 3, 2021

Same here. @wangchong666 's fix initiated the progressbar, but no progress was made. it was just stuck, even for a small dataset.
Any fix/patch would be helpful

@RobSpectre
Copy link

Confirming this bug still persists in 1.5.4.

@nalepae
Copy link
Owner

nalepae commented Jan 11, 2022

On it.

@park
Copy link

park commented Jan 15, 2022

Same here. Waiting for a fix. Thanks.

1 similar comment
@dmytrostriletskyi
Copy link

Same here. Waiting for a fix. Thanks.

@nalepae
Copy link
Owner

nalepae commented Feb 8, 2022 via email

@nalepae
Copy link
Owner

nalepae commented Mar 3, 2022

On Pandarallel v1.5.6, I temporarily deactivated the inliner, which causes this issue.

Positive impact: This bug is solve.
Negative impact: In some cases, progress bars themselves will slow down computation.

@nalepae nalepae closed this as completed Mar 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests