Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Maximum size exceeded #34

Closed
vvssttkk opened this issue Jul 10, 2019 · 23 comments
Closed

Maximum size exceeded #34

vvssttkk opened this issue Jul 10, 2019 · 23 comments
Assignees
Labels
bug Something isn't working

Comments

@vvssttkk
Copy link

hi
I set pandarallel.initialize(shm_size_mb=10000) and after apply parallel_apply to my column i get the net error Maximum size exceeded (2GB)

why i get this message when i set more than 2gb?

@nalepae
Copy link
Owner

nalepae commented Jul 10, 2019

Hi,

Could you tell me, if, when you do pandarallel.initialize(shm_size_mb=10000), you get this message in the console where you launch your jupyter notebook/lab (if you use a notebook/lab) or directly in the console where you launch the script ?

W0710 15:49:31.496636 6943 store.cc:1160] System memory request exceeds memory available in /dev/shm. The request is for 10000000000 bytes, and the amount available is 3579969945 bytes. You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you may need to pass an argument with the flag '--shm-size' to 'docker run'.

Caution: This message won't display if you initialize with verbose < 2

If this message appears, could you please give me the result of the following command ?
$ df -h | grep /dev/shm

@vvssttkk
Copy link
Author

yes, get
image
i don't use any verbose

@vvssttkk
Copy link
Author

this error i get also when set 5gb anв else

@nalepae
Copy link
Owner

nalepae commented Jul 10, 2019

Thanks, and so what is the result of $ df -h | grep /dev/shm (the end of my last message) ?

@vvssttkk
Copy link
Author

sorry
here tmpfs 32G 70M 32G 1% /dev/shm

@vvssttkk
Copy link
Author

i understood I have not much memory

@nalepae
Copy link
Owner

nalepae commented Jul 10, 2019

No its the contrary, 1% represents the used memory, and not the free memory. So it should be OK ...

Can you tell me more about your computer ?
OS, RAM size ...

@vvssttkk
Copy link
Author

exactly, mixed up
ubuntu 18.04lts, ram 64, AMD Ryzen Threadripper 1920X 12-Core Processor, ssd:
image

@vvssttkk
Copy link
Author

so, from Pool, i can easily do processing

@nalepae
Copy link
Owner

nalepae commented Jul 10, 2019

Well, for your case I don't really understand what's happening, I have to dig more.
I guess you already tried without increasing the SHM size?
Does it work correctly with examples given in the documentation?

Do you have the last version of Pandarallel ? (v1.2.0, because I changed things in case of multiple initializations in this version)

@vvssttkk
Copy link
Author

vvssttkk commented Jul 10, 2019

i installed today lib
yes, i already tried without increasing the SHM size

@lores
Copy link

lores commented Jul 17, 2019

I don't know if it's the same error, but I get the "Maximum size exceeded (2GB)" from Arrow. I run within Jupyter.

pandarallel.initialize(shm_size_mb=10000)
decoded = df.myjsonfield.parallel_apply(json.loads) # ~4M rows
---------------------------------------------------------------------------
ArrowInvalid                              Traceback (most recent call last)
<ipython-input-2-102322070224> in <module>()
     28         denials_json = denials_json.append(d)
     29 
---> 30 decoded = df.myjsonfield.parallel_apply(json.loads)

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pandarallel/utils.py in wrapper(*args, **kwargs)
     61             """Please see the docstring of this method without `parallel`"""
     62             try:
---> 63                 return func(*args, **kwargs)
     64 
     65             except _PlasmaStoreFull:

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pandarallel/series.py in closure(series, func, *args, **kwargs)
     69         def closure(series, func, *args, **kwargs):
     70             chunks = chunk(series.size, nb_workers)
---> 71             object_id = plasma_client.put(series)
     72 
     73             with ProcessPoolExecutor(max_workers=nb_workers) as executor:

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pyarrow/_plasma.pyx in pyarrow._plasma.PlasmaClient.put()

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pyarrow/serialization.pxi in pyarrow.lib.serialize()

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowInvalid: Maximum size exceeded (2GB)

@msankith
Copy link

msankith commented Jul 22, 2019

I am encountering same error as well.

New pandarallel memory created - Size: 6000 MB
Pandarallel will run on 32 workers

ArrowInvalid                              Traceback (most recent call last)
<ipython-input-13-fb3be564d742> in <module>()
      9     return row
     10 pandarallel.initialize(shm_size_mb=6000,nb_workers=32)
---> 11 data = data.parallel_apply(getLabel,axis=1)

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/pandarallel/utils.py in wrapper(*args, **kwargs)
     61             """Please see the docstring of this method without `parallel`"""
     62             try:
---> 63                 return func(*args, **kwargs)
     64 
     65             except _PlasmaStoreFull:

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/pandarallel/dataframe.py in closure(df, func, *args, **kwargs)
     39             chunks = chunk(df.shape[opposite_axis], nb_workers)
     40 
---> 41             object_id = plasma_client.put(df)
     42 
     43             with ProcessPoolExecutor(max_workers=nb_workers) as executor:

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/pyarrow/_plasma.pyx in pyarrow._plasma.PlasmaClient.put()

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/pyarrow/serialization.pxi in pyarrow.lib.serialize()

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowInvalid: Maximum size exceeded (2GB)`

df -h | grep /dev/shm
tmpfs           121G   47M  121G   1% /dev/shm

@nalepae
Copy link
Owner

nalepae commented Jul 22, 2019

Well I cannot reproduce this issue on my computer.

Could one of you send me the portion of code (with the dataframe) to reproduce this issue?
If you don't want to send me the dataframe as is because of confidentiality reasons, you can replace all data by placeholder ("aaaa" for strings, 42. for floats ...).

Curiously, the "2GB" seems to be hard coded directly in Pyarrow error message:
here, line 113

Thanks!

@nalepae nalepae self-assigned this Jul 22, 2019
@nalepae nalepae added the bug Something isn't working label Jul 22, 2019
@lores
Copy link

lores commented Jul 23, 2019

I can't make it happen anymore, I'm afraid - could it be thanks to the 1.3.0 version?
I will post my dataframe if I see the error again.

@vvssttkk
Copy link
Author

so, i think the problems with the shape string
because when i put little shape text then apply finished, but when put data over 500k - broke
-> @nalepae i think u can test creating data yourself

@vvssttkk
Copy link
Author

nice, get the new error BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

@vvssttkk
Copy link
Author

vvssttkk commented Jul 24, 2019

@nalepae i think this can help u
in this moment the memory not even half-filled
image

@Dajre
Copy link

Dajre commented Aug 2, 2019

I'm having this issue as well on python 3.6.5 using an AWS Sagemaker Notebook.
I set the shm size to 100GB and still get the same error that it exceeded 2GB.

@slarrain
Copy link

slarrain commented Sep 4, 2019

I'm having this issue as well on python 3.6.5 using an AWS Sagemaker Notebook.
I set the shm size to 100GB and still get the same error that it exceeded 2GB.

Like @nalepae mentioned here , the 2GB message is hardcoded into Arrow, so it doesn't mean you are only getting 2GB of memory.
Might be that you are getting a Memory Error, even though you have 100GB of RAM.
How big is your DataFrame?

@shon-otmazgin
Copy link

I have the same issue also.
Dataframe info: memory usage: 2.0+ GB

Machine:
200 GB RAM.
12 CORES

Setting:
pandarallel.initialize(shm_size_mb=5000)

info:
New pandarallel memory created - Size: 5000 MB
Pandarallel will run on 12 workers

And i getting always the Error:
ArrowInvalid: Maximum size exceeded (2GB)

@nalepae
Copy link
Owner

nalepae commented Sep 16, 2019

I'm currently developing a new version of Pandarallel without PyArrow Plasma (which seems to be the cause of your bug).

This version is not yet relased, but you can already try it by:

  • Cloning this git repository on your computer
  • Switching to develop branch
  • Running $ pip install . (with the dot after the install)

By default, this version of Pandarallel will try to use /dev/shm to transfer data between the main process and the workers. If you don't want to use this feature but you prefer to use standard multiprocessing tranfert feature (pipe), you can disable it by passing use_memory_fs_is_available=False in the initialize method.

See the docstring of initialize for more information.

Note the shm_size_mb parameter of initialize is now deprecated since Pandarallel don't use PyArrow anymore.

In you chosoe to use /dev/shm to transfer data between the main process and the workers and you got a memory error, you can either:

  • Increase the size of this partition, thanks to this link, or
  • Use standard multiprocessing tranfert feature (pipe), by passing use_memory_fs_is_available=False in the initialize method.

Note that I did not re-implemented (yet) progress bar and verbosity option in this version.

To remove this develop version and retrieve the official one:

  • $ pip uninstall pandarallel
  • $ pip install pandarallel

Please let me know if you encounter a new bug or if it works better now.

Regards,

Manu

@nalepae
Copy link
Owner

nalepae commented Nov 9, 2019

Fixed with pandarallel 1.4.0

It seems your bug comes from the usage of pyarrow plasma.

pandarallel 1.4.0 does not use pyarrow plasma any more.

@nalepae nalepae closed this as completed Nov 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants