Maximum size exceeded #34

vvssttkk · 2019-07-10T13:38:51Z

hi
I set pandarallel.initialize(shm_size_mb=10000) and after apply parallel_apply to my column i get the net error Maximum size exceeded (2GB)

why i get this message when i set more than 2gb?

The text was updated successfully, but these errors were encountered:

nalepae · 2019-07-10T13:59:56Z

Hi,

Could you tell me, if, when you do pandarallel.initialize(shm_size_mb=10000), you get this message in the console where you launch your jupyter notebook/lab (if you use a notebook/lab) or directly in the console where you launch the script ?

W0710 15:49:31.496636 6943 store.cc:1160] System memory request exceeds memory available in /dev/shm. The request is for 10000000000 bytes, and the amount available is 3579969945 bytes. You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you may need to pass an argument with the flag '--shm-size' to 'docker run'.

Caution: This message won't display if you initialize with verbose < 2

If this message appears, could you please give me the result of the following command ?
$ df -h | grep /dev/shm

vvssttkk · 2019-07-10T14:04:51Z

yes, get

i don't use any verbose

vvssttkk · 2019-07-10T14:05:27Z

this error i get also when set 5gb anв else

nalepae · 2019-07-10T14:06:38Z

Thanks, and so what is the result of $ df -h | grep /dev/shm (the end of my last message) ?

vvssttkk · 2019-07-10T14:08:27Z

sorry
here tmpfs 32G 70M 32G 1% /dev/shm

vvssttkk · 2019-07-10T14:09:12Z

i understood I have not much memory

nalepae · 2019-07-10T14:10:39Z

No its the contrary, 1% represents the used memory, and not the free memory. So it should be OK ...

Can you tell me more about your computer ?
OS, RAM size ...

vvssttkk · 2019-07-10T14:14:10Z

exactly, mixed up
ubuntu 18.04lts, ram 64, AMD Ryzen Threadripper 1920X 12-Core Processor, ssd:

vvssttkk · 2019-07-10T14:15:06Z

so, from Pool, i can easily do processing

nalepae · 2019-07-10T14:19:41Z

Well, for your case I don't really understand what's happening, I have to dig more.
I guess you already tried without increasing the SHM size?
Does it work correctly with examples given in the documentation?

Do you have the last version of Pandarallel ? (v1.2.0, because I changed things in case of multiple initializations in this version)

vvssttkk · 2019-07-10T14:36:21Z

i installed today lib
yes, i already tried without increasing the SHM size

lores · 2019-07-17T23:23:37Z

I don't know if it's the same error, but I get the "Maximum size exceeded (2GB)" from Arrow. I run within Jupyter.

pandarallel.initialize(shm_size_mb=10000)
decoded = df.myjsonfield.parallel_apply(json.loads) # ~4M rows

---------------------------------------------------------------------------
ArrowInvalid                              Traceback (most recent call last)
<ipython-input-2-102322070224> in <module>()
     28         denials_json = denials_json.append(d)
     29 
---> 30 decoded = df.myjsonfield.parallel_apply(json.loads)

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pandarallel/utils.py in wrapper(*args, **kwargs)
     61             """Please see the docstring of this method without `parallel`"""
     62             try:
---> 63                 return func(*args, **kwargs)
     64 
     65             except _PlasmaStoreFull:

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pandarallel/series.py in closure(series, func, *args, **kwargs)
     69         def closure(series, func, *args, **kwargs):
     70             chunks = chunk(series.size, nb_workers)
---> 71             object_id = plasma_client.put(series)
     72 
     73             with ProcessPoolExecutor(max_workers=nb_workers) as executor:

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pyarrow/_plasma.pyx in pyarrow._plasma.PlasmaClient.put()

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pyarrow/serialization.pxi in pyarrow.lib.serialize()

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowInvalid: Maximum size exceeded (2GB)

msankith · 2019-07-22T09:58:47Z

I am encountering same error as well.

New pandarallel memory created - Size: 6000 MB
Pandarallel will run on 32 workers

ArrowInvalid                              Traceback (most recent call last)
<ipython-input-13-fb3be564d742> in <module>()
      9     return row
     10 pandarallel.initialize(shm_size_mb=6000,nb_workers=32)
---> 11 data = data.parallel_apply(getLabel,axis=1)

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/pandarallel/utils.py in wrapper(*args, **kwargs)
     61             """Please see the docstring of this method without `parallel`"""
     62             try:
---> 63                 return func(*args, **kwargs)
     64 
     65             except _PlasmaStoreFull:

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/pandarallel/dataframe.py in closure(df, func, *args, **kwargs)
     39             chunks = chunk(df.shape[opposite_axis], nb_workers)
     40 
---> 41             object_id = plasma_client.put(df)
     42 
     43             with ProcessPoolExecutor(max_workers=nb_workers) as executor:

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/pyarrow/_plasma.pyx in pyarrow._plasma.PlasmaClient.put()

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/pyarrow/serialization.pxi in pyarrow.lib.serialize()

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowInvalid: Maximum size exceeded (2GB)`

df -h | grep /dev/shm
tmpfs           121G   47M  121G   1% /dev/shm

nalepae · 2019-07-22T10:28:05Z

Well I cannot reproduce this issue on my computer.

Could one of you send me the portion of code (with the dataframe) to reproduce this issue?
If you don't want to send me the dataframe as is because of confidentiality reasons, you can replace all data by placeholder ("aaaa" for strings, 42. for floats ...).

Curiously, the "2GB" seems to be hard coded directly in Pyarrow error message:
here, line 113

Thanks!

lores · 2019-07-23T13:41:50Z

I can't make it happen anymore, I'm afraid - could it be thanks to the 1.3.0 version?
I will post my dataframe if I see the error again.

vvssttkk · 2019-07-24T13:14:21Z

so, i think the problems with the shape string
because when i put little shape text then apply finished, but when put data over 500k - broke
-> @nalepae i think u can test creating data yourself

vvssttkk · 2019-07-24T13:37:09Z

nice, get the new error BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

vvssttkk · 2019-07-24T13:40:32Z

@nalepae i think this can help u
in this moment the memory not even half-filled

Dajre · 2019-08-02T22:54:29Z

I'm having this issue as well on python 3.6.5 using an AWS Sagemaker Notebook.
I set the shm size to 100GB and still get the same error that it exceeded 2GB.

slarrain · 2019-09-04T13:35:56Z

I'm having this issue as well on python 3.6.5 using an AWS Sagemaker Notebook.
I set the shm size to 100GB and still get the same error that it exceeded 2GB.

Like @nalepae mentioned here , the 2GB message is hardcoded into Arrow, so it doesn't mean you are only getting 2GB of memory.
Might be that you are getting a Memory Error, even though you have 100GB of RAM.
How big is your DataFrame?

shon-otmazgin · 2019-09-05T08:47:28Z

I have the same issue also.
Dataframe info: memory usage: 2.0+ GB

Machine:
200 GB RAM.
12 CORES

Setting:
pandarallel.initialize(shm_size_mb=5000)

info:
New pandarallel memory created - Size: 5000 MB
Pandarallel will run on 12 workers

And i getting always the Error:
ArrowInvalid: Maximum size exceeded (2GB)

nalepae · 2019-09-16T22:55:29Z

I'm currently developing a new version of Pandarallel without PyArrow Plasma (which seems to be the cause of your bug).

This version is not yet relased, but you can already try it by:

Cloning this git repository on your computer
Switching to develop branch
Running $ pip install . (with the dot after the install)

By default, this version of Pandarallel will try to use /dev/shm to transfer data between the main process and the workers. If you don't want to use this feature but you prefer to use standard multiprocessing tranfert feature (pipe), you can disable it by passing use_memory_fs_is_available=False in the initialize method.

See the docstring of initialize for more information.

Note the shm_size_mb parameter of initialize is now deprecated since Pandarallel don't use PyArrow anymore.

In you chosoe to use /dev/shm to transfer data between the main process and the workers and you got a memory error, you can either:

Increase the size of this partition, thanks to this link, or
Use standard multiprocessing tranfert feature (pipe), by passing use_memory_fs_is_available=False in the initialize method.

Note that I did not re-implemented (yet) progress bar and verbosity option in this version.

To remove this develop version and retrieve the official one:

$ pip uninstall pandarallel
$ pip install pandarallel

Please let me know if you encounter a new bug or if it works better now.

Regards,

Manu

nalepae · 2019-11-09T22:58:05Z

Fixed with pandarallel 1.4.0

It seems your bug comes from the usage of pyarrow plasma.

pandarallel 1.4.0 does not use pyarrow plasma any more.

nalepae self-assigned this Jul 22, 2019

nalepae added the bug Something isn't working label Jul 22, 2019

nalepae closed this as completed Nov 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Maximum size exceeded #34

Maximum size exceeded #34

vvssttkk commented Jul 10, 2019

nalepae commented Jul 10, 2019 •

edited

vvssttkk commented Jul 10, 2019

vvssttkk commented Jul 10, 2019

nalepae commented Jul 10, 2019 •

edited

vvssttkk commented Jul 10, 2019

vvssttkk commented Jul 10, 2019

nalepae commented Jul 10, 2019

vvssttkk commented Jul 10, 2019

vvssttkk commented Jul 10, 2019

nalepae commented Jul 10, 2019

vvssttkk commented Jul 10, 2019 •

edited

lores commented Jul 17, 2019

msankith commented Jul 22, 2019 •

edited by nalepae

nalepae commented Jul 22, 2019 •

edited

lores commented Jul 23, 2019

vvssttkk commented Jul 24, 2019

vvssttkk commented Jul 24, 2019

vvssttkk commented Jul 24, 2019 •

edited

Dajre commented Aug 2, 2019

slarrain commented Sep 4, 2019

shon-otmazgin commented Sep 5, 2019

nalepae commented Sep 16, 2019

nalepae commented Nov 9, 2019

Maximum size exceeded #34

Maximum size exceeded #34

Comments

vvssttkk commented Jul 10, 2019

nalepae commented Jul 10, 2019 • edited

vvssttkk commented Jul 10, 2019

vvssttkk commented Jul 10, 2019

nalepae commented Jul 10, 2019 • edited

vvssttkk commented Jul 10, 2019

vvssttkk commented Jul 10, 2019

nalepae commented Jul 10, 2019

vvssttkk commented Jul 10, 2019

vvssttkk commented Jul 10, 2019

nalepae commented Jul 10, 2019

vvssttkk commented Jul 10, 2019 • edited

lores commented Jul 17, 2019

msankith commented Jul 22, 2019 • edited by nalepae

nalepae commented Jul 22, 2019 • edited

lores commented Jul 23, 2019

vvssttkk commented Jul 24, 2019

vvssttkk commented Jul 24, 2019

vvssttkk commented Jul 24, 2019 • edited

Dajre commented Aug 2, 2019

slarrain commented Sep 4, 2019

shon-otmazgin commented Sep 5, 2019

nalepae commented Sep 16, 2019

nalepae commented Nov 9, 2019

nalepae commented Jul 10, 2019 •

edited

nalepae commented Jul 10, 2019 •

edited

vvssttkk commented Jul 10, 2019 •

edited

msankith commented Jul 22, 2019 •

edited by nalepae

nalepae commented Jul 22, 2019 •

edited

vvssttkk commented Jul 24, 2019 •

edited