You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We're trying to calculate features for a ~20k hour dataset using kaldifeatfbank extractor and the process keeps crashing when the num_workers are higher than 0. This is the code we're trying to run
The initial error was RuntimeError: received 0 items of ancdata. After going through the issue k2-fsa/icefall#515 and pytorch/pytorch#973, we tried two suggested solutions which did not work -
Increasing ulimit soft and hard limit to 1024000 which stopped the initial error but was crashing later with RuntimeError: unable to mmap 111360 bytes from file <filename not specified>: Cannot allocate memory (12).
Increasing ulimit and setting torch.multiprocessing.set_sharing_strategy('file_system') which also crashed with a similar error RuntimeError: unable to mmap 202944 bytes from file </torch_57144_2565496321_1982>: Cannot allocate memory (12)
These errors suggest that there's a lot of memory leakage happening leading to high ram usage and the consequent crash.
With num_workers=0, 18 hours to calculate 14 million files with no memory leaks and with num_workers=46, it took 8 hours to calculate 12 million files with memory leaks.
How do we avoid this memory usage increase and crash? batch_duration modifications don't seem to change the vram usage. Acc. to this comment, we are supposed to set the strategy for each worker via a worker_init_fn, how do we go about doing that in this case? Also, this comment seem to point the root cause for this issue. Is there something obvious we're missing in our approach that we should take for large datasets?
Any suggestions or guidance regarding this would be of great help.
Thank you!
The text was updated successfully, but these errors were encountered:
I can't see any obvious part of the code that would cause memory leaks. We are not storing the objects returned from the dataloader - the results are processed and submitted to a file writing thread. Perhaps with so many workers, the file writing thread queue is growing indefinitely?
to let the writer thread "catch up". If this helps, we can either 1) refactor to implement a proper sized queue 2) remove the writer thread altogether and accept the inefficiency 3) refactor to use a process pool executor for writing and assign more workers to it
If this doesn't help, another option is to go paranoia mode and explicitly delete cuts, audio, and feature tensors after they are processed with del.
Hi,
We're trying to calculate features for a ~20k hour dataset using kaldifeatfbank extractor and the process keeps crashing when the num_workers are higher than 0. This is the code we're trying to run
The initial error was RuntimeError: received 0 items of ancdata. After going through the issue k2-fsa/icefall#515 and pytorch/pytorch#973, we tried two suggested solutions which did not work -
Increasing ulimit soft and hard limit to 1024000 which stopped the initial error but was crashing later with
RuntimeError: unable to mmap 111360 bytes from file <filename not specified>: Cannot allocate memory (12).
Increasing ulimit and setting torch.multiprocessing.set_sharing_strategy('file_system') which also crashed with a similar error
RuntimeError: unable to mmap 202944 bytes from file </torch_57144_2565496321_1982>: Cannot allocate memory (12)
These errors suggest that there's a lot of memory leakage happening leading to high ram usage and the consequent crash.
With num_workers=0, 18 hours to calculate 14 million files with no memory leaks and with num_workers=46, it took 8 hours to calculate 12 million files with memory leaks.
How do we avoid this memory usage increase and crash? batch_duration modifications don't seem to change the vram usage. Acc. to this comment, we are supposed to set the strategy for each worker via a worker_init_fn, how do we go about doing that in this case? Also, this comment seem to point the root cause for this issue. Is there something obvious we're missing in our approach that we should take for large datasets?
Any suggestions or guidance regarding this would be of great help.
Thank you!
The text was updated successfully, but these errors were encountered: