Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature calculation process crashing with large dataset #1364

Open
duhtapioca opened this issue Jun 26, 2024 · 1 comment
Open

Feature calculation process crashing with large dataset #1364

duhtapioca opened this issue Jun 26, 2024 · 1 comment

Comments

@duhtapioca
Copy link

Hi,

We're trying to calculate features for a ~20k hour dataset using kaldifeatfbank extractor and the process keeps crashing when the num_workers are higher than 0. This is the code we're trying to run

extractor = KaldifeatFbank(KaldifeatFbankConfig(device='cuda', frame_opts=KaldifeatFrameOptions(sampling_rate=8000)))

cuts_train = cuts_train.compute_and_store_features_batch(
     extractor=extractor,
     storage_path = "/temp_ssd/icefall/egs/librispeech/ASR/data/features_kaldifeatfbank/",
     num_workers = 46,
)

The initial error was RuntimeError: received 0 items of ancdata. After going through the issue k2-fsa/icefall#515 and pytorch/pytorch#973, we tried two suggested solutions which did not work -

  1. Increasing ulimit soft and hard limit to 1024000 which stopped the initial error but was crashing later with RuntimeError: unable to mmap 111360 bytes from file <filename not specified>: Cannot allocate memory (12).

  2. Increasing ulimit and setting torch.multiprocessing.set_sharing_strategy('file_system') which also crashed with a similar error
    RuntimeError: unable to mmap 202944 bytes from file </torch_57144_2565496321_1982>: Cannot allocate memory (12)

These errors suggest that there's a lot of memory leakage happening leading to high ram usage and the consequent crash.

With num_workers=0, 18 hours to calculate 14 million files with no memory leaks and with num_workers=46, it took 8 hours to calculate 12 million files with memory leaks.

How do we avoid this memory usage increase and crash? batch_duration modifications don't seem to change the vram usage. Acc. to this comment, we are supposed to set the strategy for each worker via a worker_init_fn, how do we go about doing that in this case? Also, this comment seem to point the root cause for this issue. Is there something obvious we're missing in our approach that we should take for large datasets?

Any suggestions or guidance regarding this would be of great help.

Thank you!

@pzelasko
Copy link
Collaborator

I can't see any obvious part of the code that would cause memory leaks. We are not storing the objects returned from the dataloader - the results are processed and submitted to a file writing thread. Perhaps with so many workers, the file writing thread queue is growing indefinitely?

You could try to add a hack which replaces

                futures.append(executor.submit(_save_worker, cuts, features))

with

                while executor._work_queue.qsize() > 1000:
                    time.sleep(0.1)
                futures.append(executor.submit(_save_worker, cuts, features))

to let the writer thread "catch up". If this helps, we can either 1) refactor to implement a proper sized queue 2) remove the writer thread altogether and accept the inefficiency 3) refactor to use a process pool executor for writing and assign more workers to it

If this doesn't help, another option is to go paranoia mode and explicitly delete cuts, audio, and feature tensors after they are processed with del.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants