Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory is fully eaten and training quit with errors for 40k hours ASR training #8897

Closed
haihua opened this issue Apr 12, 2024 · 12 comments
Closed
Labels
bug Something isn't working stale

Comments

@haihua
Copy link

haihua commented Apr 12, 2024

During training, memory is noticed increasing as time goes on, until 74% training done and no memory available. The training is quit giving the following errors:
We are using Coformer+ CTC with 1.2 cpu memory, 8 workers.
We guess this might be related with pytorch lightning, and some of memory is hold till an epoch is done.
There are no errors for smaller data, but once training data is getting larger, the bugs are triggered.
Please give us tips for how to cure the problem, thanks !

[E ProcessGroupNCCL.cpp:475] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=252347, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800190 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=252347, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800190 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=252347, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800380 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=252347, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800345 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=252347, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800374 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=252347, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800316 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=252347, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800421 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=252347, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800489 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 3] NCCL watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=252347, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800421 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 4] NCCL watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=252347, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800489 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 6] NCCL watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=252347, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800380 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:916] [Rank 7] NCCL watchdog thread terminated with exception: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=252347, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800345 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:916] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=252347, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800316 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:916] [Rank 2] NCCL watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=252347, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800374 milliseconds before timing out.
asr_scripts_78/egs/sg/run.sh: line 547: 181694 Aborted (core dumped) python3 $scripts/asr_trainer.py --conf $cfg --devices $devices --accumulate_grad_batches 8 --accelerator 'cuda' --save_top_k 5 --val_check_interval 200 --checkpoint_path ${ckpt_path} --model.optim.lr 0.25 --model.optim.sched.warmup_steps 10000 --model.train_ds.max_duration 18.2 --model.train_ds.num_workers 8 --model.optim.sched.name NoamAnnealing --model.train_ds.manifest_filepath ${train_data} --resume_from_checkpoint "${pretrained_mdl}" --model.tokenizer.dir ${tokenizer} --model.tokenizer.type 'bpe' --model.train_ds.batch_size 64 --model.validation_ds.batch_size 64 --model.validation_ds.manifest_filepath ${valid_data} --model.interctc.loss_weights "[]" --model.interctc.apply_at_layers "[]" --model.optim.sched.last_epoch 25000

@haihua haihua added the bug Something isn't working label Apr 12, 2024
@titu1994
Copy link
Collaborator

Is it GPU or CPU memory that is exhausted ? And how many nodes are you using ?

What version of NeMo are you using ?
Without sufficient details it's not possible to debug.

What I can say is we train on nodes with 400 GB ram per node and A100 with 80GB gpu memory and train on 90-400K hours of speech without oom in either CPU or GPU memory.

If you can visibly see CPU ram constantly increase during training, a pseudo fix could be to use exp_manager.max_time_per_run and set it to a reasonable value like a day, then the job stops after a day and you can restart it and avoid memory leak. It's not a fix but a temporary solution

@haihua
Copy link
Author

haihua commented Apr 12, 2024

Is it GPU or CPU memory that is exhausted ? And how many nodes are you using ?

  1. CPU not GPU
  2. Just single node
    We just added one row
    self.log('loss', loss_value, on_step=True, prog_bar=True, on_epoch=False,)
    in file:
    nemo/collections/asr/models/ctc_models.py
    Previously, we used on_epoch=True, but now the problem still remains after changine to False.

What version of NeMo are you using ?

git log
commit 0d3d8fa (HEAD -> main)
Author: anteju 108555623+anteju@users.noreply.github.com
Date: Wed Nov 15 16:56:29 2023 -0800

[ASR] GSS-based mask estimator (#7849)

* Added GSS-based mask estimator for multispeaker scenarios

Signed-off-by: Ante Jukić <ajukic@nvidia.com>

* Addressed PR comments

Signed-off-by: Ante Jukić <ajukic@nvidia.com>

---------

Signed-off-by: Ante Jukić <ajukic@nvidia.com>
Co-authored-by: Taejin Park <tango4j@gmail.com>

Actually, it's very easy to verify: you just submit a training task with, say librispeech data, you can observe you CPU memory keeps increasing within an epoch.
But such memory increase won't hurt since memory increase slow and after an epoch, memory usage somehwo is going down again. Here, if we decrease our training data down to 30k, for 1.2T cpu memory, we can finish an epoch normally.

@haihua
Copy link
Author

haihua commented Apr 12, 2024 via email

@titu1994
Copy link
Collaborator

That Nemo version is 6 months old, can you use r1.23 and see if it persists ? We do not see constantly increasing CPU memory per epoch, but that may be because we use multiple nodes - min 4 nodes

@riqiang-dp
Copy link

riqiang-dp commented Apr 26, 2024

Hi, is this issue resolved? I've been running into the same issue. (I can confirm that it happens on 1.23 as well)

@ROZBEH
Copy link

ROZBEH commented May 23, 2024

Hi there,

Just checking here and wondering whether this is resolved?
I am facing same issue.

Thank you.

@haihua
Copy link
Author

haihua commented May 24, 2024 via email

@ROZBEH
Copy link

ROZBEH commented May 24, 2024

Thanks @haihua
I'm indeed 5 nodes with 5 GPU each. Is that what you mean?

@haihua
Copy link
Author

haihua commented May 24, 2024 via email

@ROZBEH
Copy link

ROZBEH commented May 24, 2024

I see but the above issue is persistent with multi node and I'd like to get it working.

Copy link
Contributor

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the stale label Jun 24, 2024
Copy link
Contributor

github-actions bot commented Jul 1, 2024

This issue was closed because it has been inactive for 7 days since being marked as stale.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale
Projects
None yet
Development

No branches or pull requests

4 participants