Memory is fully eaten and training quit with errors for 40k hours ASR training #8897

haihua · 2024-04-12T03:35:07Z

During training, memory is noticed increasing as time goes on, until 74% training done and no memory available. The training is quit giving the following errors:
We are using Coformer+ CTC with 1.2 cpu memory, 8 workers.
We guess this might be related with pytorch lightning, and some of memory is hold till an epoch is done.
There are no errors for smaller data, but once training data is getting larger, the bugs are triggered.
Please give us tips for how to cure the problem, thanks !

[E ProcessGroupNCCL.cpp:475] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=252347, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800190 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=252347, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800190 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=252347, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800380 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=252347, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800345 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=252347, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800374 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=252347, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800316 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=252347, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800421 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=252347, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800489 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 3] NCCL watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=252347, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800421 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 4] NCCL watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=252347, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800489 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 6] NCCL watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=252347, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800380 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:916] [Rank 7] NCCL watchdog thread terminated with exception: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=252347, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800345 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:916] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=252347, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800316 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:916] [Rank 2] NCCL watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=252347, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800374 milliseconds before timing out.
asr_scripts_78/egs/sg/run.sh: line 547: 181694 Aborted (core dumped) python3 $scripts/asr_trainer.py --conf $cfg --devices $devices --accumulate_grad_batches 8 --accelerator 'cuda' --save_top_k 5 --val_check_interval 200 --checkpoint_path ${ckpt_path} --model.optim.lr 0.25 --model.optim.sched.warmup_steps 10000 --model.train_ds.max_duration 18.2 --model.train_ds.num_workers 8 --model.optim.sched.name NoamAnnealing --model.train_ds.manifest_filepath ${train_data} --resume_from_checkpoint "${pretrained_mdl}" --model.tokenizer.dir ${tokenizer} --model.tokenizer.type 'bpe' --model.train_ds.batch_size 64 --model.validation_ds.batch_size 64 --model.validation_ds.manifest_filepath ${valid_data} --model.interctc.loss_weights "[]" --model.interctc.apply_at_layers "[]" --model.optim.sched.last_epoch 25000

The text was updated successfully, but these errors were encountered:

titu1994 · 2024-04-12T06:21:26Z

Is it GPU or CPU memory that is exhausted ? And how many nodes are you using ?

What version of NeMo are you using ?
Without sufficient details it's not possible to debug.

What I can say is we train on nodes with 400 GB ram per node and A100 with 80GB gpu memory and train on 90-400K hours of speech without oom in either CPU or GPU memory.

If you can visibly see CPU ram constantly increase during training, a pseudo fix could be to use exp_manager.max_time_per_run and set it to a reasonable value like a day, then the job stops after a day and you can restart it and avoid memory leak. It's not a fix but a temporary solution

haihua · 2024-04-12T07:15:27Z

Is it GPU or CPU memory that is exhausted ? And how many nodes are you using ?

CPU not GPU
Just single node
We just added one row
self.log('loss', loss_value, on_step=True, prog_bar=True, on_epoch=False,)
in file:
nemo/collections/asr/models/ctc_models.py
Previously, we used on_epoch=True, but now the problem still remains after changine to False.

What version of NeMo are you using ?

git log
commit 0d3d8fa (HEAD -> main)
Author: anteju 108555623+anteju@users.noreply.github.com
Date: Wed Nov 15 16:56:29 2023 -0800

[ASR] GSS-based mask estimator (#7849)

* Added GSS-based mask estimator for multispeaker scenarios

Signed-off-by: Ante Jukić <ajukic@nvidia.com>

* Addressed PR comments

Signed-off-by: Ante Jukić <ajukic@nvidia.com>

---------

Signed-off-by: Ante Jukić <ajukic@nvidia.com>
Co-authored-by: Taejin Park <tango4j@gmail.com>

Actually, it's very easy to verify: you just submit a training task with, say librispeech data, you can observe you CPU memory keeps increasing within an epoch.
But such memory increase won't hurt since memory increase slow and after an epoch, memory usage somehwo is going down again. Here, if we decrease our training data down to 30k, for 1.2T cpu memory, we can finish an epoch normally.

haihua · 2024-04-12T07:23:53Z

On Fri, Apr 12, 2024 at 2:21 PM Somshubra Majumdar ***@***.***> wrote: Is it GPU or CPU memory that is exhausted ? And how many nodes are you using ? What version of NeMo are you using ? Without sufficient details it's not possible to debug. What I can say is we train on nodes with 400 GB ram per node and A100 with 80GB gpu memory and train on 90-400K hours of speech without oom in either CPU or GPU memory.

How many nodes have you used ? if you use a lot of nodes, then you might not trigger the bugs. Say, you have used 8 nodes, then there might be no issues ... Regards, Haihua

…

If you can visibly see CPU ram constantly increase during training, a pseudo fix could be to use exp_manager.max_time_per_run and set it to a reasonable value like a day, then the job stops after a day and you can restart it and avoid memory leak. It's not a fix but a temporary solution — Reply to this email directly, view it on GitHub <#8897 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZBHYOSYLVXC677XXF2LWLY454PZAVCNFSM6AAAAABGDMTIVKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJRGA3DKMRTHE> . You are receiving this because you authored the thread.Message ID: ***@***.***>

titu1994 · 2024-04-12T07:34:57Z

That Nemo version is 6 months old, can you use r1.23 and see if it persists ? We do not see constantly increasing CPU memory per epoch, but that may be because we use multiple nodes - min 4 nodes

riqiang-dp · 2024-04-26T16:51:15Z

Hi, is this issue resolved? I've been running into the same issue. (I can confirm that it happens on 1.23 as well)

ROZBEH · 2024-05-23T17:57:38Z

Hi there,

Just checking here and wondering whether this is resolved?
I am facing same issue.

Thank you.

haihua · 2024-05-24T06:17:59Z

using multiple nodes to train can avoid the problem.

…

On Fri, May 24, 2024, 1:58 AM ROZBEH ***@***.***> wrote: Hi there, Just checking here and wondering whether this is resolved? I am facing same issue. Thank you. — Reply to this email directly, view it on GitHub <#8897 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZBHYJPQ36M6WWYAGGERQLZDYU2RAVCNFSM6AAAAABGDMTIVKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRXG42DGNRVHE> . You are receiving this because you authored the thread.Message ID: ***@***.***>

ROZBEH · 2024-05-24T12:05:54Z

Thanks @haihua
I'm indeed 5 nodes with 5 GPU each. Is that what you mean?

haihua · 2024-05-24T12:51:21Z

Yes, that's it.

…

On Fri, May 24, 2024, 8:06 PM ROZBEH ***@***.***> wrote: Thanks @haihua <https://github.com/haihua> I'm indeed 5 nodes with 5 GPU each. Is that what you mean? — Reply to this email directly, view it on GitHub <#8897 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZBHYLT2XC5A4QKW26PJHDZD4ULRAVCNFSM6AAAAABGDMTIVKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRZGM3DQNJYHA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

ROZBEH · 2024-05-24T13:56:02Z

I see but the above issue is persistent with multi node and I'd like to get it working.

github-actions · 2024-06-24T01:49:35Z

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions · 2024-07-01T01:54:44Z

This issue was closed because it has been inactive for 7 days since being marked as stale.

haihua added the bug Something isn't working label Apr 12, 2024

github-actions bot added the stale label Jun 24, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory is fully eaten and training quit with errors for 40k hours ASR training #8897

Memory is fully eaten and training quit with errors for 40k hours ASR training #8897

haihua commented Apr 12, 2024

titu1994 commented Apr 12, 2024

haihua commented Apr 12, 2024 •

edited

Loading

haihua commented Apr 12, 2024 via email

titu1994 commented Apr 12, 2024

riqiang-dp commented Apr 26, 2024 •

edited

Loading

ROZBEH commented May 23, 2024

haihua commented May 24, 2024 via email

ROZBEH commented May 24, 2024

haihua commented May 24, 2024 via email

ROZBEH commented May 24, 2024

github-actions bot commented Jun 24, 2024

github-actions bot commented Jul 1, 2024

Memory is fully eaten and training quit with errors for 40k hours ASR training #8897

Memory is fully eaten and training quit with errors for 40k hours ASR training #8897

Comments

haihua commented Apr 12, 2024

titu1994 commented Apr 12, 2024

haihua commented Apr 12, 2024 • edited Loading

haihua commented Apr 12, 2024 via email

titu1994 commented Apr 12, 2024

riqiang-dp commented Apr 26, 2024 • edited Loading

ROZBEH commented May 23, 2024

haihua commented May 24, 2024 via email

ROZBEH commented May 24, 2024

haihua commented May 24, 2024 via email

ROZBEH commented May 24, 2024

github-actions bot commented Jun 24, 2024

github-actions bot commented Jul 1, 2024

haihua commented Apr 12, 2024 •

edited

Loading

riqiang-dp commented Apr 26, 2024 •

edited

Loading