Fix for rnnt_loss.py #1177

yfyeung · 2023-04-25T08:14:14Z

No description provided.

danpovey · 2023-04-26T09:28:24Z

This PR is to fix an issue where a lot of memory is used in the backward pass of the simple RNNT loss.
It can cause random-seeming failures where, after some time, training ends with a message like this:

    Variable._execution_engine.run_backward(
RuntimeError: CUDA out of memory. Tried to allocate 5.37 GiB (GPU 0; 31.75 GiB total capacity; 17.47 GiB already allocated; 12.64 GiB free; 17.83 GiB reserved in total by PyTorch)

(note, there remains a mystery why, often, it seems to be asking for much less memory than the device has free (that 12.64 GiB number comes from the device_free of cudaMemGetInfo(&device_free, &device_total))... possibly this has to do with other things using the machine; but regardless, the fact is that way more memory is being used in the backward pass than really needs to be used.)

yfyeung added 2 commits April 25, 2023 16:12

Update rnnt_loss.py

0f0dc9b

Update rnnt_loss.py

0f32dc0

pkufool added the ready Ready for review and trigger GitHub actions to run label Apr 25, 2023

yfyeung added 3 commits April 25, 2023 16:26

Fix for style check

e812814

Fix for style check

2bc6adb

Update rnnt_loss.py

cf189e8

pkufool added ready Ready for review and trigger GitHub actions to run and removed ready Ready for review and trigger GitHub actions to run labels Apr 25, 2023

Merge branch 'k2-fsa:master' into yfyeung-patch-1

ff90249

pkufool merged commit a23383c into k2-fsa:master Apr 26, 2023
1 check passed

yfyeung deleted the yfyeung-patch-1 branch April 26, 2023 07:33

csukuangfj mentioned this pull request Apr 27, 2023

Release v1.24.0 #1180

Merged

danpovey mentioned this pull request Apr 30, 2023

Small fix for rnn_loss.py #1183

Merged

desh2608 mentioned this pull request May 1, 2023

Why T>=S constraint? k2-fsa/fast_rnnt#20

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix for rnnt_loss.py #1177

Fix for rnnt_loss.py #1177

yfyeung commented Apr 25, 2023

danpovey commented Apr 26, 2023

Fix for rnnt_loss.py #1177

Fix for rnnt_loss.py #1177

Conversation

yfyeung commented Apr 25, 2023

danpovey commented Apr 26, 2023