Training is not converging. eval_wer sticks at ~95%. #35

stefan-falk · 2020-06-25T07:35:19Z

I finally was able to run a training on a single GPU (multi-GPU does not seem to work right now) but the word-error-rate is not dropping.

I did not change anything in the code and I am using the common voice dataset as suggested by the README.md

As you can see below, the train_loss drops but the eval_wer goes back up after a slight drop:

Any idea where this might come from?

PeiyanFlying · 2020-07-01T03:32:01Z

Excuse me, but I have another question. When I train the model, I always run into "out of memory". Just like this:

RuntimeError: CUDA out of memory. Tried to allocate 8.05 GiB (GPU 0; 23.62 GiB total capacity; 18.02 GiB already allocated; 2.84 GiB free; 19.59 GiB reserved in total by PyTorch)

I use one GPU to train, the memory size is 23.6GiB. So how could you succeed running model only on one GPU？
Many thanks!

stefan-falk · 2020-07-01T08:17:20Z

@PeiyanFlying I am using a rather small batch size like 8 or 16 on a GeForce 1080 Ti (11 GB VRAM). In fact, multi-GPU seems to be broken at the moment. I am not able to use more GPUs than one at this point.

PeiyanFlying · 2020-07-01T19:37:59Z

@PeiyanFlying I am using a rather small batch size like 8 or 16 on a GeForce 1080 Ti (11 GB VRAM). In fact, multi-GPU seems to be broken at the moment. I am not able to use more GPUs than one at this point.

Thank you very much. These days I am working on RNNT training on LibriSpeech with Pytorch. But with the same config setting of this repository, It's easy to run into the OOM problem.
I try to check.
Thanks!

stefan-falk · 2020-07-02T06:12:47Z

@PeiyanFlying Did you have any success yet? And, could you link me to that Pytorch library you're using? I'd like to take a look in case https://github.com/noahchalifour/rnnt-speech-recognition won't work for me

PeiyanFlying · 2020-07-03T14:39:49Z

Ok, I am working on it. Once the PyTorch library can run successfully, I give you the link.

noahchalifour · 2020-09-04T23:27:56Z

@stefan-falk I have also noted that the model is not converging. I have been working on a solution for a while. It seems though if you use a small enough dataset (as a test) the model does successfully converge. I did read that in the original paper they are using massive batch sizes and im not sure if that is the reason why the model is not converging. Any insights?

WrathOfGrapes · 2020-09-22T18:22:22Z

@noahchalifour Correct me if I'm wrong... Nobody has managed to train the network from this repo to reach at least 30 WER on Libri/common_voice?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training is not converging. eval_wer sticks at ~95%. #35

Training is not converging. eval_wer sticks at ~95%. #35

stefan-falk commented Jun 25, 2020

PeiyanFlying commented Jul 1, 2020

stefan-falk commented Jul 1, 2020

PeiyanFlying commented Jul 1, 2020

stefan-falk commented Jul 2, 2020

PeiyanFlying commented Jul 3, 2020

noahchalifour commented Sep 4, 2020

WrathOfGrapes commented Sep 22, 2020

Training is not converging. eval_wer sticks at ~95%. #35

Training is not converging. eval_wer sticks at ~95%. #35

Comments

stefan-falk commented Jun 25, 2020

PeiyanFlying commented Jul 1, 2020

stefan-falk commented Jul 1, 2020

PeiyanFlying commented Jul 1, 2020

stefan-falk commented Jul 2, 2020

PeiyanFlying commented Jul 3, 2020

noahchalifour commented Sep 4, 2020

WrathOfGrapes commented Sep 22, 2020