Skip to content
This repository has been archived by the owner on Sep 6, 2022. It is now read-only.

Training is not converging. eval_wer sticks at ~95%. #35

Open
stefan-falk opened this issue Jun 25, 2020 · 7 comments
Open

Training is not converging. eval_wer sticks at ~95%. #35

stefan-falk opened this issue Jun 25, 2020 · 7 comments

Comments

@stefan-falk
Copy link

I finally was able to run a training on a single GPU (multi-GPU does not seem to work right now) but the word-error-rate is not dropping.

I did not change anything in the code and I am using the common voice dataset as suggested by the README.md

As you can see below, the train_loss drops but the eval_wer goes back up after a slight drop:

image

image

Any idea where this might come from?

@PeiyanFlying
Copy link

Excuse me, but I have another question. When I train the model, I always run into "out of memory". Just like this:

RuntimeError: CUDA out of memory. Tried to allocate 8.05 GiB (GPU 0; 23.62 GiB total capacity; 18.02 GiB already allocated; 2.84 GiB free; 19.59 GiB reserved in total by PyTorch)

I use one GPU to train, the memory size is 23.6GiB. So how could you succeed running model only on one GPU?
Many thanks!

@stefan-falk
Copy link
Author

@PeiyanFlying I am using a rather small batch size like 8 or 16 on a GeForce 1080 Ti (11 GB VRAM). In fact, multi-GPU seems to be broken at the moment. I am not able to use more GPUs than one at this point.

@PeiyanFlying
Copy link

@PeiyanFlying I am using a rather small batch size like 8 or 16 on a GeForce 1080 Ti (11 GB VRAM). In fact, multi-GPU seems to be broken at the moment. I am not able to use more GPUs than one at this point.

Thank you very much. These days I am working on RNNT training on LibriSpeech with Pytorch. But with the same config setting of this repository, It's easy to run into the OOM problem.
I try to check.
Thanks!

@stefan-falk
Copy link
Author

@PeiyanFlying Did you have any success yet? And, could you link me to that Pytorch library you're using? I'd like to take a look in case https://github.com/noahchalifour/rnnt-speech-recognition won't work for me

@PeiyanFlying
Copy link

Ok, I am working on it. Once the PyTorch library can run successfully, I give you the link.

@noahchalifour
Copy link
Owner

@stefan-falk I have also noted that the model is not converging. I have been working on a solution for a while. It seems though if you use a small enough dataset (as a test) the model does successfully converge. I did read that in the original paper they are using massive batch sizes and im not sure if that is the reason why the model is not converging. Any insights?

@WrathOfGrapes
Copy link

@noahchalifour Correct me if I'm wrong... Nobody has managed to train the network from this repo to reach at least 30 WER on Libri/common_voice?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants