New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
My training crashes with large corpus. #516
Comments
Wow, that's really large... probably like 100 million sentences. With so much data, you can probably sample a subset of your dataset and get the same performance.
Ensure consistency and you should be good.
These are advice from a guy who mainly does machine translation. |
Please try --train_extremely_large_corpus=true of spm_train. You computer needs to have reasonable memory size to enable this mode though. |
Got these silent crashes too on large text files (~50GB) |
|
I am too. 30G corpus, 180 millions sentences. Only know about -train_extremely_large_corpus=true now, will try and let you know. It works with "bpe" though, but crashed on unigram. Steve |
Keep crashing on UNIGRAM !!!! Any hint ? |
I have a corpus of 10GB and i tried to train it on Google Colab with 25 GB ram.But during training time it crashes. Is there any way to batchify or anything else so that my model doesn't crash? Any suggestion will be helpful.
The text was updated successfully, but these errors were encountered: