Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

My training crashes with large corpus. #516

Closed
Srj opened this issue Jun 27, 2020 · 6 comments
Closed

My training crashes with large corpus. #516

Srj opened this issue Jun 27, 2020 · 6 comments

Comments

@Srj
Copy link

Srj commented Jun 27, 2020

I have a corpus of 10GB and i tried to train it on Google Colab with 25 GB ram.But during training time it crashes. Is there any way to batchify or anything else so that my model doesn't crash? Any suggestion will be helpful.

@Srj Srj changed the title My training crashes. My training crashes with large corpus. Jun 27, 2020
@mingruimingrui
Copy link
Contributor

mingruimingrui commented Jun 27, 2020

Wow, that's really large... probably like 100 million sentences.
Typically a training set of 10 million sentences is already considered large.

With so much data, you can probably sample a subset of your dataset and get the same performance.
Perhaps read the first ~ 10 million sentences and split into train and test (validation optional)?
Training should be straight forward. From experience, you should take note of the following during evaluation.

  • out of vocabulary % (or % unknown token)
  • average token per word

Ensure consistency and you should be good.

It might also help to have a validation set to generate sentence piece vocabularies if your data is extremely noisy.
A vocabulary threshold like 20 or 50 would be good.

These are advice from a guy who mainly does machine translation.

@taku910
Copy link
Collaborator

taku910 commented Jul 23, 2020

Please try --train_extremely_large_corpus=true of spm_train. You computer needs to have reasonable memory size to enable this mode though.

@Mistobaan
Copy link

Got these silent crashes too on large text files (~50GB)

@tanreinama
Copy link

tanreinama commented Jul 25, 2020

I get 'std::bad_alloc' error on 11GB japanease corpus with --train_extremely_large_corpus=true option.
It seems work well in AWS r5.24xlarge instance. memory usage is 180GM+. thx.

@taku910 taku910 closed this as completed Sep 6, 2020
@thusinh1969
Copy link

I am too. 30G corpus, 180 millions sentences. Only know about -train_extremely_large_corpus=true now, will try and let you know. It works with "bpe" though, but crashed on unigram.

Steve

@thusinh1969
Copy link

I am too. 30G corpus, 180 millions sentences. Only know about -train_extremely_large_corpus=true now, will try and let you know. It works with "bpe" though, but crashed on unigram.

Steve

Keep crashing on UNIGRAM !!!! Any hint ?
Steve

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants