My training crashes with large corpus. #516

Srj · 2020-06-27T08:56:10Z

I have a corpus of 10GB and i tried to train it on Google Colab with 25 GB ram.But during training time it crashes. Is there any way to batchify or anything else so that my model doesn't crash? Any suggestion will be helpful.

mingruimingrui · 2020-06-27T20:08:01Z

Wow, that's really large... probably like 100 million sentences.
Typically a training set of 10 million sentences is already considered large.

With so much data, you can probably sample a subset of your dataset and get the same performance.
Perhaps read the first ~ 10 million sentences and split into train and test (validation optional)?
Training should be straight forward. From experience, you should take note of the following during evaluation.

out of vocabulary % (or % unknown token)
average token per word

Ensure consistency and you should be good.

It might also help to have a validation set to generate sentence piece vocabularies if your data is extremely noisy.
A vocabulary threshold like 20 or 50 would be good.

These are advice from a guy who mainly does machine translation.

taku910 · 2020-07-23T02:35:50Z

Please try --train_extremely_large_corpus=true of spm_train. You computer needs to have reasonable memory size to enable this mode though.

Mistobaan · 2020-07-24T09:32:31Z

Got these silent crashes too on large text files (~50GB)

tanreinama · 2020-07-25T09:02:52Z

~~I get 'std::bad_alloc' error on 11GB japanease corpus with --train_extremely_large_corpus=true option.~~
It seems work well in AWS r5.24xlarge instance. memory usage is 180GM+. thx.

thusinh1969 · 2023-06-16T08:56:29Z

I am too. 30G corpus, 180 millions sentences. Only know about -train_extremely_large_corpus=true now, will try and let you know. It works with "bpe" though, but crashed on unigram.

Steve

thusinh1969 · 2023-06-16T11:18:52Z

I am too. 30G corpus, 180 millions sentences. Only know about -train_extremely_large_corpus=true now, will try and let you know. It works with "bpe" though, but crashed on unigram.

Steve

Keep crashing on UNIGRAM !!!! Any hint ?
Steve

Srj changed the title ~~My training crashes.~~ My training crashes with large corpus. Jun 27, 2020

taku910 closed this as completed Sep 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

My training crashes with large corpus. #516

My training crashes with large corpus. #516

Srj commented Jun 27, 2020

mingruimingrui commented Jun 27, 2020 •

edited

taku910 commented Jul 23, 2020

Mistobaan commented Jul 24, 2020

tanreinama commented Jul 25, 2020 •

edited

thusinh1969 commented Jun 16, 2023

thusinh1969 commented Jun 16, 2023

My training crashes with large corpus. #516

My training crashes with large corpus. #516

Comments

Srj commented Jun 27, 2020

mingruimingrui commented Jun 27, 2020 • edited

taku910 commented Jul 23, 2020

Mistobaan commented Jul 24, 2020

tanreinama commented Jul 25, 2020 • edited

thusinh1969 commented Jun 16, 2023

thusinh1969 commented Jun 16, 2023

mingruimingrui commented Jun 27, 2020 •

edited

tanreinama commented Jul 25, 2020 •

edited