How to process Large Train Data out of memory? #72

wsnooker · 2017-01-13T12:22:55Z

Hi, I have a model to train based on a huge train_dataset, which contains 200 million pairs of sentence. The preprocess.lua converts all train data into a single data file which will be loaded by train.lua. But how to load the subset train data iteratively by train.lua ? Since the machine memory will run out if load data all in once.
Thanks in advance.

guillaumekln · 2017-01-13T13:55:22Z

We could support shard training in the future to automatically handle this kind of scenario.

In the meantime, you have 2 options:

Reduce your training data to a size that fits in your memory. You would certainly get good results as the model is reasonably good at generalizing to new data.
You could split your training data into several files and preprocess them independently. For example, if you want to split your training data in two parts:

th preprocess.lua -train_src src-train-1.txt -train_tgt tgt-train-1.txt -valid_src src-valid.txt -valid_tgt tgt-valid.txt -save_data data-1
th preprocess.lua -train_src src-train-2.txt -train_tgt tgt-train-2.txt -valid_src src-valid.txt -valid_tgt tgt-valid.txt -src_vocab data-1.src.dict -tgt_vocab data-1.tgt.dict -save_data data-2

which produce the data packages:

data-1-train.t7
data-2-train.t7

Then, you start an initial training for one epoch on data-1-train.t7:

th train.lua -data data-1-train.t7 -save_model model -end_epoch 1

And retrain your model for one epoch on the second data package:

th train.lua -data data-2-train.t7 -train_from model_epoch1_X.XX.t7 -save_model model -end_epoch 1

Finally you alternate a retraining on data-1-train.t7 and data-2-train.t7 for as long as is required.

th train.lua -data data-1-train.t7 -train_from model_epoch1_X.XX.t7 -save_model model -start_epoch 2 -end_epoch 2
th train.lua -data data-2-train.t7 -train_from model_epoch2_X.XX.t7 -save_model model -start_epoch 2 -end_epoch 2
th train.lua -data data-1-train.t7 -train_from model_epoch2_X.XX.t7 -save_model model -start_epoch 3 -end_epoch 3
...

wsnooker · 2017-01-13T14:35:19Z

Great, the second option you suggest is good to me. Thanks for your kindly reply. Looking forward to seeing shard training.

wsnooker · 2017-01-14T07:46:36Z

@guillaumekln, a problem occurred in the second option you provided. When processing the splitted train data, the preprocess.lua produced two different dict.src and it contained conflicted word_id in trainiing. How to prove that the splitted data shared same dict?

wsnooker · 2017-01-14T07:58:58Z

oh, I find paramter '-src_vocab', '', [[Path to an existing source vocabulary]] in preprocess.lua. Is this the parameter to set to the shared dict? When creating the dict using -src_vocab_size = 20000 in the first data part, it actually contains 20004 words. What about the -src_vocab_size in preprocessing the second data part? Is it 20000 or 20004 ?

guillaumekln · 2017-01-14T08:10:36Z

Yes you got it, I missed that. You indeed need to share the vocabularies by using src_vocab and tgt_vocab. The vocabulary size will be inferred from the file.

wsnooker closed this as completed Jan 13, 2017

wsnooker reopened this Jan 14, 2017

wsnooker closed this as completed Jan 15, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to process Large Train Data out of memory? #72

How to process Large Train Data out of memory? #72

wsnooker commented Jan 13, 2017

guillaumekln commented Jan 13, 2017 •

edited

wsnooker commented Jan 13, 2017

wsnooker commented Jan 14, 2017

wsnooker commented Jan 14, 2017

guillaumekln commented Jan 14, 2017

How to process Large Train Data out of memory? #72

How to process Large Train Data out of memory? #72

Comments

wsnooker commented Jan 13, 2017

guillaumekln commented Jan 13, 2017 • edited

wsnooker commented Jan 13, 2017

wsnooker commented Jan 14, 2017

wsnooker commented Jan 14, 2017

guillaumekln commented Jan 14, 2017

guillaumekln commented Jan 13, 2017 •

edited