Skip to content
This repository has been archived by the owner on Jun 10, 2021. It is now read-only.

How to process Large Train Data out of memory? #72

Closed
wsnooker opened this issue Jan 13, 2017 · 5 comments
Closed

How to process Large Train Data out of memory? #72

wsnooker opened this issue Jan 13, 2017 · 5 comments

Comments

@wsnooker
Copy link

Hi, I have a model to train based on a huge train_dataset, which contains 200 million pairs of sentence. The preprocess.lua converts all train data into a single data file which will be loaded by train.lua. But how to load the subset train data iteratively by train.lua ? Since the machine memory will run out if load data all in once.
Thanks in advance.

@guillaumekln
Copy link
Collaborator

guillaumekln commented Jan 13, 2017

We could support shard training in the future to automatically handle this kind of scenario.

In the meantime, you have 2 options:

  1. Reduce your training data to a size that fits in your memory. You would certainly get good results as the model is reasonably good at generalizing to new data.
  2. You could split your training data into several files and preprocess them independently. For example, if you want to split your training data in two parts:
th preprocess.lua -train_src src-train-1.txt -train_tgt tgt-train-1.txt -valid_src src-valid.txt -valid_tgt tgt-valid.txt -save_data data-1
th preprocess.lua -train_src src-train-2.txt -train_tgt tgt-train-2.txt -valid_src src-valid.txt -valid_tgt tgt-valid.txt -src_vocab data-1.src.dict -tgt_vocab data-1.tgt.dict -save_data data-2

which produce the data packages:

  • data-1-train.t7
  • data-2-train.t7

Then, you start an initial training for one epoch on data-1-train.t7:

th train.lua -data data-1-train.t7 -save_model model -end_epoch 1

And retrain your model for one epoch on the second data package:

th train.lua -data data-2-train.t7 -train_from model_epoch1_X.XX.t7 -save_model model -end_epoch 1

Finally you alternate a retraining on data-1-train.t7 and data-2-train.t7 for as long as is required.

th train.lua -data data-1-train.t7 -train_from model_epoch1_X.XX.t7 -save_model model -start_epoch 2 -end_epoch 2
th train.lua -data data-2-train.t7 -train_from model_epoch2_X.XX.t7 -save_model model -start_epoch 2 -end_epoch 2
th train.lua -data data-1-train.t7 -train_from model_epoch2_X.XX.t7 -save_model model -start_epoch 3 -end_epoch 3
...

@wsnooker
Copy link
Author

Great, the second option you suggest is good to me. Thanks for your kindly reply. Looking forward to seeing shard training.

@wsnooker wsnooker reopened this Jan 14, 2017
@wsnooker
Copy link
Author

@guillaumekln, a problem occurred in the second option you provided. When processing the splitted train data, the preprocess.lua produced two different dict.src and it contained conflicted word_id in trainiing. How to prove that the splitted data shared same dict?

@wsnooker
Copy link
Author

oh, I find paramter '-src_vocab', '', [[Path to an existing source vocabulary]] in preprocess.lua. Is this the parameter to set to the shared dict? When creating the dict using -src_vocab_size = 20000 in the first data part, it actually contains 20004 words. What about the -src_vocab_size in preprocessing the second data part? Is it 20000 or 20004 ?

@guillaumekln
Copy link
Collaborator

Yes you got it, I missed that. You indeed need to share the vocabularies by using src_vocab and tgt_vocab. The vocabulary size will be inferred from the file.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Development

No branches or pull requests

2 participants