-
Notifications
You must be signed in to change notification settings - Fork 472
How to process Large Train Data out of memory? #72
Comments
We could support shard training in the future to automatically handle this kind of scenario. In the meantime, you have 2 options:
which produce the data packages:
Then, you start an initial training for one epoch on
And retrain your model for one epoch on the second data package:
Finally you alternate a retraining on
|
Great, the second option you suggest is good to me. Thanks for your kindly reply. Looking forward to seeing shard training. |
@guillaumekln, a problem occurred in the second option you provided. When processing the splitted train data, the preprocess.lua produced two different dict.src and it contained conflicted word_id in trainiing. How to prove that the splitted data shared same dict? |
oh, I find paramter '-src_vocab', '', [[Path to an existing source vocabulary]] in preprocess.lua. Is this the parameter to set to the shared dict? When creating the dict using -src_vocab_size = 20000 in the first data part, it actually contains 20004 words. What about the -src_vocab_size in preprocessing the second data part? Is it 20000 or 20004 ? |
Yes you got it, I missed that. You indeed need to share the vocabularies by using |
Hi, I have a model to train based on a huge train_dataset, which contains 200 million pairs of sentence. The preprocess.lua converts all train data into a single data file which will be loaded by train.lua. But how to load the subset train data iteratively by train.lua ? Since the machine memory will run out if load data all in once.
Thanks in advance.
The text was updated successfully, but these errors were encountered: