Thanks for the great work!
I'm currently trying to reproduce the results with Wan2.1-T2V-14B. Using the default single-node training script, I noticed that each epoch, including both sampling rollouts and training optimization steps, takes around 8–10 hours to finish. May I ask whether this is expected?
In addition, when scaling to multi-node training, it seems that the current training code also scales up the global batch size, while keeping num_batches_per_epoch the same for each rank. As a result, the per-epoch training time does not really decrease with more nodes. May I ask whether this behavior is intended?
Finally, if possible, could you share how many epochs the 1.3B and 14B models were trained for?
Thanks a lot!
Thanks for the great work!
I'm currently trying to reproduce the results with Wan2.1-T2V-14B. Using the default single-node training script, I noticed that each epoch, including both sampling rollouts and training optimization steps, takes around 8–10 hours to finish. May I ask whether this is expected?
In addition, when scaling to multi-node training, it seems that the current training code also scales up the global batch size, while keeping num_batches_per_epoch the same for each rank. As a result, the per-epoch training time does not really decrease with more nodes. May I ask whether this behavior is intended?
Finally, if possible, could you share how many epochs the 1.3B and 14B models were trained for?
Thanks a lot!