You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, I was wondering, how large can the batch size be considering TPU training? Now I'm training vanilla Transformer model in Colab and I can barely fit TPU memory. My batch size is 128, sequences are padded with padded_batch function, max_len is 512. It seems to me that I'm missing something, because it's a bit suspicious that TPU cannot handle batches of higher magnitude (like 2048).
The thing that I tried to establish is to run TPU profiler, but I could not do it since the model doesn't output anything to keep track of.
That's why, my question is, what are the best practices of training Trax transformer on TPUs?
We have improved memory use since that release, so it would be a good idea to try again. But please remember that a TPU core on colab has only 8GB of memory - so a large batch may have trouble fitting with Transformer.
Description
Hello, I was wondering, how large can the batch size be considering TPU training? Now I'm training vanilla Transformer model in Colab and I can barely fit TPU memory. My batch size is 128, sequences are padded with
padded_batch
function, max_len is 512. It seems to me that I'm missing something, because it's a bit suspicious that TPU cannot handle batches of higher magnitude (like 2048).The thing that I tried to establish is to run TPU profiler, but I could not do it since the model doesn't output anything to keep track of.
That's why, my question is, what are the best practices of training Trax transformer on TPUs?
Error log
The text was updated successfully, but these errors were encountered: