Changes Required for a TPU Pod #2017
-
|
Hello, Is there any kind of changes that need to be done on the Flax/Jax code to make it run on a TPU pod rather than a single TPU node ? for example the T5 example on HuggingFace: or we just need to run the same script across all workers and it will communicate with each other: I am trying to run HuggingFace code on TPU V4-64 pod and your reply will be really helpful :) |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 12 replies
-
|
The reason for this question is that several users who have access to TPU V4 have a big issue. The training script for hugging face runs at the same speed and global batch size in a single node or a pod. |
Beta Was this translation helpful? Give feedback.
-
|
The first thing to do here would be to add a log message to your script printing the output of |
Beta Was this translation helpful? Give feedback.
-
|
It actually discovers all devices and it prints them correctly. |
Beta Was this translation helpful? Give feedback.
-
|
Thanks again for your effort. |
Beta Was this translation helpful? Give feedback.
It actually discovers all devices and it prints them correctly.
Ok, then I will report it in the Jax repo.
Thanks.