-
Notifications
You must be signed in to change notification settings - Fork 306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fatal Python error: Segmentation fault, when training t5x-XXL on a TPU Pod v3-32 #340
Comments
Actually I'm having some doubt on the command provided to install libraries on the Pod TPU, could someone please check it is correct?
For instance, I don't get why Thanks |
@adarob signal boost as this is blocking me and my PhD student @LeoLaugier, any ideas :) |
To be clear, it looks like you get a segfault in one thread but then your training completes? |
I think there is a segfault in one thread, though the training freezes (in the other threads?) so it does not complete. (I haven't retried since the day I raised the issue). It looks like issue #366. |
We have a new API using XManager, which manages the workers. So I think this can help resolving the issue. Could you try the instruction here: https://github.com/google-research/t5x#quickstart-recommended? |
Awesome, thanks for the pointer. I'll try that asap and let you know. I guess XManager is compatible with prompt tuning right? |
correct, xmanager is an API about managing experiments, and there's no compatibility issue with prompt-tuning. |
I am struggling with the same issues here. Any other way around this than to use XManager? |
This appears to be the same issue as google/tensorstore#30 for which we will have a fix pushed out in the next day or so. |
Previously, none of the auth providers were actually thread safe, leading to intermittent crashes. Fixes #30 Fixes google-research/t5x#340 PiperOrigin-RevId: 444883311 Change-Id: I8a49384f783a717593dc1c31f932596d12fc9c4c
This should now be fixed in tensorstore 0.1.20 |
Thanks @jbms! |
Hi,
I was able to train and infer prompt tuning with t5x-XXL on a TPU Pod v3-32 for my custom task defined from a TSV file, but I am seeing now an error and can't understand it.
I follow the instructions from prompt tuning to train and infer Prompt on a Pod Slice, except that the last libtpu_release gives an error
TPUEmbeddingEngineState_Create not available in this library.
so I install the release from February 15, 2022.I run the following script
and get the following errors:
First:
And later
Then the run freezes. I might be missing something obvious but I think I haven't changed anything but the data since the last time I was able to train and infer with prompt tuning. Moreover, I was able to train on the same train data but problems arose when I tried to infer.
Therefore, I'm asking if you could help me understand the issue.
Thanks in advance for your time.
The text was updated successfully, but these errors were encountered: