-
Notifications
You must be signed in to change notification settings - Fork 25.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Code crashes without errors when importing Trainer in TPU context #28609
Comments
I would like to work on this |
I have the same problem. |
having same kaggle issue |
Gentle ping @muellerzr |
The torch_xla team is aware of this and working towards fixing it |
@muellerzr is there a PR or Issue we can track and link here? |
Having the same issue on Kaggle, any update? |
@muellerzr In case it may help, when I try to import it I was facing a similar issue (but different error message) on GPU as well but when installed the latest versions of the hugging faces libraries that I was using, it fixed the issue:
But this doesn't fix it on TPU. |
Do you mind doing a quick santity check following https://github.com/pytorch/xla/blob/master/TROUBLESHOOTING.md#check-pytorchxla-version ? I believe we have special whl built for kaggle that bundle the libtpu with pytorch/xla so you shouldn't need to manually install libtpu.. |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Having the same issue here |
We found out that the issue is |
Thanks for sharing @JackCaoG! Cc @Rocketknight1 for reference |
System Info
I'm working on Kaggle with TPU enabled (TPU VM v3-8), running !transformers-cli env returns
[libprotobuf ERROR external/com_google_protobuf/src/google/protobuf/descriptor_database.cc:642] File already exists in database: tsl/profiler/protobuf/trace_events.proto
[libprotobuf FATAL external/com_google_protobuf/src/google/protobuf/descriptor.cc:1986] CHECK failed: GeneratedDatabase()->Add(encoded_file_descriptor, size):
terminate called after throwing an instance of 'google::protobuf::FatalException'
what(): CHECK failed: GeneratedDatabase()->Add(encoded_file_descriptor, size):
https://symbolize.stripped_domain/r/?trace=7a80dd07fd3c,7a80dd030fcf,5ab82e3a7b8f&map=
*** SIGABRT received by PID 367 (TID 367) on cpu 95 from PID 367; stack trace: ***
PC: @ 0x7a80dd07fd3c (unknown) (unknown)
@ 0x7a7f654bba19 928 (unknown)
@ 0x7a80dd030fd0 (unknown) (unknown)
@ 0x5ab82e3a7b90 (unknown) (unknown)
https://symbolize.stripped_domain/r/?trace=7a80dd07fd3c,7a7f654bba18,7a80dd030fcf,5ab82e3a7b8f&map=310b7ae7682f84c5c576a0b0030121f2:7a7f56a00000-7a7f656d11c0
E0119 15:49:22.169993 367 coredump_hook.cc:447] RAW: Remote crash data gathering hook invoked.
E0119 15:49:22.170011 367 client.cc:272] RAW: Coroner client retries enabled (b/136286901), will retry for up to 30 sec.
E0119 15:49:22.170016 367 coredump_hook.cc:542] RAW: Sending fingerprint to remote end.
E0119 15:49:22.170041 367 coredump_hook.cc:551] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] stat failed on crash reporting socket /var/google/services/logmanagerd/remote_coredump.socket (Is the listener running?): No such file or directory
E0119 15:49:22.170050 367 coredump_hook.cc:603] RAW: Dumping core locally.
E0119 15:50:17.482782 367 process_state.cc:808] RAW: Raising signal 6 with default behavior
Aborted (core dumped)
Importing and printing manually
2.1.0+libtpu
2.1.0+cu121
4.36.2
Who can help?
@muellerzr @stevhliu
I have been tryint to port my code to TPU, but cannot manage to import the libraries.
In my code (written in pytorch) I use the transformer library to load some pretrained LLMs and I subclassed the Trainer class to train some custom models with RL.
The code is working perfectly fine on GPU, but I can't manage to make it work on TPU and the code keeps crashing without returning any error. The documentation on how to use TPUs in the transformer library for a torch backend is still not present (after two years that the page was created in the documentation https://huggingface.co/docs/transformers/v4.21.3/en/perf_train_tpu), so I have no idea if I skipped any necessary step.
While the code imports without problems the transformer library, the whole session crashes when I try to import the Trainer class.
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
output:
->2.1.0+libtpu
->2.1.0+cu121
->4.36.2
->(crash session without outputs)
Expected behavior
It should either import the library or throw an error, not crash the whole session without a hint.
The text was updated successfully, but these errors were encountered: