Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Code crashes without errors when importing Trainer in TPU context #28609

Closed
1 of 4 tasks
samuele-bortolato opened this issue Jan 19, 2024 · 13 comments
Closed
1 of 4 tasks

Comments

@samuele-bortolato
Copy link

System Info

I'm working on Kaggle with TPU enabled (TPU VM v3-8), running !transformers-cli env returns

[libprotobuf ERROR external/com_google_protobuf/src/google/protobuf/descriptor_database.cc:642] File already exists in database: tsl/profiler/protobuf/trace_events.proto
[libprotobuf FATAL external/com_google_protobuf/src/google/protobuf/descriptor.cc:1986] CHECK failed: GeneratedDatabase()->Add(encoded_file_descriptor, size):
terminate called after throwing an instance of 'google::protobuf::FatalException'
what(): CHECK failed: GeneratedDatabase()->Add(encoded_file_descriptor, size):
https://symbolize.stripped_domain/r/?trace=7a80dd07fd3c,7a80dd030fcf,5ab82e3a7b8f&map=
*** SIGABRT received by PID 367 (TID 367) on cpu 95 from PID 367; stack trace: ***
PC: @ 0x7a80dd07fd3c (unknown) (unknown)
@ 0x7a7f654bba19 928 (unknown)
@ 0x7a80dd030fd0 (unknown) (unknown)
@ 0x5ab82e3a7b90 (unknown) (unknown)
https://symbolize.stripped_domain/r/?trace=7a80dd07fd3c,7a7f654bba18,7a80dd030fcf,5ab82e3a7b8f&map=310b7ae7682f84c5c576a0b0030121f2:7a7f56a00000-7a7f656d11c0
E0119 15:49:22.169993 367 coredump_hook.cc:447] RAW: Remote crash data gathering hook invoked.
E0119 15:49:22.170011 367 client.cc:272] RAW: Coroner client retries enabled (b/136286901), will retry for up to 30 sec.
E0119 15:49:22.170016 367 coredump_hook.cc:542] RAW: Sending fingerprint to remote end.
E0119 15:49:22.170041 367 coredump_hook.cc:551] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] stat failed on crash reporting socket /var/google/services/logmanagerd/remote_coredump.socket (Is the listener running?): No such file or directory
E0119 15:49:22.170050 367 coredump_hook.cc:603] RAW: Dumping core locally.
E0119 15:50:17.482782 367 process_state.cc:808] RAW: Raising signal 6 with default behavior
Aborted (core dumped)

Importing and printing manually

import torch_xla
print(torch_xla.__version__)

2.1.0+libtpu

import torch
print(torch.__version__)

2.1.0+cu121

import transformers
print(transformers.__version__)

4.36.2

Who can help?

@muellerzr @stevhliu

I have been tryint to port my code to TPU, but cannot manage to import the libraries.

In my code (written in pytorch) I use the transformer library to load some pretrained LLMs and I subclassed the Trainer class to train some custom models with RL.

The code is working perfectly fine on GPU, but I can't manage to make it work on TPU and the code keeps crashing without returning any error. The documentation on how to use TPUs in the transformer library for a torch backend is still not present (after two years that the page was created in the documentation https://huggingface.co/docs/transformers/v4.21.3/en/perf_train_tpu), so I have no idea if I skipped any necessary step.

While the code imports without problems the transformer library, the whole session crashes when I try to import the Trainer class.

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

import torch_xla
print(torch_xla.__version__)

import torch
print(torch.__version__)

import transformers
print(transformers.__version__)

from transformers import Trainer

output:
->2.1.0+libtpu
->2.1.0+cu121
->4.36.2
->(crash session without outputs)

Expected behavior

It should either import the library or throw an error, not crash the whole session without a hint.

@naseemx
Copy link

naseemx commented Jan 19, 2024

I would like to work on this

@ILG2021
Copy link

ILG2021 commented Jan 24, 2024

I have the same problem.

@phineas-pta
Copy link

having same kaggle issue

@amyeroberts
Copy link
Collaborator

Gentle ping @muellerzr

@huggingface huggingface deleted a comment from github-actions bot Mar 11, 2024
@muellerzr
Copy link
Contributor

The torch_xla team is aware of this and working towards fixing it

@huggingface huggingface deleted a comment from github-actions bot Apr 5, 2024
@ArthurZucker
Copy link
Collaborator

@muellerzr is there a PR or Issue we can track and link here?

@sitatec
Copy link

sitatec commented Apr 5, 2024

Having the same issue on Kaggle, any update?

@sitatec
Copy link

sitatec commented Apr 5, 2024

@muellerzr In case it may help, when I try to import it Trainer or SFTTrainer in the VM no error is printed, but when I launch the script that contains the import on the TPU with accelerate launch or notebook_launcher I get this error message:
ERROR: Unknown command line flag 'xla_latency_hiding_scheduler_rerun'

I was facing a similar issue (but different error message) on GPU as well but when installed the latest versions of the hugging faces libraries that I was using, it fixed the issue:

!pip install \
git+https://github.com/huggingface/transformers.git \
git+https://github.com/huggingface/datasets.git \
git+https://github.com/huggingface/trl.git \
git+https://github.com/huggingface/peft.git \
git+https://github.com/huggingface/accelerate.git \

But this doesn't fix it on TPU.

@JackCaoG
Copy link

JackCaoG commented Apr 9, 2024

xla_latency_hiding_scheduler_rerun is a XLA flag we set default value in https://github.com/pytorch/xla/blob/66ed39ba5fa6fb487790df03a9a68a6f62f2c957/torch_xla/__init__.py#L46

Do you mind doing a quick santity check following https://github.com/pytorch/xla/blob/master/TROUBLESHOOTING.md#check-pytorchxla-version ? I believe we have special whl built for kaggle that bundle the libtpu with pytorch/xla so you shouldn't need to manually install libtpu..

Copy link

github-actions bot commented May 3, 2024

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@elemosel
Copy link

Having the same issue here

@JackCaoG
Copy link

We found out that the issue is tensorflow(TPU versionm, tensorflow-cpu is fine) will always tried to load the libtpu first upon import. To overcome this issue you can pip uninstall tensorflow. starting from 2.4 release we will throw a warning message if tf is installed on the same host.

@amyeroberts
Copy link
Collaborator

Thanks for sharing @JackCaoG! Cc @Rocketknight1 for reference

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants