Code crashes without errors when importing Trainer in TPU context #28609

samuele-bortolato · 2024-01-19T16:05:15Z

System Info

I'm working on Kaggle with TPU enabled (TPU VM v3-8), running !transformers-cli env returns

[libprotobuf ERROR external/com_google_protobuf/src/google/protobuf/descriptor_database.cc:642] File already exists in database: tsl/profiler/protobuf/trace_events.proto
[libprotobuf FATAL external/com_google_protobuf/src/google/protobuf/descriptor.cc:1986] CHECK failed: GeneratedDatabase()->Add(encoded_file_descriptor, size):
terminate called after throwing an instance of 'google::protobuf::FatalException'
what(): CHECK failed: GeneratedDatabase()->Add(encoded_file_descriptor, size):
https://symbolize.stripped_domain/r/?trace=7a80dd07fd3c,7a80dd030fcf,5ab82e3a7b8f&map=
*** SIGABRT received by PID 367 (TID 367) on cpu 95 from PID 367; stack trace: ***
PC: @ 0x7a80dd07fd3c (unknown) (unknown)
@ 0x7a7f654bba19 928 (unknown)
@ 0x7a80dd030fd0 (unknown) (unknown)
@ 0x5ab82e3a7b90 (unknown) (unknown)
https://symbolize.stripped_domain/r/?trace=7a80dd07fd3c,7a7f654bba18,7a80dd030fcf,5ab82e3a7b8f&map=310b7ae7682f84c5c576a0b0030121f2:7a7f56a00000-7a7f656d11c0
E0119 15:49:22.169993 367 coredump_hook.cc:447] RAW: Remote crash data gathering hook invoked.
E0119 15:49:22.170011 367 client.cc:272] RAW: Coroner client retries enabled (b/136286901), will retry for up to 30 sec.
E0119 15:49:22.170016 367 coredump_hook.cc:542] RAW: Sending fingerprint to remote end.
E0119 15:49:22.170041 367 coredump_hook.cc:551] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] stat failed on crash reporting socket /var/google/services/logmanagerd/remote_coredump.socket (Is the listener running?): No such file or directory
E0119 15:49:22.170050 367 coredump_hook.cc:603] RAW: Dumping core locally.
E0119 15:50:17.482782 367 process_state.cc:808] RAW: Raising signal 6 with default behavior
Aborted (core dumped)

Importing and printing manually

import torch_xla
print(torch_xla.__version__)

2.1.0+libtpu

import torch
print(torch.__version__)

2.1.0+cu121

import transformers
print(transformers.__version__)

4.36.2

Who can help?

@muellerzr @stevhliu

I have been tryint to port my code to TPU, but cannot manage to import the libraries.

In my code (written in pytorch) I use the transformer library to load some pretrained LLMs and I subclassed the Trainer class to train some custom models with RL.

The code is working perfectly fine on GPU, but I can't manage to make it work on TPU and the code keeps crashing without returning any error. The documentation on how to use TPUs in the transformer library for a torch backend is still not present (after two years that the page was created in the documentation https://huggingface.co/docs/transformers/v4.21.3/en/perf_train_tpu), so I have no idea if I skipped any necessary step.

While the code imports without problems the transformer library, the whole session crashes when I try to import the Trainer class.

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

import torch_xla
print(torch_xla.__version__)

import torch
print(torch.__version__)

import transformers
print(transformers.__version__)

from transformers import Trainer

output:
->2.1.0+libtpu
->2.1.0+cu121
->4.36.2
->(crash session without outputs)

Expected behavior

It should either import the library or throw an error, not crash the whole session without a hint.

The text was updated successfully, but these errors were encountered:

naseemx · 2024-01-19T16:55:12Z

I would like to work on this

ILG2021 · 2024-01-24T04:22:18Z

I have the same problem.

phineas-pta · 2024-02-15T13:57:50Z

having same kaggle issue

amyeroberts · 2024-02-15T20:09:40Z

Gentle ping @muellerzr

muellerzr · 2024-03-11T13:28:11Z

The torch_xla team is aware of this and working towards fixing it

ArthurZucker · 2024-04-05T12:02:33Z

@muellerzr is there a PR or Issue we can track and link here?

sitatec · 2024-04-05T17:28:43Z

Having the same issue on Kaggle, any update?

sitatec · 2024-04-05T17:42:22Z

@muellerzr In case it may help, when I try to import it Trainer or SFTTrainer in the VM no error is printed, but when I launch the script that contains the import on the TPU with accelerate launch or notebook_launcher I get this error message:
ERROR: Unknown command line flag 'xla_latency_hiding_scheduler_rerun'

I was facing a similar issue (but different error message) on GPU as well but when installed the latest versions of the hugging faces libraries that I was using, it fixed the issue:

!pip install \
git+https://github.com/huggingface/transformers.git \
git+https://github.com/huggingface/datasets.git \
git+https://github.com/huggingface/trl.git \
git+https://github.com/huggingface/peft.git \
git+https://github.com/huggingface/accelerate.git \

But this doesn't fix it on TPU.

JackCaoG · 2024-04-09T01:07:46Z

xla_latency_hiding_scheduler_rerun is a XLA flag we set default value in https://github.com/pytorch/xla/blob/66ed39ba5fa6fb487790df03a9a68a6f62f2c957/torch_xla/__init__.py#L46

Do you mind doing a quick santity check following https://github.com/pytorch/xla/blob/master/TROUBLESHOOTING.md#check-pytorchxla-version ? I believe we have special whl built for kaggle that bundle the libtpu with pytorch/xla so you shouldn't need to manually install libtpu..

github-actions · 2024-05-03T08:05:40Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

elemosel · 2024-05-26T07:41:40Z

Having the same issue here

JackCaoG · 2024-05-28T17:36:26Z

We found out that the issue is tensorflow(TPU versionm, tensorflow-cpu is fine) will always tried to load the libtpu first upon import. To overcome this issue you can pip uninstall tensorflow. starting from 2.4 release we will throw a warning message if tf is installed on the same host.

amyeroberts · 2024-05-29T08:10:19Z

Thanks for sharing @JackCaoG! Cc @Rocketknight1 for reference

huggingface deleted a comment from github-actions bot Mar 11, 2024

huggingface deleted a comment from github-actions bot Apr 5, 2024

github-actions bot closed this as completed May 11, 2024

tomaarsen mentioned this issue Jun 7, 2024

Question/Bug: Trying to use sentence-transformers (v3.0.0) with Google Colab's TPU UKPLab/sentence-transformers#2730

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Code crashes without errors when importing Trainer in TPU context #28609

Code crashes without errors when importing Trainer in TPU context #28609

samuele-bortolato commented Jan 19, 2024

naseemx commented Jan 19, 2024

ILG2021 commented Jan 24, 2024

phineas-pta commented Feb 15, 2024

amyeroberts commented Feb 15, 2024

muellerzr commented Mar 11, 2024

ArthurZucker commented Apr 5, 2024

sitatec commented Apr 5, 2024

sitatec commented Apr 5, 2024

JackCaoG commented Apr 9, 2024

github-actions bot commented May 3, 2024

elemosel commented May 26, 2024

JackCaoG commented May 28, 2024

amyeroberts commented May 29, 2024

Code crashes without errors when importing Trainer in TPU context #28609

Code crashes without errors when importing Trainer in TPU context #28609

Comments

samuele-bortolato commented Jan 19, 2024

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

naseemx commented Jan 19, 2024

ILG2021 commented Jan 24, 2024

phineas-pta commented Feb 15, 2024

amyeroberts commented Feb 15, 2024

muellerzr commented Mar 11, 2024

ArthurZucker commented Apr 5, 2024

sitatec commented Apr 5, 2024

sitatec commented Apr 5, 2024

JackCaoG commented Apr 9, 2024

github-actions bot commented May 3, 2024

elemosel commented May 26, 2024

JackCaoG commented May 28, 2024

amyeroberts commented May 29, 2024