New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Seg Fault after saving checkpoints #366
Comments
I suspect this is unrelated, but am not sure what could cause it. What commit are you synced to? I'd like to see what line 693 is in checkpoints.py. |
Oh, actually maybe the segfault is coming from here: File "/home/dptam/.local/lib/python3.8/site-packages/jax/interpreters/pxla.py", line 664 in _sda_value @yashk2810 any thoughts? |
I don't think its related to pxla.py (I maybe wrong though). Maybe its coming from TensorStore? |
Is there a way to get a small repro? |
This is commit e926532 |
I am experiencing a very similar error running finetuning on a 16 pod. I am therefore attaching to this issue instead of opening a new. Just like @dptam is describing, the crashes happens right after saving a random checkpoint. I have not installed any prompt tuning dependencies, but I did just update the t5x. Here is part of my error log on the worker that crashes: 0319 18:08:55.361657 140189084843072 train.py:504] Saving checkpoint.
I0319 18:08:55.453575 140189084843072 checkpoints.py:594] Saving checkpoint for step 1020000 to
gs://nb-t5x-us-central2/model_mT5X_large_16_d/checkpoint_1020000.tmp-1647713335
Fatal Python error: Segmentation fault
Thread 0x00007f69d9b42700 (most recent call first):
File "/home/perk/.local/lib/python3.8/site-packages/jax/interpreters/pxla.py", line 665 in _sd
a_value
File "/home/perk/.local/lib/python3.8/site-packages/jax/_src/device_array.py", line 270 in __a
rray__
File "/home/perk/t5x/t5x/checkpoints.py", line 448 in <lambda>
File "/home/perk/t5x/t5x/checkpoint_importer.py", line 84 in get
File "/usr/lib/python3.8/concurrent/futures/thread.py", line 57 in run
File "/usr/lib/python3.8/concurrent/futures/thread.py", line 80 in _worker
File "/usr/lib/python3.8/threading.py", line 870 in run
File "/usr/lib/python3.8/threading.py", line 932 in _bootstrap_inner
File "/usr/lib/python3.8/threading.py", line 890 in _bootstrap
Thread 0x00007f69da343700 (most recent call first):
File "/home/perk/.local/lib/python3.8/site-packages/jax/interpreters/pxla.py", line 665 in _sd
a_value
File "/home/perk/.local/lib/python3.8/site-packages/jax/_src/device_array.py", line 270 in __a
rray__
File "/home/perk/t5x/t5x/checkpoints.py", line 448 in <lambda>
File "/home/perk/t5x/t5x/checkpoint_importer.py", line 84 in get
File "/usr/lib/python3.8/concurrent/futures/thread.py", line 57 in run
File "/usr/lib/python3.8/concurrent/futures/thread.py", line 80 in _worker
File "/usr/lib/python3.8/threading.py", line 870 in run
File "/usr/lib/python3.8/threading.py", line 932 in _bootstrap_inner
File "/usr/lib/python3.8/threading.py", line 890 in _bootstrap
<snip>
File "/home/perk/t5x/t5x/gin_utils.py", line 105 in run
File "../../t5x/t5x/train.py", line 632 in <module>
https://symbolize.stripped_domain/r/?trace=7f8050dfa03b,7f8050dfa0bf&map=
*** SIGSEGV (@0x7d60001ca7a), see gl__________41#s15 received by PID 117370 (TID 145145) on cpu
0; stack trace: ***
PC: @ 0x7f8050dfa03b (unknown) raise
@ 0x7f80475ea03a 992 (unknown)
@ 0x7f8050dfa0c0 (unknown) (unknown)
https://symbolize.stripped_domain/r/?trace=7f8050dfa03b,7f80475ea039,7f8050dfa0bf&map=52a3e57361
51c7837be510ff417da4d4:7f80374d5000-7f80479627d0
E0319 18:08:58.031300 145145 coredump_hook.cc:365] RAW: Remote crash data gathering hook invoke
d.
E0319 18:08:58.031700 145145 coredump_hook.cc:411] RAW: Skipping coredump since rlimit was 0 at
process start.
E0319 18:08:58.031717 145145 client.cc:222] RAW: Coroner client retries enabled (b/136286901),
will retry for up to 30 sec.
E0319 18:08:58.031721 145145 coredump_hook.cc:473] RAW: Sending fingerprint to remote end.
E0319 18:08:58.032034 145145 coredump_socket.cc:124] RAW: Stat failed errno=2 on socket /var/go
ogle/services/logmanagerd/remote_coredump.socket
E0319 18:08:58.032048 145145 coredump_hook.cc:477] RAW: Cannot send fingerprint to Coroner: [NO
T_FOUND] Missing crash reporting socket. Is the listener running?
E0319 18:08:58.032051 145145 coredump_hook.cc:550] RAW: Discarding core.
https://symbolize.stripped_domain/r/?trace=7f80476d5c56,7f8050dfa0bf,7f8047605bcc,7f80475f8685,7
f80475f7ed8,7f804795ba4d,7f80474fa784,7f80474f9855,7f80474f9990,7f80474fac6a,7f80474f9b3c,7f8047
4fa5be,7f80475ea7ea,7f8050dfa0bf&map=52a3e5736151c7837be510ff417da4d4:7f80374d5000-7f80479627d0
E0319 18:08:58.041457 145145 process_state.cc:1058] RAW: Signal 11 raised at PC: 0x7f80476d5c56
while already in FailureSignalHandler!
E0319 18:08:58.041479 145145 process_state.cc:1092] RAW: Raising 11 signal with default behavio
r
Segmentation fault (core dumped)
``` |
@yashk2810 I am not sure if you want the repro, but it is from prompt-tuning and running the prompt-tuning/prompt_tuning/scripts/sst2-demo-xxl.sh script on a v3-32. |
@adarob
A wild guess is that this is related to asyncio. I noticed a TODO-comment in the code around where the error occures saying " # TODO(adarob): Use asyncio.run in py3.7+.". I am using py3.8. Could this be related, @adarob? |
I had the same issue but have solved it by downgrading tensorstore from 0.1.18 (the default version) to 0.1.14 as mentioned here. |
This may well be the same issue as google/tensorstore#30. If we are able to reproduce this issue somewhat reliably that would greatly help in resolving it. @misawann @peregilk @dptam If this occurs often, can you clarify what makes it most likely to trigger, or ideally describe how to reproduce it? |
Thanks @misawann!! I did a simple "pip install tensorstore==0.1.14", and all my three TPU/pods have been running without any issues after this!! I really think this has fixed it! @jbms. I have been seeing this both on single TPUs (v3 and v4) and on v4 pods. I have most experience with running on the v4s. Here I have a fairly standard setup using the v2-alpha-tpuv4 and the v2-alpha-tpuv4-pod images. T5X installed from the repo. When pretraining, crashes usually happens within 24 hours, often sooner. Note that the tpu image is running Tensorstore 0.1.17, while the pod image is running Tensorstore 0.1.18. So far, I have not had crashes on neither on tpus or pods after downgrading to 0.1.14. The crashes are happening at irregular intervals. They seem to be happening when the checkpoint is saved. A wild guess is that the crashes happens roughly every 1/10 checkpoint. But it really varies. It has happened that I needed to restart/recover the same checkpoint multiple times. It has also trained without any issues for more than a day. From what I have seen, saving often will trigger this sooner, so there should be easy to provoke the error. If you have problem reproducing it, contact me directly, and I share my install/training scripts. |
Thanks --- I was able to get an example from another users that reproduces the problem, so I will look into it. |
This is now fixed in tensorstore 0.1.20 |
PiperOrigin-RevId: 446440817
PiperOrigin-RevId: 446440817
Hi,
I am getting a seg fault sometimes after the model has saved the checkpoint. It is not every checkpoint and seems to be random which checkpoints it crashes after. I am not sure if it is related to issue #340
For example, I am running
prompt_tuning/scripts/sst2-demo-xxl.sh
, and the output is below.Thanks
The text was updated successfully, but these errors were encountered: