Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seg Fault after saving checkpoints #366

Closed
dptam opened this issue Mar 17, 2022 · 13 comments · Fixed by #475
Closed

Seg Fault after saving checkpoints #366

dptam opened this issue Mar 17, 2022 · 13 comments · Fixed by #475

Comments

@dptam
Copy link

dptam commented Mar 17, 2022

Hi,

I am getting a seg fault sometimes after the model has saved the checkpoint. It is not every checkpoint and seems to be random which checkpoints it crashes after. I am not sure if it is related to issue #340

For example, I am running prompt_tuning/scripts/sst2-demo-xxl.sh, and the output is below.

317 18:14:56.525280 140415323761728 utils.py:138] Saved Numpy Arrays for step 1104000 to gs://nicl/checkpoint_models/sst/full_dataset/prompt-tuning/t5-11b/numpy_checkpoints/checkpoint_1104000
I0317 18:14:56.604028 140415323761728 checkpoints.py:600] Saving checkpoint for step 1104000 to gs://nicl/checkpoint_models/sst/full_dataset/prompt-tuning/t5-11b/checkpoint_1104000.tmp-1647540896
I0317 18:14:56.614308 140622481194048 checkpoints.py:600] Saving checkpoint for step 1104000 to gs://nicl/checkpoint_models/sst/full_dataset/prompt-tuning/t5-11b/checkpoint_1104000.tmp-1647540896
I0317 18:14:56.624289 140590966570048 checkpoints.py:600] Saving checkpoint for step 1104000 to gs://nicl/checkpoint_models/sst/full_dataset/prompt-tuning/t5-11b/checkpoint_1104000.tmp-1647540896
I0317 18:14:56.653718 140272509271104 checkpoints.py:600] Saving checkpoint for step 1104000 to gs://nicl/checkpoint_models/sst/full_dataset/prompt-tuning/t5-11b/checkpoint_1104000.tmp-1647540896
Fatal Python error: Segmentation fault


Thread 0x00007fdb1dc01700 (most recent call first):
  File "/home/dptam/.local/lib/python3.8/site-packages/jax/interpreters/pxla.py", line 664 in _sda_value
  File "/home/dptam/.local/lib/python3.8/site-packages/jax/_src/device_array.py", line 266 in __array__
  File "/home/dptam/.local/lib/python3.8/site-packages/t5x/checkpoints.py", line 447 in <lambda>
  File "/home/dptam/.local/lib/python3.8/site-packages/t5x/checkpoint_importer.py", line 84 in get
  File "/usr/lib/python3.8/concurrent/futures/thread.py", line 57 in run
  File "/usr/lib/python3.8/concurrent/futures/thread.py", line 80 in _worker
  File "/usr/lib/python3.8/threading.py", line 870 in run
  File "/usr/lib/python3.8/threading.py", line 932 in _bootstrap_inner
  File "/usr/lib/python3.8/threading.py", line 890 in _bootstrap

Thread 0x00007f56809df700 (most recent call first):
  File "/usr/lib/python3.8/concurrent/futures/thread.py", line 78 in _worker
  File "/usr/lib/python3.8/threading.py", line 870 in run
  File "/usr/lib/python3.8/threading.py", line 932 in _bootstrap_inner
  File "/usr/lib/python3.8/threading.py", line 890 in _bootstrap

  Thread 0x00007f56c7aad700 (most recent call first):
  File "/usr/lib/python3.8/concurrent/futures/thread.py", line 78 in _worker
  File "/usr/lib/python3.8/threading.py", line 870 in run
  File "/usr/lib/python3.8/threading.py", line 932 in _bootstrap_inner
  File "/usr/lib/python3.8/threading.py", line 890 in _bootstrap
Thread 0x00007fdde29efc40 (most recent call first):
  File "/home/dptam/.local/lib/python3.8/site-packages/t5x/checkpoints.py", line 693 in _write_array
https://symbolize.stripped_domain/r/?trace=7fdde2e4203b,7fdde2e420bf,e,5ef27540f,e,26f7c5aff,f,b15f59df&map= 
E0317 18:14:57.770066  341059 process_state.cc:1062] RAW: Signal 11 raised at PC: 0x7fdde2e4203b while already in FailureSignalHandler!
E0317 18:14:57.770096  341059 process_state.cc:1065] RAW: tid: 341059 raised new signal
    @                0xf       1440  (unknown)
    @        0x25ed159b0  (unknown)  (unknown)
    @               0x10   76231216  (unknown)
    @        0x261cdc840  (unknown)  (unknown)
    @        0x2dfdd4780  (unknown)  (unknown)
    @        0x5f1f8a120  (unknown)  (unknown)
https://symbolize.stripped_domain/r/?trace=7fdde301ffd3,7fddd98d57f9,7fdde2e420bf,7,e,25ed159af,f,261cdc83f,2dfdd477f,5f1f8a11f&map=7a511a57244151c993b16b37978e7ed7:7fddcaefd000-7fddd9c3fd50 
E0317 18:14:57.818885  341068 coredump_hook.cc:365] RAW: Remote crash data gathering hook invoked.
E0317 18:14:57.818900  341068 coredump_hook.cc:411] RAW: Skipping coredump since rlimit was 0 at process start.
E0317 18:14:57.818919  341068 client.cc:221] RAW: Coroner client retries enabled (b/136286901), will retry for up to 30 sec.
E0317 18:14:57.818922  341068 coredump_hook.cc:473] RAW: Sending fingerprint to remote end.
E0317 18:14:57.818928  341068 coredump_socket.cc:124] RAW: Stat failed errno=2 on socket /var/google/services/logmanagerd/remote_coredump.socket
E0317 18:14:57.818933  341068 coredump_hook.cc:477] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] Missing crash reporting socket. Is the listener running?
E0317 18:14:57.818938  341068 coredump_hook.cc:550] RAW: Discarding core.
prompt_tuning/scripts/sst2-demo-xxl.sh: line 37: 337643 Segmentation fault      (core dumped) python3 -m t5x.train --gin_search_paths="${T5X_DIR},${FLAXFORMER_DIR},${PROMPT_DIR}" --gin_file="prompt_tuning/configs/models/t5_1_1_xxl_prompt.gin" --gin_file="prompt_tuning/configs/prompts/from_class_labels.gin" --gin_file="prompt_tuning/configs/runs/prompt_finetune.gin" --gin.CLASS_LABELS="['positive', 'negative']" --gin.MODEL_DIR="'${MODEL_DIR}'" --gin.MIXTURE_OR_TASK_NAME="'taskless_glue_sst2_v200_examples'" --gin.MIXTURE_OR_TASK_MODULE="'prompt_tuning.data.glue'" --gin.TASK_FEATURE_LENGTHS="{'inputs': 512, 'targets': 8}" --gin.INITIAL_CHECKPOINT_PATH="'${PRETRAINED_MODEL}'" --gin.TRAIN_STEPS="1_212_000" --gin.USE_CACHED_TASKS="False" --gin.BATCH_SIZE="16" --gin.partitioning.PjitPartitioner.model_parallel_submesh="(4, 4, 1, 2)" --tfds_data_dir=${TFDS_DATA_DIR}
##### Command execution on worker 3 failed with return code 139. Continuing.
prompt_tuning/scripts/sst2-demo-xxl.sh: line 37: 334750 Aborted                 (core dumped) python3 -m t5x.train --gin_search_paths="${T5X_DIR},${FLAXFORMER_DIR},${PROMPT_DIR}" --gin_file="prompt_tuning/configs/models/t5_1_1_xxl_prompt.gin" --gin_file="prompt_tuning/configs/prompts/from_class_labels.gin" --gin_file="prompt_tuning/configs/runs/prompt_finetune.gin" --gin.CLASS_LABELS="['positive', 'negative']" --gin.MODEL_DIR="'${MODEL_DIR}'" --gin.MIXTURE_OR_TASK_NAME="'taskless_glue_sst2_v200_examples'" --gin.MIXTURE_OR_TASK_MODULE="'prompt_tuning.data.glue'" --gin.TASK_FEATURE_LENGTHS="{'inputs': 512, 'targets': 8}" --gin.INITIAL_CHECKPOINT_PATH="'${PRETRAINED_MODEL}'" --gin.TRAIN_STEPS="1_212_000" --gin.USE_CACHED_TASKS="False" --gin.BATCH_SIZE="16" --gin.partitioning.PjitPartitioner.model_parallel_submesh="(4, 4, 1, 2)" --tfds_data_dir=${TFDS_DATA_DIR}
##### Command execution on worker 1 failed with return code 134. Continuing.
prompt_tuning/scripts/sst2-demo-xxl.sh: line 37: 335504 Aborted                 (core dumped) python3 -m t5x.train --gin_search_paths="${T5X_DIR},${FLAXFORMER_DIR},${PROMPT_DIR}" --gin_file="prompt_tuning/configs/models/t5_1_1_xxl_prompt.gin" --gin_file="prompt_tuning/configs/prompts/from_class_labels.gin" --gin_file="prompt_tuning/configs/runs/prompt_finetune.gin" --gin.CLASS_LABELS="['positive', 'negative']" --gin.MODEL_DIR="'${MODEL_DIR}'" --gin.MIXTURE_OR_TASK_NAME="'taskless_glue_sst2_v200_examples'" --gin.MIXTURE_OR_TASK_MODULE="'prompt_tuning.data.glue'" --gin.TASK_FEATURE_LENGTHS="{'inputs': 512, 'targets': 8}" --gin.INITIAL_CHECKPOINT_PATH="'${PRETRAINED_MODEL}'" --gin.TRAIN_STEPS="1_212_000" --gin.USE_CACHED_TASKS="False" --gin.BATCH_SIZE="16" --gin.partitioning.PjitPartitioner.model_parallel_submesh="(4, 4, 1, 2)" --tfds_data_dir=${TFDS_DATA_DIR}
##### Command execution on worker 0 failed with return code 134. Continuing.

Thanks

@adarob
Copy link
Collaborator

adarob commented Mar 17, 2022

I suspect this is unrelated, but am not sure what could cause it. What commit are you synced to? I'd like to see what line 693 is in checkpoints.py.

@adarob
Copy link
Collaborator

adarob commented Mar 17, 2022

Oh, actually maybe the segfault is coming from here: File "/home/dptam/.local/lib/python3.8/site-packages/jax/interpreters/pxla.py", line 664 in _sda_value

@yashk2810 any thoughts?

@yashk2810
Copy link
Contributor

yashk2810 commented Mar 17, 2022

I don't think its related to pxla.py (I maybe wrong though). Maybe its coming from TensorStore?

@yashk2810
Copy link
Contributor

Is there a way to get a small repro?

@dptam
Copy link
Author

dptam commented Mar 17, 2022

This is commit e926532
One thing is that when I just tried installing the dependencies for prompt-tuning (including t5x) and running it, I was getting some tpu embedding error. I don't remember exactly. So I cloned a copy of t5x under my home directory and installed the dependencies for t5x, and the error was gone when running it. They both were using the same commits.

@peregilk
Copy link

peregilk commented Mar 19, 2022

I am experiencing a very similar error running finetuning on a 16 pod. I am therefore attaching to this issue instead of opening a new. Just like @dptam is describing, the crashes happens right after saving a random checkpoint. I have not installed any prompt tuning dependencies, but I did just update the t5x.

Here is part of my error log on the worker that crashes:

0319 18:08:55.361657 140189084843072 train.py:504] Saving checkpoint.
I0319 18:08:55.453575 140189084843072 checkpoints.py:594] Saving checkpoint for step 1020000 to
gs://nb-t5x-us-central2/model_mT5X_large_16_d/checkpoint_1020000.tmp-1647713335
Fatal Python error: Segmentation fault

Thread 0x00007f69d9b42700 (most recent call first):
  File "/home/perk/.local/lib/python3.8/site-packages/jax/interpreters/pxla.py", line 665 in _sd
a_value
  File "/home/perk/.local/lib/python3.8/site-packages/jax/_src/device_array.py", line 270 in __a
rray__
  File "/home/perk/t5x/t5x/checkpoints.py", line 448 in <lambda>
  File "/home/perk/t5x/t5x/checkpoint_importer.py", line 84 in get
  File "/usr/lib/python3.8/concurrent/futures/thread.py", line 57 in run
  File "/usr/lib/python3.8/concurrent/futures/thread.py", line 80 in _worker
  File "/usr/lib/python3.8/threading.py", line 870 in run
  File "/usr/lib/python3.8/threading.py", line 932 in _bootstrap_inner
  File "/usr/lib/python3.8/threading.py", line 890 in _bootstrap

Thread 0x00007f69da343700 (most recent call first):
  File "/home/perk/.local/lib/python3.8/site-packages/jax/interpreters/pxla.py", line 665 in _sd
a_value
  File "/home/perk/.local/lib/python3.8/site-packages/jax/_src/device_array.py", line 270 in __a
rray__
  File "/home/perk/t5x/t5x/checkpoints.py", line 448 in <lambda>
  File "/home/perk/t5x/t5x/checkpoint_importer.py", line 84 in get
  File "/usr/lib/python3.8/concurrent/futures/thread.py", line 57 in run
  File "/usr/lib/python3.8/concurrent/futures/thread.py", line 80 in _worker
  File "/usr/lib/python3.8/threading.py", line 870 in run
  File "/usr/lib/python3.8/threading.py", line 932 in _bootstrap_inner
  File "/usr/lib/python3.8/threading.py", line 890 in _bootstrap
<snip>
  File "/home/perk/t5x/t5x/gin_utils.py", line 105 in run
  File "../../t5x/t5x/train.py", line 632 in <module>
https://symbolize.stripped_domain/r/?trace=7f8050dfa03b,7f8050dfa0bf&map=
*** SIGSEGV (@0x7d60001ca7a), see gl__________41#s15 received by PID 117370 (TID 145145) on cpu
0; stack trace: ***
PC: @     0x7f8050dfa03b  (unknown)  raise
    @     0x7f80475ea03a        992  (unknown)
    @     0x7f8050dfa0c0  (unknown)  (unknown)
https://symbolize.stripped_domain/r/?trace=7f8050dfa03b,7f80475ea039,7f8050dfa0bf&map=52a3e57361
51c7837be510ff417da4d4:7f80374d5000-7f80479627d0
E0319 18:08:58.031300  145145 coredump_hook.cc:365] RAW: Remote crash data gathering hook invoke
d.
E0319 18:08:58.031700  145145 coredump_hook.cc:411] RAW: Skipping coredump since rlimit was 0 at
 process start.
E0319 18:08:58.031717  145145 client.cc:222] RAW: Coroner client retries enabled (b/136286901),
will retry for up to 30 sec.
E0319 18:08:58.031721  145145 coredump_hook.cc:473] RAW: Sending fingerprint to remote end.
E0319 18:08:58.032034  145145 coredump_socket.cc:124] RAW: Stat failed errno=2 on socket /var/go
ogle/services/logmanagerd/remote_coredump.socket
E0319 18:08:58.032048  145145 coredump_hook.cc:477] RAW: Cannot send fingerprint to Coroner: [NO
T_FOUND] Missing crash reporting socket. Is the listener running?
E0319 18:08:58.032051  145145 coredump_hook.cc:550] RAW: Discarding core.
https://symbolize.stripped_domain/r/?trace=7f80476d5c56,7f8050dfa0bf,7f8047605bcc,7f80475f8685,7
f80475f7ed8,7f804795ba4d,7f80474fa784,7f80474f9855,7f80474f9990,7f80474fac6a,7f80474f9b3c,7f8047
4fa5be,7f80475ea7ea,7f8050dfa0bf&map=52a3e5736151c7837be510ff417da4d4:7f80374d5000-7f80479627d0
E0319 18:08:58.041457  145145 process_state.cc:1058] RAW: Signal 11 raised at PC: 0x7f80476d5c56
 while already in FailureSignalHandler!
E0319 18:08:58.041479  145145 process_state.cc:1092] RAW: Raising 11 signal with default behavio
r
Segmentation fault (core dumped)
```

@dptam
Copy link
Author

dptam commented Mar 22, 2022

@yashk2810 I am not sure if you want the repro, but it is from prompt-tuning and running the prompt-tuning/prompt_tuning/scripts/sst2-demo-xxl.sh script on a v3-32.

@peregilk
Copy link

@adarob
I am seeing this error quite a lot. I also noticed this part in the stacktrace:

Thread 0x00007f4a25687c40 (most recent call first):
  File "/usr/lib/python3.8/selectors.py", line 468 in select
  File "/usr/lib/python3.8/asyncio/base_events.py", line 1823 in _run_once
  File "/usr/lib/python3.8/asyncio/base_events.py", line 570 in run_forever
  File "/usr/lib/python3.8/asyncio/base_events.py", line 603 in run_until_complete
  File "/home/perk/t5x/t5x/checkpoints.py", line 161 in _run_future_tree
  File "/home/perk/t5x/t5x/checkpoints.py", line 759 in _write_state_to_tensorstore
  File "/home/perk/t5x/t5x/checkpoints.py", line 604 in save
  File "../../t5x/t5x/train.py", line 563 in train
  File "/home/perk/.local/lib/python3.8/site-packages/gin/config.py", line 1582 in gin_wrapper
  File "../../t5x/t5x/train.py", line 657 in _main
  File "../../t5x/t5x/train.py", line 637 in main
  File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 258 in _run_main
  File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 312 in run
  File "/home/perk/t5x/t5x/gin_utils.py", line 105 in run
  File "../../t5x/t5x/train.py", line 659 in <module>
https://symbolize.stripped_domain/r/?trace=7f4a25ada03b,7f4a25ada0bf,11&map= 
*** SIGSEGV (@0x7d10036789c), see gl__________41#s15 received by PID 3569820 (TID 3572240) on cpu 97; stack trace: ***
PC: @     0x7f4a25ada03b  (unknown)  raise
    @     0x7f4a1c2ca03a        992  (unknown)
    @     0x7f4a25ada0c0  1922911024  (unknown)
    @               0x12  (unknown)  (unknown)
https://symbolize.stripped_domain/r/?trace=7f4a25ada03b,7f4a1c2ca039,7f4a25ada0bf,11&map=52a3e5736151c7837be510ff417da4d4:7f4a0c1b5000-7f4a1c6427d0 
E0421 07:21:57.150952 3572240 coredump_hook.cc:365] RAW: Remote crash data gathering hook invoked.
E0421 07:21:57.150974 3572240 coredump_hook.cc:411] RAW: Skipping coredump since rlimit was 0 at process start.
E0421 07:21:57.151134 3572240 client.cc:222] RAW: Coroner client retries enabled (b/136286901), will retry for up to 30 sec.
E0421 07:21:57.151148 3572240 coredump_hook.cc:473] RAW: Sending fingerprint to remote end.
E0421 07:21:57.151158 3572240 coredump_socket.cc:124] RAW: Stat failed errno=2 on socket /var/google/services/logmanagerd/remote_coredump.socket
E0421 07:21:57.151163 3572240 coredump_hook.cc:477] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] Missing crash reporting socket. Is the listener running?
E0421 07:21:57.151179 3572240 coredump_hook.cc:550] RAW: Discarding core.

A wild guess is that this is related to asyncio. I noticed a TODO-comment in the code around where the error occures saying " # TODO(adarob): Use asyncio.run in py3.7+.".

I am using py3.8. Could this be related, @adarob?

@misawann
Copy link

misawann commented Apr 21, 2022

I had the same issue but have solved it by downgrading tensorstore from 0.1.18 (the default version) to 0.1.14 as mentioned here.

@jbms
Copy link

jbms commented Apr 22, 2022

This may well be the same issue as google/tensorstore#30.

If we are able to reproduce this issue somewhat reliably that would greatly help in resolving it.

@misawann @peregilk @dptam If this occurs often, can you clarify what makes it most likely to trigger, or ideally describe how to reproduce it?

@peregilk
Copy link

Thanks @misawann!! I did a simple "pip install tensorstore==0.1.14", and all my three TPU/pods have been running without any issues after this!! I really think this has fixed it!

@jbms. I have been seeing this both on single TPUs (v3 and v4) and on v4 pods. I have most experience with running on the v4s. Here I have a fairly standard setup using the v2-alpha-tpuv4 and the v2-alpha-tpuv4-pod images. T5X installed from the repo. When pretraining, crashes usually happens within 24 hours, often sooner.

Note that the tpu image is running Tensorstore 0.1.17, while the pod image is running Tensorstore 0.1.18. So far, I have not had crashes on neither on tpus or pods after downgrading to 0.1.14.

The crashes are happening at irregular intervals. They seem to be happening when the checkpoint is saved. A wild guess is that the crashes happens roughly every 1/10 checkpoint. But it really varies. It has happened that I needed to restart/recover the same checkpoint multiple times. It has also trained without any issues for more than a day.

From what I have seen, saving often will trigger this sooner, so there should be easy to provoke the error. If you have problem reproducing it, contact me directly, and I share my install/training scripts.

@jbms
Copy link

jbms commented Apr 22, 2022

Thanks --- I was able to get an example from another users that reproduces the problem, so I will look into it.

@jbms
Copy link

jbms commented May 4, 2022

This is now fixed in tensorstore 0.1.20

copybara-service bot pushed a commit that referenced this issue May 4, 2022
copybara-service bot pushed a commit that referenced this issue May 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants