Fatal Python error: Segmentation fault, when training t5x-XXL on a TPU Pod v3-32 #340

LeoLaugier · 2022-03-10T17:23:07Z

Hi,

I was able to train and infer prompt tuning with t5x-XXL on a TPU Pod v3-32 for my custom task defined from a TSV file, but I am seeing now an error and can't understand it.

I follow the instructions from prompt tuning to train and infer Prompt on a Pod Slice, except that the last libtpu_release gives an error TPUEmbeddingEngineState_Create not available in this library. so I install the release from February 15, 2022.

I run the following script


MODEL_DIR=${1:-${MODEL_DIR}}
TFDS_DATA_DIR=${2:-${TFDS_DATA_DIR}}

if [ -z ${MODEL_DIR} ] || [ -z ${TFDS_DATA_DIR} ]; then
                  echo "usage: ./rec_sys.sh gs://your-bucket/path/to/model_dir gs://your-bucket/path/to/tfds/cache"
                              exit 1
fi

T5X_DIR="`python3 -m prompt_tuning.scripts.find_module t5x`/.."
FLAXFORMER_DIR="`python3 -m prompt_tuning.scripts.find_module flaxformer`/.."
PROMPT_DIR="`python3 -m prompt_tuning.scripts.find_module prompt_tuning`/.."
echo "Searching for gin configs in:"
echo "- ${T5X_DIR}"
echo "- ${FLAXFORMER_DIR}"
echo "- ${PROMPT_DIR}"
echo "============================="
PRETRAINED_MODEL="gs://t5-data/pretrained_models/t5x/t5_1_1_lm100k_xxl/checkpoint_1100000"

python3 -m t5x.train \
                  --gin_search_paths="${T5X_DIR},${FLAXFORMER_DIR},${PROMPT_DIR}" \
                  --gin_file="prompt_tuning/configs/models/t5_1_1_xxl_prompt.gin" \
                  --gin_file="prompt_tuning/configs/runs/prompt_finetune.gin" \
                  --gin.MODEL_DIR="'${MODEL_DIR}'" \
                  --gin.BATCH_SIZE="16" \
                  --gin.MIXTURE_OR_TASK_NAME="'yelp'" \
                  --gin.MIXTURE_OR_TASK_MODULE="'task_dir.mytasks'" \
                  --gin.TASK_FEATURE_LENGTHS="{'inputs': 512, 'targets': 8}" \
                  --gin.USE_CACHED_TASKS="False" \
                  --gin.INITIAL_CHECKPOINT_PATH="'${PRETRAINED_MODEL}'" \
		  --gin.partitioning.PjitPartitioner.model_parallel_submesh="(4, 4, 1, 2)" \
                  --gin.TRAIN_STEPS="1_100_010" \

and get the following errors:
First:


tensorstore/internal/oauth2/google_auth_provider.cc:163: Credentials file not found. NOT_FOUND: $GOOGLE_APPLICATION_CREDENTIALS is not set or corrupt. 

tensorstore/internal/oauth2/google_auth_provider.cc:168: Credentials file not found. NOT_FOUND: Could not find the credentials file in the standard gcloud location [/home/leojlaugier/.config/gcloud/application_default_credentials.json] 

tensorstore/internal/oauth2/google_auth_provider.cc:203: Running on GCE, using GCE Auth Provider 

Fatal Python error: Segmentation fault 

 

 

Thread 0x00007f3304539c40 (most recent call first): 

  File "/usr/lib/python3.8/selectors.py", line 468 in select 

  File "/usr/lib/python3.8/asyncio/base_events.py", line 1823 in _run_once 

  File "/usr/lib/python3.8/asyncio/base_events.py", line 570 in run_forever 

  File "/usr/lib/python3.8/asyncio/base_events.py", line 603 in run_until_complete 

  File "/home/leojlaugier/.local/lib/python3.8/site-packages/t5x/checkpoints.py", line 160 in _run_future_tree 

  File "/home/leojlaugier/.local/lib/python3.8/site-packages/t5x/checkpoints.py", line 913 in _read_state_from_tensorstore 

  File "/home/leojlaugier/.local/lib/python3.8/site-packages/t5x/checkpoints.py", line 860 in restore 

  File "/home/leojlaugier/.local/lib/python3.8/site-packages/t5x/utils.py", line 455 in _restore_path 

  File "/home/leojlaugier/.local/lib/python3.8/site-packages/t5x/utils.py", line 466 in from_checkpoints 

  File "/home/leojlaugier/.local/lib/python3.8/site-packages/t5x/utils.py", line 507 in from_checkpoint 

  File "/home/leojlaugier/.local/lib/python3.8/site-packages/t5x/utils.py", line 522 in from_checkpoint_or_scratch 

  File "/home/leojlaugier/.local/lib/python3.8/site-packages/t5x/train.py", line 320 in train 

  File "/home/leojlaugier/.local/lib/python3.8/site-packages/gin/config.py", line 1582 in gin_wrapper 

  File "/home/leojlaugier/.local/lib/python3.8/site-packages/t5x/train.py", line 623 in _main 

  File "/home/leojlaugier/.local/lib/python3.8/site-packages/t5x/train.py", line 605 in main 

  File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 251 in _run_main 

  File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 303 in run 

  File "/home/leojlaugier/.local/lib/python3.8/site-packages/t5x/gin_utils.py", line 105 in run 

  File "/home/leojlaugier/.local/lib/python3.8/site-packages/t5x/train.py", line 625 in <module> 

  File "/usr/lib/python3.8/runpy.py", line 87 in _run_code 

  File "/usr/lib/python3.8/runpy.py", line 194 in _run_module_as_main 

https://symbolize.stripped_domain/r/?trace=7f330498e18b,7f330498e20f,6&map= 

*** SIGSEGV (@0x7d100002a60), see gl__________41#s15 received by PID 10848 (TID 12157) on cpu 19; stack trace: *** 

PC: @     0x7f330498e18b  (unknown)  raise 

    @     0x7f32fb6ea1fa        992  (unknown) 

    @     0x7f330498e210  (unknown)  (unknown) 

    @                0x7  (unknown)  (unknown) 

https://symbolize.stripped_domain/r/?trace=7f330498e18b,7f32fb6ea1f9,7f330498e20f,6&map=55976a7e1de583f3a9544af1c86ac940:7f32ed01c000-7f32fba50d80 

E0310 16:51:25.580514   12157 coredump_hook.cc:365] RAW: Remote crash data gathering hook invoked. 

E0310 16:51:25.580525   12157 coredump_hook.cc:411] RAW: Skipping coredump since rlimit was 0 at process start. 

E0310 16:51:25.580535   12157 client.cc:221] RAW: Coroner client retries enabled (b/136286901), will retry for up to 30 sec. 

E0310 16:51:25.580557   12157 coredump_hook.cc:473] RAW: Sending fingerprint to remote end. 

E0310 16:51:25.580562   12157 coredump_socket.cc:124] RAW: Stat failed errno=2 on socket /var/google/services/logmanagerd/remote_coredump.socket 

E0310 16:51:25.580565   12157 coredump_hook.cc:477] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] Missing crash reporting socket. Is the listener running? 

E0310 16:51:25.580569   12157 coredump_hook.cc:550] RAW: Discarding core.

And later

^[[6~^[[6~I0310 16:54:01.224571 140510105828416 train.py:456] Epoch 1100 of 1101
I0310 16:54:01.224764 140510105828416 train.py:462] BEGIN Train loop.
I0310 16:54:01.224818 140510105828416 train.py:467] Training for 10 steps.
I0310 16:54:01.226046 140497868457728 logging_writer.py:48] [1100000] collection=train timing/compilation_seconds=87.301567
I0310 16:54:01.230673 140510105828416 trainer.py:491] Training: step 1100000
I0310 16:54:01.635557 140510105828416 train.py:490] END Train loop.
./train_yelp_xxl.sh: line 34: 12024 Aborted                 (core dumped) python3 -m t5x.train --gin_search_paths="${T5X_DIR},${FLAXFORMER_DIR},${PROMPT_DIR}" --gin_file="prompt_tuning/configs/models/t5_1_1_xxl_prompt.gin" --gin_file="prompt_tuning/configs/runs/prompt_finetune.gin" --gin.MODEL_DIR="'${MODEL_DIR}'" --gin.partitioning.PjitPartitioner.model_parallel_submesh="(4, 4, 1, 2)" --gin.BATCH_SIZE="16" --gin.MIXTURE_OR_TASK_NAME="'yelp'" --gin.MIXTURE_OR_TASK_MODULE="'task_dir.mytasks'" --gin.TASK_FEATURE_LENGTHS="{'inputs': 512, 'targets': 8}" --gin.USE_CACHED_TASKS="False" --gin.INITIAL_CHECKPOINT_PATH="'${PRETRAINED_MODEL}'" --gin.TRAIN_STEPS="1_100_010"
##### Command execution on worker 1 failed with return code 134. Continuing.
./train_yelp_xxl.sh: line 34: 10848 Aborted                 (core dumped) python3 -m t5x.train --gin_search_paths="${T5X_DIR},${FLAXFORMER_DIR},${PROMPT_DIR}" --gin_file="prompt_tuning/configs/models/t5_1_1_xxl_prompt.gin" --gin_file="prompt_tuning/configs/runs/prompt_finetune.gin" --gin.MODEL_DIR="'${MODEL_DIR}'" --gin.partitioning.PjitPartitioner.model_parallel_submesh="(4, 4, 1, 2)" --gin.BATCH_SIZE="16" --gin.MIXTURE_OR_TASK_NAME="'yelp'" --gin.MIXTURE_OR_TASK_MODULE="'task_dir.mytasks'" --gin.TASK_FEATURE_LENGTHS="{'inputs': 512, 'targets': 8}" --gin.USE_CACHED_TASKS="False" --gin.INITIAL_CHECKPOINT_PATH="'${PRETRAINED_MODEL}'" --gin.TRAIN_STEPS="1_100_010"
##### Command execution on worker 0 failed with return code 134. Continuing.
./train_yelp_xxl.sh: line 34: 11607 Aborted                 (core dumped) python3 -m t5x.train --gin_search_paths="${T5X_DIR},${FLAXFORMER_DIR},${PROMPT_DIR}" --gin_file="prompt_tuning/configs/models/t5_1_1_xxl_prompt.gin" --gin_file="prompt_tuning/configs/runs/prompt_finetune.gin" --gin.MODEL_DIR="'${MODEL_DIR}'" --gin.partitioning.PjitPartitioner.model_parallel_submesh="(4, 4, 1, 2)" --gin.BATCH_SIZE="16" --gin.MIXTURE_OR_TASK_NAME="'yelp'" --gin.MIXTURE_OR_TASK_MODULE="'task_dir.mytasks'" --gin.TASK_FEATURE_LENGTHS="{'inputs': 512, 'targets': 8}" --gin.USE_CACHED_TASKS="False" --gin.INITIAL_CHECKPOINT_PATH="'${PRETRAINED_MODEL}'" --gin.TRAIN_STEPS="1_100_010"
##### Command execution on worker 3 failed with return code 134. Continuing.

Then the run freezes. I might be missing something obvious but I think I haven't changed anything but the data since the last time I was able to train and infer with prompt tuning. Moreover, I was able to train on the same train data but problems arose when I tried to infer.
Therefore, I'm asking if you could help me understand the issue.

Thanks in advance for your time.

The text was updated successfully, but these errors were encountered:

LeoLaugier · 2022-03-11T13:29:34Z

Actually I'm having some doubt on the command provided to install libraries on the Pod TPU, could someone please check it is correct?

$ gcloud alpha compute tpus tpu-vm ssh ${TPU_NAME} \
  --zone ${ZONE} \
  --worker=all \
  --command="git clone --branch=main https://github.com/google-research/prompt-tuning && cd prompt-tuning && "
python3 -m pip install . -f https://storage.googleapis.com/jax-releases/libtpu_releases.html

For instance, I don't get why python3 -m pip install . -f https://storage.googleapis.com/jax-releases/libtpu_releases.html is not right after the &&, and why there is no need of [tpu] after pip install .

Thanks

iislucas · 2022-04-01T09:54:06Z

@adarob signal boost as this is blocking me and my PhD student @LeoLaugier, any ideas :)

adarob · 2022-04-01T12:23:26Z

To be clear, it looks like you get a segfault in one thread but then your training completes?

LeoLaugier · 2022-04-01T14:58:11Z

I think there is a segfault in one thread, though the training freezes (in the other threads?) so it does not complete. (I haven't retried since the day I raised the issue). It looks like issue #366.

hwchung27 · 2022-04-01T15:46:40Z

We have a new API using XManager, which manages the workers. So I think this can help resolving the issue. Could you try the instruction here: https://github.com/google-research/t5x#quickstart-recommended?

LeoLaugier · 2022-04-01T15:57:40Z

Awesome, thanks for the pointer. I'll try that asap and let you know.

I guess XManager is compatible with prompt tuning right?

iislucas · 2022-04-05T20:46:28Z

correct, xmanager is an API about managing experiments, and there's no compatibility issue with prompt-tuning.

peregilk · 2022-04-21T07:08:02Z

I am struggling with the same issues here. Any other way around this than to use XManager?

jbms · 2022-04-27T02:22:53Z

This appears to be the same issue as google/tensorstore#30 for which we will have a fix pushed out in the next day or so.

Previously, none of the auth providers were actually thread safe, leading to intermittent crashes. Fixes #30 Fixes google-research/t5x#340 PiperOrigin-RevId: 444883311 Change-Id: I8a49384f783a717593dc1c31f932596d12fc9c4c

jbms · 2022-04-27T19:23:47Z

This should now be fixed in tensorstore 0.1.20

adarob · 2022-05-04T13:39:55Z

Thanks @jbms!

dptam mentioned this issue Mar 17, 2022

Seg Fault after saving checkpoints #366

Closed

adarob assigned adarob and hwchung27 Apr 1, 2022

adarob closed this as completed May 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fatal Python error: Segmentation fault, when training t5x-XXL on a TPU Pod v3-32 #340

Fatal Python error: Segmentation fault, when training t5x-XXL on a TPU Pod v3-32 #340

LeoLaugier commented Mar 10, 2022 •

edited

Loading

LeoLaugier commented Mar 11, 2022 •

edited

Loading

iislucas commented Apr 1, 2022

adarob commented Apr 1, 2022

LeoLaugier commented Apr 1, 2022

hwchung27 commented Apr 1, 2022

LeoLaugier commented Apr 1, 2022

iislucas commented Apr 5, 2022

peregilk commented Apr 21, 2022

jbms commented Apr 27, 2022

jbms commented Apr 27, 2022

adarob commented May 4, 2022

Fatal Python error: Segmentation fault, when training t5x-XXL on a TPU Pod v3-32 #340

Fatal Python error: Segmentation fault, when training t5x-XXL on a TPU Pod v3-32 #340

Comments

LeoLaugier commented Mar 10, 2022 • edited Loading

LeoLaugier commented Mar 11, 2022 • edited Loading

iislucas commented Apr 1, 2022

adarob commented Apr 1, 2022

LeoLaugier commented Apr 1, 2022

hwchung27 commented Apr 1, 2022

LeoLaugier commented Apr 1, 2022

iislucas commented Apr 5, 2022

peregilk commented Apr 21, 2022

jbms commented Apr 27, 2022

jbms commented Apr 27, 2022

adarob commented May 4, 2022

LeoLaugier commented Mar 10, 2022 •

edited

Loading

LeoLaugier commented Mar 11, 2022 •

edited

Loading