Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fatal Python error: Segmentation fault, when training t5x-XXL on a TPU Pod v3-32 #340

Closed
LeoLaugier opened this issue Mar 10, 2022 · 11 comments
Assignees

Comments

@LeoLaugier
Copy link

LeoLaugier commented Mar 10, 2022

Hi,

I was able to train and infer prompt tuning with t5x-XXL on a TPU Pod v3-32 for my custom task defined from a TSV file, but I am seeing now an error and can't understand it.

I follow the instructions from prompt tuning to train and infer Prompt on a Pod Slice, except that the last libtpu_release gives an error TPUEmbeddingEngineState_Create not available in this library. so I install the release from February 15, 2022.

I run the following script


MODEL_DIR=${1:-${MODEL_DIR}}
TFDS_DATA_DIR=${2:-${TFDS_DATA_DIR}}

if [ -z ${MODEL_DIR} ] || [ -z ${TFDS_DATA_DIR} ]; then
                  echo "usage: ./rec_sys.sh gs://your-bucket/path/to/model_dir gs://your-bucket/path/to/tfds/cache"
                              exit 1
fi

T5X_DIR="`python3 -m prompt_tuning.scripts.find_module t5x`/.."
FLAXFORMER_DIR="`python3 -m prompt_tuning.scripts.find_module flaxformer`/.."
PROMPT_DIR="`python3 -m prompt_tuning.scripts.find_module prompt_tuning`/.."
echo "Searching for gin configs in:"
echo "- ${T5X_DIR}"
echo "- ${FLAXFORMER_DIR}"
echo "- ${PROMPT_DIR}"
echo "============================="
PRETRAINED_MODEL="gs://t5-data/pretrained_models/t5x/t5_1_1_lm100k_xxl/checkpoint_1100000"

python3 -m t5x.train \
                  --gin_search_paths="${T5X_DIR},${FLAXFORMER_DIR},${PROMPT_DIR}" \
                  --gin_file="prompt_tuning/configs/models/t5_1_1_xxl_prompt.gin" \
                  --gin_file="prompt_tuning/configs/runs/prompt_finetune.gin" \
                  --gin.MODEL_DIR="'${MODEL_DIR}'" \
                  --gin.BATCH_SIZE="16" \
                  --gin.MIXTURE_OR_TASK_NAME="'yelp'" \
                  --gin.MIXTURE_OR_TASK_MODULE="'task_dir.mytasks'" \
                  --gin.TASK_FEATURE_LENGTHS="{'inputs': 512, 'targets': 8}" \
                  --gin.USE_CACHED_TASKS="False" \
                  --gin.INITIAL_CHECKPOINT_PATH="'${PRETRAINED_MODEL}'" \
		  --gin.partitioning.PjitPartitioner.model_parallel_submesh="(4, 4, 1, 2)" \
                  --gin.TRAIN_STEPS="1_100_010" \

and get the following errors:
First:


tensorstore/internal/oauth2/google_auth_provider.cc:163: Credentials file not found. NOT_FOUND: $GOOGLE_APPLICATION_CREDENTIALS is not set or corrupt. 

tensorstore/internal/oauth2/google_auth_provider.cc:168: Credentials file not found. NOT_FOUND: Could not find the credentials file in the standard gcloud location [/home/leojlaugier/.config/gcloud/application_default_credentials.json] 

tensorstore/internal/oauth2/google_auth_provider.cc:203: Running on GCE, using GCE Auth Provider 

Fatal Python error: Segmentation fault 

 

 

Thread 0x00007f3304539c40 (most recent call first): 

  File "/usr/lib/python3.8/selectors.py", line 468 in select 

  File "/usr/lib/python3.8/asyncio/base_events.py", line 1823 in _run_once 

  File "/usr/lib/python3.8/asyncio/base_events.py", line 570 in run_forever 

  File "/usr/lib/python3.8/asyncio/base_events.py", line 603 in run_until_complete 

  File "/home/leojlaugier/.local/lib/python3.8/site-packages/t5x/checkpoints.py", line 160 in _run_future_tree 

  File "/home/leojlaugier/.local/lib/python3.8/site-packages/t5x/checkpoints.py", line 913 in _read_state_from_tensorstore 

  File "/home/leojlaugier/.local/lib/python3.8/site-packages/t5x/checkpoints.py", line 860 in restore 

  File "/home/leojlaugier/.local/lib/python3.8/site-packages/t5x/utils.py", line 455 in _restore_path 

  File "/home/leojlaugier/.local/lib/python3.8/site-packages/t5x/utils.py", line 466 in from_checkpoints 

  File "/home/leojlaugier/.local/lib/python3.8/site-packages/t5x/utils.py", line 507 in from_checkpoint 

  File "/home/leojlaugier/.local/lib/python3.8/site-packages/t5x/utils.py", line 522 in from_checkpoint_or_scratch 

  File "/home/leojlaugier/.local/lib/python3.8/site-packages/t5x/train.py", line 320 in train 

  File "/home/leojlaugier/.local/lib/python3.8/site-packages/gin/config.py", line 1582 in gin_wrapper 

  File "/home/leojlaugier/.local/lib/python3.8/site-packages/t5x/train.py", line 623 in _main 

  File "/home/leojlaugier/.local/lib/python3.8/site-packages/t5x/train.py", line 605 in main 

  File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 251 in _run_main 

  File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 303 in run 

  File "/home/leojlaugier/.local/lib/python3.8/site-packages/t5x/gin_utils.py", line 105 in run 

  File "/home/leojlaugier/.local/lib/python3.8/site-packages/t5x/train.py", line 625 in <module> 

  File "/usr/lib/python3.8/runpy.py", line 87 in _run_code 

  File "/usr/lib/python3.8/runpy.py", line 194 in _run_module_as_main 

https://symbolize.stripped_domain/r/?trace=7f330498e18b,7f330498e20f,6&map= 

*** SIGSEGV (@0x7d100002a60), see gl__________41#s15 received by PID 10848 (TID 12157) on cpu 19; stack trace: *** 

PC: @     0x7f330498e18b  (unknown)  raise 

    @     0x7f32fb6ea1fa        992  (unknown) 

    @     0x7f330498e210  (unknown)  (unknown) 

    @                0x7  (unknown)  (unknown) 

https://symbolize.stripped_domain/r/?trace=7f330498e18b,7f32fb6ea1f9,7f330498e20f,6&map=55976a7e1de583f3a9544af1c86ac940:7f32ed01c000-7f32fba50d80 

E0310 16:51:25.580514   12157 coredump_hook.cc:365] RAW: Remote crash data gathering hook invoked. 

E0310 16:51:25.580525   12157 coredump_hook.cc:411] RAW: Skipping coredump since rlimit was 0 at process start. 

E0310 16:51:25.580535   12157 client.cc:221] RAW: Coroner client retries enabled (b/136286901), will retry for up to 30 sec. 

E0310 16:51:25.580557   12157 coredump_hook.cc:473] RAW: Sending fingerprint to remote end. 

E0310 16:51:25.580562   12157 coredump_socket.cc:124] RAW: Stat failed errno=2 on socket /var/google/services/logmanagerd/remote_coredump.socket 

E0310 16:51:25.580565   12157 coredump_hook.cc:477] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] Missing crash reporting socket. Is the listener running? 

E0310 16:51:25.580569   12157 coredump_hook.cc:550] RAW: Discarding core. 

And later

^[[6~^[[6~I0310 16:54:01.224571 140510105828416 train.py:456] Epoch 1100 of 1101
I0310 16:54:01.224764 140510105828416 train.py:462] BEGIN Train loop.
I0310 16:54:01.224818 140510105828416 train.py:467] Training for 10 steps.
I0310 16:54:01.226046 140497868457728 logging_writer.py:48] [1100000] collection=train timing/compilation_seconds=87.301567
I0310 16:54:01.230673 140510105828416 trainer.py:491] Training: step 1100000
I0310 16:54:01.635557 140510105828416 train.py:490] END Train loop.
./train_yelp_xxl.sh: line 34: 12024 Aborted                 (core dumped) python3 -m t5x.train --gin_search_paths="${T5X_DIR},${FLAXFORMER_DIR},${PROMPT_DIR}" --gin_file="prompt_tuning/configs/models/t5_1_1_xxl_prompt.gin" --gin_file="prompt_tuning/configs/runs/prompt_finetune.gin" --gin.MODEL_DIR="'${MODEL_DIR}'" --gin.partitioning.PjitPartitioner.model_parallel_submesh="(4, 4, 1, 2)" --gin.BATCH_SIZE="16" --gin.MIXTURE_OR_TASK_NAME="'yelp'" --gin.MIXTURE_OR_TASK_MODULE="'task_dir.mytasks'" --gin.TASK_FEATURE_LENGTHS="{'inputs': 512, 'targets': 8}" --gin.USE_CACHED_TASKS="False" --gin.INITIAL_CHECKPOINT_PATH="'${PRETRAINED_MODEL}'" --gin.TRAIN_STEPS="1_100_010"
##### Command execution on worker 1 failed with return code 134. Continuing.
./train_yelp_xxl.sh: line 34: 10848 Aborted                 (core dumped) python3 -m t5x.train --gin_search_paths="${T5X_DIR},${FLAXFORMER_DIR},${PROMPT_DIR}" --gin_file="prompt_tuning/configs/models/t5_1_1_xxl_prompt.gin" --gin_file="prompt_tuning/configs/runs/prompt_finetune.gin" --gin.MODEL_DIR="'${MODEL_DIR}'" --gin.partitioning.PjitPartitioner.model_parallel_submesh="(4, 4, 1, 2)" --gin.BATCH_SIZE="16" --gin.MIXTURE_OR_TASK_NAME="'yelp'" --gin.MIXTURE_OR_TASK_MODULE="'task_dir.mytasks'" --gin.TASK_FEATURE_LENGTHS="{'inputs': 512, 'targets': 8}" --gin.USE_CACHED_TASKS="False" --gin.INITIAL_CHECKPOINT_PATH="'${PRETRAINED_MODEL}'" --gin.TRAIN_STEPS="1_100_010"
##### Command execution on worker 0 failed with return code 134. Continuing.
./train_yelp_xxl.sh: line 34: 11607 Aborted                 (core dumped) python3 -m t5x.train --gin_search_paths="${T5X_DIR},${FLAXFORMER_DIR},${PROMPT_DIR}" --gin_file="prompt_tuning/configs/models/t5_1_1_xxl_prompt.gin" --gin_file="prompt_tuning/configs/runs/prompt_finetune.gin" --gin.MODEL_DIR="'${MODEL_DIR}'" --gin.partitioning.PjitPartitioner.model_parallel_submesh="(4, 4, 1, 2)" --gin.BATCH_SIZE="16" --gin.MIXTURE_OR_TASK_NAME="'yelp'" --gin.MIXTURE_OR_TASK_MODULE="'task_dir.mytasks'" --gin.TASK_FEATURE_LENGTHS="{'inputs': 512, 'targets': 8}" --gin.USE_CACHED_TASKS="False" --gin.INITIAL_CHECKPOINT_PATH="'${PRETRAINED_MODEL}'" --gin.TRAIN_STEPS="1_100_010"
##### Command execution on worker 3 failed with return code 134. Continuing.

Then the run freezes. I might be missing something obvious but I think I haven't changed anything but the data since the last time I was able to train and infer with prompt tuning. Moreover, I was able to train on the same train data but problems arose when I tried to infer.
Therefore, I'm asking if you could help me understand the issue.

Thanks in advance for your time.

@LeoLaugier
Copy link
Author

LeoLaugier commented Mar 11, 2022

Actually I'm having some doubt on the command provided to install libraries on the Pod TPU, could someone please check it is correct?

$ gcloud alpha compute tpus tpu-vm ssh ${TPU_NAME} \
  --zone ${ZONE} \
  --worker=all \
  --command="git clone --branch=main https://github.com/google-research/prompt-tuning && cd prompt-tuning && "
python3 -m pip install . -f https://storage.googleapis.com/jax-releases/libtpu_releases.html

For instance, I don't get why python3 -m pip install . -f https://storage.googleapis.com/jax-releases/libtpu_releases.html is not right after the &&, and why there is no need of [tpu] after pip install .

Thanks

@iislucas
Copy link

iislucas commented Apr 1, 2022

@adarob signal boost as this is blocking me and my PhD student @LeoLaugier, any ideas :)

@adarob
Copy link
Collaborator

adarob commented Apr 1, 2022

To be clear, it looks like you get a segfault in one thread but then your training completes?

@LeoLaugier
Copy link
Author

I think there is a segfault in one thread, though the training freezes (in the other threads?) so it does not complete. (I haven't retried since the day I raised the issue). It looks like issue #366.

@hwchung27
Copy link
Contributor

We have a new API using XManager, which manages the workers. So I think this can help resolving the issue. Could you try the instruction here: https://github.com/google-research/t5x#quickstart-recommended?

@LeoLaugier
Copy link
Author

Awesome, thanks for the pointer. I'll try that asap and let you know.

I guess XManager is compatible with prompt tuning right?

@iislucas
Copy link

iislucas commented Apr 5, 2022

correct, xmanager is an API about managing experiments, and there's no compatibility issue with prompt-tuning.

@peregilk
Copy link

I am struggling with the same issues here. Any other way around this than to use XManager?

@jbms
Copy link

jbms commented Apr 27, 2022

This appears to be the same issue as google/tensorstore#30 for which we will have a fix pushed out in the next day or so.

jbms added a commit to google/tensorstore that referenced this issue Apr 27, 2022
Previously, none of the auth providers were actually thread safe,
leading to intermittent crashes.

Fixes #30
Fixes google-research/t5x#340

PiperOrigin-RevId: 444883311
Change-Id: I8a49384f783a717593dc1c31f932596d12fc9c4c
@jbms
Copy link

jbms commented Apr 27, 2022

This should now be fixed in tensorstore 0.1.20

@adarob
Copy link
Collaborator

adarob commented May 4, 2022

Thanks @jbms!

@adarob adarob closed this as completed May 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants