Resnet benchmark crashed with exit code -6 #1165

gopitk · 2022-06-14T17:08:32Z

Environment

OS: [Ubuntu 20.04]
Hardware (GPU, or instance type): [AMD Instinct/ROCm 5.1.1]

To reproduce

Steps to reproduce the behavior:

Execute the recipe in https://github.com/mosaicml/benchmarks/tree/main/blogs/resnet. Specifically running the recipe: recipes/resnet50_hot.yaml
Benchmark runs for several epochs (and seems to be doing well) after which composer prints an error and terminates the run
ERROR:composer.cli.launcher:Rank 2 crashed with exit code -6

Dont see any error or stack trace on why the specific rank exited. I enabled FileLogger and dont see any error / stack trace there as well. Here is the last few lines of the rank 2 log file that was generated.

[EPOCH][batch=42500]: { "metrics/eval/Accuracy": 0.7575, "metrics/eval/CrossEntropy": 1.6468, }
[EPOCH][batch=42500]: { "epoch": 68, }
[EPOCH][batch=42500]: { "metrics/eval/Accuracy": 0.7575, "metrics/eval/CrossEntropy": 1.6468, }
[EPOCH][batch=42500]: { "epoch": 68, }
[stderr]: INFO:composer.algorithms.progressive_resizing.progressive_resizing:Applied Progressive Resizing with scale_factor=0.9820359281437125 and mode=resize.
[stderr]: Old input dimensions: (H,W)=(167, 167).
[stderr]: New input dimensions: (H,W)=(164, 164)

Is there a way to enable more logging or understand what the exit code -6 means?

Expected behavior

Benchmark runs to completion.

hanlint · 2022-06-14T19:22:27Z

Hi @gopitk we haven't tested everything on AMD GPUs just yet. If you are using the composer command, run with --stderr to see the logs. @florescl has been testing on AMD GPUs internally, according to her, you'll likely seen an error like:

:0:rocdevice.cpp            
:2603: 7586688823011 us: 75436: [tid:0x7f54541db700] 
Device::callbackQueue aborting with error : HSA_STATUS_ERROR_MEMORY_FAULT: Agent attempted to access an inaccessible address. code: 0x2b

florescl · 2022-06-14T19:46:18Z

@gopitk in order to run with --stderr you can do "composer -n num_gpus --stderr "{local_rank}.log" examples/run_composer_trainer.py -f yaml"

gopitk · 2022-06-15T17:50:18Z

Thanks @florescl and @hanlint. I am able to get the errors now in the file. (It was a bit confusing to me to not see stderr by default in composer. Maybe something that can be changed?). Anyway this seems like an issue in the underlying ROCm layers. Tagging @sunway513 .

ravi-mosaicml · 2022-06-16T15:33:13Z

Thanks @florescl and @hanlint. I am able to get the errors now in the file. (It was a bit confusing to me to not see stderr by default in composer. Maybe something that can be changed?). Anyway this seems like an issue in the underlying ROCm layers. Tagging @sunway513 .

Hmm, the stdout and stderr error for any crashed process should print by default when using the launcher script. Curious did the error print in the console when using --stderr "{local_rank}.log, or did it only print in the file?

When using the launcher script, the stack trace would not be printed when the process is terminated via a POSIX signal (e.g. `SIGABRT` -- see mosaicml#1165). This is to prevent cluttering the logs when the launcher script terminates the process via SIGTERM or SIGKILL. Instead of skipping the stack trace on all negative exit codes (e.g. all signals), this PR skips the stack trace only when the process was terminated via SIGKILL or SIGTERM, which are the two codes that the launcher script uses to terminate processes. Closes mosaicml#1165

ravi-mosaicml · 2022-06-16T15:55:45Z

Edit: NVM; I think I traced down the issue of why the stack traces were not printing -- see #1175.

When using the launcher script, the stack trace would not be printed when the process is terminated via a POSIX signal (e.g. `SIGABRT` -- see #1165). This is to prevent cluttering the logs when the launcher script terminates the process via SIGTERM or SIGKILL. Instead of skipping the stack trace on all negative exit codes (e.g. all signals), this PR skips the stack trace only when the process was terminated via SIGKILL or SIGTERM, which are the two codes that the launcher script uses to terminate processes.

sunway513 · 2022-06-24T14:25:14Z

Hi, thanks @gopitk for reporting the issue, we have been working on triaging the issue, and a root cause has been identified. We'll try to provide a Dockerfile to workaround this issue before full fix is rolled out with the proper official ROCm release.

sunway513 · 2022-06-24T20:37:26Z

Here's a docker container with the MIOpen staging fixes for this issue:
rocm/rocm-dev:rocm5.1.3_ubuntu20.04_py3.7_pytorch_1.10.0_miopen_5.2_staging

The corresponding Dockerfile:

FROM rocm/pytorch:rocm5.1.3_ubuntu20.04_py3.7_pytorch_1.11.0
RUN cd ~ && wget https://www.dropbox.com/s/m7br1iidh1mxl18/build_miopen_dev.sh
RUN cd ~ && bash build_miopen_dev.sh release/rocm-rel-5.2-staging

Sample output

Epoch    62 train 100%|█████████████████████████| 625/625 [03:01<00:00,  3.44it/s, loss/train=4.3485]
Epoch    63 val   100%|█████████████████████████| 25/25 [00:05<00:00,  4.23it/s, metrics/eval/Accuracy=0.7734]
Epoch    63 train 100%|█████████████████████████| 625/625 [03:01<00:00,  3.44it/s, loss/train=3.7371]
Epoch    64 val   100%|█████████████████████████| 25/25 [00:05<00:00,  4.30it/s, metrics/eval/Accuracy=0.7732]
Epoch    64 train 100%|█████████████████████████| 625/625 [03:02<00:00,  3.43it/s, loss/train=4.1074]
Epoch    65 val   100%|█████████████████████████| 25/25 [00:05<00:00,  4.30it/s, metrics/eval/Accuracy=0.7743]
Epoch    65 train 100%|█████████████████████████| 625/625 [03:02<00:00,  3.43it/s, loss/train=4.3634]
Epoch    66 val   100%|█████████████████████████| 25/25 [00:05<00:00,  4.30it/s, metrics/eval/Accuracy=0.7745]
Epoch    66 train 100%|█████████████████████████| 625/625 [03:01<00:00,  3.44it/s, loss/train=4.2283]
Epoch    67 val   100%|█████████████████████████| 25/25 [00:05<00:00,  4.37it/s, metrics/eval/Accuracy=0.7742]
Waiting up to 30 seconds for all training processes to terminate. Press Ctrl-C to exit immediately.acy=0.7742]

gopitk · 2022-06-24T20:51:08Z

Thanks @sunway513 for the fix. Will test it and get back to you.

florescl · 2022-06-28T19:24:03Z

Thanks @sunway513 for the docker container. To double check, what is the command you ran for the sample output you copied?

sunway513 · 2022-06-30T00:28:32Z

Hi @florescl , the command I used for the above sample output was the following:
composer -n 8 train.py -f recipes/resnet50_hot.yaml --scale_schedule_ratio 0.75

Besides, I have looped the workloads with the following command for 10 times using the updated docker container, and all passed fine:
composer -n 16 train.py -f recipes/resnet50_hot.yaml --scale_schedule_ratio 2.2

florescl · 2022-07-12T22:33:12Z

closing this issue since it's fixed with the updated docker container.

gopitk added the bug Something isn't working label Jun 14, 2022

hanlint added the amd label Jun 14, 2022

ravi-mosaicml mentioned this issue Jun 16, 2022

Fix Printing of the Stack Trace on POSIX Signal Exits #1175

Merged

florescl closed this as completed Jul 12, 2022

junliume mentioned this issue Feb 12, 2023

[bug] found igemm solver does not support group conv ROCm/MIOpen#1979

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resnet benchmark crashed with exit code -6 #1165

Resnet benchmark crashed with exit code -6 #1165

gopitk commented Jun 14, 2022

hanlint commented Jun 14, 2022 •

edited

florescl commented Jun 14, 2022

gopitk commented Jun 15, 2022

ravi-mosaicml commented Jun 16, 2022

ravi-mosaicml commented Jun 16, 2022

sunway513 commented Jun 24, 2022

sunway513 commented Jun 24, 2022

gopitk commented Jun 24, 2022

florescl commented Jun 28, 2022 •

edited

sunway513 commented Jun 30, 2022

florescl commented Jul 12, 2022

Resnet benchmark crashed with exit code -6 #1165

Resnet benchmark crashed with exit code -6 #1165

Comments

gopitk commented Jun 14, 2022

Expected behavior

hanlint commented Jun 14, 2022 • edited

florescl commented Jun 14, 2022

gopitk commented Jun 15, 2022

ravi-mosaicml commented Jun 16, 2022

ravi-mosaicml commented Jun 16, 2022

sunway513 commented Jun 24, 2022

sunway513 commented Jun 24, 2022

gopitk commented Jun 24, 2022

florescl commented Jun 28, 2022 • edited

sunway513 commented Jun 30, 2022

florescl commented Jul 12, 2022

hanlint commented Jun 14, 2022 •

edited

florescl commented Jun 28, 2022 •

edited