Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resnet benchmark crashed with exit code -6 #1165

Closed
gopitk opened this issue Jun 14, 2022 · 11 comments
Closed

Resnet benchmark crashed with exit code -6 #1165

gopitk opened this issue Jun 14, 2022 · 11 comments
Labels
amd bug Something isn't working

Comments

@gopitk
Copy link

gopitk commented Jun 14, 2022

Environment

  • OS: [Ubuntu 20.04]
  • Hardware (GPU, or instance type): [AMD Instinct/ROCm 5.1.1]

To reproduce

Steps to reproduce the behavior:

  1. Execute the recipe in https://github.com/mosaicml/benchmarks/tree/main/blogs/resnet. Specifically running the recipe: recipes/resnet50_hot.yaml
  2. Benchmark runs for several epochs (and seems to be doing well) after which composer prints an error and terminates the run
    ERROR:composer.cli.launcher:Rank 2 crashed with exit code -6

Dont see any error or stack trace on why the specific rank exited. I enabled FileLogger and dont see any error / stack trace there as well. Here is the last few lines of the rank 2 log file that was generated.

[EPOCH][batch=42500]: { "metrics/eval/Accuracy": 0.7575, "metrics/eval/CrossEntropy": 1.6468, }
[EPOCH][batch=42500]: { "epoch": 68, }
[EPOCH][batch=42500]: { "metrics/eval/Accuracy": 0.7575, "metrics/eval/CrossEntropy": 1.6468, }
[EPOCH][batch=42500]: { "epoch": 68, }
[stderr]: INFO:composer.algorithms.progressive_resizing.progressive_resizing:Applied Progressive Resizing with scale_factor=0.9820359281437125 and mode=resize.
[stderr]: Old input dimensions: (H,W)=(167, 167).
[stderr]: New input dimensions: (H,W)=(164, 164)

Is there a way to enable more logging or understand what the exit code -6 means?

Expected behavior

Benchmark runs to completion.

@gopitk gopitk added the bug Something isn't working label Jun 14, 2022
@hanlint
Copy link
Contributor

hanlint commented Jun 14, 2022

Hi @gopitk we haven't tested everything on AMD GPUs just yet. If you are using the composer command, run with --stderr to see the logs. @florescl has been testing on AMD GPUs internally, according to her, you'll likely seen an error like:

:0:rocdevice.cpp            
:2603: 7586688823011 us: 75436: [tid:0x7f54541db700] 
Device::callbackQueue aborting with error : HSA_STATUS_ERROR_MEMORY_FAULT: Agent attempted to access an inaccessible address. code: 0x2b

@florescl
Copy link
Contributor

@gopitk in order to run with --stderr you can do "composer -n num_gpus --stderr "{local_rank}.log" examples/run_composer_trainer.py -f yaml"

@hanlint hanlint added the amd label Jun 14, 2022
@gopitk
Copy link
Author

gopitk commented Jun 15, 2022

Thanks @florescl and @hanlint. I am able to get the errors now in the file. (It was a bit confusing to me to not see stderr by default in composer. Maybe something that can be changed?). Anyway this seems like an issue in the underlying ROCm layers. Tagging @sunway513 .

@ravi-mosaicml
Copy link
Contributor

Thanks @florescl and @hanlint. I am able to get the errors now in the file. (It was a bit confusing to me to not see stderr by default in composer. Maybe something that can be changed?). Anyway this seems like an issue in the underlying ROCm layers. Tagging @sunway513 .

Hmm, the stdout and stderr error for any crashed process should print by default when using the launcher script. Curious did the error print in the console when using --stderr "{local_rank}.log, or did it only print in the file?

ravi-mosaicml added a commit to ravi-mosaicml/ravi-composer that referenced this issue Jun 16, 2022
When using the launcher script, the stack trace would not be printed when the process is terminated via a POSIX signal (e.g. `SIGABRT` -- see mosaicml#1165). This is to prevent cluttering the logs when the launcher script terminates the process via SIGTERM or SIGKILL.

Instead of skipping the stack trace on all negative exit codes (e.g. all signals), this PR skips the stack trace only when the process was terminated via SIGKILL or SIGTERM, which are the two codes that the launcher script uses to terminate processes.

Closes mosaicml#1165
@ravi-mosaicml
Copy link
Contributor

Edit: NVM; I think I traced down the issue of why the stack traces were not printing -- see #1175.

ravi-mosaicml added a commit that referenced this issue Jun 16, 2022
When using the launcher script, the stack trace would not be printed when the process is terminated via a POSIX signal (e.g. `SIGABRT` -- see #1165). This is to prevent cluttering the logs when the launcher script terminates the process via SIGTERM or SIGKILL.

Instead of skipping the stack trace on all negative exit codes (e.g. all signals), this PR skips the stack trace only when the process was terminated via SIGKILL or SIGTERM, which are the two codes that the launcher script uses to terminate processes.
@sunway513
Copy link

Hi, thanks @gopitk for reporting the issue, we have been working on triaging the issue, and a root cause has been identified. We'll try to provide a Dockerfile to workaround this issue before full fix is rolled out with the proper official ROCm release.

@sunway513
Copy link

Here's a docker container with the MIOpen staging fixes for this issue:
rocm/rocm-dev:rocm5.1.3_ubuntu20.04_py3.7_pytorch_1.10.0_miopen_5.2_staging

The corresponding Dockerfile:

FROM rocm/pytorch:rocm5.1.3_ubuntu20.04_py3.7_pytorch_1.11.0
RUN cd ~ && wget https://www.dropbox.com/s/m7br1iidh1mxl18/build_miopen_dev.sh
RUN cd ~ && bash build_miopen_dev.sh release/rocm-rel-5.2-staging

Sample output

Epoch    62 train 100%|█████████████████████████| 625/625 [03:01<00:00,  3.44it/s, loss/train=4.3485]
Epoch    63 val   100%|█████████████████████████| 25/25 [00:05<00:00,  4.23it/s, metrics/eval/Accuracy=0.7734]
Epoch    63 train 100%|█████████████████████████| 625/625 [03:01<00:00,  3.44it/s, loss/train=3.7371]
Epoch    64 val   100%|█████████████████████████| 25/25 [00:05<00:00,  4.30it/s, metrics/eval/Accuracy=0.7732]
Epoch    64 train 100%|█████████████████████████| 625/625 [03:02<00:00,  3.43it/s, loss/train=4.1074]
Epoch    65 val   100%|█████████████████████████| 25/25 [00:05<00:00,  4.30it/s, metrics/eval/Accuracy=0.7743]
Epoch    65 train 100%|█████████████████████████| 625/625 [03:02<00:00,  3.43it/s, loss/train=4.3634]
Epoch    66 val   100%|█████████████████████████| 25/25 [00:05<00:00,  4.30it/s, metrics/eval/Accuracy=0.7745]
Epoch    66 train 100%|█████████████████████████| 625/625 [03:01<00:00,  3.44it/s, loss/train=4.2283]
Epoch    67 val   100%|█████████████████████████| 25/25 [00:05<00:00,  4.37it/s, metrics/eval/Accuracy=0.7742]
Waiting up to 30 seconds for all training processes to terminate. Press Ctrl-C to exit immediately.acy=0.7742]

@gopitk
Copy link
Author

gopitk commented Jun 24, 2022

Thanks @sunway513 for the fix. Will test it and get back to you.

@florescl
Copy link
Contributor

florescl commented Jun 28, 2022

Thanks @sunway513 for the docker container. To double check, what is the command you ran for the sample output you copied?

@sunway513
Copy link

Hi @florescl , the command I used for the above sample output was the following:
composer -n 8 train.py -f recipes/resnet50_hot.yaml --scale_schedule_ratio 0.75

Besides, I have looped the workloads with the following command for 10 times using the updated docker container, and all passed fine:
composer -n 16 train.py -f recipes/resnet50_hot.yaml --scale_schedule_ratio 2.2

@florescl
Copy link
Contributor

closing this issue since it's fixed with the updated docker container.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
amd bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants