New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resnet benchmark crashed with exit code -6 #1165
Comments
Hi @gopitk we haven't tested everything on AMD GPUs just yet. If you are using the
|
@gopitk in order to run with --stderr you can do "composer -n num_gpus --stderr "{local_rank}.log" examples/run_composer_trainer.py -f yaml" |
Thanks @florescl and @hanlint. I am able to get the errors now in the file. (It was a bit confusing to me to not see stderr by default in composer. Maybe something that can be changed?). Anyway this seems like an issue in the underlying ROCm layers. Tagging @sunway513 . |
Hmm, the stdout and stderr error for any crashed process should print by default when using the launcher script. Curious did the error print in the console when using |
When using the launcher script, the stack trace would not be printed when the process is terminated via a POSIX signal (e.g. `SIGABRT` -- see mosaicml#1165). This is to prevent cluttering the logs when the launcher script terminates the process via SIGTERM or SIGKILL. Instead of skipping the stack trace on all negative exit codes (e.g. all signals), this PR skips the stack trace only when the process was terminated via SIGKILL or SIGTERM, which are the two codes that the launcher script uses to terminate processes. Closes mosaicml#1165
Edit: NVM; I think I traced down the issue of why the stack traces were not printing -- see #1175. |
When using the launcher script, the stack trace would not be printed when the process is terminated via a POSIX signal (e.g. `SIGABRT` -- see #1165). This is to prevent cluttering the logs when the launcher script terminates the process via SIGTERM or SIGKILL. Instead of skipping the stack trace on all negative exit codes (e.g. all signals), this PR skips the stack trace only when the process was terminated via SIGKILL or SIGTERM, which are the two codes that the launcher script uses to terminate processes.
Hi, thanks @gopitk for reporting the issue, we have been working on triaging the issue, and a root cause has been identified. We'll try to provide a Dockerfile to workaround this issue before full fix is rolled out with the proper official ROCm release. |
Here's a docker container with the MIOpen staging fixes for this issue: The corresponding Dockerfile:
Sample output
|
Thanks @sunway513 for the fix. Will test it and get back to you. |
Thanks @sunway513 for the docker container. To double check, what is the command you ran for the sample output you copied? |
Hi @florescl , the command I used for the above sample output was the following: Besides, I have looped the workloads with the following command for 10 times using the updated docker container, and all passed fine: |
closing this issue since it's fixed with the updated docker container. |
Environment
To reproduce
Steps to reproduce the behavior:
ERROR:composer.cli.launcher:Rank 2 crashed with exit code -6
Dont see any error or stack trace on why the specific rank exited. I enabled FileLogger and dont see any error / stack trace there as well. Here is the last few lines of the rank 2 log file that was generated.
[EPOCH][batch=42500]: { "metrics/eval/Accuracy": 0.7575, "metrics/eval/CrossEntropy": 1.6468, }
[EPOCH][batch=42500]: { "epoch": 68, }
[EPOCH][batch=42500]: { "metrics/eval/Accuracy": 0.7575, "metrics/eval/CrossEntropy": 1.6468, }
[EPOCH][batch=42500]: { "epoch": 68, }
[stderr]: INFO:composer.algorithms.progressive_resizing.progressive_resizing:Applied Progressive Resizing with scale_factor=0.9820359281437125 and mode=resize.
[stderr]: Old input dimensions: (H,W)=(167, 167).
[stderr]: New input dimensions: (H,W)=(164, 164)
Is there a way to enable more logging or understand what the exit code -6 means?
Expected behavior
Benchmark runs to completion.
The text was updated successfully, but these errors were encountered: