Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Optimization of Unet fails 6950 XT #517

Open
captroper opened this issue Aug 26, 2023 · 7 comments
Open

[Bug]: Optimization of Unet fails 6950 XT #517

captroper opened this issue Aug 26, 2023 · 7 comments
Labels
bug Something isn't working DirectML DirectML

Comments

@captroper
Copy link

captroper commented Aug 26, 2023

What happened?

This appeared to me to be the same issue as 510 and 301, though I know nothing. I ran the following commands:

  • conda create --name olive python=3.9
  • conda activate olive
  • pip install olive-ai[directml]==0.3.1
  • git clone https://github.com/microsoft/olive --branch v0.3.1
  • cd (to relevant directory)
  • pip install -r requirements.txt
  • python stable_diffusion_xl.py --optimize

I've attached the log, as well as a DXDIAG, but it errors out when optimizing unet saying "failed to run olive on gpu-dml".... "887a0006 the gpu will not respond to more commands".

DxDiag.txt
ErrorLog.txt

Version?

0.3.1

@captroper captroper added the bug Something isn't working label Aug 26, 2023
@guotuofeng
Copy link
Collaborator

The following error message seems be related to DirectML EP.

onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : D:\a_work\1\s\onnxruntime\core\providers\dml\DmlExecutionProvider\src\ExecutionProvider.cpp(896)\onnxruntime_pybind11_state.pyd!00007FFE31C80201: (caller: 00007FFE31C80C2F) Exception(2) tid(3c14) 887A0006 The GPU will not respond to more commands, most likely because of an invalid command passed by the calling application.

@guotuofeng
Copy link
Collaborator

@jstoecker, do you have any insight?

@guotuofeng guotuofeng added the waiting for response Waiting for response label Sep 17, 2023
@guotuofeng
Copy link
Collaborator

seems similar with #510

@jstoecker
Copy link
Contributor

This is DXGI_ERROR_DEVICE_HUNG during inference/evaluation, which typically happens when some GPU work is taking excessively long. The recent AMD driver optimizations for stable diffusion / multi-head attention target the RDNA 3 architecture (e.g., the 7000 series, like the Radeon RX 7900 XTX) but not the RDNA 2 (6000 series). Still, we can try to repro this on an RDNA card to see if anything jumps out.

@guotuofeng guotuofeng removed the waiting for response Waiting for response label Sep 19, 2023
@CellerX
Copy link

CellerX commented Sep 26, 2023

6800xt has same err

@vibbix
Copy link

vibbix commented Nov 22, 2023

Error on my 6900XT as well, on 0.4.0

@Jerry-zirui
Copy link

Jerry-zirui commented Jun 21, 2024

Same Error occurred in AMD Ryzen 7 7840U w/ Radeon 780M Graphics.
I increased the dedicated GPU memory as #510 mentioned, but the error still.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working DirectML DirectML
Projects
None yet
Development

No branches or pull requests

7 participants