Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect compute kernel from evaluator on WSL2 #284

Closed
Column01 opened this issue Aug 11, 2021 · 14 comments
Closed

Incorrect compute kernel from evaluator on WSL2 #284

Column01 opened this issue Aug 11, 2021 · 14 comments

Comments

@Column01
Copy link

Column01 commented Aug 11, 2021

Following the other issue about this (#269) I went to install Antares to hopefully get ROCm on WSL2 using Ubuntu 20.04, and it seems to not work.

When running sudo BACKEND=c-rocm_win64 make to install the ROCm backend on windows in WSL2, it tries to evaluate a custom kernel at the end and fails to do so.

The AMD HIP driver is present (C:\Windows\System32\amdhip64.dll) and when running sudo apt install rocm-dev shows it is already installed. Antares is visible in windows /mnt/c/Users/Colin/ubuntu_stuff/antares/

Here is the log during the evaluation: https://gist.github.com/3c77d7003a0a212d3f30abea8ee2b9d8

Should be noted that when running /opt/rocm/bin/rocminfo, it states: ROCk module is NOT loaded, possibly no GPU devices . AMD has closed all issues regarding WSL2 and this error message...

Kernel version is: Linux Colin-Desktop 5.10.16.3-microsoft-standard-WSL2 #1 SMP Fri Apr 2 22:23:49 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

@Column01
Copy link
Author

Here is more info if needed. The GPU is an RX580

colin-ubuntu@Colin-Desktop:~$ rocminfo
ROCk module is NOT loaded, possibly no GPU devices
colin-ubuntu@Colin-Desktop:~$ clinfo
Number of platforms:                             1
  Platform Profile:                              FULL_PROFILE
  Platform Version:                              OpenCL 2.0 AMD-APP (3305.0)
  Platform Name:                                 AMD Accelerated Parallel Processing
  Platform Vendor:                               Advanced Micro Devices, Inc.
  Platform Extensions:                           cl_khr_icd cl_amd_event_callback


  Platform Name:                                 AMD Accelerated Parallel Processing
Number of devices:                               0

@Column01
Copy link
Author

I was able to get torch's ROCm version to install, but running the antares samples will use the CPU

@ghostplant
Copy link
Contributor

ghostplant commented Aug 12, 2021

From #269, we have said Antares is used to launch ROCm device code using Windows native ROCm driver and help to port device code to Standard Win64 applications, not the one to restore the full-stack of ROCm and make pytorch to work in Linux mode (maybe this is possible in theory but letting it come true is definitely a costly task and I am not sure whether it is deserved to do using plenty of time).

But your logs https://gist.github.com/3c77d7003a0a212d3f30abea8ee2b9d8 and statement of "running the antares samples will use the CPU" is not expected though. I am not sure whether you have installed Windows AMDGPU driver correctly.

So can you paste the log by running bash -c 'cd .libAntares/cache/_/ && ../../evaluator.c-rocm_win64' ? It will display the error reasons for your case.

@Column01
Copy link
Author

+ /opt/rocm/bin/hipcc .antares-module-tempfile.cu --amdgpu-target=gfx803 --amdgpu-target=gfx900 --amdgpu-target=gfx906 --amdgpu-target=gfx908 --amdgpu-target=gfx1010 --genco -Wno-ignored-attributes -O2 -o .antares-module-tempfile.cu.out
..\..\..\rocclr\hip_code_object.cpp:482: guarantee(false && "hipErrorNoBinaryForGpu: Unable to find code object for all
current devices!")

@Column01
Copy link
Author

The AMDGPU drivers are installed correctly on windows, the DLL mentioned in the docs is present

@Column01
Copy link
Author

If I'm reading the error correctly I might need to install an older ROCm version? But it could totally be related to the fact rocminfo displays that the ROCk module is not running, could it not?

@ghostplant
Copy link
Contributor

Your log is clear to show the root-cause: no existing --amdgpu-target exactly hit your GPU type.
Current --amdgpu-target list added in the compiling argument includes gfx803, gfx900, gfx906, gfx908, gfx1010.

My AMDGPU is Radeon7 which should match gfx906, if I remove argument --amdgpu-target=gfx906, I'll also get this error:

+ /opt/rocm/bin/hipcc .antares-module-tempfile.cu --amdgpu-target=gfx803 --amdgpu-target=gfx900 --amdgpu-target=gfx908 --amdgpu-target=gfx1010 --genco -Wno-ignored-attributes -O2 -o .antares-module-tempfile.cu.out
..\..\..\hip_code_object.cpp:92: guarantee(false && "hipErrorNoBinaryForGpu: Coudn't find binary for current devices!")

Once any --amdgpu-target is satisfied the GPU type, it would work normally:

+ /opt/rocm/bin/hipcc .antares-module-tempfile.cu --amdgpu-target=gfx803 --amdgpu-target=gfx900 --amdgpu-target=gfx906 --amdgpu-target=gfx908 --amdgpu-target=gfx1010 --genco -Wno-ignored-attributes -O2 -o .antares-module-tempfile.cu.out

[EvalAgent] Results = {"K/0": 1504583185.0, "TPR": 0.000401532}

This is also tested using AMD Navi 10 which should match gfx1010.

Besides, some GPUs with same types may still have special suffix like xnack- so it may influence the Windows ROCm runtime driver to match your compiled target types. I am not sure whether your AMDGPU indeed and precisely matches gfx803. If so, it might indicate that Windows ROCm runtime driver has dropped support for any gfx803 cards.

@Column01
Copy link
Author

I'm running an RX580, no idea what that one is in this naming scheme

@ghostplant
Copy link
Contributor

@Column01 Just found the news: https://www.videogames.ai/2021/01/07/RX580-ROCM-40.html Seems like new ROCm drivers >= 4.0 no longer support gfx803, and this is also the same in Linux ROCm. If you still want to use it for acceleration, you may need to consider "Linux + ROCm < 4.0" or "Windows + DirectX12 (over BACKEND=c-hlsl_win64)"

@Column01
Copy link
Author

ugh, This is extremely frustrating. AMD keeps making stupid decisions like this and pissing consumers off. I wholeheartedly regret buying my AMD card,

@ghostplant
Copy link
Contributor

ghostplant commented Aug 12, 2021

@Column01 Windows AMDGPU with ROCm runtime was initially support by the end of 2020 while ROCm 4.0 is released after that? Maybe the an history version of Windows AMD driver is still supporting gfx803? But.. it is indeed annoying though.

Fortunately, DirectX12 runtime is always able to use RTX580 for acceleration, and I think the performance is not far from acceleration by ROCm runtime.

@Column01
Copy link
Author

I will try 4.0 tomorrow and see if works

@Column01
Copy link
Author

Tried older ROCm versions, no dice (3.9 and 4.0 and 4.1 all have a longer error about /dev/kfd not existing when doing rocminfo

So the repository for ROCm lists gfx8 GPUs as compatible but full support is not guaranteed. I think this is more likely due to its being ROCm inside WSL2 and not WSL1. I'm going to dual boot Linux and run my workflows in there and hopefully, ROCm will work properly there...

@Column01
Copy link
Author

After lots of tearing my hair out, I gave up. This is not an antares issue, it's an "AMD being stupid" issue. GFX803 is not supported and just straight-up broke at some point. RIP AMD consumer cards for computing...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants