Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clpeak and llama.cpp stuck at 100% CPU on 6.8.5 kernel #726

Closed
notsyncing opened this issue Apr 21, 2024 · 14 comments
Closed

clpeak and llama.cpp stuck at 100% CPU on 6.8.5 kernel #726

notsyncing opened this issue Apr 21, 2024 · 14 comments

Comments

@notsyncing
Copy link

As #710, @Disty0 writes:

New 6.8.5, 6.8.6 and 6.6.27 LTS kernels are unable to run using the GPU.
It detects and tries to run on the GPU but gets stuck with 100% single CPU core usage.
Happens on any OpenCL or SYCL app. (Kernel 6.8 is using the workaround provided in this thread.)

You can downgrade to Linux 6.8.4 for Arch Linux with these packages:
linux 6.8.4: https://archive.archlinux.org/packages/l/linux/linux-6.8.4.arch1-1-x86_64.pkg.tar.zst
linux-headers 6.8.4: https://archive.archlinux.org/packages/l/linux-headers/linux-headers-6.8.4.arch1-1-x86_64.pkg.tar.zst

This happens to me as well on both llama.cpp and clpeak.

clpeak output:

$ NEOReadDebugKeys=1 PrintDebugMessages=1 LogWaitingForCompletion=1 EventsDebugEnable=1 PrintKmdTimes=1 LogZEInfo=1 clpeak
WARNING: Failed to request OCL Turbo Boost
computeUnitsUsedForScratch: 4096
hwInfo: {512, 4096}: (16, 1, 32)

Platform: Intel(R) OpenCL Graphics
  Device: Intel(R) Arc(TM) A770 Graphics
    Driver version  : 24.09.28717.17 (Linux x64)
    Compute units   : 512
    Clock frequency : 2400 MHz
DeviceBinaryFormat::zebin : Unhandled SHT_NOTE section : .note.intelgt.metrics currently supports only : .note.intelgt.compat.
DeviceBinaryFormat::zebin::.ze_info : Minor version : 40 is newer than available in decoder : 39 - some features may be skipped


    Global memory bandwidth (GBPS)

Waiting for task count 0 at location 0x7fb1fbd65000 with timeout 0. Current value: 0

Waiting completed. Current value: 0

Waiting for task count 1 at location 0x7fb1fbd5f000 with timeout 0. Current value: 0

Waiting completed. Current value: 1
computeUnits for each thread: 4096
perHwThreadPrivateMemorySize: 0  totalPrivateMemorySize: 0
perHwThreadScratchSize: 0        totalScratchSize: 0
perHwThreadPrivateScratchSize: 0         totalPrivateScratchSize: 0
computeUnits for each thread: 4096
perHwThreadPrivateMemorySize: 0  totalPrivateMemorySize: 0
perHwThreadScratchSize: 0        totalScratchSize: 0
perHwThreadPrivateScratchSize: 0         totalPrivateScratchSize: 0
computeUnits for each thread: 4096
perHwThreadPrivateMemorySize: 0  totalPrivateMemorySize: 0
perHwThreadScratchSize: 0        totalScratchSize: 0
perHwThreadPrivateScratchSize: 0         totalPrivateScratchSize: 0
computeUnits for each thread: 4096
perHwThreadPrivateMemorySize: 0  totalPrivateMemorySize: 0
perHwThreadScratchSize: 0        totalScratchSize: 0
perHwThreadPrivateScratchSize: 0         totalPrivateScratchSize: 0
computeUnits for each thread: 4096
perHwThreadPrivateMemorySize: 0  totalPrivateMemorySize: 0
perHwThreadScratchSize: 0        totalScratchSize: 0
perHwThreadPrivateScratchSize: 0         totalPrivateScratchSize: 0
computeUnits for each thread: 4096
perHwThreadPrivateMemorySize: 0  totalPrivateMemorySize: 0
perHwThreadScratchSize: 0        totalScratchSize: 0
perHwThreadPrivateScratchSize: 0         totalPrivateScratchSize: 0
computeUnits for each thread: 4096
perHwThreadPrivateMemorySize: 0  totalPrivateMemorySize: 0
perHwThreadScratchSize: 0        totalScratchSize: 0
perHwThreadPrivateScratchSize: 0         totalPrivateScratchSize: 0
computeUnits for each thread: 4096
perHwThreadPrivateMemorySize: 0  totalPrivateMemorySize: 0
perHwThreadScratchSize: 0        totalScratchSize: 0
perHwThreadPrivateScratchSize: 0         totalPrivateScratchSize: 0
computeUnits for each thread: 4096
perHwThreadPrivateMemorySize: 0  totalPrivateMemorySize: 0
perHwThreadScratchSize: 0        totalScratchSize: 0
perHwThreadPrivateScratchSize: 0         totalPrivateScratchSize: 0
computeUnits for each thread: 4096
perHwThreadPrivateMemorySize: 0  totalPrivateMemorySize: 0
perHwThreadScratchSize: 0        totalScratchSize: 0
perHwThreadPrivateScratchSize: 0         totalPrivateScratchSize: 0
      float   : DIM:1   GWS:(33550336, 1, 1)    ELWS:(256, 1, 1)        Offset:(0, 0, 0)        AGWS:(33550336, 1, 1) LWS:(256, 1, 1) TWGS:(131056, 1, 1)     NWGS:(131056, 1, 1)     SWGS:(0, 0, 0)
devMode = 3, taskMode = 3.
devMode = 3, taskMode = 3.
preemption = 3.
DIM:1   GWS:(33550336, 1, 1)    ELWS:(256, 1, 1)        Offset:(0, 0, 0)        AGWS:(33550336, 1, 1)   LWS:(256, 1, 1)       TWGS:(131056, 1, 1)     NWGS:(131056, 1, 1)     SWGS:(0, 0, 0)
devMode = 3, taskMode = 3.
devMode = 3, taskMode = 3.
preemption = 3.

Waiting for task count 2 at location 0x7fb1fbd65000 with timeout 96. Current value: 0

Waiting completed. Current value: 0

Waiting for task count 2 at location 0x7fb1fbd65000 with timeout 0. Current value: 0

Then it stuck here and clpeak process consumes one cpu core (100% usage).

perf record -a when it stuck reports:

Samples: 537K of event 'cycles:P', Event count (approx.): 456413846417
Overhead       Samples  Command          Shared Object                                   Symbol
   9.08%         36420  clpeak           [kernel.kallsyms]                               [k] clear_bhb_loop                                                                                                                                ◆
   5.43%         21754  clpeak           [kernel.kallsyms]                               [k] __schedule                                                                                                                                    ▒
   4.73%         18984  clpeak           libc.so.6                                       [.] __sched_yield                                                                                                                                 ▒
   4.65%         18636  clpeak           [kernel.kallsyms]                               [k] _raw_spin_lock                                                                                                                                ▒
   4.64%         18584  clpeak           [vdso]                                          [.] __vdso_clock_gettime                                                                                                                          ▒
   4.61%         18482  clpeak           [kernel.kallsyms]                               [k] native_sched_clock                                                                                                                            ▒
   4.48%         17981  clpeak           [kernel.kallsyms]                               [k] psi_account_irqtime                                                                                                                           ▒
   4.01%         16069  clpeak           [kernel.kallsyms]                               [k] update_curr                                                                                                                                   ▒
   3.71%         14894  clpeak           [kernel.kallsyms]                               [k] syscall_exit_to_user_mode                                                                                                                     ▒
   3.69%         14808  clpeak           [kernel.kallsyms]                               [k] __calc_delta.constprop.0                                                                                                                      ▒
   3.66%         14687  clpeak           [kernel.kallsyms]                               [k] entry_SYSRETQ_unsafe_stack                                                                                                                    ▒
   3.50%         14043  clpeak           [kernel.kallsyms]                               [k] pick_next_task_fair                                                                                                                           ▒
   3.23%         12965  clpeak           libigdrcl.so                                    [.] 0x000000000005f724                                                                                                                            ▒
   2.76%         11064  clpeak           [kernel.kallsyms]                               [k] pick_eevdf                                                                                                                                    ▒
   2.61%         10453  clpeak           [kernel.kallsyms]                               [k] do_syscall_64                                                                                                                                 ▒
   2.40%          9623  clpeak           [kernel.kallsyms]                               [k] entry_SYSCALL_64                                                                                                                              ▒
   2.25%          9021  clpeak           [kernel.kallsyms]                               [k] update_min_vruntime                                                                                                                           ▒
   1.83%          7352  clpeak           [kernel.kallsyms]                               [k] update_curr_se                                                                                                                                ▒
   1.78%          7126  clpeak           [kernel.kallsyms]                               [k] entry_SYSCALL_64_after_hwframe                                                                                                                ▒
   1.56%          6269  clpeak           [kernel.kallsyms]                               [k] __cgroup_account_cputime                                                                                                                      ▒
   1.52%          6079  clpeak           [kernel.kallsyms]                               [k] record_times                                                                                                                                  ▒
   1.51%          6064  clpeak           [kernel.kallsyms]                               [k] update_rq_clock                                                                                                                               ▒
   1.49%          5981  clpeak           [kernel.kallsyms]                               [k] do_sched_yield                                                                                                                                ▒
   1.11%          4453  clpeak           [kernel.kallsyms]                               [k] rcu_note_context_switch                                                                                                                       ▒
   1.01%          4037  clpeak           libigdrcl.so                                    [.] 0x000000000005f726                                                                                                                            ▒
   0.91%          3651  clpeak           [kernel.kallsyms]                               [k] yield_task_fair                                                                                                                               ▒
   0.83%          3346  clpeak           [kernel.kallsyms]                               [k] entry_SYSCALL_64_safe_stack                                                                                                                   ▒
   0.82%          3285  clpeak           [kernel.kallsyms]                               [k] raw_spin_rq_lock_nested                                                                                                                       ▒
   0.78%          3113  clpeak           [kernel.kallsyms]                               [k] syscall_return_via_sysret                                                                                                                     ▒
   0.75%          3008  clpeak           [kernel.kallsyms]                               [k] schedule                                                                                                                                      ▒
   0.72%         64724  swapper          [kernel.kallsyms]                               [k] intel_idle                                                                                                                                    ▒
   0.70%          2792  clpeak           libc.so.6                                       [.] clock_gettime@@GLIBC_2.17                                                                                                                     ▒
   0.58%          2338  clpeak           [kernel.kallsyms]                               [k] cpuacct_charge                                                                                                                                ▒
   0.53%          2107  clpeak           [kernel.kallsyms]                               [k] sched_clock                                                                                                                                   ▒
   0.51%          2060  clpeak           [kernel.kallsyms]                               [k] sched_clock_cpu                                                                                                                               ▒
   0.45%          1805  clpeak           libstdc++.so.6.0.33                             [.] std::chrono::_V2::system_clock::now()                                                                                                         ▒
   0.42%          1690  clpeak           [kernel.kallsyms]                               [k] _raw_spin_unlock                                                                                                                              ▒
   0.39%          1567  clpeak           libigdrcl.so                                    [.] 0x0000000000534467                                                                                                                            ▒
   0.39%          1563  clpeak           [kernel.kallsyms]                               [k] x64_sys_call                                                                                                                                  ▒
   0.37%          1469  clpeak           [kernel.kallsyms]                               [k] __list_add_valid_or_report                                                                                                                    ▒
   0.35%          1418  clpeak           [kernel.kallsyms]                               [k] check_cfs_rq_runtime                                                                                                                          ▒
   0.35%          1402  clpeak           [kernel.kallsyms]                               [k] pick_next_entity                                                                                                                              ▒
   0.34%          1368  clpeak           [kernel.kallsyms]                               [k] cgroup_rstat_updated                                                                                                                          ▒
   0.34%          1364  clpeak           [kernel.kallsyms]                               [k] __list_del_entry_valid_or_report                                                                                                              ▒
   0.27%          1098  clpeak           [kernel.kallsyms]                               [k] syscall_exit_to_user_mode_prepare                                                                                                             ▒
   0.23%          1856  pgrep            libc.so.6                                       [.] __strncpy_evex                                                                                                                                ▒
   0.22%           882  clpeak           libigdrcl.so                                    [.] 0x000000000056e33e                                                                                                                            ▒
   0.19%           775  clpeak           [kernel.kallsyms]                               [k] __x64_sys_sched_yield                 

System information:

kernel 6.8.5-301.fc40.x86_64
Fedora Kinoite 40.20240419.n.0
intel-compute-runtime 24.09.28717.17-1.fc40.x86_64 (this also happens on latest 24.13.29138.7, installed in a ubuntu 22.04 container)

CPU: Intel Core i9-10940X
GPU: Intel Arc A770 16GB
@nyanmisaka
Copy link
Contributor

There's a i915 kernel driver regression on CCS (Compute Command Streamer) engine of DG2/Arc GPU.

https://gitlab.freedesktop.org/drm/intel/-/issues/10895

The faulty kernel commit in 6.6.26+ and 6.8.5+ is https://lore.kernel.org/stable/20240327155622.538140-4-andi.shyti@linux.intel.com/T/ And both Media-Driver and OpenCL are affected.

@JablonskiMateusz Could you please let the kernel developer aware of this regression?

@Disty0
Copy link

Disty0 commented Apr 21, 2024

This one seems to be happening with any kernel that has Spectre BHI fix in it.
(Not sure if it is related or not and disabling it doesn't change anything.)

Happens with LTS kernels as well.
Happens with compute runtime 24.13 as well.

Kernels with this issue:

  • 6.8.5 or newer
  • 6.6.26 LTS or newer

Kernels that works fine:

  • 6.8.4 or older
  • 6.6.25 LTS or older

Haven't tried kernels older than 6.6 LTS.

OS: Arch Linux
Compute Runtime: aur/intel-compute-runtime-bin 24.13.29138.7-1

CPU: AMD Ryzen 7 5800X3D
GPU: Intel ARC A770 16GB

@gnattu
Copy link

gnattu commented Apr 21, 2024

To determine if the same issue caused this regression, you can try my custom Arch Linux kernel package with the CCS changes reverted: https://github.com/gnattu/linux/releases/tag/6.8.7-jelly

If this kernel fixes the issue, then it's likely related to the CCS change as well.

@Disty0
Copy link

Disty0 commented Apr 22, 2024

To determine if the same issue caused this regression, you can try my custom Arch Linux kernel package with the CCS changes reverted: https://github.com/gnattu/linux/releases/tag/6.8.7-jelly

If this kernel fixes the issue, then it's likely related to the CCS change as well.

That custom kernel works fine. So it is CCS related.
Spectre BHI is just coincidence i guess.

@gnattu
Copy link

gnattu commented Apr 22, 2024

An Intel developer has provided me with a series of patches to test the theory of a potential fix without a full revert. I have created an Arch Linux package for it, with the related patches attached:

https://github.com/gnattu/linux/releases/tag/6.8.7-intel-ccs_mode-4

Please be aware that this kernel is not a guaranteed fix and is only for testing purposes. If anyone has the time, please try this kernel and let me know if it fixes the issue or not.

@Disty0
Copy link

Disty0 commented Apr 22, 2024

An Intel developer has provided me with a series of patches to test the theory of a potential fix without a full revert. I have created an Arch Linux package for it, with the related patches attached:

https://github.com/gnattu/linux/releases/tag/6.8.7-intel-ccs_mode-4

Please be aware that this kernel is not a guaranteed fix and is only for testing purposes. If anyone has the time, please try this kernel and let me know if it fixes the issue or not.

This one doesn't work. But it has "gray / io-wait" CPU usage instead of "red / kernel" CPU usage on htop.

@gnattu
Copy link

gnattu commented Apr 26, 2024

Just got the latest patch from Intel developers:

https://github.com/gnattu/linux/releases/tag/6.8.7-intel-set-ccs-mode-on-reset

Intel developers already reproduced our issue with clpeak, so this kernel has a higher chance of fixing our issue.

@Disty0
Copy link

Disty0 commented Apr 27, 2024

Just got the latest patch from Intel developers:

https://github.com/gnattu/linux/releases/tag/6.8.7-intel-set-ccs-mode-on-reset

Intel developers already reproduced our issue with clpeak, so this kernel has a higher chance of fixing our issue.

This works but performance is 1/4 of what it should be.

@simonlui
Copy link

simonlui commented May 2, 2024

I can confirm that the applications no longer hang and got fixed with a kernel built with the new patches on drm-intel-gt-next but compute performance seems to be lowered on my system as well. Not sure if there is something going on with the compute runtime and these new changes or if it is out of scope for this discussion here.

@gnattu
Copy link

gnattu commented May 19, 2024

Just got the latest patch from Intel developers:
https://github.com/gnattu/linux/releases/tag/6.8.7-intel-set-ccs-mode-on-reset
Intel developers already reproduced our issue with clpeak, so this kernel has a higher chance of fixing our issue.

This works but performance is 1/4 of what it should be.

Can you try this kernel build to see if the performance improves?

https://github.com/gnattu/linux/releases/tag/6.9.1-jelly

@Disty0
Copy link

Disty0 commented May 19, 2024

Just got the latest patch from Intel developers:
https://github.com/gnattu/linux/releases/tag/6.8.7-intel-set-ccs-mode-on-reset
Intel developers already reproduced our issue with clpeak, so this kernel has a higher chance of fixing our issue.

This works but performance is 1/4 of what it should be.

Can you try this kernel build to see if the performance improves?

https://github.com/gnattu/linux/releases/tag/6.9.1-jelly

This works fine and performance is pretty much the same as 6.8.4.
Note: I am using a PCI-E 4.0 x4 port with a PCI-E riser so can't really say much about memcpy.

6.8.4 / 6.9.1-custom:

image

@murl-digital
Copy link

Just got the latest patch from Intel developers:
https://github.com/gnattu/linux/releases/tag/6.8.7-intel-set-ccs-mode-on-reset
Intel developers already reproduced our issue with clpeak, so this kernel has a higher chance of fixing our issue.

This works but performance is 1/4 of what it should be.

Can you try this kernel build to see if the performance improves?

https://github.com/gnattu/linux/releases/tag/6.9.1-jelly

Can confirm that this is an improvement. I don't know how to run a benchmark to check memcopy but just comparing to a pretty intense blender file I have lying around, there's a noticeable improvement in the viewport

@notsyncing
Copy link
Author

Now, with kernel 6.8.10 in fedora 40, it works, though performance is not great. I'm closing this. Thanks for everyone!

@simonlui
Copy link

simonlui commented Jun 5, 2024

Now, with kernel 6.8.10 in fedora 40, it works, though performance is not great. I'm closing this. Thanks for everyone!

@notsyncing Full performance needs the patch mentioned above, drm/i915/gt: Fix CCS id's calculation for CCS mode setting but it has only been merged in kernel 6.10-rc2 so it will take a while to get to Fedora.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants