`liquidctl status` hangs every once in a while on an (up-to-date) Commander Core #505

intelfx · 2022-09-18T02:56:12Z

Description of the problem

I'm trying to use liquidctl (after applying #501) on a Commander Core device I have (as part of the H100i ELITE CAPELLIX CLC).
If I run liquidctl status in a loop each second, eventually an invocation hangs indefinitely and never exits.

According to strace, it hangs on a read() from a /dev/hidrawN node corresponding to the Commander Core.

Steps to reproduce

watch -n1 liquidctl status
expected behavior: no hangs
observed behavior: within 5-10 minutes an invocation of liquidctl would hang indefinitely

Logs

liquidctl status --debug output from an invocation reproducing the issue:

up to the point of the hang: liquidctl-status-debug.txt
including the ^C, cleanup and the backtrace: liquidctl-status-debug-full.txt

Additional information about the system

Affected device(s): Corsair Commander Core
Operating system and version: Arch Linux
Liquidctl installation method: liquidctl-git AUR package
Output of liquidctl --version: liquidctl v1.11.0.dev41+ga2b4391 (Linux-5.19.9-arch1pf4-1-x86_64-with-glibc2.36)

The text was updated successfully, but these errors were encountered:

jonasmalacofilho · 2022-09-18T04:10:18Z

Cc: @Meriemi-git @ParkerMc

Sometimes a _CMD_OPEN_ENDPOINT command fails to produce a ready (00:) response, blocking after several non-ready (f0:) responses instead. If this happens, just retry the command and hope for the best... TODO: only do this with commands known to be idempotent (Hopefully) Fixes liquidctl#505.

jonasmalacofilho · 2022-09-18T04:55:18Z

Looking at your WIP proposal for a fix, it occurred to me: are you perhaps running some other instance of liquidctl (accessing that same device) while your watch -n1 liquidctl status loop is running?

Because what your fix effectively does is assume that the device will sometimes simply forget to reply (successfully) to _CMD_OPEN_ENDPOINT... but what could also happen, I think, is that the device did respond, but some other driver instance discarded the message by calling clear_enqueued_reports().

(Specifically, IIRC, the report queue is global and, on Linux/hidraw, shared by all readers).

intelfx · 2022-09-18T05:02:33Z

Looking at your WIP proposal for a fix, it occurred to me: are you perhaps running some other instance of liquidctl (accessing that same device) while your watch -n1 liquidctl status loop is running?

No. Even if I did, on my system liquidctl is invoked via a custom wrapper for mutual exclusion (since the #298 discussion), so this should not be happening.

It does look suspicious, though. I'm not sure how I could verify that no one else is accessing the device concurrently. Any ideas?

jonasmalacofilho · 2022-09-18T05:16:04Z

No. Even if I did, on my system liquidctl is invoked via a custom wrapper for mutual exclusion (since the #298 discussion), so this should not be happening.

Oh, right... by the way, to which file do you attach the lock for your wrapper?

It does look suspicious, though. I'm not sure how I could verify that no one else is accessing the device concurrently. Any ideas?

Hm, maybe lsof is enough? Or globally strace'ing (then filtering) the relevant syscalls?

Apart from those, there are is the kernel's tracing functionally...

jonasmalacofilho · 2022-09-18T05:19:01Z

Or globally strace'ing (then filtering) the relevant syscalls?

Hm, re-reading the man page for strace suggests that functionally doesn't actually exist.

I still vaguely remember reading about some some way to globally (and easily) trace all syscalls on a system...

intelfx · 2022-09-18T05:24:47Z

Oh, right... by the way, to which file do you attach the lock for your wrapper?

It doesn't really matter, so I just do it on the liquidctl executable itself:

$ cat /usr/local/bin/liquidctl                            
#!/bin/sh
exec flock -F /usr/bin/liquidctl /usr/bin/liquidctl "$@"

I still vaguely remember reading about some some way to globally (and easily) trace all syscalls on a system...

There's bpftrace (and friends) but I really hoped to avoid dealing with it :-)

jonasmalacofilho · 2022-09-18T05:57:30Z

It doesn't really matter, so I just do it on the liquidctl executable itself:

Yeah, as long as every instance agrees, it really doesn't matter.

By the way, are you sure /usr/local/bin/liquidctl is what's being called everywhere? Presumably you could have scripts running with different environments, where /usr/local/bin isn't in the PATH.

There's bpftrace (and friends) but I really hoped to avoid dealing with it :-)

Maybe that's what I was thinking of...

Anyway, if I'm reading the docs correctly, the fanotify APIs might have enough information (i.e. the PID that triggered the event), but it isn't currently reported by the inotifywa* CLI tools.

However, since you know the expected interval between open calls, maybe just looking if there are accesses that don't match that cadence, with

$ inotifywait -m --format '%T %w %e' --timefmt '%F %T' /dev/hidraw?

would be enough to at least see if further investigation is even necessary?

(Since the time resolution is rather limited, it might be good increase the watch loop interval).

intelfx · 2022-09-18T08:55:44Z

By the way, are you sure /usr/local/bin/liquidctl is what's being called everywhere? Presumably you could have scripts running with different environments, where /usr/local/bin isn't in the PATH.

Yes, pretty sure, $PATH is set globally via systemd:

# grep -z '^PATH=' /proc/*/environ | tr '\0' '\n' | grep -v '/usr/local/bin'
/proc/154661/environ:PATH=/bin:/usr/bin:/sbin:/usr/sbin  # this single process belongs to a container and is irrelevant

Maybe that's what I was thinking of...

I have confirmed with bpftrace that there are indeed no other processes attempting to access the hidraw node.

Terminal 1:

# FOOBAR=1 watch -n1 liquidctl status

Terminal 2:

# bpftrace -B line -e 'tracepoint:syscalls:sys_enter_open, tracepoint:syscalls:sys_enter_openat /str(args->filename) == "/dev/hidraw7"/ { printf("OUT: %d %s %s\n", pid, comm, str(args->filename)); }' | grep -E 'OUT:' --line-buffered | while read _ pid comm fn; do echo "$(grep -z 'FOOBAR=' /proc/$pid/environ) $pid $comm $fn"; done
stdin:1:95-97: WARNING: Addrspace mismatch                                                                                                                                                                                                                     
tracepoint:syscalls:sys_enter_open, tracepoint:syscalls:sys_enter_openat /str(args->filename) == "/dev/hidraw7"/ { printf("OUT %d %s %s\n", pid, comm, str(args->filename)); }                                                                                 
                                                                                              ~~                                                                                                                                                               

FOOBAR=1 <pid> liquidctl /dev/hidraw7
<many lines, each tagged with FOOBAR=1>
FOOBAR=1 <pid> liquidctl /dev/hidraw7
<hang, no more lines>

Meriemi-git · 2022-09-18T19:08:36Z

I try to reproduce your problem i install the latest yay version for liquidctl-git and I'm on Arch too.
For me there is no lock when looping on status :
watch -n1 liquidctl status
Every second my AIO blink and status is updated.

intelfx · 2022-09-19T04:00:54Z

It does not necessarily happen right away. Might be tens of minutes. Try to leave it running for some time?

Otherwise I'm not sure what the problem is then. @jonasmalacofilho suggested I might have something concurrently accessing the hidraw node and breaking the protocol, but I have verified that it's not the case (at least in userspace). Any other ideas?

Grooty12 · 2022-09-19T14:33:15Z

Happens to me as well. Just that it happens immediately, I don't have to wait.

Meriemi-git · 2022-09-19T15:43:22Z

After wait few minutes it's stuck for me too.

ParkerMc · 2022-09-19T16:09:02Z

Would any of y'all mind getting a capture in Wireshark of this issue happening?

Meriemi-git · 2022-09-19T19:44:35Z

I post a Wireshark capture file, freeze appears at the end during few seconds before i stop the capture.
I rename it to .txt to be able to upload it, the only filters i applied were idVendor IdProduct and device_address as i read in the documentation.
commander_core_status_freeze.pcapng.txt

ParkerMc · 2022-09-20T22:12:45Z

I post a Wireshark capture file, freeze appears at the end during few seconds before i stop the capture.

Looking at that it appears to freeze on a read command 0x08
@Meriemi-git Do you mind doing the same thing but running liquidctl with --debug and uploading the log and capture?

Meriemi-git · 2022-09-21T06:39:05Z

Yes i will do it today

Fixes: #488 Related: #497, partially: hid.device.write doesn't support a timeout Related: #505 Related: #506

Meriemi-git · 2022-09-27T20:54:50Z

Sorry for the delay, it tooks me a long time to reproduce the hang (between one or two hour) with Wireshark capture and --debug option.
To be able to capture the full log i tried with tee and the watch -n 1 command but i wasn't able to write the entire debug info so i write a simple bash script to reproduce the watch behavior :

#!/bin/bash
i=0
while true
do
    liquidctl status --debug >> liquidctl_freeze.log 2>&1
    sleep 1
    ((i=i+1))
    echo "Duration : $i sec"
done

I had to edit the the log and capture files to be able to upload them.
liquidctl_freeze_.log
liquidctl_freeze.pcapng.txt

ParkerMc · 2022-09-28T02:36:32Z

Thank you for doing that.
Try setting _REPORT_LENGTH and _RESPONSE_LENGTH to 96 and see if that fixes it.
https://github.com/liquidctl/liquidctl/blob/main/liquidctl/driver/commander_core.py#L20-L21

Meriemi-git · 2022-09-28T11:37:37Z

Tested during 5 hours without any freeze, but usually the freeze can happen after several hours so @intelfx can you test it ?

Meriemi-git · 2022-10-03T07:02:01Z

PR 513
Here is the pull request to apply @ParkerMc fix.

jonasmalacofilho · 2022-10-03T17:07:35Z

@intelfx, can you please take a look at #513 and see if that also fixes things on your system?

intelfx · 2022-10-03T17:32:43Z

Apologies for the delay. Testing now.

intelfx · 2022-10-04T07:16:36Z

Yep, it looks like #513 works for me.

intelfx added the bug Apparent bug in liquidctl label Sep 18, 2022

intelfx mentioned this issue Sep 18, 2022

commander_core: retry commands if the device fails to produce a response #506

Closed

13 tasks

This comment was marked as outdated.

Sign in to view

jonasmalacofilho added a commit that referenced this issue Sep 23, 2022

usb: add default timeouts to all IO methods that support them

f0aa68c

Fixes: #488 Related: #497, partially: hid.device.write doesn't support a timeout Related: #505 Related: #506

Meriemi-git mentioned this issue Oct 3, 2022

Commander core fix hang #513

Merged

13 tasks

jonasmalacofilho closed this as completed in ad80134 Oct 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`liquidctl status` hangs every once in a while on an (up-to-date) Commander Core #505

`liquidctl status` hangs every once in a while on an (up-to-date) Commander Core #505

intelfx commented Sep 18, 2022 •

edited

Loading

jonasmalacofilho commented Sep 18, 2022

jonasmalacofilho commented Sep 18, 2022 •

edited

Loading

intelfx commented Sep 18, 2022 •

edited

Loading

jonasmalacofilho commented Sep 18, 2022

jonasmalacofilho commented Sep 18, 2022

intelfx commented Sep 18, 2022 •

edited

Loading

jonasmalacofilho commented Sep 18, 2022

intelfx commented Sep 18, 2022 •

edited

Loading

Meriemi-git commented Sep 18, 2022

intelfx commented Sep 19, 2022

Grooty12 commented Sep 19, 2022

Meriemi-git commented Sep 19, 2022 •

edited

Loading

ParkerMc commented Sep 19, 2022

This comment was marked as outdated.

Meriemi-git commented Sep 19, 2022 •

edited

Loading

ParkerMc commented Sep 20, 2022

Meriemi-git commented Sep 21, 2022

Meriemi-git commented Sep 27, 2022 •

edited

Loading

ParkerMc commented Sep 28, 2022

Meriemi-git commented Sep 28, 2022

Meriemi-git commented Oct 3, 2022

jonasmalacofilho commented Oct 3, 2022

intelfx commented Oct 3, 2022

intelfx commented Oct 4, 2022

liquidctl status hangs every once in a while on an (up-to-date) Commander Core #505

liquidctl status hangs every once in a while on an (up-to-date) Commander Core #505

Comments

intelfx commented Sep 18, 2022 • edited Loading

Description of the problem

Steps to reproduce

Logs

Additional information about the system

jonasmalacofilho commented Sep 18, 2022

jonasmalacofilho commented Sep 18, 2022 • edited Loading

intelfx commented Sep 18, 2022 • edited Loading

jonasmalacofilho commented Sep 18, 2022

jonasmalacofilho commented Sep 18, 2022

intelfx commented Sep 18, 2022 • edited Loading

jonasmalacofilho commented Sep 18, 2022

intelfx commented Sep 18, 2022 • edited Loading

Meriemi-git commented Sep 18, 2022

intelfx commented Sep 19, 2022

Grooty12 commented Sep 19, 2022

Meriemi-git commented Sep 19, 2022 • edited Loading

ParkerMc commented Sep 19, 2022

This comment was marked as outdated.

Meriemi-git commented Sep 19, 2022 • edited Loading

ParkerMc commented Sep 20, 2022

Meriemi-git commented Sep 21, 2022

Meriemi-git commented Sep 27, 2022 • edited Loading

ParkerMc commented Sep 28, 2022

Meriemi-git commented Sep 28, 2022

Meriemi-git commented Oct 3, 2022

jonasmalacofilho commented Oct 3, 2022

intelfx commented Oct 3, 2022

intelfx commented Oct 4, 2022

`liquidctl status` hangs every once in a while on an (up-to-date) Commander Core #505

`liquidctl status` hangs every once in a while on an (up-to-date) Commander Core #505

intelfx commented Sep 18, 2022 •

edited

Loading

jonasmalacofilho commented Sep 18, 2022 •

edited

Loading

intelfx commented Sep 18, 2022 •

edited

Loading

intelfx commented Sep 18, 2022 •

edited

Loading

intelfx commented Sep 18, 2022 •

edited

Loading

Meriemi-git commented Sep 19, 2022 •

edited

Loading

Meriemi-git commented Sep 19, 2022 •

edited

Loading

Meriemi-git commented Sep 27, 2022 •

edited

Loading