Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

liquidctl status hangs every once in a while on an (up-to-date) Commander Core #505

Closed
intelfx opened this issue Sep 18, 2022 · 24 comments · Fixed by #513
Closed

liquidctl status hangs every once in a while on an (up-to-date) Commander Core #505

intelfx opened this issue Sep 18, 2022 · 24 comments · Fixed by #513
Labels
bug Apparent bug in liquidctl

Comments

@intelfx
Copy link
Contributor

intelfx commented Sep 18, 2022

Description of the problem

I'm trying to use liquidctl (after applying #501) on a Commander Core device I have (as part of the H100i ELITE CAPELLIX CLC).
If I run liquidctl status in a loop each second, eventually an invocation hangs indefinitely and never exits.

According to strace, it hangs on a read() from a /dev/hidrawN node corresponding to the Commander Core.

Steps to reproduce

  • watch -n1 liquidctl status
  • expected behavior: no hangs
  • observed behavior: within 5-10 minutes an invocation of liquidctl would hang indefinitely

Logs

liquidctl status --debug output from an invocation reproducing the issue:

Additional information about the system

  • Affected device(s): Corsair Commander Core
  • Operating system and version: Arch Linux
  • Liquidctl installation method: liquidctl-git AUR package
  • Output of liquidctl --version: liquidctl v1.11.0.dev41+ga2b4391 (Linux-5.19.9-arch1pf4-1-x86_64-with-glibc2.36)
@intelfx intelfx added the bug Apparent bug in liquidctl label Sep 18, 2022
@jonasmalacofilho
Copy link
Member

Cc: @Meriemi-git @ParkerMc

intelfx added a commit to intelfx/liquidctl that referenced this issue Sep 18, 2022
Sometimes a _CMD_OPEN_ENDPOINT command fails to produce a ready (00:)
response, blocking after several non-ready (f0:) responses instead.
If this happens, just retry the command and hope for the best...

TODO: only do this with commands known to be idempotent

(Hopefully) Fixes liquidctl#505.
@jonasmalacofilho
Copy link
Member

jonasmalacofilho commented Sep 18, 2022

Looking at your WIP proposal for a fix, it occurred to me: are you perhaps running some other instance of liquidctl (accessing that same device) while your watch -n1 liquidctl status loop is running?

Because what your fix effectively does is assume that the device will sometimes simply forget to reply (successfully) to _CMD_OPEN_ENDPOINT... but what could also happen, I think, is that the device did respond, but some other driver instance discarded the message by calling clear_enqueued_reports().

(Specifically, IIRC, the report queue is global and, on Linux/hidraw, shared by all readers).

@intelfx
Copy link
Contributor Author

intelfx commented Sep 18, 2022

Looking at your WIP proposal for a fix, it occurred to me: are you perhaps running some other instance of liquidctl (accessing that same device) while your watch -n1 liquidctl status loop is running?

No. Even if I did, on my system liquidctl is invoked via a custom wrapper for mutual exclusion (since the #298 discussion), so this should not be happening.

It does look suspicious, though. I'm not sure how I could verify that no one else is accessing the device concurrently. Any ideas?

@jonasmalacofilho
Copy link
Member

No. Even if I did, on my system liquidctl is invoked via a custom wrapper for mutual exclusion (since the #298 discussion), so this should not be happening.

Oh, right... by the way, to which file do you attach the lock for your wrapper?

It does look suspicious, though. I'm not sure how I could verify that no one else is accessing the device concurrently. Any ideas?

Hm, maybe lsof is enough? Or globally strace'ing (then filtering) the relevant syscalls?

Apart from those, there are is the kernel's tracing functionally...

@jonasmalacofilho
Copy link
Member

Or globally strace'ing (then filtering) the relevant syscalls?

Hm, re-reading the man page for strace suggests that functionally doesn't actually exist.

I still vaguely remember reading about some some way to globally (and easily) trace all syscalls on a system...

@intelfx
Copy link
Contributor Author

intelfx commented Sep 18, 2022

Oh, right... by the way, to which file do you attach the lock for your wrapper?

It doesn't really matter, so I just do it on the liquidctl executable itself:

$ cat /usr/local/bin/liquidctl                            
#!/bin/sh
exec flock -F /usr/bin/liquidctl /usr/bin/liquidctl "$@"

I still vaguely remember reading about some some way to globally (and easily) trace all syscalls on a system...

There's bpftrace (and friends) but I really hoped to avoid dealing with it :-)

@jonasmalacofilho
Copy link
Member

It doesn't really matter, so I just do it on the liquidctl executable itself:

Yeah, as long as every instance agrees, it really doesn't matter.

By the way, are you sure /usr/local/bin/liquidctl is what's being called everywhere? Presumably you could have scripts running with different environments, where /usr/local/bin isn't in the PATH.

There's bpftrace (and friends) but I really hoped to avoid dealing with it :-)

Maybe that's what I was thinking of...

Anyway, if I'm reading the docs correctly, the fanotify APIs might have enough information (i.e. the PID that triggered the event), but it isn't currently reported by the inotifywa* CLI tools.

However, since you know the expected interval between open calls, maybe just looking if there are accesses that don't match that cadence, with

$ inotifywait -m --format '%T %w %e' --timefmt '%F %T' /dev/hidraw?

would be enough to at least see if further investigation is even necessary?

(Since the time resolution is rather limited, it might be good increase the watch loop interval).

@intelfx
Copy link
Contributor Author

intelfx commented Sep 18, 2022

By the way, are you sure /usr/local/bin/liquidctl is what's being called everywhere? Presumably you could have scripts running with different environments, where /usr/local/bin isn't in the PATH.

Yes, pretty sure, $PATH is set globally via systemd:

# grep -z '^PATH=' /proc/*/environ | tr '\0' '\n' | grep -v '/usr/local/bin'
/proc/154661/environ:PATH=/bin:/usr/bin:/sbin:/usr/sbin  # this single process belongs to a container and is irrelevant

Maybe that's what I was thinking of...

I have confirmed with bpftrace that there are indeed no other processes attempting to access the hidraw node.

Terminal 1:

# FOOBAR=1 watch -n1 liquidctl status

Terminal 2:

# bpftrace -B line -e 'tracepoint:syscalls:sys_enter_open, tracepoint:syscalls:sys_enter_openat /str(args->filename) == "/dev/hidraw7"/ { printf("OUT: %d %s %s\n", pid, comm, str(args->filename)); }' | grep -E 'OUT:' --line-buffered | while read _ pid comm fn; do echo "$(grep -z 'FOOBAR=' /proc/$pid/environ) $pid $comm $fn"; done
stdin:1:95-97: WARNING: Addrspace mismatch                                                                                                                                                                                                                     
tracepoint:syscalls:sys_enter_open, tracepoint:syscalls:sys_enter_openat /str(args->filename) == "/dev/hidraw7"/ { printf("OUT %d %s %s\n", pid, comm, str(args->filename)); }                                                                                 
                                                                                              ~~                                                                                                                                                               

FOOBAR=1 <pid> liquidctl /dev/hidraw7
<many lines, each tagged with FOOBAR=1>
FOOBAR=1 <pid> liquidctl /dev/hidraw7
<hang, no more lines>

@Meriemi-git
Copy link
Contributor

I try to reproduce your problem i install the latest yay version for liquidctl-git and I'm on Arch too.
For me there is no lock when looping on status :
watch -n1 liquidctl status
Every second my AIO blink and status is updated.

@intelfx
Copy link
Contributor Author

intelfx commented Sep 19, 2022

It does not necessarily happen right away. Might be tens of minutes. Try to leave it running for some time?

Otherwise I'm not sure what the problem is then. @jonasmalacofilho suggested I might have something concurrently accessing the hidraw node and breaking the protocol, but I have verified that it's not the case (at least in userspace). Any other ideas?

@Grooty12
Copy link

Happens to me as well. Just that it happens immediately, I don't have to wait.

@Meriemi-git
Copy link
Contributor

Meriemi-git commented Sep 19, 2022

After wait few minutes it's stuck for me too.

@ParkerMc
Copy link
Contributor

Would any of y'all mind getting a capture in Wireshark of this issue happening?

@intelfx

This comment was marked as outdated.

@Meriemi-git
Copy link
Contributor

Meriemi-git commented Sep 19, 2022

I post a Wireshark capture file, freeze appears at the end during few seconds before i stop the capture.
I rename it to .txt to be able to upload it, the only filters i applied were idVendor IdProduct and device_address as i read in the documentation.
commander_core_status_freeze.pcapng.txt

@ParkerMc
Copy link
Contributor

I post a Wireshark capture file, freeze appears at the end during few seconds before i stop the capture.

Looking at that it appears to freeze on a read command 0x08
@Meriemi-git Do you mind doing the same thing but running liquidctl with --debug and uploading the log and capture?

@Meriemi-git
Copy link
Contributor

Yes i will do it today

jonasmalacofilho added a commit that referenced this issue Sep 23, 2022
Fixes: #488
Related: #497, partially: hid.device.write doesn't support a timeout
Related: #505
Related: #506
@Meriemi-git
Copy link
Contributor

Meriemi-git commented Sep 27, 2022

Sorry for the delay, it tooks me a long time to reproduce the hang (between one or two hour) with Wireshark capture and --debug option.
To be able to capture the full log i tried with tee and the watch -n 1 command but i wasn't able to write the entire debug info so i write a simple bash script to reproduce the watch behavior :

#!/bin/bash
i=0
while true
do
    liquidctl status --debug >> liquidctl_freeze.log 2>&1
    sleep 1
    ((i=i+1))
    echo "Duration : $i sec"
done

I had to edit the the log and capture files to be able to upload them.
liquidctl_freeze_.log
liquidctl_freeze.pcapng.txt

@ParkerMc
Copy link
Contributor

Thank you for doing that.
Try setting _REPORT_LENGTH and _RESPONSE_LENGTH to 96 and see if that fixes it.
https://github.com/liquidctl/liquidctl/blob/main/liquidctl/driver/commander_core.py#L20-L21

@Meriemi-git
Copy link
Contributor

Tested during 5 hours without any freeze, but usually the freeze can happen after several hours so @intelfx can you test it ?

@Meriemi-git Meriemi-git mentioned this issue Oct 3, 2022
13 tasks
@Meriemi-git
Copy link
Contributor

PR 513
Here is the pull request to apply @ParkerMc fix.

@jonasmalacofilho
Copy link
Member

@intelfx, can you please take a look at #513 and see if that also fixes things on your system?

@intelfx
Copy link
Contributor Author

intelfx commented Oct 3, 2022

Apologies for the delay. Testing now.

@intelfx
Copy link
Contributor Author

intelfx commented Oct 4, 2022

Yep, it looks like #513 works for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Apparent bug in liquidctl
Projects
None yet
5 participants