-
Notifications
You must be signed in to change notification settings - Fork 212
Description
From time to time vdsm ends up in a deadlock being completely unresponsive in OST. The problem is observed on el8stream and it's always host-0 that is affected. While OST reports it as a 'test_use_ovn_provider' failure, a quick look at the 'vdsm.log' shows that the problem happens earlier - while in 'vdsm.log' we can see entries up until some point in time, 'messages' and other log files show that the host was up for about 8 more minutes.
After attaching to 'vdsm' process with gdb we can see all the threads waiting on some locks.
The deadlock timing always aligns with the SD detach/reattach tests:
https://github.com/oVirt/ovirt-system-tests/blob/master/basic-suite-master/test-scenarios/test_007_sd_reattach.py
Even after analyzing a couple of such failures it's hard to pinpoint one specific thing that causes this problem, but the logs always end in storage parts of vdsm.
Version-Release number of selected component (if applicable):
Latest vdsm version.
How reproducible:
Rarely, ~1 in 10 runs.
Steps to Reproduce:
- Run basic suite master on elstream
- Check if it failed on 'test_use_ovn_provider'
- Check if 'vdsm.log' entries end a couple of minutes earlier than those from i.e. 'messages'
Actual results:
vdsm ends up in a deadlock.
Expected results:
vdsm continues to operate normally
Original bz: https://bugzilla.redhat.com/show_bug.cgi?id=2111187