Skip to content

vdsm hangs deadlocked after SD detach/reattach cycle #319

@ahadas

Description

@ahadas

From time to time vdsm ends up in a deadlock being completely unresponsive in OST. The problem is observed on el8stream and it's always host-0 that is affected. While OST reports it as a 'test_use_ovn_provider' failure, a quick look at the 'vdsm.log' shows that the problem happens earlier - while in 'vdsm.log' we can see entries up until some point in time, 'messages' and other log files show that the host was up for about 8 more minutes.

After attaching to 'vdsm' process with gdb we can see all the threads waiting on some locks.

The deadlock timing always aligns with the SD detach/reattach tests:
https://github.com/oVirt/ovirt-system-tests/blob/master/basic-suite-master/test-scenarios/test_007_sd_reattach.py

Even after analyzing a couple of such failures it's hard to pinpoint one specific thing that causes this problem, but the logs always end in storage parts of vdsm.

Version-Release number of selected component (if applicable):
Latest vdsm version.

How reproducible:
Rarely, ~1 in 10 runs.

Steps to Reproduce:

  1. Run basic suite master on elstream
  2. Check if it failed on 'test_use_ovn_provider'
  3. Check if 'vdsm.log' entries end a couple of minutes earlier than those from i.e. 'messages'

Actual results:
vdsm ends up in a deadlock.

Expected results:
vdsm continues to operate normally

Original bz: https://bugzilla.redhat.com/show_bug.cgi?id=2111187

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions