backport: core: fix network faults handling and fencing flow #306

mwperina · 2022-04-25T11:54:19Z

This patch fixes network exception handling and fencing flow logic.
Problems in current code:

1. Hard fencing happens too fast since we waited on number of

attempts grace period, since number of attempts is configured to a
value of "2", grace period was ~20 seconds.

2. VdsManager::isHostInGracePeriod was called periodically from

VdsManager:handleNetworkExeception and from
SsshSoftFencingCommand::checkIfHostBecomeUp which makes the logic
complex in not working as expected
While we have to handle the network exception grace period when the host
is switched to 'connecting' state due to its load regarding number of
running VMs and SPM status, in the case of soft-fencing flow, the host
is already in not-responding status, other host already took the SPM
role and all its running VMs set to 'unknown' status. So we should not
consider the host load at all and a fixed grace period (configurable 1
min) is enough to restart the vdsmd service on the host and get it up
and running.

Solution was tested with host as SPM with running VMs (some are HA),
with a non SPM host running VMs and with a regular host.

Results:

Both initial grace between connecting and non-responding and between
soft-fencing and hard-fencing are honored.
Code is more readable and straight foreword

Signed-off-by: Eli Mesika emesika@redhat.com
Bug-Url: https://bugzilla.redhat.com/2071468

This patch fixes network exception handling and fencing flow logic. Problems in current code: 1. Hard fencing happens too fast since we waited on number of attempts <or> grace period, since number of attempts is configured to a value of "2", grace period was ~20 seconds. 2. VdsManager::isHostInGracePeriod was called periodically from VdsManager:handleNetworkExeception and from SsshSoftFencingCommand::checkIfHostBecomeUp which makes the logic complex in not working as expected While we have to handle the network exception grace period when the host is switched to 'connecting' state due to its load regarding number of running VMs and SPM status, in the case of soft-fencing flow, the host is already in not-responding status, other host already took the SPM role and all its running VMs set to 'unknown' status. So we should not consider the host load at all and a fixed grace period (configurable 1 min) is enough to restart the vdsmd service on the host and get it up and running. Solution was tested with host as SPM with running VMs (some are HA), with a non SPM host running VMs and with a regular host. Results: 1. Both initial grace between connecting and non-responding and between soft-fencing and hard-fencing are honored. 2. Code is more readable and straight foreword Signed-off-by: Eli Mesika <emesika@redhat.com> Bug-Url: https://bugzilla.redhat.com/2071468

mwperina · 2022-04-25T12:40:42Z

Verified on master

mwperina requested review from emesika, ahadas, bennyz, michalskrivanek, oliel, sgratch and didib as code owners April 25, 2022 11:54

mwperina changed the title ~~core: fix network faults handling and fencing flow~~ backport: core: fix network faults handling and fencing flow Apr 25, 2022

mwperina requested review from sandrobonazzola and removed request for bennyz, didib, sgratch and oliel April 25, 2022 11:54

mwperina merged commit 646d2e2 into oVirt:ovirt-engine-4.5.0.z Apr 25, 2022

mwperina deleted the soft-fencing-timeout-backportt branch April 25, 2022 12:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

backport: core: fix network faults handling and fencing flow #306

backport: core: fix network faults handling and fencing flow #306

mwperina commented Apr 25, 2022

mwperina commented Apr 25, 2022

backport: core: fix network faults handling and fencing flow #306

backport: core: fix network faults handling and fencing flow #306

Conversation

mwperina commented Apr 25, 2022

mwperina commented Apr 25, 2022