Raft: Check suspect info once per suspect interval #1600

jyellick · 2020-07-14T19:32:39Z

Type of change

Bug fix

Description

Today's existing suspect logic has a periodic checker, which checks
every 10s if the Raft cluster still has quorum. If the cluster has lost
quorum, it marks the time this event begins, then, every 10s checks to
see if 'enough' time has elapsed since the quorum was lost to suspect
that the OSN has been evicted.

If the OSN has not been evicted, or cannot determine its eviction
status, then every 10s the OSN attempts to re-check its suspicion
status, which can lead to large volumes of network traffic, especially
in significiantly multichannel environments.

This commit modifies the logic to track the number of times that the
suspect checking logic has actually executed, to ensure that we check no
more than once every suspect interval (by default every 10m, instead of
every 10s).
-->

jyellick · 2020-07-14T19:33:36Z

@yacovm @guoger @tock-ibm Could you take a look? Assuming this looks good, I think we'll want to create backports.

yacovm · 2020-07-14T19:54:58Z

orderer/consensus/etcdraft/eviction.go

@@ -60,6 +61,9 @@ func (pc *PeriodicCheck) check() {
 }

 func (pc *PeriodicCheck) conditionNotFulfilled() {
+	if pc.ReportCleared != nil && pc.conditionHoldsSince != (time.Time{}) {


maybe instead of:
pc.conditionHoldsSince != (time.Time{}
do:
pc.conditionHoldsSince.IsZero() ?

Sure -- that is quite a bit more graceful, will fix.

Today's existing suspect logic has a periodic checker, which checks every 10s if the Raft cluster still has quorum. If the cluster has lost quorum, it marks the time this event begins, then, every 10s checks to see if 'enough' time has elapsed since the quorum was lost to suspect that the OSN has been evicted. If the OSN has not been evicted, or cannot determine its eviction status, then every 10s the OSN attempts to re-check its suspicion status, which can lead to large volumes of network traffic, especially in significiantly multichannel environments. This commit modifies the logic to track the number of times that the suspect checking logic has actually executed, to ensure that we check no more than once every suspect interval (by default every 10m, instead of every 10s). Signed-off-by: Jason Yellick <jyellick@us.ibm.com>

jyellick · 2020-07-14T20:59:12Z

@Mergifyio backport release-2.2

jyellick · 2020-07-14T20:59:19Z

@Mergifyio backport release-2.1

jyellick · 2020-07-14T20:59:26Z

@Mergifyio backport release-2.0

jyellick · 2020-07-14T20:59:33Z

@Mergifyio backport release-1.4

mergify · 2020-07-14T21:00:18Z

Command backport release-2.2: success

Backports have been created

#1601 Raft: Check suspect info once per suspect interval (bp #1600) has been created for branch release-2.2

Today's existing suspect logic has a periodic checker, which checks every 10s if the Raft cluster still has quorum. If the cluster has lost quorum, it marks the time this event begins, then, every 10s checks to see if 'enough' time has elapsed since the quorum was lost to suspect that the OSN has been evicted. If the OSN has not been evicted, or cannot determine its eviction status, then every 10s the OSN attempts to re-check its suspicion status, which can lead to large volumes of network traffic, especially in significiantly multichannel environments. This commit modifies the logic to track the number of times that the suspect checking logic has actually executed, to ensure that we check no more than once every suspect interval (by default every 10m, instead of every 10s). Signed-off-by: Jason Yellick <jyellick@us.ibm.com> (cherry picked from commit c90015c)

mergify · 2020-07-14T21:00:53Z

Command backport release-2.1: failure

No backport have been created

Backport to branch release-2.1 failed

Cherry-pick of c90015c has failed:

On branch mergify/bp/release-2.1/pr-1600
Your branch is up to date with 'origin/release-2.1'.

You are currently cherry-picking commit c90015c9b.
  (fix conflicts and run "git cherry-pick --continue")
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Changes to be committed:

	modified:   orderer/consensus/etcdraft/chain.go
	modified:   orderer/consensus/etcdraft/eviction.go

Unmerged paths:
  (use "git add <file>..." to mark resolution)

	both modified:   orderer/consensus/etcdraft/eviction_test.go

mergify · 2020-07-14T21:01:24Z

Command backport release-2.0: failure

No backport have been created

Backport to branch release-2.0 failed

Cherry-pick of c90015c has failed:

On branch mergify/bp/release-2.0/pr-1600
Your branch is up to date with 'origin/release-2.0'.

You are currently cherry-picking commit c90015c9b.
  (fix conflicts and run "git cherry-pick --continue")
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Changes to be committed:

	modified:   orderer/consensus/etcdraft/chain.go
	modified:   orderer/consensus/etcdraft/eviction.go

Unmerged paths:
  (use "git add <file>..." to mark resolution)

	both modified:   orderer/consensus/etcdraft/eviction_test.go

mergify · 2020-07-14T21:01:54Z

Command backport release-1.4: failure

No backport have been created

Backport to branch release-1.4 failed

Cherry-pick of c90015c has failed:

On branch mergify/bp/release-1.4/pr-1600
Your branch is up to date with 'origin/release-1.4'.

You are currently cherry-picking commit c90015c9b.
  (fix conflicts and run "git cherry-pick --continue")
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Changes to be committed:

	modified:   orderer/consensus/etcdraft/chain.go

Unmerged paths:
  (use "git add/rm <file>..." as appropriate to mark resolution)

	deleted by us:   orderer/consensus/etcdraft/eviction.go
	deleted by us:   orderer/consensus/etcdraft/eviction_test.go

Today's existing suspect logic has a periodic checker, which checks every 10s if the Raft cluster still has quorum. If the cluster has lost quorum, it marks the time this event begins, then, every 10s checks to see if 'enough' time has elapsed since the quorum was lost to suspect that the OSN has been evicted. If the OSN has not been evicted, or cannot determine its eviction status, then every 10s the OSN attempts to re-check its suspicion status, which can lead to large volumes of network traffic, especially in significiantly multichannel environments. This commit modifies the logic to track the number of times that the suspect checking logic has actually executed, to ensure that we check no more than once every suspect interval (by default every 10m, instead of every 10s). Signed-off-by: Jason Yellick <jyellick@us.ibm.com> (cherry picked from commit c90015c)

jyellick requested a review from a team as a code owner July 14, 2020 19:32

yacovm reviewed Jul 14, 2020

View reviewed changes

jyellick force-pushed the raft-suspect-interval branch from 771e51c to 8b3eef0 Compare July 14, 2020 20:05

yacovm approved these changes Jul 14, 2020

View reviewed changes

yacovm merged commit c90015c into hyperledger:master Jul 14, 2020

mergify bot mentioned this pull request Jul 14, 2020

Raft: Check suspect info once per suspect interval (bp #1600) #1601

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Raft: Check suspect info once per suspect interval #1600

Raft: Check suspect info once per suspect interval #1600

jyellick commented Jul 14, 2020

jyellick commented Jul 14, 2020

yacovm Jul 14, 2020

jyellick Jul 14, 2020

jyellick commented Jul 14, 2020

jyellick commented Jul 14, 2020

jyellick commented Jul 14, 2020

jyellick commented Jul 14, 2020

mergify bot commented Jul 14, 2020

mergify bot commented Jul 14, 2020

mergify bot commented Jul 14, 2020

mergify bot commented Jul 14, 2020

Raft: Check suspect info once per suspect interval #1600

Raft: Check suspect info once per suspect interval #1600

Conversation

jyellick commented Jul 14, 2020

Type of change

Description

jyellick commented Jul 14, 2020

yacovm Jul 14, 2020

Choose a reason for hiding this comment

jyellick Jul 14, 2020

Choose a reason for hiding this comment

jyellick commented Jul 14, 2020

jyellick commented Jul 14, 2020

jyellick commented Jul 14, 2020

jyellick commented Jul 14, 2020

mergify bot commented Jul 14, 2020

mergify bot commented Jul 14, 2020

mergify bot commented Jul 14, 2020

mergify bot commented Jul 14, 2020