Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raft: Check suspect info once per suspect interval #1600

Merged
merged 1 commit into from Jul 14, 2020

Conversation

jyellick
Copy link
Contributor

Type of change

  • Bug fix

Description

Today's existing suspect logic has a periodic checker, which checks
every 10s if the Raft cluster still has quorum. If the cluster has lost
quorum, it marks the time this event begins, then, every 10s checks to
see if 'enough' time has elapsed since the quorum was lost to suspect
that the OSN has been evicted.

If the OSN has not been evicted, or cannot determine its eviction
status, then every 10s the OSN attempts to re-check its suspicion
status, which can lead to large volumes of network traffic, especially
in significiantly multichannel environments.

This commit modifies the logic to track the number of times that the
suspect checking logic has actually executed, to ensure that we check no
more than once every suspect interval (by default every 10m, instead of
every 10s).
-->

@jyellick jyellick requested a review from a team as a code owner July 14, 2020 19:32
@jyellick
Copy link
Contributor Author

@yacovm @guoger @tock-ibm Could you take a look? Assuming this looks good, I think we'll want to create backports.

@@ -60,6 +61,9 @@ func (pc *PeriodicCheck) check() {
}

func (pc *PeriodicCheck) conditionNotFulfilled() {
if pc.ReportCleared != nil && pc.conditionHoldsSince != (time.Time{}) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe instead of:
pc.conditionHoldsSince != (time.Time{}
do:
pc.conditionHoldsSince.IsZero() ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure -- that is quite a bit more graceful, will fix.

Today's existing suspect logic has a periodic checker, which checks
every 10s if the Raft cluster still has quorum.  If the cluster has lost
quorum, it marks the time this event begins, then, every 10s checks to
see if 'enough' time has elapsed since the quorum was lost to suspect
that the OSN has been evicted.

If the OSN has not been evicted, or cannot determine its eviction
status, then every 10s the OSN attempts to re-check its suspicion
status, which can lead to large volumes of network traffic, especially
in significiantly multichannel environments.

This commit modifies the logic to track the number of times that the
suspect checking logic has actually executed, to ensure that we check no
more than once every suspect interval (by default every 10m, instead of
every 10s).

Signed-off-by: Jason Yellick <jyellick@us.ibm.com>
@yacovm yacovm merged commit c90015c into hyperledger:master Jul 14, 2020
@jyellick
Copy link
Contributor Author

@Mergifyio backport release-2.2

@jyellick
Copy link
Contributor Author

@Mergifyio backport release-2.1

@jyellick
Copy link
Contributor Author

@Mergifyio backport release-2.0

@jyellick
Copy link
Contributor Author

@Mergifyio backport release-1.4

@mergify
Copy link

mergify bot commented Jul 14, 2020

Command backport release-2.2: success

Backports have been created

mergify bot pushed a commit that referenced this pull request Jul 14, 2020
Today's existing suspect logic has a periodic checker, which checks
every 10s if the Raft cluster still has quorum.  If the cluster has lost
quorum, it marks the time this event begins, then, every 10s checks to
see if 'enough' time has elapsed since the quorum was lost to suspect
that the OSN has been evicted.

If the OSN has not been evicted, or cannot determine its eviction
status, then every 10s the OSN attempts to re-check its suspicion
status, which can lead to large volumes of network traffic, especially
in significiantly multichannel environments.

This commit modifies the logic to track the number of times that the
suspect checking logic has actually executed, to ensure that we check no
more than once every suspect interval (by default every 10m, instead of
every 10s).

Signed-off-by: Jason Yellick <jyellick@us.ibm.com>
(cherry picked from commit c90015c)
@mergify
Copy link

mergify bot commented Jul 14, 2020

Command backport release-2.1: failure

No backport have been created

  • Backport to branch release-2.1 failed

Cherry-pick of c90015c has failed:

On branch mergify/bp/release-2.1/pr-1600
Your branch is up to date with 'origin/release-2.1'.

You are currently cherry-picking commit c90015c9b.
  (fix conflicts and run "git cherry-pick --continue")
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Changes to be committed:

	modified:   orderer/consensus/etcdraft/chain.go
	modified:   orderer/consensus/etcdraft/eviction.go

Unmerged paths:
  (use "git add <file>..." to mark resolution)

	both modified:   orderer/consensus/etcdraft/eviction_test.go

@mergify
Copy link

mergify bot commented Jul 14, 2020

Command backport release-2.0: failure

No backport have been created

  • Backport to branch release-2.0 failed

Cherry-pick of c90015c has failed:

On branch mergify/bp/release-2.0/pr-1600
Your branch is up to date with 'origin/release-2.0'.

You are currently cherry-picking commit c90015c9b.
  (fix conflicts and run "git cherry-pick --continue")
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Changes to be committed:

	modified:   orderer/consensus/etcdraft/chain.go
	modified:   orderer/consensus/etcdraft/eviction.go

Unmerged paths:
  (use "git add <file>..." to mark resolution)

	both modified:   orderer/consensus/etcdraft/eviction_test.go

@mergify
Copy link

mergify bot commented Jul 14, 2020

Command backport release-1.4: failure

No backport have been created

  • Backport to branch release-1.4 failed

Cherry-pick of c90015c has failed:

On branch mergify/bp/release-1.4/pr-1600
Your branch is up to date with 'origin/release-1.4'.

You are currently cherry-picking commit c90015c9b.
  (fix conflicts and run "git cherry-pick --continue")
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Changes to be committed:

	modified:   orderer/consensus/etcdraft/chain.go

Unmerged paths:
  (use "git add/rm <file>..." as appropriate to mark resolution)

	deleted by us:   orderer/consensus/etcdraft/eviction.go
	deleted by us:   orderer/consensus/etcdraft/eviction_test.go

lindluni pushed a commit that referenced this pull request Jul 15, 2020
Today's existing suspect logic has a periodic checker, which checks
every 10s if the Raft cluster still has quorum.  If the cluster has lost
quorum, it marks the time this event begins, then, every 10s checks to
see if 'enough' time has elapsed since the quorum was lost to suspect
that the OSN has been evicted.

If the OSN has not been evicted, or cannot determine its eviction
status, then every 10s the OSN attempts to re-check its suspicion
status, which can lead to large volumes of network traffic, especially
in significiantly multichannel environments.

This commit modifies the logic to track the number of times that the
suspect checking logic has actually executed, to ensure that we check no
more than once every suspect interval (by default every 10m, instead of
every 10s).

Signed-off-by: Jason Yellick <jyellick@us.ibm.com>
(cherry picked from commit c90015c)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants