New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raft: Ensure we still get metrics after a worker bounce #9691

Merged
merged 1 commit into from Jan 31, 2019

Conversation

Projects
None yet
6 participants
@babbageclunk
Copy link
Member

babbageclunk commented Jan 30, 2019

Description of change

The go-metrics library registers the returned collector with the default registry (which we don't use). Unregister it so we don't lose metrics after the worker restarts, for example if a mongo i/o timeout causes the state worker to bounce.

When raft times out starting up we can get a situation where a second loop is started before the old one finishes (so it hasn't yet unregistered). Use unregister/register rather than register/defer-unregister to handle this.

QA steps

  • Bootstrap a controller, enable HA.
  • Use the juju_metrics command to check that juju_raft_fsm_apply_count is increasing over time on all controllers (setTime commands being applied on the leader and replicated to followers). juju_metrics 2>&1 | grep juju_raft_fsm_apply_count
  • Watch the controller debug-log for errors.
  • Trigger a state worker bounce on controller machine 1 by creating the necessary wrench file: sudo echo io-timeout > /var/lib/juju/wrench/state-worker
  • Remove the wrench after seeing wrench error in the controller log.
  • Check that the raft worker start-count has increased on controller 1: juju_engine_report | less +/raft:
  • Check that the fsm_apply_count metric is still increasing over time on controller 1: juju_metrics 2>&1 | grep juju_raft_fsm_apply_count

Documentation changes

None

Bug reference

Fixes https://bugs.launchpad.net/juju/+bug/1813992

@jameinel
Copy link
Member

jameinel left a comment

I'm happy with the change as you've described it, and I really appreciate the detailed 'how to reproduce', though I haven't done it myself.
Given the commented out wrench, is it possible to do it as you described? Is there a different wrench you're supposed to set?
Is there overhead from leaving that wrench in? Otherwise it seems we could use it to isolate other issues in the future?

@@ -194,6 +194,12 @@ func (w *stateWorker) loop() error {
return errors.Trace(err)
}
}
// Useful for tracking down some bugs that occur when
// mongo is overloaded.
// case <-time.After(30 * time.Second):

This comment has been minimized.

@jameinel

jameinel Jan 31, 2019

Member

if this wrench is useful, is there a reason to
comment it out?

This comment has been minimized.

@babbageclunk

babbageclunk Jan 31, 2019

Author Member

After discussing it with you I'm putting this one back in.

@babbageclunk babbageclunk force-pushed the babbageclunk:fix-raft-metrics-reg branch from a281049 to 531ee93 Jan 31, 2019

@babbageclunk

This comment has been minimized.

Copy link
Member Author

babbageclunk commented Jan 31, 2019

Given the commented out wrench, is it possible to do it as you described? Is there a different wrench you're supposed to set?

Not without uncommenting the wrench, although that's the first step of the instructions ;). I'm not really sure whether the QA steps here are supposed to describe what I did to test this, or what someone else should do (in theory) - either a reviewer or someone testing? I think it was originally added as a kind of "show your work" nudge to make sure people were actually trying things manually as well as getting unit tests to pass. What do we think it should be?

I'm ambivalent about leaving wrenches in the code. I've been using them a lot in the last few days when we find bugs that we haven't been able to easily reproduce locally in other ways. They're handy, but I feel like they're distracting from the actual code flow, and I haven't been putting tests in for them (since I'm using them as a development tool) which seems a bit risky. Leaving them in commented was a bit of a compromise so a dev wouldn't have to reinvent the wheel in the future if they want to get into that state, but it's still clutter. What do you think?

Is there overhead from leaving that wrench in? Otherwise it seems we could use it to isolate other issues in the future?

I don't think it would be much overhead - checking for a file every 30s seems innocuous. In this case it might make sense to uncomment it, since the state worker is right near the root of our dependency graph - testing how various workers cope with it bouncing is widely applicable. But other ones I've been using have been more specific.

@babbageclunk

This comment has been minimized.

Copy link
Member Author

babbageclunk commented Jan 31, 2019

!!build!!

1 similar comment
@mitechie

This comment has been minimized.

Copy link
Member

mitechie commented Jan 31, 2019

!!build!!

Ensure we still get raft metrics after a bounce
The go-metrics library registers the returned collector with the default
registry (which we don't use). If we don't unregister it then the
we lose metrics after the worker restarts, for example if a mongo i/o
timeout causes the state worker to bounce.

When raft times out starting up we can get a situation where a second
loop is started before the old one finishes (so it hasn't yet
unregistered). Use unregister/register rather than
register/defer-unregister to handle this.
@babbageclunk

This comment has been minimized.

Copy link
Member Author

babbageclunk commented Jan 31, 2019

To test the raft slow-start scenario apply this patch:

diff --git a/worker/raft/worker.go b/worker/raft/worker.go
index 20862dd..e728e11 100644
--- a/worker/raft/worker.go
+++ b/worker/raft/worker.go
@@ -21,6 +21,7 @@ import (
        "gopkg.in/juju/worker.v1/catacomb"
 
        "github.com/juju/juju/worker/raft/raftutil"
+       "github.com/juju/juju/wrench"
 )
 
 const (
@@ -343,6 +344,11 @@ func (w *Worker) loop(raftConfig *raft.Config) (loopErr error) {
        noLeaderCheck := w.config.Clock.After(noLeaderFrequency)
        lastContact := w.config.Clock.Now()
 
+       if wrench.IsActive("raft-worker", "slow-start") {
+               w.config.Logger.Warningf("raft-worker slow-start wrench - pausing for 90s")
+               time.Sleep(90 * time.Second)
+       }
+
        for {
                select {
                case <-w.catacomb.Dying():

Create the wrench, restart the controller, when the wrench has been hit once remove it. Afterwards the internal raft stats like fsm_apply_count should still be updated.

@babbageclunk babbageclunk force-pushed the babbageclunk:fix-raft-metrics-reg branch from 531ee93 to 995c114 Jan 31, 2019

@babbageclunk

This comment has been minimized.

Copy link
Member Author

babbageclunk commented Jan 31, 2019

$$merge$$

@jujubot jujubot merged commit ed48053 into juju:2.5 Jan 31, 2019

0 of 2 checks passed

check-multi-juju Build finished.
Details
merge-multi-juju CI Run in progress
Details

@babbageclunk babbageclunk deleted the babbageclunk:fix-raft-metrics-reg branch Jan 31, 2019

@wallyworld wallyworld referenced this pull request Feb 2, 2019

Merged

Merge 2.5 into develop #9704

jujubot added a commit that referenced this pull request Feb 2, 2019

Merge pull request #9704 from wallyworld/merge-2.5-20190202
#9704

## Description of change

Merge the 2.5 branch, including the following pull requests:

#9702 Remove juju- prefix from k8s pod names
#9698 Make raft logging more usefull
#9616 hubwatcher cache removal
#9691 raft metrics registration fix
#9695 log try again errors as info
#9678 API server metrics collector
#9693 support more k8s specific pod attributes
#9685 filtered access to leases in store
#9692 fix k8s charm upgrades
#9676 fix meter status watcher
#9683 handle failures writing lease changes to database
#9688 update juju txn dependency
#9684 exit http worker after a timeout even if muxes are in use
#9675 add raft transport dependencies
#9672 optimise SetSupportedContainer db writes
#9669 fix goroutine leak in log sink handler
#9666 fix generation of operator tag for k8s docker images
#9656 handle k8s app config in deploy ConfigYAML
#9631 multi-cloud controllers
#9638 add support for setting k8s service annotations

## QA steps

bootstrap and deploy
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment