Handle PingBatcher getting restarted (bug #1703526) #7626

Merged
merged 2 commits into from Jul 11, 2017

Conversation

Projects
None yet
5 participants
Owner

jameinel commented Jul 11, 2017

Description of change

Instead of fixing a PingBatcher when initializing a Pinger,
we instead have the pinger ask for whatever the current PingBatcher is.
That way if the PingBatcher gets restarted due to an error, then the
Pingers can find the correct one.

Only other problem is we probably don't restart the Pingers either if
they encounter an error, so that needs testing as well.

Also, we're missing a test for the fact that Pinger's always use the
fresh PingBatcher (they do by inspection, but it should be actually tested.)

QA steps

  $ juju bootstrap lxd
  $ juju deploy ubuntu
  # connect to the mongo
  > use presence
  > db.presence.pings.find()
  # grab the most recent ping and add something like 60 seconds to the
  # timestamp, inject a broken record to the database, which will force
  # the PingBatcher to die.
  > db.presence.pings.insert(
{"_id": "8e869b13-85a7-4d86-8a3e-4d332b4306e8:1499793870", "slot" : NumberLong(1499793870), "alive" : { "1" : "a"}})
  # note that the slot is both its own field and the tail of the _id, both need to be updated
  $ juju debug-log
  # should show the PingBatcher die once the appropriate slot is the current slot.
  $ juju status
  # might show the agents go down temporarily, but ultimately they should come back up again

I did manage to QA this by injecting bad data, seeing PingBatcher die, seeing some agents show up as down, but see everything recover afterward.
Without this patch, everything stays dead after injecting bad data.

Documentation changes

Just fixes a bug.

Bug reference

https://bugs.launchpad.net/juju/+bug/1703526

jameinel added some commits Jul 11, 2017

Possible fix for 1703526.
Instead of fixing a PingBatcher when initializing a Pinger,
we instead have the pinger ask for whatever the current PingBatcher is.
That way if the PingBatcher gets restarted due to an error, then the
Pingers can find the correct one.

Only other problem is we probably don't restart the Pingers either if
they encounter an error, so that needs testing as well.

Also, we're missing a test for the fact that Pinger's always use the
fresh PingBatcher (they do by inspection, but it should be actually tested.)

@jameinel jameinel changed the title from Possible fix for 1703526. to Handle PingBatcher getting restarted (bug #1703526) Jul 11, 2017

Owner

jameinel commented Jul 11, 2017

There were a couple of bugs in my patch (u.globalAgentKey() is equivalent to m.globalKey(), while u.globalKey is actually the #charm key).
But I was able to kill PingBatcher by injecting bad data into the database, see that some agents did get 'lost' and then saw them come back again after another minute. Without this patch the agents stayed dead.
I'm a little surprised that PingBatcher dying didn't end up causing Pingers to die and get restarted, but I'll live with it.

wupeka approved these changes Jul 11, 2017

I'd add a test that checks what happens when PingBatcher dies, other than that LGTM.

Owner

wallyworld commented Jul 11, 2017

$$merge$$

Contributor

jujubot commented Jul 11, 2017

Status: merge request accepted. Url: http://juju-ci.vapour.ws:8080/job/github-merge-juju

@jujubot jujubot merged commit 78ed638 into juju:2.2 Jul 11, 2017

1 check passed

github-check-merge-juju Ran tests against PR. Use !!.*!! to request another build. IE, !!build!!, !!retry!!
Details
Contributor

dshcherb commented Jul 13, 2017

I've built a snap for this change and upgraded a controller. Let's see if the issue reproduces after a while. I will post more info tomorrow as it takes time to reproduce.

Contributor

dshcherb commented Jul 15, 2017

I can confirm that this change has fixed the issue both with a self-built snap and in the official 2.2.2 snap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment