remotefirewaller API: Watch machine addresses #7234

Merged
merged 6 commits into from Apr 17, 2017

Conversation

Projects
None yet
3 participants
Member

babbageclunk commented Apr 13, 2017

Description of change

The IngressAddressWatcher will now signal when a unit's address changes, as well as when units enter or leave the scope of the relation. This means that cross-model relations will still work when addresses change.

QA steps

Smoke test CMR deployment. Then have a machine's public address change in the consuming model (I'm not sure how to do this). The offering model's firewall ingress rules should be updated to match the new public addresses of the consuming units.

babbageclunk added some commits Apr 11, 2017

Unified handling of known addresses
Everything manages the known map, then when there's been a change to the
addresses or for the initial event we generate the addresses from known.
Only format the addresses once
Rather than each time the select runs.
Member

babbageclunk commented Apr 13, 2017

I was thinking about tracking the addresses by machine rather than by unit (so combining it into the machines map) but I figured it's probably enough refactoring for now.

This can be tested live on GCE can't it?

)
- // addresses holds the current set of known addresses.
- addresses, err := w.initial()
+ err = w.initial()
@wallyworld

wallyworld Apr 13, 2017

Owner

In watchers, initial() by convention has been used to return the initial set of values.
If we are changing the semantics here, the method should be renamed to initialise() or something

@babbageclunk

babbageclunk Apr 13, 2017

Member

Ok, I'll rename it.

- out = nil
+ case machineId, ok := <-w.addressChanges:
+ if !ok {
+ return errors.Errorf("address changes channel unexpectedly closed")
@wallyworld

wallyworld Apr 13, 2017

Owner

This could just be part of normal shutdown. I think this should just continue.

// If the unit is departing and we have seen its address,
// remove the address.
- address := w.known[name]
+ address, ok := w.known[name]
+ if !ok {
@wallyworld

wallyworld Apr 13, 2017

Owner

This isn't needed. It's safe to delete a non-existent key from a map

@babbageclunk

babbageclunk Apr 13, 2017

Member

I'm using ok to prevent reporting we've changed the map when the value wasn't in there to begin with - it's really there to stop the loop further down from happening.

+ if err != nil {
+ return errors.Trace(err)
+ }
+ w.machines[machine.Id()] = &machineData{
@wallyworld

wallyworld Apr 13, 2017

Owner

This doesn't look right - what happens if we have already seen that machine before. This could be the second unit assigned to a particular machine.

@babbageclunk

babbageclunk Apr 13, 2017

Member

That's handled above - if the machine data is found we just add this unit to the unit set and return.

+ mData, ok := w.machines[machineId]
+ if !ok {
+ // This shouldn't happen.
+ return errors.Errorf("missing machine data for machine %q (hosting unit %q)", machineId, unitName)
@wallyworld

wallyworld Apr 13, 2017

Owner

I think a logged warning here is ok?

@babbageclunk

babbageclunk Apr 13, 2017

Member

I guess so - it could happen if we saw the unit leave scope more than once somehow.

+ return errors.Trace(err)
+ }
+ delete(w.machines, machineId)
+ delete(w.unitToMachine, unitName)
@wallyworld

wallyworld Apr 13, 2017

Owner

Can we move this line up to just after machineId is retrieved

@babbageclunk

babbageclunk Apr 13, 2017

Member

Ok - well, to just after we check units.Size().

@babbageclunk

babbageclunk Apr 13, 2017

Member

Duh, thought you were talking about the machines line. Yes, you're right - we should do that deletion even if the machine as a whole isn't going away.

+ return machine, nil
+}
+
+func (w *IngressAddressWatcher) machineAddressChanged(machineId string) (bool, error) {
@wallyworld

wallyworld Apr 13, 2017

Owner

The method name implies this is a read only operation.
But "known" is updated.
Can we call it processMachineAddresses() or something.
The return values could be named to make it clear what is happening eg (changed bool, _ error)

@babbageclunk

babbageclunk Apr 13, 2017

Member

Ok, will do.

+func newMachineAddressWatcher(machine Machine, dest chan<- string) (*machineAddressWatcher, error) {
+ w := &machineAddressWatcher{
+ machine: machine,
+ dest: dest,
@wallyworld

wallyworld Apr 13, 2017

Owner

This should just be called addressChangeChan or something

+// machineAddressWatcher watches for machine address changes and
+// notifies the dest channel when it sees them. It isn't a watcher in
+// the strict sense since it doesn't have a Changes method.
+type machineAddressWatcher struct {
@wallyworld

wallyworld Apr 13, 2017

Owner

In other places we would have called this "machineAddressWorker"

@babbageclunk

babbageclunk Apr 13, 2017

Member

Ok, that's clearer.

@@ -302,3 +311,62 @@ func (s *addressWatcherSuite) TestTwoUnitsSameAddressOneLeaves(c *gc.C) {
wc.AssertChange()
wc.AssertNoChange()
}
+
+func (s *addressWatcherSuite) TestSeesMachineAddressChanges(c *gc.C) {
@wallyworld

wallyworld Apr 13, 2017

Owner

We need a test where a second unit is assigned to an existing machine with a unit already on it

@babbageclunk

babbageclunk Apr 13, 2017

Member

I'll add that.

Thanks, hopefully can test live...

Member

babbageclunk commented Apr 13, 2017

!!build!!

Member

babbageclunk commented Apr 13, 2017

$$merge$$

Contributor

jujubot commented Apr 13, 2017

Status: merge request accepted. Url: http://juju-ci.vapour.ws:8080/job/github-merge-juju

Contributor

jujubot commented Apr 13, 2017

Build failed: Tests failed
build url: http://juju-ci.vapour.ws:8080/job/github-merge-juju/10675

Member

babbageclunk commented Apr 17, 2017

$$merge$$

Contributor

jujubot commented Apr 17, 2017

Status: merge request accepted. Url: http://juju-ci.vapour.ws:8080/job/github-merge-juju

Member

babbageclunk commented Apr 17, 2017

I did a GCE test on Thursday - it seemed like this change worked correctly, but from the logs it appeared that the firewaller got stuck in the call to flushInstancePorts. I can see the log message about flushing instance ports with the right change in it, but it appears to have hung in the OpenPorts call. Trying to reproduce now (the GCE instances were cleaned up over the break).

@jujubot jujubot merged commit 26f56a9 into juju:develop Apr 17, 2017

1 check failed

github-check-merge-juju Built PR, ran unit tests, and tested LXD deploy. Use !!.*!! to request another build. IE, !!build!!, !!retry!!
Details

@babbageclunk babbageclunk deleted the babbageclunk:watch-machine-addresses branch Apr 17, 2017

jujubot added a commit that referenced this pull request Apr 24, 2017

Merge pull request #7246 from babbageclunk/gce-firewaller-fix
provider/gce: Don't base the name of firewall rules on hash of CIDRs

## Description of change

Found when testing #7234 - the updated addresses would be logged for updating the firewall, but the update wouldn't happen. The error on updating was being swallowed somewhere so it wasn't clear what was happening from the logs. (I'll fix that in a separate PR when I track it down.)

We were generating the name of the firewall using a hash of the new rules, and trying to update the firewall to have the new name. GCE will prevent a mismatch between the URL and the name with this
error:

```
opening port(s) [3306/tcp from 35.187.146.143/32]: googleapi: Error 400: Invalid value for field 'resource.name': 'juju-893de4-0-d4bc50'. The name of the resource in request body must be the same as the resource name in the URL, invalid
```
Since we can't update the name of the firewall rules it seems like the options are:

1. Create a new one (with the name still based on the hashed CIDRs) and then remove the old one.
2. Change the naming to not be based on hashing the CIDRs - use a random unique string instead.

Option 1 can't be done atomically, and cleaning up the mess if the create happened but the delete didn't seems like it would be nasty. So this change implements option 2.

Since the names will be unpredictable we need to hang on to them when we get the existing rules from GCE so we can send them back with updates/deletes. The old version of the code wasn't doing that - names were thrown away when the compute.Firewall is converted into []network.IngressRule because it could regenerate the name from the CIDRs.


## QA steps

* Bootstrap on GCE, create two models with a cross-model relation between mysql and wordpress. 
* Expose wordpress
* Check that you can successfully see the wordpress setup page at the wordpress public address.
* In the GCE console, stop the wordpress machine and then start it again - note that its public address has changed.
* Wait for the Juju instance poller to notice the updated public address.
* Check that the wordpress setup page is available at the new address.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment