Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure cross model app proxy is removed with the offer #13349

Merged
merged 6 commits into from
Sep 23, 2021

Conversation

wallyworld
Copy link
Member

Hopefully the final fix for cross model relations issues observed on site.
When an offer is removed, any remote proxies for the consuming side (in the offering model) also need to be deleted. This wasn't always happening, partly because the removal depended on there being relations in play. The logic to remove the app proxies has been moved so it is always run even if there are no relations.

Because there could be orphaned remote proxies, an upgrade step is added to delete them.

Finally, to guard against the possibility that reference counting across models might get out of sync and block relation and unit removal (in a single model things are done inside a txn but that's not possible across models), if the txn slice fails to be applied the first time, cross model relations are more forcibly removed the second time.

QA steps

I used the repro scenario from site, primarily this bundle deployed to a model called ceph:

series: bionic
applications:
  ceph-mon:
    charm: cs:ceph-mon-55
    num_units: 3
    options:
      default-rbd-features: 1
      expected-osd-count: 3
      permit-insecure-cmr: True
  ceph-osd:
    charm: cs:ceph-osd-310
    num_units: 3
    options:
      osd-devices: /srv/ceph-osd
      bluestore: False
relations:
- - ceph-mon:osd
  - ceph-osd:mon

Then set up the relations:

juju offer ceph-mon:client ceph-mon-client
juju offer ceph-mon:radosgw ceph-mon-radogw

juju add-model consume
juju deploy cs:ceph-radosgw-296 --series bionic
juju relate admin/ceph.ceph-mon-radogw ceph-radosgw:mon
juju deploy glance --series bionic
juju relate admin/ceph.ceph-mon-client glance
juju deploy cs:keystone --series bionic
juju deploy cs:percona-cluster --series bionic
juju config percona-cluster min-cluster-size=1
juju relate glance percona-cluster
juju relate keystone percona-cluster
juju relate keystone glance
juju deploy cs:nova-compute --series bionic
juju relate nova-compute:image-service glance:image-service
juju relate nova-compute:ceph ceph-mon-client

You can now repeatedly

juju switch ceph
juju remove-offer ceph-mon-radogw --force -y
juju offer ceph-mon:radosgw ceph-mon-radogw
juju switch consume
juju relate admin/ceph.ceph-mon-radogw ceph-radosgw:mon

For each iteration, you van see the relation created, joined, departed, broken hooks have run:

juju show-status-log -m consume ceph-radosgw/0
juju show-status-log -m ceph ceph-mon/0

Bug reference

https://bugs.launchpad.net/charm-ceph-mon/+bug/1940983

@wallyworld
Copy link
Member Author

cmr offer stale proxies

@manadart
Copy link
Member

manadart commented Sep 22, 2021

I managed to get into a state where I think I did things too fast and had a disconnected SAAS.

22 Sep 2021 13:08:17+02:00  juju-unit  executing  running mon-relation-changed hook for ceph-mon-radogw/1
22 Sep 2021 13:08:19+02:00  workload   blocked    Services not running that should be: radosgw
22 Sep 2021 13:08:20+02:00  juju-unit  executing  running mon-relation-joined hook for ceph-mon-radogw/2
22 Sep 2021 13:08:20+02:00  juju-unit  executing  running mon-relation-changed hook for ceph-mon-radogw/2
22 Sep 2021 13:08:25+02:00  juju-unit  executing  running mon-relation-changed hook for ceph-mon-radogw/0
22 Sep 2021 13:08:26+02:00  workload   blocked    Services not running that should be: ceph-radosgw@rgw.juju-7f32a4-0
22 Sep 2021 13:08:27+02:00  juju-unit  idle
22 Sep 2021 13:09:24+02:00  juju-unit  executing  running mon-relation-changed hook for ceph-mon-radogw/1
22 Sep 2021 13:09:35+02:00  juju-unit  idle
22 Sep 2021 13:30:49+02:00  workload   active     Unit is ready
22 Sep 2021 13:30:52+02:00  juju-unit  executing  running mon-relation-departed hook for ceph-mon-radogw/0
22 Sep 2021 13:30:52+02:00  juju-unit  error      hook failed: "mon-relation-departed"
22 Sep 2021 13:30:56+02:00  juju-unit  error      hook failed: "relation-departed: relation: 1 not found"
22 Sep 2021 13:31:01+02:00  juju-unit  executing  running mon-relation-created hook
22 Sep 2021 13:31:01+02:00  juju-unit  idle
22 Sep 2021 14:01:38+02:00  juju-unit  executing  running mon-relation-broken hook
22 Sep 2021 14:01:38+02:00  juju-unit  idle
22 Sep 2021 14:02:02+02:00  juju-unit  executing  running mon-relation-created hook
22 Sep 2021 14:02:02+02:00  juju-unit  idle
22 Sep 2021 15:04:53+02:00  workload   waiting    Incomplete relations: mon

The good news is that I could go through the steps again and the relation was re-established OK.

22 Sep 2021 13:31:01+02:00  juju-unit  idle
22 Sep 2021 14:01:38+02:00  juju-unit  executing  running mon-relation-broken hook
22 Sep 2021 14:01:38+02:00  juju-unit  idle
22 Sep 2021 14:02:02+02:00  juju-unit  executing  running mon-relation-created hook
22 Sep 2021 14:02:02+02:00  juju-unit  idle
22 Sep 2021 15:07:20+02:00  juju-unit  executing  running mon-relation-broken hook
22 Sep 2021 15:07:20+02:00  juju-unit  idle
22 Sep 2021 15:07:22+02:00  juju-unit  executing  running mon-relation-created hook
22 Sep 2021 15:07:22+02:00  juju-unit  idle
22 Sep 2021 15:07:35+02:00  juju-unit  executing  running mon-relation-joined hook for ceph-mon-radogw/1
22 Sep 2021 15:07:35+02:00  juju-unit  executing  running mon-relation-changed hook for ceph-mon-radogw/1
22 Sep 2021 15:07:38+02:00  juju-unit  executing  running mon-relation-joined hook for ceph-mon-radogw/0
22 Sep 2021 15:07:38+02:00  juju-unit  executing  running mon-relation-changed hook for ceph-mon-radogw/0
22 Sep 2021 15:07:39+02:00  workload   waiting    Incomplete relations: mon
22 Sep 2021 15:07:40+02:00  juju-unit  executing  running mon-relation-joined hook for ceph-mon-radogw/2
22 Sep 2021 15:07:40+02:00  juju-unit  executing  running mon-relation-changed hook for ceph-mon-radogw/2
22 Sep 2021 15:07:42+02:00  juju-unit  executing  running mon-relation-changed hook for ceph-mon-radogw/0
22 Sep 2021 15:07:44+02:00  juju-unit  executing  running mon-relation-changed hook for ceph-mon-radogw/1
22 Sep 2021 15:07:48+02:00  workload   active     Unit is ready
22 Sep 2021 15:07:49+02:00  juju-unit  idle
Model    Controller  Cloud/Region  Version   SLA          Timestamp
consume  scratch     lxd/default   2.9.15.1  unsupported  15:37:41+02:00

SAAS             Status  Store    URL
ceph-mon-client  active  scratch  admin/ceph.ceph-mon-client
ceph-mon-radogw  active  scratch  admin/ceph.ceph-mon-radogw

App              Version  Status   Scale  Charm            Store       Channel  Rev  OS      Message
ceph-radosgw     12.2.13  active       1  ceph-radosgw     charmstore  stable   296  ubuntu  Unit is ready
glance           16.0.1   active       1  glance           charmhub    stable   512  ubuntu  Unit is ready
keystone         13.0.4   active       1  keystone         charmstore  stable   326  ubuntu  Application Ready
nova-compute     17.0.13  blocked      1  nova-compute     charmstore  stable   334  ubuntu  Missing relations: messaging
percona-cluster  5.7.20   active       1  percona-cluster  charmstore  stable   299  ubuntu  Unit is ready

Unit                Workload  Agent  Machine  Public address  Ports     Message
ceph-radosgw/0*     active    idle   0        10.161.87.57    80/tcp    Unit is ready
glance/0*           active    idle   1        10.161.87.137   9292/tcp  Unit is ready
keystone/0*         active    idle   2        10.161.87.13    5000/tcp  Unit is ready
nova-compute/0*     blocked   idle   4        10.161.87.190             Missing relations: messaging
percona-cluster/0*  active    idle   3        10.161.87.138   3306/tcp  Unit is ready

Machine  State    DNS            Inst id        Series  AZ  Message
0        started  10.161.87.57   juju-7f32a4-0  bionic      Running
1        started  10.161.87.137  juju-7f32a4-1  bionic      Running
2        started  10.161.87.13   juju-7f32a4-2  bionic      Running
3        started  10.161.87.138  juju-7f32a4-3  bionic      Running
4        started  10.161.87.190  juju-7f32a4-4  bionic      Running

Relation provider          Requirer                    Interface        Type     Message
ceph-mon-client:client     glance:ceph                 ceph-client      regular
ceph-mon-client:client     nova-compute:ceph           ceph-client      regular
ceph-mon-radogw:radosgw    ceph-radosgw:mon            ceph-radosgw     regular
ceph-radosgw:cluster       ceph-radosgw:cluster        swift-ha         peer
glance:cluster             glance:cluster              glance-ha        peer
glance:image-service       nova-compute:image-service  glance           regular
keystone:cluster           keystone:cluster            keystone-ha      peer
keystone:identity-service  glance:identity-service     keystone         regular
nova-compute:compute-peer  nova-compute:compute-peer   nova             peer
percona-cluster:cluster    percona-cluster:cluster     percona-cluster  peer
percona-cluster:shared-db  glance:shared-db            mysql-shared     regular
percona-cluster:shared-db  keystone:shared-db          mysql-shared     regular

Copy link
Member

@manadart manadart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The transaction logic here is horrific. I think this patch is OK, but I'd like another review on it.

continue
}
logger.Debugf("destroy consumer proxy %v for offer %v", remoteApp.Name(), op.offerName)
remoteAppOps, err := remoteApp.DestroyOperation(true).Build(attempt)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The model operations are nice for applying transactionality to a chunk of imperative logic. Side-loading them into another slice of ops is on the smelly side of clever.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, it's not ideal. I'll see if I can add support for composing nested ops.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've pushed a change to improve the implementation. Not perfect, but better.

@wallyworld
Copy link
Member Author

The transaction logic here is horrific. I think this patch is OK, but I'd like another review on it.

A major root cause for the messiness is that we don't support 2 phase commit of operations across models, And even if we did, there's still the scenario where one model has permanently gone away and you need to forcibly sanitise the other model. It's exacerbated by the less than ideal ref count numbers we use on various parent objects; we should instead be using reference objects which can be properly tracked, allowing sensible cleanup logic to be applied. As it is now, across models, without txn guarantees, we can easily get wedged due to ref counts getting out of sync and there's little option right now but to force through the clean up. The ref counting used is also somewhat forced by the limitation of the client side txns used with mongo - moving to serer side txns would allow things to be done better for sure but that's way beyond the scope of this PR.

Copy link
Member

@hpidcock hpidcock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple of questions and a few import ordering issues but LGTM

state/applicationoffers.go Outdated Show resolved Hide resolved
state/relation.go Outdated Show resolved Hide resolved
state/relationunit.go Outdated Show resolved Hide resolved
worker/uniter/relation/relationer.go Outdated Show resolved Hide resolved
worker/uniter/relation/relationer_test.go Outdated Show resolved Hide resolved
@wallyworld
Copy link
Member Author

$$merge$$

@wallyworld wallyworld force-pushed the remove-orphaned-cmrapps branch 3 times, most recently from 889ee1b to 26a79ee Compare September 23, 2021 02:15
@wallyworld
Copy link
Member Author

$$merge$$

@wallyworld
Copy link
Member Author

$$merge$$

@jujubot jujubot merged commit e5f2da6 into juju:2.9 Sep 23, 2021
jujubot added a commit that referenced this pull request Sep 24, 2021
#13353

When removing and re-consuming a cross model offer, it's possible that the remote proxy might be removed in the middle of the remote relation worker updating things like the macaroon to use. This PR handles that sort of case and make the worker more resilient to not found errors.
Also, when a consuming saas application is dying, we ignore status updates from the offering side as these can interfere with a clean status history.
On the offering side, we use "force" to destroy the consuming app proxy to ensure it is removed, otherwise it's possible a dying entitiy can remain and wedge the offering processing of events.

## QA steps

Deploy a cmr scenario like in #13349 and on the consuming side, remove the saas app and check logs for errors.

## Bug reference

https://bugs.launchpad.net/charm-ceph-mon/+bug/1940983
jujubot added a commit that referenced this pull request Sep 24, 2021
#13354

The offering model creates a proxy for the consuming app. If the consuming side does a `remove-saas` and then consumes the same offer again, the offering model might retain stale info, especially if it is down then the consuming side does a force delete.

This PR uses an incrementing version on the consuming side. If the offering side gets a new connection and the version is newer than what it has, it force deletes the consuming proxy and starts again. This is analogous to what is done on the consuming side then the offer is force removed and then consumed again.

This is same to do without a facade bump because the default version will be 0.

## QA steps

I tested this on top of #13353 but rebased against 2.9 to make this PR.

Deploy a cmr scenario like in #13349 and on the consuming side, remove the saas app and consume again.
Ensure relation hooks are run on both sides of the relation.

## Bug reference

https://bugs.launchpad.net/charm-ceph-mon/+bug/1940983
jujubot added a commit that referenced this pull request Sep 28, 2021
#13361

Merge from 2.9 to bring forward:
- #13360 from wallyworld/simplestreams-compression
- #13359 from manadart/2.9-lxd-container-images
- #13352 from tlm/aws-instance-profile
- #13358 from jujubot/increment-to-2.9.16
- #13354 from wallyworld/refresh-consume-proxy
- #13353 from wallyworld/cmr-consume-fixes
- #13346 from SimonRichardson/raft-api-client
- #13349 from wallyworld/remove-orphaned-cmrapps
- #13348 from benhoyt/fix-secretrotate-tests
- #13119 from SimonRichardson/pass-context
- #13342 from SimonRichardson/raft-facade
- #13341 from ycliuhw/feature/quay.io

Conflicts (easy resolution):
- apiserver/common/crossmodel/interface.go
- apiserver/errors/errors.go
- apiserver/params/apierror.go
- apiserver/testserver/server.go
- scripts/win-installer/setup.iss
- snap/snapcraft.yaml
- version/version.go
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants