Merge 2.9 into develop #13361

manadart · 2021-09-27T08:21:32Z

Merge from 2.9 to bring forward:

Add tests to ensure simplestreams client does gzip compression #13360 from wallyworld/simplestreams-compression
Search only for LXD "container" images #13359 from manadart/2.9-lxd-container-images
Adds controller tag support to instance profile. #13352 from tlm/aws-instance-profile
Increment juju to 2.9.16 #13358 from jujubot/increment-to-2.9.16
When an offer is consumed a second time, delete the old proxy #13354 from wallyworld/refresh-consume-proxy
Tweaks to make consuming cmr offers more robust #13353 from wallyworld/cmr-consume-fixes
Raft API Client #13346 from SimonRichardson/raft-api-client
Ensure cross model app proxy is removed with the offer #13349 from wallyworld/remove-orphaned-cmrapps
Fix race conditions in worker/secretrotate #13348 from benhoyt/fix-secretrotate-tests
Logs: Ensure we close goroutine on cancel #13119 from SimonRichardson/pass-context
APIServer: Raft context #13342 from SimonRichardson/raft-facade
Introduces quay.io support #13341 from ycliuhw/feature/quay.io

Conflicts (easy resolution):

apiserver/common/crossmodel/interface.go
apiserver/errors/errors.go
apiserver/params/apierror.go
apiserver/testserver/server.go
scripts/win-installer/setup.iss
snap/snapcraft.yaml
version/version.go

Adds a new error type - Not Leader to the api server error types. This will be used to tell the client that it can not process the raft application without trying to move the request to the leader. Additionally, I've moved the error types to a new file to make it a lot easier to understand the code.

The raft context is a way to mediate between the RaftOpQueue with a worker listening on the other end and a way for facades to enqueue commands onto the queue. It's done like this, so it doesn't expose the raft instance arbitrary to the facades, limiting abuse and side effects. Additionally, having a queue in between the instance and the facade allows us to apply back pressure and deadlines to enqueuing. The code is simple, but _very_ mechanical, as we have to thread the queue from the machine manifolds all the way to the facade context.

The facade will interact with the underlying raft queue that is exposed via the facade context. The facade is ultra lightweight as we've designed the facade to not have access to the underlying raft instance.

There were two issues here: 1. In TestManySecretsTrigger and similar, we send two separate changes to the watcher channel. After the first one, the worker starts the timer for an hour from now. After we send the second one, the worker receives it and calls handleSecretRotateChanges, which calls computeNextRotateTime, which calls the clock's Now(). But meanwhile the test calls WaitAdvance to advance the clock by 90 minutes, and it's a race: if the worker calls Now() after the clock is advanced, the test succeeds, because computeNextRotateTime retains the existing rotate time and returns, then the worker will receive the timeout and call rotate(). If the worker calls Now() before the clock is advanced, the test fails, because computeNextRotateTime gets a Now() value that's exactly the same as the previous time, and the return early doesn't happen. The fix for this is to either: a) in the test, insert an advanceClock call of any amount between sending the first change and sending the second, or b) changing the logic in computeNextRotateTime to return early and retain the existing timeout if the next secret rotation time is after OR EQUAL TO the existing timeout (instead of just after). Better to be safe than sorry, so I've made both fixes. 2. The second issue was in TestNewSecretTriggersBefore: if the second advanceClock happened after the worker received the second change and calculated the new shorter timeout but BEFORE it actually reset the timer, the timer wouldn't be fired because it's still set to 1h. There are better ways to fix this test, but I've spent too long on this, so I'm just inserting a testing.ShortWait delay in the test to ensure the advanceClock happens after the timer is updated.

juju#13341 This PR introduces `quay.io` support for - pulling `jujud-operator`, `juju-db` and `charm-base` images from k8s controllers, application operators, and sidecar pods; - fetching available controller versions to upgrade for `upgrade-controller` command; Driveby: ensures `ghcr.io` and `registry.gitlab.com` are only supported for private registries; Tested registries are(the rest of the registries will be implemented and tested in the following PRs): | Registry | Public | Private | | --- | --- | --- | | azurecr.io | | | | docker.io | ✅ | ✅ | | ecr.aws | | | | gcr.io | ✅ | ✅ | | ghcr.io | ❌ | ✅ | | registry.gitlab.com | ❌ | ✅ | | quay.io | ✅ | ✅ | ## Checklist - [ ] ~Requires a [pylibjuju](https://github.com/juju/python-libjuju) change~ - [ ] ~Added [integration tests](https://github.com/juju/juju/tree/develop/tests) for the PR~ - [ ] ~Added or updated [doc.go](https://discourse.jujucharms.com/t/readme-in-packages/451) related to packages changed~ - [x] Comments answer the question of why design decisions were made ## QA steps ```console # docker login with your quay.io username and password # copy the credential from ~/.docker/config.json $ cat quay.json { "auth": "xxxx==", "repository": "quay.io/juju_qa", "serveraddress": "quay.io" } $ juju bootstrap microk8s k1 --config caas-image-repo="'$(cat quay.json)'" --debug $ mkubectl -ncontroller-k1 get pod/controller-0 -o json | jq .spec.imagePullSecrets [ { "name": "juju-image-pull-secret" } ] $ mkubectl -ncontroller-k1 get secret/juju-image-pull-secret -o json | jq -r '.data[".dockerconfigjson"]' | base64 --decode | jq { "auths": { "quay.io": { "auth": "xxxx==", "username": "", "password": "", "serveraddress": "quay.io" } } } $ juju upgrade-controller --agent-stream=develop # public registry: $ juju bootstrap microk8s k1 --config caas-image-repo="quay.io/juju_qa" --debug $ juju upgrade-controller --agent-stream=develop ``` ## Documentation changes https://discourse.charmhub.io/t/initial-private-registry-support/5079 ## Bug reference https://bugs.launchpad.net/juju/+bug/1935830 https://bugs.launchpad.net/juju/+bug/1940820 https://bugs.launchpad.net/juju/+bug/1935953

The following is just a mechanical change, moving the queue from worker to core. As rightly pointed out, importing workers from the apiserver seems a little bit off.

The rehydration of not leader error didn't need to jump through if statements to check for values, instead we can do it in one line.

As logging parameters aren't lazy, we have to check if trace is enabled before we try and trace.

The following handles the not leader error differently from a normal error. If we get a not leader error, we should stop processing any more leases and mark all subsequent errors as the same and then exit out. Normal errors, we do want to keep iterating on, as we just don't if we can bail out early unless we have more typed error cases.

The following is a WIP to clean up the StreamDebugLogs goroutine if the callee wants to cancel the streaming of the logs. I noticed the TODO whilst threading through a new debug-log parameter.

juju#13342 The raft context is a way to mediate between the RaftOpQueue with a worker listening on the other end and a way for facades to enqueue commands onto the queue. It's done like this, so it doesn't expose the raft instance arbitrarily to the facades, limiting abuse and side effects. Additionally, having a queue in between the instance and the facade allows us to apply back pressure and deadlines to enqueuing. The code is simple, but _very_ mechanical, as we have to thread the queue from the machine manifolds all the way to the facade context. ## QA steps ```sh $ juju bootstrap lxd test $ juju model-config -m controller logging-config="<root>=INFO;juju.worker.lease.raft=TRACE;juju.core.raftlease=TRACE;juju.worker.globalclockupdater.raft=TRACE;juju.worker.raft=TRACE" $ juju enable-ha $ juju debug-log -m controller ```

juju#13119 The following cleans up the StreamDebugLogs goroutine if the callee wants to cancel the streaming of the logs. I noticed the TODO whilst threading through a new debug-log parameter. ## QA steps Test pass. ## Bug reference https://bugs.launchpad.net/juju/+bug/1644084

The following adds a api client for the raft lease. The client is super simple, the actually raft lease client that supplements this on the other hand will not. For now the api client mirrors other clients.

juju#13348 The worker/secretrotate tests were failing intermittently in CI, particularly during race testing (where there were additional delays involved, not because of the race detection per se). There were two issues here: 1. In TestManySecretsTrigger and similar, we send two separate changes to the watcher channel. After the first one, the worker starts the timer for an hour from now. After we send the second one, the worker receives it and calls handleSecretRotateChanges, which calls computeNextRotateTime, which calls the clock's Now(). But meanwhile the test calls WaitAdvance to advance the clock by 90 minutes, and it's a race: if the worker calls Now() after the clock is advanced, the test succeeds, because computeNextRotateTime retains the existing rotate time and returns, then the worker will receive the timeout and call rotate(). If the worker calls Now() before the clock is advanced, the test fails, because computeNextRotateTime gets a Now() value that's exactly the same as the previous time, and the return early doesn't happen. The fix for this is to either: a) in the test, insert an advanceClock call of any amount between sending the first change and sending the second, or b) changing the logic in computeNextRotateTime to return early and retain the existing timeout if the next secret rotation time is after OR EQUAL TO the existing timeout (instead of just after). Better to be safe than sorry, so I've made both fixes. Also made the similar >= vs > fix to the timeout code in the lease manager. 2. The second issue was in TestNewSecretTriggersBefore: if the second advanceClock happened after the worker received the second change and calculated the new shorter timeout but BEFORE it actually reset the timer, the timer wouldn't be fired because it's still set to 1h. There are better ways to fix this test, but I've spent too long on this, so I'm just inserting a testing.ShortWait delay in the test to ensure the advanceClock happens after the timer is updated. ```sh go test -count=10 -v -race ./worker/secretrotate/... -check.v -check.vv ```

…g app proxies

Instance profiles created by the controller now contain the controller tag that they are made for.

…nted cmr txn logic which is not needed

juju#13349 Hopefully the final fix for cross model relations issues observed on site. When an offer is removed, any remote proxies for the consuming side (in the offering model) also need to be deleted. This wasn't always happening, partly because the removal depended on there being relations in play. The logic to remove the app proxies has been moved so it is always run even if there are no relations. Because there could be orphaned remote proxies, an upgrade step is added to delete them. Finally, to guard against the possibility that reference counting across models might get out of sync and block relation and unit removal (in a single model things are done inside a txn but that's not possible across models), if the txn slice fails to be applied the first time, cross model relations are more forcibly removed the second time. ## QA steps I used the repro scenario from site, primarily this bundle deployed to a model called `ceph`: ``` series: bionic applications: ceph-mon: charm: cs:ceph-mon-55 num_units: 3 options: default-rbd-features: 1 expected-osd-count: 3 permit-insecure-cmr: True ceph-osd: charm: cs:ceph-osd-310 num_units: 3 options: osd-devices: /srv/ceph-osd bluestore: False relations: - - ceph-mon:osd - ceph-osd:mon ``` Then set up the relations: ``` juju offer ceph-mon:client ceph-mon-client juju offer ceph-mon:radosgw ceph-mon-radogw juju add-model consume juju deploy cs:ceph-radosgw-296 --series bionic juju relate admin/ceph.ceph-mon-radogw ceph-radosgw:mon juju deploy glance --series bionic juju relate admin/ceph.ceph-mon-client glance juju deploy cs:keystone --series bionic juju deploy cs:percona-cluster --series bionic juju config percona-cluster min-cluster-size=1 juju relate glance percona-cluster juju relate keystone percona-cluster juju relate keystone glance juju deploy cs:nova-compute --series bionic juju relate nova-compute:image-service glance:image-service juju relate nova-compute:ceph ceph-mon-client ``` You can now repeatedly ``` juju switch ceph juju remove-offer ceph-mon-radogw --force -y juju offer ceph-mon:radosgw ceph-mon-radogw juju switch consume juju relate admin/ceph.ceph-mon-radogw ceph-radosgw:mon ``` For each iteration, you van see the relation created, joined, departed, broken hooks have run: ``` juju show-status-log -m consume ceph-radosgw/0 juju show-status-log -m ceph ceph-mon/0 ``` ## Bug reference https://bugs.launchpad.net/charm-ceph-mon/+bug/1940983

The following adds a raftlease client. The aim is to provide a robust way of sending requests whilst being honest about the error messages that fail. In particular we drop requested commands when we don't have a connection and it's up to the callee to retry again. Previous implementations just dropped them on the floor.

juju#13346 ~~**Requires juju#13342 to land first**~~ The following adds a raftlease client. The aim is to provide a robust way of sending requests whilst being honest about the error messages that fail. In particular, we drop requested commands when we don't have a connection and it's up to the callee to retry again. Previous implementations just dropped them on the floor. ## QA steps Current tests pass. As this hasn't been turned on yet, we just need to ensure that existing implementations work. ```sh $ juju bootstrap lxd test $ juju model-config -m controller logging-config="<root>=INFO;juju.worker.lease.raft=TRACE;juju.core.raftlease=TRACE;juju.worker.globalclockupdater.raft=TRACE;juju.worker.raft=TRACE" $ juju enable-ha $ juju debug-log -m controller ```

…tionUnitsOps

juju#13353 When removing and re-consuming a cross model offer, it's possible that the remote proxy might be removed in the middle of the remote relation worker updating things like the macaroon to use. This PR handles that sort of case and make the worker more resilient to not found errors. Also, when a consuming saas application is dying, we ignore status updates from the offering side as these can interfere with a clean status history. On the offering side, we use "force" to destroy the consuming app proxy to ensure it is removed, otherwise it's possible a dying entitiy can remain and wedge the offering processing of events. ## QA steps Deploy a cmr scenario like in juju#13349 and on the consuming side, remove the saas app and check logs for errors. ## Bug reference https://bugs.launchpad.net/charm-ceph-mon/+bug/1940983

juju#13354 The offering model creates a proxy for the consuming app. If the consuming side does a `remove-saas` and then consumes the same offer again, the offering model might retain stale info, especially if it is down then the consuming side does a force delete. This PR uses an incrementing version on the consuming side. If the offering side gets a new connection and the version is newer than what it has, it force deletes the consuming proxy and starts again. This is analogous to what is done on the consuming side then the offer is force removed and then consumed again. This is same to do without a facade bump because the default version will be 0. ## QA steps I tested this on top of juju#13353 but rebased against 2.9 to make this PR. Deploy a cmr scenario like in juju#13349 and on the consuming side, remove the saas app and consume again. Ensure relation hooks are run on both sides of the relation. ## Bug reference https://bugs.launchpad.net/charm-ceph-mon/+bug/1940983

Increment juju to 2.9.16

juju#13352 Instance profiles created by the controller now contain the controller tag that they are made for. ## Checklist - [x] Requires a [pylibjuju](https://github.com/juju/python-libjuju) change - [x] Added [integration tests](https://github.com/juju/juju/tree/develop/tests) for the PR - [x] Added or updated [doc.go](https://discourse.jujucharms.com/t/readme-in-packages/451) related to packages changed - [x] Comments answer the question of why design decisions were made ## QA steps Bootstrap an AWS controller with `juju bootstrap aws/ap-southeast-2 --bootstrap-constraints="instance-role=auto" testmctestface` Check that the corresponding Instance Profile has the controller uuid tag set with: aws iam list-instance-profile-tags --instance-profile-name "juju-controller-testmctestface"

simple-streams when searching for LXD images by alias. VM images are not usable for containers.

juju#13359 The linked bug describes an issue where searching simple-streams for LXD images by alias sometimes returns VM images, which are unsuitable for containers. Here we ensure that we only ever look on remote servers for _container_ images specifically. ## QA steps Bootstrap, add some machines and ensure `juju add-machine lxd:<machine ID>` always results in successful provisioning. ## Documentation changes None. ## Bug reference https://bugs.launchpad.net/juju/+bug/1943088

juju#13360 The `streams.canonical.com` server now supports serving gzip encoding of simplestreame metadata. This PR simply adds tests to ensure we haven't accidentally disabled compression on the http transport, which would prevent the gzip by default handling from being done by the underlying transport. Also a few drive by lint fixes. ## QA steps bootstrap any provider

manadart · 2021-09-27T11:46:47Z

Merge 2.9 into develop

hpidcock · 2021-09-28T06:32:21Z

$$merge$$

SimonRichardson and others added 30 commits September 17, 2021 15:08

Change ghcr.io and registry.gitlab.com registry API private only;

231bb1b

Add raft facade

6df384a

The facade will interact with the underlying raft queue that is exposed via the facade context. The facade is ultra lightweight as we've designed the facade to not have access to the underlying raft instance.

Same >= vs > fix in similar timeout code in worker/lease/manager.go

12eb7b5

Implement quay.io container registry API support;

09454fb

Ensure cross model app proxy is removed with the offer

d8fb101

Raft Queue: Move from worker to core

f1f0540

The following is just a mechanical change, moving the queue from worker to core. As rightly pointed out, importing workers from the apiserver seems a little bit off.

Clean up the rehydration for not leader error

521b878

The rehydration of not leader error didn't need to jump through if statements to check for values, instead we can do it in one line.

Logout trace if trace is enabled

9c9f95f

As logging parameters aren't lazy, we have to check if trace is enabled before we try and trace.

Logs: Ensure we close goroutine on cancel

8ed1170

The following is a WIP to clean up the StreamDebugLogs goroutine if the callee wants to cancel the streaming of the logs. I noticed the TODO whilst threading through a new debug-log parameter.

API: Raft lease api client

e5e16e6

The following adds a api client for the raft lease. The client is super simple, the actually raft lease client that supplements this on the other hand will not. For now the api client mirrors other clients.

Driveby: Remove connsuite from pubsub api client

d78b58b

Refactor RemoveOfferOperation to use nested ops for removing consumin…

48a6c68

…g app proxies

Adds controller tag support to instance profile.

3361bf8

Instance profiles created by the controller now contain the controller tag that they are made for.

Add a query to get remote apps by offer, and clean up somne of the mu…

96b7327

…nted cmr txn logic which is not needed

Better error message if saas application is dying

0c123cd

Don't log error on shutdown if watched entity is already deleted

6eee757

Fix relation removal query expression

994ccf8

Tweaks to make consuming cmr offers more robust

e2c5fbe

Use jujutxn.ErrNoOperations in more places like destroyCrossModelRela…

455ba31

…tionUnitsOps

wallyworld and others added 10 commits September 24, 2021 09:53

When an offer is consumed a second time, delete the old proxy

7616f39

Increment juju to 2.9.16

c790cc6

Merge pull request juju#13358 from jujubot/increment-to-2.9.16

8643959

Increment juju to 2.9.16

Ensures that we only attempt to retrieve "container" images from

2306c62

simple-streams when searching for LXD images by alias. VM images are not usable for containers.

Add tests to ensure simplestreams client does gzip compression

6a1be32

manadart added the 3.0 label Sep 27, 2021

Merge branch 'upstream/2.9' into 2.9-into-develop

738c4f6

manadart force-pushed the 2.9-into-develop branch from c424380 to 738c4f6 Compare September 27, 2021 08:57

wallyworld approved these changes Sep 28, 2021

View reviewed changes

jujubot merged commit 101e9b7 into juju:develop Sep 28, 2021

manadart deleted the 2.9-into-develop branch September 28, 2021 07:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge 2.9 into develop #13361

Merge 2.9 into develop #13361

manadart commented Sep 27, 2021

manadart commented Sep 27, 2021

hpidcock commented Sep 28, 2021

Merge 2.9 into develop #13361

Merge 2.9 into develop #13361

Conversation

manadart commented Sep 27, 2021

manadart commented Sep 27, 2021

hpidcock commented Sep 28, 2021