New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge 2.9 into develop #13361
Merged
Merged
Merge 2.9 into develop #13361
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Adds a new error type - Not Leader to the api server error types. This will be used to tell the client that it can not process the raft application without trying to move the request to the leader. Additionally, I've moved the error types to a new file to make it a lot easier to understand the code.
The raft context is a way to mediate between the RaftOpQueue with a worker listening on the other end and a way for facades to enqueue commands onto the queue. It's done like this, so it doesn't expose the raft instance arbitrary to the facades, limiting abuse and side effects. Additionally, having a queue in between the instance and the facade allows us to apply back pressure and deadlines to enqueuing. The code is simple, but _very_ mechanical, as we have to thread the queue from the machine manifolds all the way to the facade context.
The facade will interact with the underlying raft queue that is exposed via the facade context. The facade is ultra lightweight as we've designed the facade to not have access to the underlying raft instance.
There were two issues here: 1. In TestManySecretsTrigger and similar, we send two separate changes to the watcher channel. After the first one, the worker starts the timer for an hour from now. After we send the second one, the worker receives it and calls handleSecretRotateChanges, which calls computeNextRotateTime, which calls the clock's Now(). But meanwhile the test calls WaitAdvance to advance the clock by 90 minutes, and it's a race: if the worker calls Now() after the clock is advanced, the test succeeds, because computeNextRotateTime retains the existing rotate time and returns, then the worker will receive the timeout and call rotate(). If the worker calls Now() before the clock is advanced, the test fails, because computeNextRotateTime gets a Now() value that's exactly the same as the previous time, and the return early doesn't happen. The fix for this is to either: a) in the test, insert an advanceClock call of any amount between sending the first change and sending the second, or b) changing the logic in computeNextRotateTime to return early and retain the existing timeout if the next secret rotation time is after OR EQUAL TO the existing timeout (instead of just after). Better to be safe than sorry, so I've made both fixes. 2. The second issue was in TestNewSecretTriggersBefore: if the second advanceClock happened after the worker received the second change and calculated the new shorter timeout but BEFORE it actually reset the timer, the timer wouldn't be fired because it's still set to 1h. There are better ways to fix this test, but I've spent too long on this, so I'm just inserting a testing.ShortWait delay in the test to ensure the advanceClock happens after the timer is updated.
juju#13341 This PR introduces `quay.io` support for - pulling `jujud-operator`, `juju-db` and `charm-base` images from k8s controllers, application operators, and sidecar pods; - fetching available controller versions to upgrade for `upgrade-controller` command; Driveby: ensures `ghcr.io` and `registry.gitlab.com` are only supported for private registries; Tested registries are(the rest of the registries will be implemented and tested in the following PRs): | Registry | Public | Private | | --- | --- | --- | | azurecr.io | | | | docker.io | ✅ | ✅ | | ecr.aws | | | | gcr.io | ✅ | ✅ | | ghcr.io | ❌ | ✅ | | registry.gitlab.com | ❌ | ✅ | | quay.io | ✅ | ✅ | ## Checklist - [ ] ~Requires a [pylibjuju](https://github.com/juju/python-libjuju) change~ - [ ] ~Added [integration tests](https://github.com/juju/juju/tree/develop/tests) for the PR~ - [ ] ~Added or updated [doc.go](https://discourse.jujucharms.com/t/readme-in-packages/451) related to packages changed~ - [x] Comments answer the question of why design decisions were made ## QA steps ```console # docker login with your quay.io username and password # copy the credential from ~/.docker/config.json $ cat quay.json { "auth": "xxxx==", "repository": "quay.io/juju_qa", "serveraddress": "quay.io" } $ juju bootstrap microk8s k1 --config caas-image-repo="'$(cat quay.json)'" --debug $ mkubectl -ncontroller-k1 get pod/controller-0 -o json | jq .spec.imagePullSecrets [ { "name": "juju-image-pull-secret" } ] $ mkubectl -ncontroller-k1 get secret/juju-image-pull-secret -o json | jq -r '.data[".dockerconfigjson"]' | base64 --decode | jq { "auths": { "quay.io": { "auth": "xxxx==", "username": "", "password": "", "serveraddress": "quay.io" } } } $ juju upgrade-controller --agent-stream=develop # public registry: $ juju bootstrap microk8s k1 --config caas-image-repo="quay.io/juju_qa" --debug $ juju upgrade-controller --agent-stream=develop ``` ## Documentation changes https://discourse.charmhub.io/t/initial-private-registry-support/5079 ## Bug reference https://bugs.launchpad.net/juju/+bug/1935830 https://bugs.launchpad.net/juju/+bug/1940820 https://bugs.launchpad.net/juju/+bug/1935953
The following is just a mechanical change, moving the queue from worker to core. As rightly pointed out, importing workers from the apiserver seems a little bit off.
The rehydration of not leader error didn't need to jump through if statements to check for values, instead we can do it in one line.
As logging parameters aren't lazy, we have to check if trace is enabled before we try and trace.
The following handles the not leader error differently from a normal error. If we get a not leader error, we should stop processing any more leases and mark all subsequent errors as the same and then exit out. Normal errors, we do want to keep iterating on, as we just don't if we can bail out early unless we have more typed error cases.
The following is a WIP to clean up the StreamDebugLogs goroutine if the callee wants to cancel the streaming of the logs. I noticed the TODO whilst threading through a new debug-log parameter.
juju#13342 The raft context is a way to mediate between the RaftOpQueue with a worker listening on the other end and a way for facades to enqueue commands onto the queue. It's done like this, so it doesn't expose the raft instance arbitrarily to the facades, limiting abuse and side effects. Additionally, having a queue in between the instance and the facade allows us to apply back pressure and deadlines to enqueuing. The code is simple, but _very_ mechanical, as we have to thread the queue from the machine manifolds all the way to the facade context. ## QA steps ```sh $ juju bootstrap lxd test $ juju model-config -m controller logging-config="<root>=INFO;juju.worker.lease.raft=TRACE;juju.core.raftlease=TRACE;juju.worker.globalclockupdater.raft=TRACE;juju.worker.raft=TRACE" $ juju enable-ha $ juju debug-log -m controller ```
juju#13119 The following cleans up the StreamDebugLogs goroutine if the callee wants to cancel the streaming of the logs. I noticed the TODO whilst threading through a new debug-log parameter. ## QA steps Test pass. ## Bug reference https://bugs.launchpad.net/juju/+bug/1644084
The following adds a api client for the raft lease. The client is super simple, the actually raft lease client that supplements this on the other hand will not. For now the api client mirrors other clients.
juju#13348 The worker/secretrotate tests were failing intermittently in CI, particularly during race testing (where there were additional delays involved, not because of the race detection per se). There were two issues here: 1. In TestManySecretsTrigger and similar, we send two separate changes to the watcher channel. After the first one, the worker starts the timer for an hour from now. After we send the second one, the worker receives it and calls handleSecretRotateChanges, which calls computeNextRotateTime, which calls the clock's Now(). But meanwhile the test calls WaitAdvance to advance the clock by 90 minutes, and it's a race: if the worker calls Now() after the clock is advanced, the test succeeds, because computeNextRotateTime retains the existing rotate time and returns, then the worker will receive the timeout and call rotate(). If the worker calls Now() before the clock is advanced, the test fails, because computeNextRotateTime gets a Now() value that's exactly the same as the previous time, and the return early doesn't happen. The fix for this is to either: a) in the test, insert an advanceClock call of any amount between sending the first change and sending the second, or b) changing the logic in computeNextRotateTime to return early and retain the existing timeout if the next secret rotation time is after OR EQUAL TO the existing timeout (instead of just after). Better to be safe than sorry, so I've made both fixes. Also made the similar >= vs > fix to the timeout code in the lease manager. 2. The second issue was in TestNewSecretTriggersBefore: if the second advanceClock happened after the worker received the second change and calculated the new shorter timeout but BEFORE it actually reset the timer, the timer wouldn't be fired because it's still set to 1h. There are better ways to fix this test, but I've spent too long on this, so I'm just inserting a testing.ShortWait delay in the test to ensure the advanceClock happens after the timer is updated. ```sh go test -count=10 -v -race ./worker/secretrotate/... -check.v -check.vv ```
Instance profiles created by the controller now contain the controller tag that they are made for.
…nted cmr txn logic which is not needed
juju#13349 Hopefully the final fix for cross model relations issues observed on site. When an offer is removed, any remote proxies for the consuming side (in the offering model) also need to be deleted. This wasn't always happening, partly because the removal depended on there being relations in play. The logic to remove the app proxies has been moved so it is always run even if there are no relations. Because there could be orphaned remote proxies, an upgrade step is added to delete them. Finally, to guard against the possibility that reference counting across models might get out of sync and block relation and unit removal (in a single model things are done inside a txn but that's not possible across models), if the txn slice fails to be applied the first time, cross model relations are more forcibly removed the second time. ## QA steps I used the repro scenario from site, primarily this bundle deployed to a model called `ceph`: ``` series: bionic applications: ceph-mon: charm: cs:ceph-mon-55 num_units: 3 options: default-rbd-features: 1 expected-osd-count: 3 permit-insecure-cmr: True ceph-osd: charm: cs:ceph-osd-310 num_units: 3 options: osd-devices: /srv/ceph-osd bluestore: False relations: - - ceph-mon:osd - ceph-osd:mon ``` Then set up the relations: ``` juju offer ceph-mon:client ceph-mon-client juju offer ceph-mon:radosgw ceph-mon-radogw juju add-model consume juju deploy cs:ceph-radosgw-296 --series bionic juju relate admin/ceph.ceph-mon-radogw ceph-radosgw:mon juju deploy glance --series bionic juju relate admin/ceph.ceph-mon-client glance juju deploy cs:keystone --series bionic juju deploy cs:percona-cluster --series bionic juju config percona-cluster min-cluster-size=1 juju relate glance percona-cluster juju relate keystone percona-cluster juju relate keystone glance juju deploy cs:nova-compute --series bionic juju relate nova-compute:image-service glance:image-service juju relate nova-compute:ceph ceph-mon-client ``` You can now repeatedly ``` juju switch ceph juju remove-offer ceph-mon-radogw --force -y juju offer ceph-mon:radosgw ceph-mon-radogw juju switch consume juju relate admin/ceph.ceph-mon-radogw ceph-radosgw:mon ``` For each iteration, you van see the relation created, joined, departed, broken hooks have run: ``` juju show-status-log -m consume ceph-radosgw/0 juju show-status-log -m ceph ceph-mon/0 ``` ## Bug reference https://bugs.launchpad.net/charm-ceph-mon/+bug/1940983
The following adds a raftlease client. The aim is to provide a robust way of sending requests whilst being honest about the error messages that fail. In particular we drop requested commands when we don't have a connection and it's up to the callee to retry again. Previous implementations just dropped them on the floor.
juju#13346 ~~**Requires juju#13342 to land first**~~ The following adds a raftlease client. The aim is to provide a robust way of sending requests whilst being honest about the error messages that fail. In particular, we drop requested commands when we don't have a connection and it's up to the callee to retry again. Previous implementations just dropped them on the floor. ## QA steps Current tests pass. As this hasn't been turned on yet, we just need to ensure that existing implementations work. ```sh $ juju bootstrap lxd test $ juju model-config -m controller logging-config="<root>=INFO;juju.worker.lease.raft=TRACE;juju.core.raftlease=TRACE;juju.worker.globalclockupdater.raft=TRACE;juju.worker.raft=TRACE" $ juju enable-ha $ juju debug-log -m controller ```
juju#13353 When removing and re-consuming a cross model offer, it's possible that the remote proxy might be removed in the middle of the remote relation worker updating things like the macaroon to use. This PR handles that sort of case and make the worker more resilient to not found errors. Also, when a consuming saas application is dying, we ignore status updates from the offering side as these can interfere with a clean status history. On the offering side, we use "force" to destroy the consuming app proxy to ensure it is removed, otherwise it's possible a dying entitiy can remain and wedge the offering processing of events. ## QA steps Deploy a cmr scenario like in juju#13349 and on the consuming side, remove the saas app and check logs for errors. ## Bug reference https://bugs.launchpad.net/charm-ceph-mon/+bug/1940983
juju#13354 The offering model creates a proxy for the consuming app. If the consuming side does a `remove-saas` and then consumes the same offer again, the offering model might retain stale info, especially if it is down then the consuming side does a force delete. This PR uses an incrementing version on the consuming side. If the offering side gets a new connection and the version is newer than what it has, it force deletes the consuming proxy and starts again. This is analogous to what is done on the consuming side then the offer is force removed and then consumed again. This is same to do without a facade bump because the default version will be 0. ## QA steps I tested this on top of juju#13353 but rebased against 2.9 to make this PR. Deploy a cmr scenario like in juju#13349 and on the consuming side, remove the saas app and consume again. Ensure relation hooks are run on both sides of the relation. ## Bug reference https://bugs.launchpad.net/charm-ceph-mon/+bug/1940983
Increment juju to 2.9.16
juju#13352 Instance profiles created by the controller now contain the controller tag that they are made for. ## Checklist - [x] Requires a [pylibjuju](https://github.com/juju/python-libjuju) change - [x] Added [integration tests](https://github.com/juju/juju/tree/develop/tests) for the PR - [x] Added or updated [doc.go](https://discourse.jujucharms.com/t/readme-in-packages/451) related to packages changed - [x] Comments answer the question of why design decisions were made ## QA steps Bootstrap an AWS controller with `juju bootstrap aws/ap-southeast-2 --bootstrap-constraints="instance-role=auto" testmctestface` Check that the corresponding Instance Profile has the controller uuid tag set with: aws iam list-instance-profile-tags --instance-profile-name "juju-controller-testmctestface"
simple-streams when searching for LXD images by alias. VM images are not usable for containers.
juju#13359 The linked bug describes an issue where searching simple-streams for LXD images by alias sometimes returns VM images, which are unsuitable for containers. Here we ensure that we only ever look on remote servers for _container_ images specifically. ## QA steps Bootstrap, add some machines and ensure `juju add-machine lxd:<machine ID>` always results in successful provisioning. ## Documentation changes None. ## Bug reference https://bugs.launchpad.net/juju/+bug/1943088
juju#13360 The `streams.canonical.com` server now supports serving gzip encoding of simplestreame metadata. This PR simply adds tests to ensure we haven't accidentally disabled compression on the http transport, which would prevent the gzip by default handling from being done by the underlying transport. Also a few drive by lint fixes. ## QA steps bootstrap any provider
manadart
force-pushed
the
2.9-into-develop
branch
from
September 27, 2021 08:57
c424380
to
738c4f6
Compare
wallyworld
approved these changes
Sep 28, 2021
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Merge from 2.9 to bring forward:
Conflicts (easy resolution):