Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge 2.9 into develop #13361

Merged
merged 41 commits into from Sep 28, 2021
Merged

Merge 2.9 into develop #13361

merged 41 commits into from Sep 28, 2021

Conversation

manadart
Copy link
Member

Merge from 2.9 to bring forward:

Conflicts (easy resolution):

  • apiserver/common/crossmodel/interface.go
  • apiserver/errors/errors.go
  • apiserver/params/apierror.go
  • apiserver/testserver/server.go
  • scripts/win-installer/setup.iss
  • snap/snapcraft.yaml
  • version/version.go

SimonRichardson and others added 30 commits September 17, 2021 15:08
Adds a new error type - Not Leader to the api server error types. This
will be used to tell the client that it can not process the raft
application without trying to move the request to the leader.

Additionally, I've moved the error types to a new file to make it a lot
easier to understand the code.
The raft context is a way to mediate between the RaftOpQueue with a
worker listening on the other end and a way for facades to enqueue
commands onto the queue. It's done like this, so it doesn't expose the
raft instance arbitrary to the facades, limiting abuse and side effects.

Additionally, having a queue in between the instance and the facade
allows us to apply back pressure and deadlines to enqueuing.

The code is simple, but _very_ mechanical, as we have to thread the
queue from the machine manifolds all the way to the facade context.
The facade will interact with the underlying raft queue that is exposed
via the facade context.

The facade is ultra lightweight as we've designed the facade to not have
access to the underlying raft instance.
There were two issues here:

1. In TestManySecretsTrigger and similar, we send two separate changes
to the watcher channel. After the first one, the worker starts the
timer for an hour from now. After we send the second one, the worker
receives it and calls handleSecretRotateChanges, which calls
computeNextRotateTime, which calls the clock's Now().

But meanwhile the test calls WaitAdvance to advance the clock by 90
minutes, and it's a race: if the worker calls Now() after the clock is
advanced, the test succeeds, because computeNextRotateTime retains the
existing rotate time and returns, then the worker will receive the
timeout and call rotate().

If the worker calls Now() before the clock is advanced, the test fails,
because computeNextRotateTime gets a Now() value that's exactly the
same as the previous time, and the return early doesn't happen. The
fix for this is to either:

a) in the test, insert an advanceClock call of any amount between
   sending the first change and sending the second, or
b) changing the logic in computeNextRotateTime to return early and
   retain the existing timeout if the next secret rotation time is
   after OR EQUAL TO the existing timeout (instead of just after).

Better to be safe than sorry, so I've made both fixes.

2. The second issue was in TestNewSecretTriggersBefore: if the second
advanceClock happened after the worker received the second change
and calculated the new shorter timeout but BEFORE it actually reset
the timer, the timer wouldn't be fired because it's still set to 1h.
There are better ways to fix this test, but I've spent too long on
this, so I'm just inserting a testing.ShortWait delay in the test to
ensure the advanceClock happens after the timer is updated.
juju#13341

This PR introduces `quay.io` support for

- pulling `jujud-operator`, `juju-db` and `charm-base` images from k8s controllers, application operators, and sidecar pods;
- fetching available controller versions to upgrade for `upgrade-controller` command;

Driveby: ensures `ghcr.io` and `registry.gitlab.com` are only supported for private registries;

Tested registries are(the rest of the registries will be implemented and tested in the following PRs):

| Registry | Public | Private |
| --- | --- | --- |
| azurecr.io | | |
| docker.io | ✅ | ✅ |
| ecr.aws | | |
| gcr.io | ✅ | ✅ |
| ghcr.io | ❌ | ✅ |
| registry.gitlab.com | ❌ | ✅ |
| quay.io | ✅ | ✅ |

## Checklist

 - [ ] ~Requires a [pylibjuju](https://github.com/juju/python-libjuju) change~
 - [ ] ~Added [integration tests](https://github.com/juju/juju/tree/develop/tests) for the PR~
 - [ ] ~Added or updated [doc.go](https://discourse.jujucharms.com/t/readme-in-packages/451) related to packages changed~
 - [x] Comments answer the question of why design decisions were made

## QA steps

```console
# docker login with your quay.io username and password
# copy the credential from ~/.docker/config.json
$ cat quay.json
{
 "auth": "xxxx==",
 "repository": "quay.io/juju_qa",
 "serveraddress": "quay.io"
}

$ juju bootstrap microk8s k1 --config caas-image-repo="'$(cat quay.json)'" --debug

$ mkubectl -ncontroller-k1 get pod/controller-0 -o json | jq .spec.imagePullSecrets

[
 {
 "name": "juju-image-pull-secret"
 }
]

$ mkubectl -ncontroller-k1 get secret/juju-image-pull-secret -o json | jq -r '.data[".dockerconfigjson"]' | base64 --decode | jq
{
 "auths": {
 "quay.io": {
 "auth": "xxxx==",
 "username": "",
 "password": "",
 "serveraddress": "quay.io"
 }
 }
}

$ juju upgrade-controller --agent-stream=develop


# public registry:
$ juju bootstrap microk8s k1 --config caas-image-repo="quay.io/juju_qa" --debug

$ juju upgrade-controller --agent-stream=develop
```

## Documentation changes

https://discourse.charmhub.io/t/initial-private-registry-support/5079

## Bug reference


https://bugs.launchpad.net/juju/+bug/1935830
https://bugs.launchpad.net/juju/+bug/1940820
https://bugs.launchpad.net/juju/+bug/1935953
The following is just a mechanical change, moving the queue from worker
to core. As rightly pointed out, importing workers from the apiserver
seems a little bit off.
The rehydration of not leader error didn't need to jump through if
statements to check for values, instead we can do it in one line.
As logging parameters aren't lazy, we have to check if trace is enabled
before we try and trace.
The following handles the not leader error differently from a normal
error. If we get a not leader error, we should stop processing any more
leases and mark all subsequent errors as the same and then exit out.
Normal errors, we do want to keep iterating on, as we just don't if we
can bail out early unless we have more typed error cases.
The following is a WIP to clean up the StreamDebugLogs goroutine if
the callee wants to cancel the streaming of the logs.

I noticed the TODO whilst threading through a new debug-log parameter.
juju#13342

The raft context is a way to mediate between the RaftOpQueue with a
worker listening on the other end and a way for facades to enqueue
commands onto the queue. It's done like this, so it doesn't expose the
raft instance arbitrarily to the facades, limiting abuse and side effects.

Additionally, having a queue in between the instance and the facade
allows us to apply back pressure and deadlines to enqueuing.

The code is simple, but _very_ mechanical, as we have to thread the
queue from the machine manifolds all the way to the facade context.

## QA steps

```sh
$ juju bootstrap lxd test
$ juju model-config -m controller logging-config="<root>=INFO;juju.worker.lease.raft=TRACE;juju.core.raftlease=TRACE;juju.worker.globalclockupdater.raft=TRACE;juju.worker.raft=TRACE"
$ juju enable-ha
$ juju debug-log -m controller
```
juju#13119

The following cleans up the StreamDebugLogs goroutine if
the callee wants to cancel the streaming of the logs.

I noticed the TODO whilst threading through a new debug-log parameter.


## QA steps

Test pass.

## Bug reference

https://bugs.launchpad.net/juju/+bug/1644084
The following adds a api client for the raft lease. The client is super
simple, the actually raft lease client that supplements this on the
other hand will not.

For now the api client mirrors other clients.
juju#13348

The worker/secretrotate tests were failing intermittently in CI, particularly during race testing (where there were additional delays involved, not because of the race detection per se).

There were two issues here:

1. In TestManySecretsTrigger and similar, we send two separate changes
to the watcher channel. After the first one, the worker starts the
timer for an hour from now. After we send the second one, the worker
receives it and calls handleSecretRotateChanges, which calls
computeNextRotateTime, which calls the clock's Now().

But meanwhile the test calls WaitAdvance to advance the clock by 90
minutes, and it's a race: if the worker calls Now() after the clock is
advanced, the test succeeds, because computeNextRotateTime retains the
existing rotate time and returns, then the worker will receive the
timeout and call rotate().

If the worker calls Now() before the clock is advanced, the test fails,
because computeNextRotateTime gets a Now() value that's exactly the
same as the previous time, and the return early doesn't happen. The
fix for this is to either:

a) in the test, insert an advanceClock call of any amount between
 sending the first change and sending the second, or
b) changing the logic in computeNextRotateTime to return early and
 retain the existing timeout if the next secret rotation time is
 after OR EQUAL TO the existing timeout (instead of just after).

Better to be safe than sorry, so I've made both fixes. Also made the
similar >= vs > fix to the timeout code in the lease manager.

2. The second issue was in TestNewSecretTriggersBefore: if the second
advanceClock happened after the worker received the second change
and calculated the new shorter timeout but BEFORE it actually reset
the timer, the timer wouldn't be fired because it's still set to 1h.
There are better ways to fix this test, but I've spent too long on
this, so I'm just inserting a testing.ShortWait delay in the test to
ensure the advanceClock happens after the timer is updated.

```sh
go test -count=10 -v -race ./worker/secretrotate/... -check.v -check.vv
```
Instance profiles created by the controller now contain the controller
tag that they are made for.
juju#13349

Hopefully the final fix for cross model relations issues observed on site.
When an offer is removed, any remote proxies for the consuming side (in the offering model) also need to be deleted. This wasn't always happening, partly because the removal depended on there being relations in play. The logic to remove the app proxies has been moved so it is always run even if there are no relations.

Because there could be orphaned remote proxies, an upgrade step is added to delete them.

Finally, to guard against the possibility that reference counting across models might get out of sync and block relation and unit removal (in a single model things are done inside a txn but that's not possible across models), if the txn slice fails to be applied the first time, cross model relations are more forcibly removed the second time.

## QA steps

I used the repro scenario from site, primarily this bundle deployed to a model called `ceph`:
```
series: bionic
applications:
 ceph-mon:
 charm: cs:ceph-mon-55
 num_units: 3
 options:
 default-rbd-features: 1
 expected-osd-count: 3
 permit-insecure-cmr: True
 ceph-osd:
 charm: cs:ceph-osd-310
 num_units: 3
 options:
 osd-devices: /srv/ceph-osd
 bluestore: False
relations:
- - ceph-mon:osd
 - ceph-osd:mon
```

Then set up the relations:
```
juju offer ceph-mon:client ceph-mon-client
juju offer ceph-mon:radosgw ceph-mon-radogw

juju add-model consume
juju deploy cs:ceph-radosgw-296 --series bionic
juju relate admin/ceph.ceph-mon-radogw ceph-radosgw:mon
juju deploy glance --series bionic
juju relate admin/ceph.ceph-mon-client glance
juju deploy cs:keystone --series bionic
juju deploy cs:percona-cluster --series bionic
juju config percona-cluster min-cluster-size=1
juju relate glance percona-cluster
juju relate keystone percona-cluster
juju relate keystone glance
juju deploy cs:nova-compute --series bionic
juju relate nova-compute:image-service glance:image-service
juju relate nova-compute:ceph ceph-mon-client
```

You can now repeatedly 
```
juju switch ceph
juju remove-offer ceph-mon-radogw --force -y
juju offer ceph-mon:radosgw ceph-mon-radogw
juju switch consume
juju relate admin/ceph.ceph-mon-radogw ceph-radosgw:mon
```

For each iteration, you van see the relation created, joined, departed, broken hooks have run:
```
juju show-status-log -m consume ceph-radosgw/0
juju show-status-log -m ceph ceph-mon/0
```

## Bug reference

https://bugs.launchpad.net/charm-ceph-mon/+bug/1940983
The following adds a raftlease client. The aim is to provide a robust
way of sending requests whilst being honest about the error messages
that fail. In particular we drop requested commands when we don't have a
connection and it's up to the callee to retry again. Previous
implementations just dropped them on the floor.
juju#13346

~~**Requires juju#13342 to land first**~~

The following adds a raftlease client. The aim is to provide a robust
way of sending requests whilst being honest about the error messages
that fail. In particular, we drop requested commands when we don't have a
connection and it's up to the callee to retry again. Previous
implementations just dropped them on the floor.

## QA steps

Current tests pass. As this hasn't been turned on yet, we just need
to ensure that existing implementations work.

```sh
$ juju bootstrap lxd test
$ juju model-config -m controller logging-config="<root>=INFO;juju.worker.lease.raft=TRACE;juju.core.raftlease=TRACE;juju.worker.globalclockupdater.raft=TRACE;juju.worker.raft=TRACE"
$ juju enable-ha
$ juju debug-log -m controller
```
wallyworld and others added 10 commits September 24, 2021 09:53
juju#13353

When removing and re-consuming a cross model offer, it's possible that the remote proxy might be removed in the middle of the remote relation worker updating things like the macaroon to use. This PR handles that sort of case and make the worker more resilient to not found errors.
Also, when a consuming saas application is dying, we ignore status updates from the offering side as these can interfere with a clean status history.
On the offering side, we use "force" to destroy the consuming app proxy to ensure it is removed, otherwise it's possible a dying entitiy can remain and wedge the offering processing of events.

## QA steps

Deploy a cmr scenario like in juju#13349 and on the consuming side, remove the saas app and check logs for errors.

## Bug reference

https://bugs.launchpad.net/charm-ceph-mon/+bug/1940983
juju#13354

The offering model creates a proxy for the consuming app. If the consuming side does a `remove-saas` and then consumes the same offer again, the offering model might retain stale info, especially if it is down then the consuming side does a force delete.

This PR uses an incrementing version on the consuming side. If the offering side gets a new connection and the version is newer than what it has, it force deletes the consuming proxy and starts again. This is analogous to what is done on the consuming side then the offer is force removed and then consumed again.

This is same to do without a facade bump because the default version will be 0.

## QA steps

I tested this on top of juju#13353 but rebased against 2.9 to make this PR.

Deploy a cmr scenario like in juju#13349 and on the consuming side, remove the saas app and consume again.
Ensure relation hooks are run on both sides of the relation.

## Bug reference

https://bugs.launchpad.net/charm-ceph-mon/+bug/1940983
juju#13352

Instance profiles created by the controller now contain the controller
tag that they are made for.

## Checklist

 - [x] Requires a [pylibjuju](https://github.com/juju/python-libjuju) change
 - [x] Added [integration tests](https://github.com/juju/juju/tree/develop/tests) for the PR
 - [x] Added or updated [doc.go](https://discourse.jujucharms.com/t/readme-in-packages/451) related to packages changed
 - [x] Comments answer the question of why design decisions were made

## QA steps

Bootstrap an AWS controller with `juju bootstrap aws/ap-southeast-2 --bootstrap-constraints="instance-role=auto" testmctestface`

Check that the corresponding Instance Profile has the controller uuid tag set with:

aws iam list-instance-profile-tags --instance-profile-name "juju-controller-testmctestface"
simple-streams when searching for LXD images by alias.

VM images are not usable for containers.
juju#13359

The linked bug describes an issue where searching simple-streams for LXD images by alias sometimes returns VM images, which are unsuitable for containers.

Here we ensure that we only ever look on remote servers for _container_ images specifically.

## QA steps

Bootstrap, add some machines and ensure `juju add-machine lxd:<machine ID>` always results in successful provisioning.

## Documentation changes

None.

## Bug reference

https://bugs.launchpad.net/juju/+bug/1943088
juju#13360

The `streams.canonical.com` server now supports serving gzip encoding of simplestreame metadata.
This PR simply adds tests to ensure we haven't accidentally disabled compression on the http transport, which would
prevent the gzip by default handling from being done by the underlying transport.

Also a few drive by lint fixes.

## QA steps

bootstrap any provider
@manadart manadart added the 3.0 label Sep 27, 2021
@manadart
Copy link
Member Author

Merge 2.9 into develop

@hpidcock
Copy link
Member

$$merge$$

@jujubot jujubot merged commit 101e9b7 into juju:develop Sep 28, 2021
@manadart manadart deleted the 2.9-into-develop branch September 28, 2021 07:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
9 participants