-
Notifications
You must be signed in to change notification settings - Fork 499
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
APIServer: Raft context #13342
APIServer: Raft context #13342
Conversation
Adds a new error type - Not Leader to the api server error types. This will be used to tell the client that it can not process the raft application without trying to move the request to the leader. Additionally, I've moved the error types to a new file to make it a lot easier to understand the code.
4b8e498
to
345721e
Compare
The raft context is a way to mediate between the RaftOpQueue with a worker listening on the other end and a way for facades to enqueue commands onto the queue. It's done like this, so it doesn't expose the raft instance arbitrary to the facades, limiting abuse and side effects. Additionally, having a queue in between the instance and the facade allows us to apply back pressure and deadlines to enqueuing. The code is simple, but _very_ mechanical, as we have to thread the queue from the machine manifolds all the way to the facade context.
345721e
to
6d587d1
Compare
The facade will interact with the underlying raft queue that is exposed via the facade context. The facade is ultra lightweight as we've designed the facade to not have access to the underlying raft instance.
8440419
to
6df384a
Compare
@@ -319,6 +320,9 @@ func AllFacades() *facade.Registry { | |||
|
|||
reg("ProxyUpdater", 1, proxyupdater.NewFacadeV1) | |||
reg("ProxyUpdater", 2, proxyupdater.NewFacadeV2) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why have extra whitespace here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Readability more than anything.
// topic). | ||
func (facade *Facade) ApplyLease(args params.LeaseOperations) (params.ErrorResults, error) { | ||
results := make([]params.ErrorResult, len(args.Operations)) | ||
for k, op := range args.Operations { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the operations are applied sequentially, if the leader change while an operation is being applied you may end up in a situation where ops [0-m] are applied and (m, n] fail.
It could be worth investigating whether we could work around this issue by creating a batch apply command which essentially is a list of the individual op.Command byte-streams and applying that in one go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can, but I wasn't going to do it for this PR. As we don't consolidate operations yet.
@@ -395,3 +396,7 @@ func IsCodeCloudRegionRequired(err error) bool { | |||
func IsCodeQuotaLimitExceeded(err error) bool { | |||
return ErrCode(err) == CodeQuotaLimitExceeded | |||
} | |||
|
|||
func IsCodeNotLeader(err error) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really think we should invest some time to create a juju-dev-helper tool to automate scaffolding work for repeated tasks. E.g
juju-dev-helper add-server-error
juju-dev-helper add-facade
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The dream.
The following is just a mechanical change, moving the queue from worker to core. As rightly pointed out, importing workers from the apiserver seems a little bit off.
The rehydration of not leader error didn't need to jump through if statements to check for values, instead we can do it in one line.
As logging parameters aren't lazy, we have to check if trace is enabled before we try and trace.
The following handles the not leader error differently from a normal error. If we get a not leader error, we should stop processing any more leases and mark all subsequent errors as the same and then exit out. Normal errors, we do want to keep iterating on, as we just don't if we can bail out early unless we have more typed error cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
#13346 ~~**Requires #13342 to land first**~~ The following adds a raftlease client. The aim is to provide a robust way of sending requests whilst being honest about the error messages that fail. In particular, we drop requested commands when we don't have a connection and it's up to the callee to retry again. Previous implementations just dropped them on the floor. ## QA steps Current tests pass. As this hasn't been turned on yet, we just need to ensure that existing implementations work. ```sh $ juju bootstrap lxd test $ juju model-config -m controller logging-config="<root>=INFO;juju.worker.lease.raft=TRACE;juju.core.raftlease=TRACE;juju.worker.globalclockupdater.raft=TRACE;juju.worker.raft=TRACE" $ juju enable-ha $ juju debug-log -m controller ```
// If no information is supplied, it is expected that the client performs their | ||
// own algorithm to locate the leader (roundrobin or listen to the apidetails | ||
// topic). | ||
func (facade *Facade) ApplyLease(args params.LeaseOperations) (params.ErrorResults, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ApplyLeaseOp
might be a better name. A naive glace at this might suggest a lease claim, but they can be claims/extensions/revocations.
#13361 Merge from 2.9 to bring forward: - #13360 from wallyworld/simplestreams-compression - #13359 from manadart/2.9-lxd-container-images - #13352 from tlm/aws-instance-profile - #13358 from jujubot/increment-to-2.9.16 - #13354 from wallyworld/refresh-consume-proxy - #13353 from wallyworld/cmr-consume-fixes - #13346 from SimonRichardson/raft-api-client - #13349 from wallyworld/remove-orphaned-cmrapps - #13348 from benhoyt/fix-secretrotate-tests - #13119 from SimonRichardson/pass-context - #13342 from SimonRichardson/raft-facade - #13341 from ycliuhw/feature/quay.io Conflicts (easy resolution): - apiserver/common/crossmodel/interface.go - apiserver/errors/errors.go - apiserver/params/apierror.go - apiserver/testserver/server.go - scripts/win-installer/setup.iss - snap/snapcraft.yaml - version/version.go
The raft context is a way to mediate between the RaftOpQueue with a
worker listening on the other end and a way for facades to enqueue
commands onto the queue. It's done like this, so it doesn't expose the
raft instance arbitrarily to the facades, limiting abuse and side effects.
Additionally, having a queue in between the instance and the facade
allows us to apply back pressure and deadlines to enqueuing.
The code is simple, but very mechanical, as we have to thread the
queue from the machine manifolds all the way to the facade context.
QA steps