-
Notifications
You must be signed in to change notification settings - Fork 7.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
trustbundle: fixing spiffe trustanchor support to ensure delay in pro… #32369
Conversation
@costinm @stevenctl should review this WRT to the multicluster readiness work. This is a bit different since we need the roots for traffic to work |
r.RetryCount = 0 | ||
} | ||
|
||
func (r *trustAnchorEndpoint) AddCert(cert string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be private?
pilot/pkg/trustbundle/trustbundle.go
Outdated
} | ||
} else { | ||
if !remoteEndpoints[endpointURI].IncrementRetryOrFail() { | ||
remoteTrustAnchor.PurgeCerts() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So after 1 day we purge it? This feels a bit odd to me. It seems like its either too high or too low. Either we need to purge the certs when the endpoint is down, in which case we ought to do it sooner than a day, or we don't need to in which case we never should?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I agree. There is clearly a need to delay removal of trust Anchors (in the event that an endpoint becomes temporarily unavailable) so that data-plane traffic is not affected by endpoint flakiness.
However, I am also concerned about having a trustAnchor in the mesh long after an endpoint becomes unreachable. What if the user is unaware of this and doesn't change meshConfig to remove this endpoint? Is the expectation that istiod will probably be rebooted sometime, which will clean up this defunct remote TrustAnchor? Could we increase this timeout to a week maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the concern - I think the tricky thing is there is no "good" solution, just the least bad.
A similar issue is leaving "dead" multicluster configs around - I don't think its right for us to prune them. We should make it very clear to users that something is wrong, with logs+metrics, but I think removing it is a bit too far.
Especially when we have some logic that will be exercised only in a day, it seems concerning that there is such a huge gap between something going wrong and the impact of that
If traffic would fail before these are processed, we should wait to push. That means checking for some istio/pilot/pkg/bootstrap/server.go Line 820 in 21a12a7
But to avoid indefinitely blocking istiod start, we should mark ready regardless of true success after this timeout: istio/pilot/pkg/features/pilot.go Line 357 in 21a12a7
(the name of that flag isn't great for this usecase) |
@howardjohn - as per initial discussion, we would wait for certs from all remote Endpoints to be fetched before we declared ourselves ready. Are we also going to wait till the certs have been transferred via xds to the proxy? Is there any easy canonical way to determine if the pcds update has been pushed to all the proxies? |
Istiod shouldn't wait until the proxy gets PCDS - proxy cannot connect to istiod until it is ready, so it would be a circular dependency. But the proxies themselves should wait until they get it. This might already happen by blocking SDS, which in turn blocks envoy readiness, but i am not sure if its implemented that way |
30f1a88
to
a8fb0a4
Compare
96757a7
to
934487b
Compare
…pagation if remote endpoints are unreachable
} | ||
tb.initDone.Store(false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think the expected usage is initDone: atomic.NewBool(false)
and storing a pointer
@@ -827,6 +827,9 @@ func (s *Server) cachesSynced() bool { | |||
if !s.configController.HasSynced() { | |||
return false | |||
} | |||
if !s.workloadTrustBundle.HasSynced() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understand of how this works (and correct me if I am wrong..): Istiod starts up, we try to fetch the bundle. If it succeeds we are good to go. If it fails, we wait 30 minutes to try again. After 30s we mark ourselves ready.
So I am not sure what the point of waiting 30s to mark ourselves ready. We should either aggressively retry or just mark ourselves ready immediately (since we are not retrying in the meantime anyways)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand. My vote would be to mark ourselves ready immediately. Note that we poll each endpoint configured for 10 whole seconds waiting for it to become available. Lack of availability after 10s usually implies (a) Network Issues (congestion/connectivity) (b) Endpoint is not up. In these scenarios, I think it is better to not be aggressive and back off, and try again after 30 minutes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if the spiffe bundle is down for ~10 seconds (possible with 5 nines reliability), I will fail to fetch it. We mark ourselves ready. New proxies connect, get no root cert, and all mtls will fail as a result with obscure tls errors. We will not recover for 30 minutes.
This seems like a really dangerous idea.. am I missing some reason that this is safe?
It seems equivalent to failing open if we fail to read AuthorizationPolicy from k8s api-server
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm..I see the concern here. This isn't safe in the scenario you described..however, as we discussed earlier, we don't want to unnecessarily delay Istiod from starting up for this. Perhaps we could have an exponential backoff here in event of failure? Until we get all certs from all remote Trust Anchors?
var ok bool | ||
var err error | ||
|
||
tb.mutex.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@howardjohn - there are two go-routines that call this function, hence grabbing the write locks. The scope of this function is large now and could get larger (reading/writing into other parts of the trustbundle). As I discussed earlier - I think its better this function is only ever accessed from one go-routine, given its scope. This would require all meshConfig updates to be written into a channel and processed by the same goroutine that processes remote Trust Anchor updates. Does this sound good?
/retest |
@shankgan: PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
🚧 This issue or pull request has been closed due to not having had activity from an Istio team member since 2021-06-10. If you feel this issue or pull request deserves attention, please reopen the issue. Please see this wiki page for more information. Thank you for your contributions. Created by the issue and PR lifecycle manager. |
@shankgan: The following tests failed, say
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
…pagation if remote endpoints are unreachable
Implements part of the fix for #31497
[ ] Configuration Infrastructure
[ ] Docs
[ ] Installation
[ ] Networking
[ ] Performance and Scalability
[ ] Policies and Telemetry
[ ] Security
[ ] Test and Release
[ ] User Experience
[ ] Developer Infrastructure
Pull Request Attributes
Please check any characteristics that apply to this pull request.
[ ] Does not have any changes that may affect Istio users.