Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

trustbundle: fixing spiffe trustanchor support to ensure delay in pro… #32369

Closed
wants to merge 2 commits into from

Conversation

shankgan
Copy link
Contributor

@shankgan shankgan commented Apr 21, 2021

…pagation if remote endpoints are unreachable

Implements part of the fix for #31497

[ ] Configuration Infrastructure
[ ] Docs
[ ] Installation
[ ] Networking
[ ] Performance and Scalability
[ ] Policies and Telemetry
[ ] Security
[ ] Test and Release
[ ] User Experience
[ ] Developer Infrastructure

Pull Request Attributes

Please check any characteristics that apply to this pull request.

[ ] Does not have any changes that may affect Istio users.

@shankgan shankgan requested a review from a team as a code owner April 21, 2021 20:25
@google-cla google-cla bot added the cla: yes Set by the Google CLA bot to indicate the author of a PR has signed the Google CLA. label Apr 21, 2021
@istio-testing istio-testing added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 21, 2021
@howardjohn
Copy link
Member

@costinm @stevenctl should review this WRT to the multicluster readiness work. This is a bit different since we need the roots for traffic to work

pilot/pkg/trustbundle/trustbundle.go Outdated Show resolved Hide resolved
pilot/pkg/trustbundle/trustbundle.go Outdated Show resolved Hide resolved
r.RetryCount = 0
}

func (r *trustAnchorEndpoint) AddCert(cert string) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be private?

}
} else {
if !remoteEndpoints[endpointURI].IncrementRetryOrFail() {
remoteTrustAnchor.PurgeCerts()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So after 1 day we purge it? This feels a bit odd to me. It seems like its either too high or too low. Either we need to purge the certs when the endpoint is down, in which case we ought to do it sooner than a day, or we don't need to in which case we never should?

Copy link
Contributor Author

@shankgan shankgan Apr 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I agree. There is clearly a need to delay removal of trust Anchors (in the event that an endpoint becomes temporarily unavailable) so that data-plane traffic is not affected by endpoint flakiness.

However, I am also concerned about having a trustAnchor in the mesh long after an endpoint becomes unreachable. What if the user is unaware of this and doesn't change meshConfig to remove this endpoint? Is the expectation that istiod will probably be rebooted sometime, which will clean up this defunct remote TrustAnchor? Could we increase this timeout to a week maybe?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the concern - I think the tricky thing is there is no "good" solution, just the least bad.

A similar issue is leaving "dead" multicluster configs around - I don't think its right for us to prune them. We should make it very clear to users that something is wrong, with logs+metrics, but I think removing it is a bit too far.

Especially when we have some logic that will be exercised only in a day, it seems concerning that there is such a huge gap between something going wrong and the impact of that

@shankgan shankgan marked this pull request as draft April 22, 2021 19:56
@istio-testing istio-testing added the do-not-merge/work-in-progress Block merging of a PR because it isn't ready yet. label Apr 22, 2021
@stevenctl
Copy link
Contributor

If traffic would fail before these are processed, we should wait to push. That means checking for some synced flag here:

func (s *Server) cachesSynced() bool {

But to avoid indefinitely blocking istiod start, we should mark ready regardless of true success after this timeout:

"PILOT_REMOTE_CLUSTER_TIMEOUT",

(the name of that flag isn't great for this usecase)

@shankgan
Copy link
Contributor Author

If traffic would fail before these are processed, we should wait to push. That means checking for some synced flag here:

func (s *Server) cachesSynced() bool {

But to avoid indefinitely blocking istiod start, we should mark ready regardless of true success after this timeout:

"PILOT_REMOTE_CLUSTER_TIMEOUT",

(the name of that flag isn't great for this usecase)

@howardjohn - as per initial discussion, we would wait for certs from all remote Endpoints to be fetched before we declared ourselves ready. Are we also going to wait till the certs have been transferred via xds to the proxy? Is there any easy canonical way to determine if the pcds update has been pushed to all the proxies?

@howardjohn
Copy link
Member

as per initial discussion, we would wait for certs from all remote Endpoints to be fetched before we declared ourselves ready. Are we also going to wait till the certs have been transferred via xds to the proxy? Is there any easy canonical way to determine if the pcds update has been pushed to all the proxies?

Istiod shouldn't wait until the proxy gets PCDS - proxy cannot connect to istiod until it is ready, so it would be a circular dependency. But the proxies themselves should wait until they get it. This might already happen by blocking SDS, which in turn blocks envoy readiness, but i am not sure if its implemented that way

@shankgan shankgan force-pushed the spiffe_fixes branch 2 times, most recently from 30f1a88 to a8fb0a4 Compare May 3, 2021 17:55
@shankgan shankgan marked this pull request as ready for review May 4, 2021 15:31
@istio-testing istio-testing removed the do-not-merge/work-in-progress Block merging of a PR because it isn't ready yet. label May 4, 2021
@shankgan shankgan requested review from howardjohn and myidpt May 4, 2021 16:33
@shankgan shankgan force-pushed the spiffe_fixes branch 3 times, most recently from 96757a7 to 934487b Compare May 7, 2021 20:38
…pagation if remote endpoints are unreachable
}
tb.initDone.Store(false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think the expected usage is initDone: atomic.NewBool(false) and storing a pointer

@@ -827,6 +827,9 @@ func (s *Server) cachesSynced() bool {
if !s.configController.HasSynced() {
return false
}
if !s.workloadTrustBundle.HasSynced() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understand of how this works (and correct me if I am wrong..): Istiod starts up, we try to fetch the bundle. If it succeeds we are good to go. If it fails, we wait 30 minutes to try again. After 30s we mark ourselves ready.

So I am not sure what the point of waiting 30s to mark ourselves ready. We should either aggressively retry or just mark ourselves ready immediately (since we are not retrying in the meantime anyways)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand. My vote would be to mark ourselves ready immediately. Note that we poll each endpoint configured for 10 whole seconds waiting for it to become available. Lack of availability after 10s usually implies (a) Network Issues (congestion/connectivity) (b) Endpoint is not up. In these scenarios, I think it is better to not be aggressive and back off, and try again after 30 minutes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if the spiffe bundle is down for ~10 seconds (possible with 5 nines reliability), I will fail to fetch it. We mark ourselves ready. New proxies connect, get no root cert, and all mtls will fail as a result with obscure tls errors. We will not recover for 30 minutes.

This seems like a really dangerous idea.. am I missing some reason that this is safe?

It seems equivalent to failing open if we fail to read AuthorizationPolicy from k8s api-server

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm..I see the concern here. This isn't safe in the scenario you described..however, as we discussed earlier, we don't want to unnecessarily delay Istiod from starting up for this. Perhaps we could have an exponential backoff here in event of failure? Until we get all certs from all remote Trust Anchors?

var ok bool
var err error

tb.mutex.Lock()
Copy link
Contributor Author

@shankgan shankgan May 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@howardjohn - there are two go-routines that call this function, hence grabbing the write locks. The scope of this function is large now and could get larger (reading/writing into other parts of the trustbundle). As I discussed earlier - I think its better this function is only ever accessed from one go-routine, given its scope. This would require all meshConfig updates to be written into a channel and processed by the same goroutine that processes remote Trust Anchor updates. Does this sound good?

@istio-policy-bot istio-policy-bot added the lifecycle/stale Indicates a PR or issue hasn't been manipulated by an Istio team member for a while label Jun 10, 2021
@shankgan
Copy link
Contributor Author

/retest

@istio-policy-bot istio-policy-bot removed the lifecycle/stale Indicates a PR or issue hasn't been manipulated by an Istio team member for a while label Jun 10, 2021
@istio-testing
Copy link
Collaborator

@shankgan: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@istio-testing istio-testing added the needs-rebase Indicates a PR needs to be rebased before being merged label Jun 21, 2021
@istio-policy-bot istio-policy-bot added the lifecycle/stale Indicates a PR or issue hasn't been manipulated by an Istio team member for a while label Jul 11, 2021
@istio-policy-bot
Copy link

🚧 This issue or pull request has been closed due to not having had activity from an Istio team member since 2021-06-10. If you feel this issue or pull request deserves attention, please reopen the issue. Please see this wiki page for more information. Thank you for your contributions.

Created by the issue and PR lifecycle manager.

@istio-policy-bot istio-policy-bot added the lifecycle/automatically-closed Indicates a PR or issue that has been closed automatically. label Jul 26, 2021
@shankgan shankgan reopened this Jul 26, 2021
@istio-testing
Copy link
Collaborator

@shankgan: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Rerun command
integ-telemetry-mc-k8s-tests_istio 7767285 link /test integ-telemetry-mc-k8s-tests_istio
integ-multicluster-k8s-tests_istio 7767285 link /test integ-multicluster-k8s-tests_istio
integ-helm-tests_istio 7767285 link /test integ-helm-tests_istio
integ-cni-k8s-tests_istio 7767285 link /test integ-cni-k8s-tests_istio
integ-distroless-k8s-tests_istio 7767285 link /test integ-distroless-k8s-tests_istio
unit-tests_istio 7767285 link /test unit-tests_istio
release-test_istio 7767285 link /test release-test_istio
integ-telemetry-k8s-tests_istio 7767285 link /test integ-telemetry-k8s-tests_istio
integ-operator-controller-tests_istio 7767285 link /test integ-operator-controller-tests_istio
lint_istio 7767285 link /test lint_istio
integ-security-k8s-tests_istio 7767285 link /test integ-security-k8s-tests_istio
integ-pilot-k8s-tests_istio 7767285 link /test integ-pilot-k8s-tests_istio
analyze-tests_istio 7767285 link /test analyze-tests_istio
integ-pilot-multicluster-tests_istio 7767285 link /test integ-pilot-multicluster-tests_istio
integ-security-multicluster-tests_istio 7767285 link /test integ-security-multicluster-tests_istio
release-notes_istio 7767285 link /test release-notes_istio
gencheck_istio 7767285 link /test gencheck_istio
integ-ipv6-k8s-tests_istio 7767285 link /test integ-ipv6-k8s-tests_istio

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes Set by the Google CLA bot to indicate the author of a PR has signed the Google CLA. lifecycle/automatically-closed Indicates a PR or issue that has been closed automatically. lifecycle/stale Indicates a PR or issue hasn't been manipulated by an Istio team member for a while needs-rebase Indicates a PR needs to be rebased before being merged size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants