trustbundle: fixing spiffe trustanchor support to ensure delay in pro… #32369

shankgan · 2021-04-21T20:25:14Z

…pagation if remote endpoints are unreachable

Implements part of the fix for #31497

[ ] Configuration Infrastructure
[ ] Docs
[ ] Installation
[ ] Networking
[ ] Performance and Scalability
[ ] Policies and Telemetry
[ ] Security
[ ] Test and Release
[ ] User Experience
[ ] Developer Infrastructure

Pull Request Attributes

Please check any characteristics that apply to this pull request.

[ ] Does not have any changes that may affect Istio users.

howardjohn · 2021-04-21T20:31:53Z

@costinm @stevenctl should review this WRT to the multicluster readiness work. This is a bit different since we need the roots for traffic to work

pilot/pkg/trustbundle/trustbundle.go

howardjohn · 2021-04-21T21:34:39Z

pilot/pkg/trustbundle/trustbundle.go

+	r.RetryCount = 0
+}
+
+func (r *trustAnchorEndpoint) AddCert(cert string) {


Can this be private?

howardjohn · 2021-04-21T21:37:20Z

pilot/pkg/trustbundle/trustbundle.go

+			}
+		} else {
+			if !remoteEndpoints[endpointURI].IncrementRetryOrFail() {
+				remoteTrustAnchor.PurgeCerts()


So after 1 day we purge it? This feels a bit odd to me. It seems like its either too high or too low. Either we need to purge the certs when the endpoint is down, in which case we ought to do it sooner than a day, or we don't need to in which case we never should?

Not sure I agree. There is clearly a need to delay removal of trust Anchors (in the event that an endpoint becomes temporarily unavailable) so that data-plane traffic is not affected by endpoint flakiness.

However, I am also concerned about having a trustAnchor in the mesh long after an endpoint becomes unreachable. What if the user is unaware of this and doesn't change meshConfig to remove this endpoint? Is the expectation that istiod will probably be rebooted sometime, which will clean up this defunct remote TrustAnchor? Could we increase this timeout to a week maybe?

I understand the concern - I think the tricky thing is there is no "good" solution, just the least bad.

A similar issue is leaving "dead" multicluster configs around - I don't think its right for us to prune them. We should make it very clear to users that something is wrong, with logs+metrics, but I think removing it is a bit too far.

Especially when we have some logic that will be exercised only in a day, it seems concerning that there is such a huge gap between something going wrong and the impact of that

stevenctl · 2021-04-23T21:26:43Z

If traffic would fail before these are processed, we should wait to push. That means checking for some synced flag here:

istio/pilot/pkg/bootstrap/server.go

Line 820 in 21a12a7

func (s *Server) cachesSynced() bool {

But to avoid indefinitely blocking istiod start, we should mark ready regardless of true success after this timeout:

istio/pilot/pkg/features/pilot.go

Line 357 in 21a12a7

"PILOT_REMOTE_CLUSTER_TIMEOUT",

(the name of that flag isn't great for this usecase)

pilot/pkg/trustbundle/trustbundle.go

shankgan · 2021-04-27T17:19:20Z

If traffic would fail before these are processed, we should wait to push. That means checking for some synced flag here:

istio/pilot/pkg/bootstrap/server.go

Line 820 in 21a12a7

func (s *Server) cachesSynced() bool {

But to avoid indefinitely blocking istiod start, we should mark ready regardless of true success after this timeout:

istio/pilot/pkg/features/pilot.go

Line 357 in 21a12a7

"PILOT_REMOTE_CLUSTER_TIMEOUT",

(the name of that flag isn't great for this usecase)

@howardjohn - as per initial discussion, we would wait for certs from all remote Endpoints to be fetched before we declared ourselves ready. Are we also going to wait till the certs have been transferred via xds to the proxy? Is there any easy canonical way to determine if the pcds update has been pushed to all the proxies?

howardjohn · 2021-04-27T17:27:43Z

as per initial discussion, we would wait for certs from all remote Endpoints to be fetched before we declared ourselves ready. Are we also going to wait till the certs have been transferred via xds to the proxy? Is there any easy canonical way to determine if the pcds update has been pushed to all the proxies?

Istiod shouldn't wait until the proxy gets PCDS - proxy cannot connect to istiod until it is ready, so it would be a circular dependency. But the proxies themselves should wait until they get it. This might already happen by blocking SDS, which in turn blocks envoy readiness, but i am not sure if its implemented that way

pilot/pkg/trustbundle/trustbundle.go

…pagation if remote endpoints are unreachable

howardjohn · 2021-05-10T15:59:19Z

pilot/pkg/trustbundle/trustbundle.go

 	}
+	tb.initDone.Store(false)


nit: I think the expected usage is initDone: atomic.NewBool(false) and storing a pointer

howardjohn · 2021-05-10T16:05:25Z

pilot/pkg/bootstrap/server.go

@@ -827,6 +827,9 @@ func (s *Server) cachesSynced() bool {
 	if !s.configController.HasSynced() {
 		return false
 	}
+	if !s.workloadTrustBundle.HasSynced() {


My understand of how this works (and correct me if I am wrong..): Istiod starts up, we try to fetch the bundle. If it succeeds we are good to go. If it fails, we wait 30 minutes to try again. After 30s we mark ourselves ready.

So I am not sure what the point of waiting 30s to mark ourselves ready. We should either aggressively retry or just mark ourselves ready immediately (since we are not retrying in the meantime anyways)?

I understand. My vote would be to mark ourselves ready immediately. Note that we poll each endpoint configured for 10 whole seconds waiting for it to become available. Lack of availability after 10s usually implies (a) Network Issues (congestion/connectivity) (b) Endpoint is not up. In these scenarios, I think it is better to not be aggressive and back off, and try again after 30 minutes.

So if the spiffe bundle is down for ~10 seconds (possible with 5 nines reliability), I will fail to fetch it. We mark ourselves ready. New proxies connect, get no root cert, and all mtls will fail as a result with obscure tls errors. We will not recover for 30 minutes.

This seems like a really dangerous idea.. am I missing some reason that this is safe?

It seems equivalent to failing open if we fail to read AuthorizationPolicy from k8s api-server

Hmm..I see the concern here. This isn't safe in the scenario you described..however, as we discussed earlier, we don't want to unnecessarily delay Istiod from starting up for this. Perhaps we could have an exponential backoff here in event of failure? Until we get all certs from all remote Trust Anchors?

shankgan · 2021-05-10T16:41:34Z

pilot/pkg/trustbundle/trustbundle.go

 	var ok bool
 	var err error

+	tb.mutex.Lock()


@howardjohn - there are two go-routines that call this function, hence grabbing the write locks. The scope of this function is large now and could get larger (reading/writing into other parts of the trustbundle). As I discussed earlier - I think its better this function is only ever accessed from one go-routine, given its scope. This would require all meshConfig updates to be written into a channel and processed by the same goroutine that processes remote Trust Anchor updates. Does this sound good?

shankgan · 2021-06-10T14:37:28Z

/retest

istio-testing · 2021-06-21T20:36:06Z

@shankgan: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

istio-policy-bot · 2021-07-26T05:03:55Z

🚧 This issue or pull request has been closed due to not having had activity from an Istio team member since 2021-06-10. If you feel this issue or pull request deserves attention, please reopen the issue. Please see this wiki page for more information. Thank you for your contributions.

Created by the issue and PR lifecycle manager.

istio-testing · 2021-07-26T14:01:53Z

@shankgan: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Rerun command
integ-telemetry-mc-k8s-tests_istio	`7767285`	link	`/test integ-telemetry-mc-k8s-tests_istio`
integ-multicluster-k8s-tests_istio	`7767285`	link	`/test integ-multicluster-k8s-tests_istio`
integ-helm-tests_istio	`7767285`	link	`/test integ-helm-tests_istio`
integ-cni-k8s-tests_istio	`7767285`	link	`/test integ-cni-k8s-tests_istio`
integ-distroless-k8s-tests_istio	`7767285`	link	`/test integ-distroless-k8s-tests_istio`
unit-tests_istio	`7767285`	link	`/test unit-tests_istio`
release-test_istio	`7767285`	link	`/test release-test_istio`
integ-telemetry-k8s-tests_istio	`7767285`	link	`/test integ-telemetry-k8s-tests_istio`
integ-operator-controller-tests_istio	`7767285`	link	`/test integ-operator-controller-tests_istio`
lint_istio	`7767285`	link	`/test lint_istio`
integ-security-k8s-tests_istio	`7767285`	link	`/test integ-security-k8s-tests_istio`
integ-pilot-k8s-tests_istio	`7767285`	link	`/test integ-pilot-k8s-tests_istio`
analyze-tests_istio	`7767285`	link	`/test analyze-tests_istio`
integ-pilot-multicluster-tests_istio	`7767285`	link	`/test integ-pilot-multicluster-tests_istio`
integ-security-multicluster-tests_istio	`7767285`	link	`/test integ-security-multicluster-tests_istio`
release-notes_istio	`7767285`	link	`/test release-notes_istio`
gencheck_istio	`7767285`	link	`/test gencheck_istio`
integ-ipv6-k8s-tests_istio	`7767285`	link	`/test integ-ipv6-k8s-tests_istio`

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

shankgan requested a review from a team as a code owner April 21, 2021 20:25

google-cla bot added the cla: yes Set by the Google CLA bot to indicate the author of a PR has signed the Google CLA. label Apr 21, 2021

istio-testing added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 21, 2021

shankgan force-pushed the spiffe_fixes branch from ec31c9e to 2c3af02 Compare April 21, 2021 20:30

howardjohn reviewed Apr 21, 2021

View reviewed changes

shankgan force-pushed the spiffe_fixes branch from 2c3af02 to 514ead6 Compare April 22, 2021 19:54

shankgan marked this pull request as draft April 22, 2021 19:56

istio-testing added the do-not-merge/work-in-progress Block merging of a PR because it isn't ready yet. label Apr 22, 2021

stevenctl reviewed Apr 23, 2021

View reviewed changes

pilot/pkg/trustbundle/trustbundle.go Outdated Show resolved Hide resolved

shankgan force-pushed the spiffe_fixes branch 2 times, most recently from 30f1a88 to a8fb0a4 Compare May 3, 2021 17:55

shankgan marked this pull request as ready for review May 4, 2021 15:31

istio-testing removed the do-not-merge/work-in-progress Block merging of a PR because it isn't ready yet. label May 4, 2021

shankgan requested review from howardjohn and myidpt May 4, 2021 16:33

shankgan commented May 4, 2021

View reviewed changes

pilot/pkg/trustbundle/trustbundle.go Outdated Show resolved Hide resolved

shankgan force-pushed the spiffe_fixes branch 3 times, most recently from 96757a7 to 934487b Compare May 7, 2021 20:38

trustbundle: fixing spiffe trustanchor support to ensure delay in pro…

39ee099

…pagation if remote endpoints are unreachable

shankgan force-pushed the spiffe_fixes branch from 934487b to 39ee099 Compare May 7, 2021 22:33

howardjohn reviewed May 10, 2021

View reviewed changes

Resolve race condition by expanding scope of locks

7767285

shankgan commented May 10, 2021

View reviewed changes

istio-policy-bot added the lifecycle/stale Indicates a PR or issue hasn't been manipulated by an Istio team member for a while label Jun 10, 2021

istio-policy-bot removed the lifecycle/stale Indicates a PR or issue hasn't been manipulated by an Istio team member for a while label Jun 10, 2021

istio-testing added the needs-rebase Indicates a PR needs to be rebased before being merged label Jun 21, 2021

istio-policy-bot added the lifecycle/stale Indicates a PR or issue hasn't been manipulated by an Istio team member for a while label Jul 11, 2021

istio-policy-bot closed this Jul 26, 2021

istio-policy-bot added the lifecycle/automatically-closed Indicates a PR or issue that has been closed automatically. label Jul 26, 2021

shankgan reopened this Jul 26, 2021

istio-policy-bot closed this Jul 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

trustbundle: fixing spiffe trustanchor support to ensure delay in pro… #32369

trustbundle: fixing spiffe trustanchor support to ensure delay in pro… #32369

shankgan commented Apr 21, 2021 •

edited

Loading

howardjohn commented Apr 21, 2021

howardjohn Apr 21, 2021

howardjohn Apr 21, 2021

shankgan Apr 23, 2021 •

edited

Loading

howardjohn Apr 23, 2021

stevenctl commented Apr 23, 2021

shankgan commented Apr 27, 2021

howardjohn commented Apr 27, 2021

howardjohn May 10, 2021

howardjohn May 10, 2021

shankgan May 10, 2021

howardjohn May 10, 2021

shankgan May 10, 2021

shankgan May 10, 2021 •

edited

Loading

shankgan commented Jun 10, 2021

istio-testing commented Jun 21, 2021

istio-policy-bot commented Jul 26, 2021

istio-testing commented Jul 26, 2021

trustbundle: fixing spiffe trustanchor support to ensure delay in pro… #32369

trustbundle: fixing spiffe trustanchor support to ensure delay in pro… #32369

Conversation

shankgan commented Apr 21, 2021 • edited Loading

howardjohn commented Apr 21, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shankgan Apr 23, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stevenctl commented Apr 23, 2021

shankgan commented Apr 27, 2021

howardjohn commented Apr 27, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shankgan May 10, 2021 • edited Loading

Choose a reason for hiding this comment

shankgan commented Jun 10, 2021

istio-testing commented Jun 21, 2021

istio-policy-bot commented Jul 26, 2021

istio-testing commented Jul 26, 2021

shankgan commented Apr 21, 2021 •

edited

Loading

shankgan Apr 23, 2021 •

edited

Loading

shankgan May 10, 2021 •

edited

Loading