Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TestReachability flake; cluster warming #39720

Closed
howardjohn opened this issue Jun 30, 2022 · 12 comments · Fixed by #39746
Closed

TestReachability flake; cluster warming #39720

howardjohn opened this issue Jun 30, 2022 · 12 comments · Fixed by #39746

Comments

@howardjohn
Copy link
Member

https://prow.istio.io/view/gs/istio-prow/logs/integ-security-multicluster_istio_postsubmit/1542555639413215232

Still debugging, but looks like all clusters are warming

@howardjohn
Copy link
Member Author

Investigated this one a bit.

warm: 2022-06-30T17:32:43Z/226
219, all eds
ready: 2022-06-30T17:32:42Z/224 and 226 - it is a mix; 114 on 226 and 219 on 224
28 eds ready, 219 unready

pushes look same for good and bad

bad disconnects at 17:32:55.308978Z, reconects 2s later
test fails around 33:22. starts at 33:02. So its all after reconnect

reconnect has EDS first ????.

Pushed resources seem the same on reconnect

all the ready ones are vm, fake stateful, headless

VM is eds but empty?


So a few conclusions

  • The broken pods are reconnecting to XDS and getting stuck warming. The get the same number of resources, but somehow warming. The super weird thing is the XDS request order is out of order: EDS,LDS,RDS,CDS.
  • VMs are somehow 0 endpoints? also seeing this:
$ rg 'transport: authentication handshake failed'
_test_context/echo2-3-10889-state3295745563/cross-network-primary/vm-v1-749fc9946-97pcf_istio-proxy.envoy.err.log
20422:2022-06-30T17:33:31.211172Z       warning envoy config    StreamAggregatedResources gRPC config stream closed since 30s ago: 14, connection error: desc = "transport: authentication handshake failed: read tcp 10.30.0.31:47650->172.18.3.255:15012: read: connection reset by peer"
20423:2022-06-30T17:33:37.756010Z       warning envoy config    StreamAggregatedResources gRPC config stream closed since 36s ago: 14, connection error: desc = "transport: authentication handshake failed: read tcp 10.30.0.31:49634->172.18.3.255:15012: read: connection reset by peer"
20424:2022-06-30T17:33:40.458735Z       warning envoy config    StreamAggregatedResources gRPC config stream closed since 39s ago: 14, connection error: desc = "transport: authentication handshake failed: read tcp 10.30.0.31:50206->172.18.3.255:15012: read: connection reset by peer"

_test_context/echo2-3-10889-state3295745563/remote/delta-v1-c44cc99b6-tl5zq_istio-proxy.log
89:2022-06-30T17:33:31.110614Z  warning envoy config    DeltaAggregatedResources gRPC config stream closed since 34s ago: 14, connection error: desc = "transport: authentication handshake failed: read tcp 10.20.0.20:58550->10.255.20.23:15012: read: connection reset by peer"

_test_context/echo1-4-90458-state3421295793/primary/vm-v1-5f5ddc88f4-qm7rw_istio-proxy.envoy.err.log
274072:2022-06-30T17:33:40.523895Z      warning envoy config    StreamAggregatedResources gRPC config stream closed since 41s ago: 14, connection error: desc = "transport: authentication handshake failed: read tcp 10.10.0.31:55096->172.18.3.235:15012: read: connection reset by peer"

_test_context/echo1-4-90458-state3421295793/cross-network-primary/vm-v1-749fc9946-g8fqc_istio-proxy.envoy.err.log
314374:2022-06-30T17:33:30.398699Z      warning envoy config    StreamAggregatedResources gRPC config stream closed since 30s ago: 14, connection error: desc = "transport: authentication handshake failed: read tcp 10.30.0.30:56002->172.18.3.255:15012: read: connection reset by peer"
314375:2022-06-30T17:33:44.399088Z      warning envoy config    StreamAggregatedResources gRPC config stream closed since 44s ago: 14, connection error: desc = "transport: authentication handshake failed: read tcp 10.30.0.30:60162->172.18.3.255:15012: read: connection reset by peer"

@howardjohn
Copy link
Member Author

@howardjohn
Copy link
Member Author

howardjohn commented Jul 1, 2022

can reproduce with

diff --git a/pilot/pkg/networking/core/v1alpha3/cluster.go b/pilot/pkg/networking/core/v1alpha3/cluster.go
index a1c4dd9f9b..8e1a639587 100644
--- a/pilot/pkg/networking/core/v1alpha3/cluster.go
+++ b/pilot/pkg/networking/core/v1alpha3/cluster.go
@@ -135,52 +135,60 @@ func isClusterForServiceRemoved(cluster string, hostName string, svc *model.Serv
 
 // buildClusters builds clusters for the proxy with the services passed.
 func (configgen *ConfigGeneratorImpl) buildClusters(proxy *model.Proxy, req *model.PushRequest,
-	services []*model.Service) ([]*discovery.Resource, model.XdsLogDetails) {
+	services []*model.Service,
+) ([]*discovery.Resource, model.XdsLogDetails) {
 	clusters := make([]*cluster.Cluster, 0)
-	resources := model.Resources{}
-	envoyFilterPatches := req.Push.EnvoyFilters(proxy)
-	cb := NewClusterBuilder(proxy, req, configgen.Cache)
-	instances := proxy.ServiceInstances
-	cacheStats := cacheStats{}
-	switch proxy.Type {
-	case model.SidecarProxy:
-		// Setup outbound clusters
-		outboundPatcher := clusterPatcher{efw: envoyFilterPatches, pctx: networking.EnvoyFilter_SIDECAR_OUTBOUND}
-		ob, cs := configgen.buildOutboundClusters(cb, proxy, outboundPatcher, services)
-		cacheStats = cacheStats.merge(cs)
-		resources = append(resources, ob...)
-		// Add a blackhole and passthrough cluster for catching traffic to unresolved routes
-		clusters = outboundPatcher.conditionallyAppend(clusters, nil, cb.buildBlackHoleCluster(), cb.buildDefaultPassthroughCluster())
-		clusters = append(clusters, outboundPatcher.insertedClusters()...)
-
-		// Setup inbound clusters
-		inboundPatcher := clusterPatcher{efw: envoyFilterPatches, pctx: networking.EnvoyFilter_SIDECAR_INBOUND}
-		clusters = append(clusters, configgen.buildInboundClusters(cb, proxy, instances, inboundPatcher)...)
-		// Pass through clusters for inbound traffic. These cluster bind loopback-ish src address to access node local service.
-		clusters = inboundPatcher.conditionallyAppend(clusters, nil, cb.buildInboundPassthroughClusters()...)
-		clusters = append(clusters, inboundPatcher.insertedClusters()...)
-	default: // Gateways
-		patcher := clusterPatcher{efw: envoyFilterPatches, pctx: networking.EnvoyFilter_GATEWAY}
-		ob, cs := configgen.buildOutboundClusters(cb, proxy, patcher, services)
-		cacheStats = cacheStats.merge(cs)
-		resources = append(resources, ob...)
-		// Gateways do not require the default passthrough cluster as they do not have original dst listeners.
-		clusters = patcher.conditionallyAppend(clusters, nil, cb.buildBlackHoleCluster())
-		if proxy.Type == model.Router && proxy.MergedGateway != nil && proxy.MergedGateway.ContainsAutoPassthroughGateways {
-			clusters = append(clusters, configgen.buildOutboundSniDnatClusters(proxy, req, patcher)...)
+
+	for i := 0; i < 100000; i++ {
+		c := &cluster.Cluster{
+			Name: fmt.Sprint(i),
+			AltStatName: req.Push.PushVersion,
+			ClusterDiscoveryType: &cluster.Cluster_Type{Type: cluster.Cluster_EDS},
 		}
-		clusters = append(clusters, patcher.insertedClusters()...)
+		maybeApplyEdsConfig(c)
+		clusters = append(clusters, c)
 	}
+	resources := model.Resources{}
+	//envoyFilterPatches := req.Push.EnvoyFilters(proxy)
+	//cb := NewClusterBuilder(proxy, req, configgen.Cache)
+	//instances := proxy.ServiceInstances
+	//cacheStats := cacheStats{}
+	//switch proxy.Type {
+	//case model.SidecarProxy:
+	//	// Setup outbound clusters
+	//	outboundPatcher := clusterPatcher{efw: envoyFilterPatches, pctx: networking.EnvoyFilter_SIDECAR_OUTBOUND}
+	//	ob, cs := configgen.buildOutboundClusters(cb, proxy, outboundPatcher, services)
+	//	cacheStats = cacheStats.merge(cs)
+	//	resources = append(resources, ob...)
+	//	// Add a blackhole and passthrough cluster for catching traffic to unresolved routes
+	//	clusters = outboundPatcher.conditionallyAppend(clusters, nil, cb.buildBlackHoleCluster(), cb.buildDefaultPassthroughCluster())
+	//	clusters = append(clusters, outboundPatcher.insertedClusters()...)
+	//
+	//	// Setup inbound clusters
+	//	inboundPatcher := clusterPatcher{efw: envoyFilterPatches, pctx: networking.EnvoyFilter_SIDECAR_INBOUND}
+	//	clusters = append(clusters, configgen.buildInboundClusters(cb, proxy, instances, inboundPatcher)...)
+	//	// Pass through clusters for inbound traffic. These cluster bind loopback-ish src address to access node local service.
+	//	clusters = inboundPatcher.conditionallyAppend(clusters, nil, cb.buildInboundPassthroughClusters()...)
+	//	clusters = append(clusters, inboundPatcher.insertedClusters()...)
+	//default: // Gateways
+	//	patcher := clusterPatcher{efw: envoyFilterPatches, pctx: networking.EnvoyFilter_GATEWAY}
+	//	ob, cs := configgen.buildOutboundClusters(cb, proxy, patcher, services)
+	//	cacheStats = cacheStats.merge(cs)
+	//	resources = append(resources, ob...)
+	//	// Gateways do not require the default passthrough cluster as they do not have original dst listeners.
+	//	clusters = patcher.conditionallyAppend(clusters, nil, cb.buildBlackHoleCluster())
+	//	if proxy.Type == model.Router && proxy.MergedGateway != nil && proxy.MergedGateway.ContainsAutoPassthroughGateways {
+	//		clusters = append(clusters, configgen.buildOutboundSniDnatClusters(proxy, req, patcher)...)
+	//	}
+	//	clusters = append(clusters, patcher.insertedClusters()...)
+	//}
 
 	for _, c := range clusters {
 		resources = append(resources, &discovery.Resource{Name: c.Name, Resource: util.MessageToAny(c)})
 	}
-	resources = cb.normalizeClusters(resources)
+	//resources = cb.normalizeClusters(resources)
 
-	if cacheStats.empty() {
-		return resources, model.DefaultXdsLogDetails
-	}
-	return resources, model.XdsLogDetails{AdditionalInfo: fmt.Sprintf("cached:%v/%v", cacheStats.hits, cacheStats.hits+cacheStats.miss)}
+	return resources, model.XdsLogDetails{}
 }
 
 func shouldUseDelta(updates *model.PushRequest) bool {
diff --git a/pilot/pkg/xds/ads.go b/pilot/pkg/xds/ads.go
index e0cfa8ccff..a66e7df456 100644
--- a/pilot/pkg/xds/ads.go
+++ b/pilot/pkg/xds/ads.go
@@ -162,6 +162,9 @@ func (s *DiscoveryServer) receive(con *Connection, identities []string) {
 		}
 		// This should be only set for the first request. The node id may not be set - for example malicious clients.
 		if firstRequest {
+			if req.TypeUrl != v3.ClusterType && req.TypeUrl != v3.SecretType {
+				log.Fatalf("unexpected type %v", req.TypeUrl)
+			}
 			// probe happens before envoy sends first xDS request
 			if req.TypeUrl == v3.HealthInfoType {
 				log.Warnf("ADS: %q %s send health check probe before normal xDS request", con.PeerAddr, con.ConID)

proxy-local --proxyLogLevel=config:debug and pilot-local "--keepaliveMaxServerConnectionAge=3s".

Limit envoy cpu: sudo cpulimit --pid $( ps aux | rg envoy | rg howardjohn | awk '{print $2}') --limit 20

send lots of pushes: while :; do curl localhost:8080/debug/adsz?push=true; sleep .5; done

howardjohn added a commit to howardjohn/istio that referenced this issue Jul 1, 2022
See envoyproxy/envoy#13009 for details

Fixes istio#38709 (previously 'fixed', but really the fix was a workaround)

Fixes istio#39720
istio-testing pushed a commit that referenced this issue Jul 2, 2022
* xds: respond to requests previously miscategorized as ACKs

See envoyproxy/envoy#13009 for details

Fixes #38709 (previously 'fixed', but really the fix was a workaround)

Fixes #39720

* fix tests

* Make integ tests more aggressive
@howardjohn
Copy link
Member Author

@howardjohn howardjohn reopened this Jul 12, 2022
howardjohn added a commit to howardjohn/istio that referenced this issue Jul 12, 2022
Currently, we do not log incremental pushes at Info level. The intent
behind this is to avoid spam when we have large endpoint churn. However,
because we also do incremental pushes for Full pushes now, we are also
hiding these logs. These logs are both critical to debugging (things
like istio#39720, etc) and not spammy --
while the `Full=false` pushes may add thousands of messages, this change
only adds at most 1 log per push/proxy. For these types of pushes I
don't see a benefit to excluding only EDS.

Additionally, fix SDS to correctly assert it is incremental (when it
is).
istio-testing pushed a commit that referenced this issue Jul 12, 2022
Currently, we do not log incremental pushes at Info level. The intent
behind this is to avoid spam when we have large endpoint churn. However,
because we also do incremental pushes for Full pushes now, we are also
hiding these logs. These logs are both critical to debugging (things
like #39720, etc) and not spammy --
while the `Full=false` pushes may add thousands of messages, this change
only adds at most 1 log per push/proxy. For these types of pushes I
don't see a benefit to excluding only EDS.

Additionally, fix SDS to correctly assert it is incremental (when it
is).
@howardjohn
Copy link
Member Author

Might still be broken after #39937: https://prow.istio.io/view/gs/istio-prow/pr-logs/pull/istio_istio/39945/integ-security-multicluster_istio/1547894269065302016. Hard to tell without debug logs, need to trigger it in my pr with debug enabled

@hzxuzhonghu
Copy link
Member

https://prow.istio.io/view/gs/istio-prow/pr-logs/pull/istio_istio/39949/integ-security-multicluster_istio/1548045809608560640

Same here, in order to find out what is warming. I searched the proxy config logs, but could not find one.
Confused with these logs

2022-07-15T21:21:05.800743Z	warn	tf	FAIL: cluster/pod remote/istio-system/istio-eastwestgateway-5f8cb48c98-fmq5m found warming resources
2022-07-15T21:21:08.447543Z	warn	tf	FAIL: cluster/pod remote/echo1-2-11752/migration-vistio-77bc9bb579-djhj4 found warming resources
2022-07-15T21:21:08.495375Z	warn	tf	FAIL: cluster/pod primary/echo1-2-11752/migration-vistio-7c9dc8d79-6njqg found warming resources
2022-07-15T21:21:08.553112Z	warn	tf	FAIL: cluster/pod cross-network-primary/echo2-3-70653/c-v1-8bc7657d6-6njl6 found warming resources
2022-07-15T21:21:09.529097Z	warn	tf	FAIL: cluster/pod cross-network-primary/istio-system/istio-egressgateway-5988676f67-wgfwg found warming resources
2022-07-15T21:21:10.033589Z	warn	tf	FAIL: cluster/pod cross-network-primary/echo2-3-70653/statefulset-v1-0 found warming resources
2022-07-15T21:21:10.079488Z	warn	tf	FAIL: cluster/pod remote/echo1-2-11752/delta-v1-c44cc99b6-bspwr found warming resources
2022-07-15T21:21:10.271691Z	warn	tf	FAIL: cluster/pod remote/echo2-3-70653/headless-v1-7fc8f84d4d-wcqcg found warming resources
2022-07-15T21:21:10.367917Z	warn	tf	FAIL: cluster/pod remote/echo2-3-70653/delta-v1-c44cc99b6-rwzjb found warming resources
2022-07-15T21:21:10.370993Z	warn	tf	FAIL: cluster/pod cross-network-primary/echo1-2-11752/migration-vistio-6c786f484f-87d8l found warming resources
2022-07-15T21:21:10.462918Z	warn	tf	FAIL: cluster/pod remote/echo1-2-11752/a-v1-68584bfb78-6zqg6 found warming resources
2022-07-15T21:21:10.675447Z	warn	tf	FAIL: cluster/pod primary/echo1-2-11752/statefulset-v1-0 found warming resources
2022-07-15T21:21:10.970060Z	warn	tf	FAIL: cluster/pod cross-network-primary/echo1-2-11752/tproxy-v1-7b9fc5fcb7-fprmb found warming resources
2022-07-15T21:21:11.389734Z	warn	tf	FAIL: cluster/pod primary/echo1-2-11752/b-v1-7cd6f464f9-ltvmd found warming resources
2022-07-15T21:21:11.668383Z	warn	tf	FAIL: cluster/pod cross-network-primary/echo1-2-11752/a-v1-68b8dd7876-5wvb9 found warming resources

@howardjohn
Copy link
Member Author

The test framework dumps at multiple points. For example, if we have suite security with TestFoo/bar/baz, we may dump up to 4 times (there is some logic to avoid redundant ones I think).

In this case, the warming is for the top level security dump which is probably just from cleaning up a bunch of resources, so doing a bunch of pushes and we just have been warming for 1s or so.

@hzxuzhonghu
Copy link
Member

Confusing log if it is a temp output

@howardjohn
Copy link
Member Author

howardjohn commented Jul 19, 2022 via email

@istio-policy-bot
Copy link

🤔 ❄️ Hey there's been no update for this test flakes for 3 days.

Courtesy of your friendly test flake nag.

@howardjohn
Copy link
Member Author

AFAIK this issue is fixed now... closing for now. LMK if we see anything else

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants