TestReachability flake; cluster warming #39720

howardjohn · 2022-06-30T18:29:24Z

https://prow.istio.io/view/gs/istio-prow/logs/integ-security-multicluster_istio_postsubmit/1542555639413215232

Still debugging, but looks like all clusters are warming

howardjohn · 2022-07-01T15:58:00Z

Investigated this one a bit.

warm: 2022-06-30T17:32:43Z/226
219, all eds
ready: 2022-06-30T17:32:42Z/224 and 226 - it is a mix; 114 on 226 and 219 on 224
28 eds ready, 219 unready

pushes look same for good and bad

bad disconnects at 17:32:55.308978Z, reconects 2s later
test fails around 33:22. starts at 33:02. So its all after reconnect

reconnect has EDS first ????.

Pushed resources seem the same on reconnect

all the ready ones are vm, fake stateful, headless

VM is eds but empty?

So a few conclusions

The broken pods are reconnecting to XDS and getting stuck warming. The get the same number of resources, but somehow warming. The super weird thing is the XDS request order is out of order: EDS,LDS,RDS,CDS.
VMs are somehow 0 endpoints? also seeing this:

$ rg 'transport: authentication handshake failed'
_test_context/echo2-3-10889-state3295745563/cross-network-primary/vm-v1-749fc9946-97pcf_istio-proxy.envoy.err.log
20422:2022-06-30T17:33:31.211172Z       warning envoy config    StreamAggregatedResources gRPC config stream closed since 30s ago: 14, connection error: desc = "transport: authentication handshake failed: read tcp 10.30.0.31:47650->172.18.3.255:15012: read: connection reset by peer"
20423:2022-06-30T17:33:37.756010Z       warning envoy config    StreamAggregatedResources gRPC config stream closed since 36s ago: 14, connection error: desc = "transport: authentication handshake failed: read tcp 10.30.0.31:49634->172.18.3.255:15012: read: connection reset by peer"
20424:2022-06-30T17:33:40.458735Z       warning envoy config    StreamAggregatedResources gRPC config stream closed since 39s ago: 14, connection error: desc = "transport: authentication handshake failed: read tcp 10.30.0.31:50206->172.18.3.255:15012: read: connection reset by peer"

_test_context/echo2-3-10889-state3295745563/remote/delta-v1-c44cc99b6-tl5zq_istio-proxy.log
89:2022-06-30T17:33:31.110614Z  warning envoy config    DeltaAggregatedResources gRPC config stream closed since 34s ago: 14, connection error: desc = "transport: authentication handshake failed: read tcp 10.20.0.20:58550->10.255.20.23:15012: read: connection reset by peer"

_test_context/echo1-4-90458-state3421295793/primary/vm-v1-5f5ddc88f4-qm7rw_istio-proxy.envoy.err.log
274072:2022-06-30T17:33:40.523895Z      warning envoy config    StreamAggregatedResources gRPC config stream closed since 41s ago: 14, connection error: desc = "transport: authentication handshake failed: read tcp 10.10.0.31:55096->172.18.3.235:15012: read: connection reset by peer"

_test_context/echo1-4-90458-state3421295793/cross-network-primary/vm-v1-749fc9946-g8fqc_istio-proxy.envoy.err.log
314374:2022-06-30T17:33:30.398699Z      warning envoy config    StreamAggregatedResources gRPC config stream closed since 30s ago: 14, connection error: desc = "transport: authentication handshake failed: read tcp 10.30.0.30:56002->172.18.3.255:15012: read: connection reset by peer"
314375:2022-06-30T17:33:44.399088Z      warning envoy config    StreamAggregatedResources gRPC config stream closed since 44s ago: 14, connection error: desc = "transport: authentication handshake failed: read tcp 10.30.0.30:60162->172.18.3.255:15012: read: connection reset by peer"

howardjohn · 2022-07-01T16:54:56Z

https://prow.istio.io/view/gs/istio-prow/logs/integ-security-multicluster_istio_postsubmit/1542825359978795008 is another example with eds request first

howardjohn · 2022-07-01T19:42:47Z

can reproduce with

diff --git a/pilot/pkg/networking/core/v1alpha3/cluster.go b/pilot/pkg/networking/core/v1alpha3/cluster.go
index a1c4dd9f9b..8e1a639587 100644
--- a/pilot/pkg/networking/core/v1alpha3/cluster.go
+++ b/pilot/pkg/networking/core/v1alpha3/cluster.go
@@ -135,52 +135,60 @@ func isClusterForServiceRemoved(cluster string, hostName string, svc *model.Serv
 
 // buildClusters builds clusters for the proxy with the services passed.
 func (configgen *ConfigGeneratorImpl) buildClusters(proxy *model.Proxy, req *model.PushRequest,
-	services []*model.Service) ([]*discovery.Resource, model.XdsLogDetails) {
+	services []*model.Service,
+) ([]*discovery.Resource, model.XdsLogDetails) {
 	clusters := make([]*cluster.Cluster, 0)
-	resources := model.Resources{}
-	envoyFilterPatches := req.Push.EnvoyFilters(proxy)
-	cb := NewClusterBuilder(proxy, req, configgen.Cache)
-	instances := proxy.ServiceInstances
-	cacheStats := cacheStats{}
-	switch proxy.Type {
-	case model.SidecarProxy:
-		// Setup outbound clusters
-		outboundPatcher := clusterPatcher{efw: envoyFilterPatches, pctx: networking.EnvoyFilter_SIDECAR_OUTBOUND}
-		ob, cs := configgen.buildOutboundClusters(cb, proxy, outboundPatcher, services)
-		cacheStats = cacheStats.merge(cs)
-		resources = append(resources, ob...)
-		// Add a blackhole and passthrough cluster for catching traffic to unresolved routes
-		clusters = outboundPatcher.conditionallyAppend(clusters, nil, cb.buildBlackHoleCluster(), cb.buildDefaultPassthroughCluster())
-		clusters = append(clusters, outboundPatcher.insertedClusters()...)
-
-		// Setup inbound clusters
-		inboundPatcher := clusterPatcher{efw: envoyFilterPatches, pctx: networking.EnvoyFilter_SIDECAR_INBOUND}
-		clusters = append(clusters, configgen.buildInboundClusters(cb, proxy, instances, inboundPatcher)...)
-		// Pass through clusters for inbound traffic. These cluster bind loopback-ish src address to access node local service.
-		clusters = inboundPatcher.conditionallyAppend(clusters, nil, cb.buildInboundPassthroughClusters()...)
-		clusters = append(clusters, inboundPatcher.insertedClusters()...)
-	default: // Gateways
-		patcher := clusterPatcher{efw: envoyFilterPatches, pctx: networking.EnvoyFilter_GATEWAY}
-		ob, cs := configgen.buildOutboundClusters(cb, proxy, patcher, services)
-		cacheStats = cacheStats.merge(cs)
-		resources = append(resources, ob...)
-		// Gateways do not require the default passthrough cluster as they do not have original dst listeners.
-		clusters = patcher.conditionallyAppend(clusters, nil, cb.buildBlackHoleCluster())
-		if proxy.Type == model.Router && proxy.MergedGateway != nil && proxy.MergedGateway.ContainsAutoPassthroughGateways {
-			clusters = append(clusters, configgen.buildOutboundSniDnatClusters(proxy, req, patcher)...)
+
+	for i := 0; i < 100000; i++ {
+		c := &cluster.Cluster{
+			Name: fmt.Sprint(i),
+			AltStatName: req.Push.PushVersion,
+			ClusterDiscoveryType: &cluster.Cluster_Type{Type: cluster.Cluster_EDS},
 		}
-		clusters = append(clusters, patcher.insertedClusters()...)
+		maybeApplyEdsConfig(c)
+		clusters = append(clusters, c)
 	}
+	resources := model.Resources{}
+	//envoyFilterPatches := req.Push.EnvoyFilters(proxy)
+	//cb := NewClusterBuilder(proxy, req, configgen.Cache)
+	//instances := proxy.ServiceInstances
+	//cacheStats := cacheStats{}
+	//switch proxy.Type {
+	//case model.SidecarProxy:
+	//	// Setup outbound clusters
+	//	outboundPatcher := clusterPatcher{efw: envoyFilterPatches, pctx: networking.EnvoyFilter_SIDECAR_OUTBOUND}
+	//	ob, cs := configgen.buildOutboundClusters(cb, proxy, outboundPatcher, services)
+	//	cacheStats = cacheStats.merge(cs)
+	//	resources = append(resources, ob...)
+	//	// Add a blackhole and passthrough cluster for catching traffic to unresolved routes
+	//	clusters = outboundPatcher.conditionallyAppend(clusters, nil, cb.buildBlackHoleCluster(), cb.buildDefaultPassthroughCluster())
+	//	clusters = append(clusters, outboundPatcher.insertedClusters()...)
+	//
+	//	// Setup inbound clusters
+	//	inboundPatcher := clusterPatcher{efw: envoyFilterPatches, pctx: networking.EnvoyFilter_SIDECAR_INBOUND}
+	//	clusters = append(clusters, configgen.buildInboundClusters(cb, proxy, instances, inboundPatcher)...)
+	//	// Pass through clusters for inbound traffic. These cluster bind loopback-ish src address to access node local service.
+	//	clusters = inboundPatcher.conditionallyAppend(clusters, nil, cb.buildInboundPassthroughClusters()...)
+	//	clusters = append(clusters, inboundPatcher.insertedClusters()...)
+	//default: // Gateways
+	//	patcher := clusterPatcher{efw: envoyFilterPatches, pctx: networking.EnvoyFilter_GATEWAY}
+	//	ob, cs := configgen.buildOutboundClusters(cb, proxy, patcher, services)
+	//	cacheStats = cacheStats.merge(cs)
+	//	resources = append(resources, ob...)
+	//	// Gateways do not require the default passthrough cluster as they do not have original dst listeners.
+	//	clusters = patcher.conditionallyAppend(clusters, nil, cb.buildBlackHoleCluster())
+	//	if proxy.Type == model.Router && proxy.MergedGateway != nil && proxy.MergedGateway.ContainsAutoPassthroughGateways {
+	//		clusters = append(clusters, configgen.buildOutboundSniDnatClusters(proxy, req, patcher)...)
+	//	}
+	//	clusters = append(clusters, patcher.insertedClusters()...)
+	//}
 
 	for _, c := range clusters {
 		resources = append(resources, &discovery.Resource{Name: c.Name, Resource: util.MessageToAny(c)})
 	}
-	resources = cb.normalizeClusters(resources)
+	//resources = cb.normalizeClusters(resources)
 
-	if cacheStats.empty() {
-		return resources, model.DefaultXdsLogDetails
-	}
-	return resources, model.XdsLogDetails{AdditionalInfo: fmt.Sprintf("cached:%v/%v", cacheStats.hits, cacheStats.hits+cacheStats.miss)}
+	return resources, model.XdsLogDetails{}
 }
 
 func shouldUseDelta(updates *model.PushRequest) bool {
diff --git a/pilot/pkg/xds/ads.go b/pilot/pkg/xds/ads.go
index e0cfa8ccff..a66e7df456 100644
--- a/pilot/pkg/xds/ads.go
+++ b/pilot/pkg/xds/ads.go
@@ -162,6 +162,9 @@ func (s *DiscoveryServer) receive(con *Connection, identities []string) {
 		}
 		// This should be only set for the first request. The node id may not be set - for example malicious clients.
 		if firstRequest {
+			if req.TypeUrl != v3.ClusterType && req.TypeUrl != v3.SecretType {
+				log.Fatalf("unexpected type %v", req.TypeUrl)
+			}
 			// probe happens before envoy sends first xDS request
 			if req.TypeUrl == v3.HealthInfoType {
 				log.Warnf("ADS: %q %s send health check probe before normal xDS request", con.PeerAddr, con.ConID)

proxy-local --proxyLogLevel=config:debug and pilot-local "--keepaliveMaxServerConnectionAge=3s".

Limit envoy cpu: sudo cpulimit --pid $( ps aux | rg envoy | rg howardjohn | awk '{print $2}') --limit 20

send lots of pushes: while :; do curl localhost:8080/debug/adsz?push=true; sleep .5; done

See envoyproxy/envoy#13009 for details Fixes istio#38709 (previously 'fixed', but really the fix was a workaround) Fixes istio#39720

* xds: respond to requests previously miscategorized as ACKs See envoyproxy/envoy#13009 for details Fixes #38709 (previously 'fixed', but really the fix was a workaround) Fixes #39720 * fix tests * Make integ tests more aggressive

howardjohn · 2022-07-12T17:39:51Z

https://prow.istio.io/view/gs/istio-prow/pr-logs/pull/istio_istio/39774/integ-distroless_istio/1546896390188175360 still happening

Currently, we do not log incremental pushes at Info level. The intent behind this is to avoid spam when we have large endpoint churn. However, because we also do incremental pushes for Full pushes now, we are also hiding these logs. These logs are both critical to debugging (things like istio#39720, etc) and not spammy -- while the `Full=false` pushes may add thousands of messages, this change only adds at most 1 log per push/proxy. For these types of pushes I don't see a benefit to excluding only EDS. Additionally, fix SDS to correctly assert it is incremental (when it is).

Currently, we do not log incremental pushes at Info level. The intent behind this is to avoid spam when we have large endpoint churn. However, because we also do incremental pushes for Full pushes now, we are also hiding these logs. These logs are both critical to debugging (things like #39720, etc) and not spammy -- while the `Full=false` pushes may add thousands of messages, this change only adds at most 1 log per push/proxy. For these types of pushes I don't see a benefit to excluding only EDS. Additionally, fix SDS to correctly assert it is incremental (when it is).

howardjohn · 2022-07-12T21:32:59Z

https://prow.istio.io/view/gs/istio-prow/pr-logs/pull/istio_istio/39774/integ-security_istio/1546930867178311680 as well

howardjohn · 2022-07-15T15:26:48Z

Might still be broken after #39937: https://prow.istio.io/view/gs/istio-prow/pr-logs/pull/istio_istio/39945/integ-security-multicluster_istio/1547894269065302016. Hard to tell without debug logs, need to trigger it in my pr with debug enabled

hzxuzhonghu · 2022-07-16T07:58:18Z

https://prow.istio.io/view/gs/istio-prow/pr-logs/pull/istio_istio/39949/integ-security-multicluster_istio/1548045809608560640

Same here, in order to find out what is warming. I searched the proxy config logs, but could not find one.
Confused with these logs

2022-07-15T21:21:05.800743Z	warn	tf	FAIL: cluster/pod remote/istio-system/istio-eastwestgateway-5f8cb48c98-fmq5m found warming resources
2022-07-15T21:21:08.447543Z	warn	tf	FAIL: cluster/pod remote/echo1-2-11752/migration-vistio-77bc9bb579-djhj4 found warming resources
2022-07-15T21:21:08.495375Z	warn	tf	FAIL: cluster/pod primary/echo1-2-11752/migration-vistio-7c9dc8d79-6njqg found warming resources
2022-07-15T21:21:08.553112Z	warn	tf	FAIL: cluster/pod cross-network-primary/echo2-3-70653/c-v1-8bc7657d6-6njl6 found warming resources
2022-07-15T21:21:09.529097Z	warn	tf	FAIL: cluster/pod cross-network-primary/istio-system/istio-egressgateway-5988676f67-wgfwg found warming resources
2022-07-15T21:21:10.033589Z	warn	tf	FAIL: cluster/pod cross-network-primary/echo2-3-70653/statefulset-v1-0 found warming resources
2022-07-15T21:21:10.079488Z	warn	tf	FAIL: cluster/pod remote/echo1-2-11752/delta-v1-c44cc99b6-bspwr found warming resources
2022-07-15T21:21:10.271691Z	warn	tf	FAIL: cluster/pod remote/echo2-3-70653/headless-v1-7fc8f84d4d-wcqcg found warming resources
2022-07-15T21:21:10.367917Z	warn	tf	FAIL: cluster/pod remote/echo2-3-70653/delta-v1-c44cc99b6-rwzjb found warming resources
2022-07-15T21:21:10.370993Z	warn	tf	FAIL: cluster/pod cross-network-primary/echo1-2-11752/migration-vistio-6c786f484f-87d8l found warming resources
2022-07-15T21:21:10.462918Z	warn	tf	FAIL: cluster/pod remote/echo1-2-11752/a-v1-68584bfb78-6zqg6 found warming resources
2022-07-15T21:21:10.675447Z	warn	tf	FAIL: cluster/pod primary/echo1-2-11752/statefulset-v1-0 found warming resources
2022-07-15T21:21:10.970060Z	warn	tf	FAIL: cluster/pod cross-network-primary/echo1-2-11752/tproxy-v1-7b9fc5fcb7-fprmb found warming resources
2022-07-15T21:21:11.389734Z	warn	tf	FAIL: cluster/pod primary/echo1-2-11752/b-v1-7cd6f464f9-ltvmd found warming resources
2022-07-15T21:21:11.668383Z	warn	tf	FAIL: cluster/pod cross-network-primary/echo1-2-11752/a-v1-68b8dd7876-5wvb9 found warming resources

howardjohn · 2022-07-18T14:31:05Z

The test framework dumps at multiple points. For example, if we have suite security with TestFoo/bar/baz, we may dump up to 4 times (there is some logic to avoid redundant ones I think).

In this case, the warming is for the top level security dump which is probably just from cleaning up a bunch of resources, so doing a bunch of pushes and we just have been warming for 1s or so.

hzxuzhonghu · 2022-07-19T01:09:56Z

Confusing log if it is a temp output

howardjohn · 2022-07-19T01:20:49Z

yes I'm hoping to resolve that issue

…

On Mon, Jul 18, 2022, 6:10 PM Zhonghu Xu ***@***.***> wrote: Confusing log if it is a temp output — Reply to this email directly, view it on GitHub <#39720 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAEYGXJEXHBH6ZAMOZ3VYILVUX565ANCNFSM52KILHYA> . You are receiving this because you modified the open/close state.Message ID: ***@***.***>

istio-policy-bot · 2022-07-23T09:00:11Z

🤔 ❄️ Hey there's been no update for this test flakes for 3 days.

Courtesy of your friendly test flake nag.

howardjohn · 2022-07-25T14:47:52Z

AFAIK this issue is fixed now... closing for now. LMK if we see anything else

howardjohn added a commit to howardjohn/istio that referenced this issue Jul 1, 2022

xds: respond to requests previously miscategorized as ACKs

1ba5716

See envoyproxy/envoy#13009 for details Fixes istio#38709 (previously 'fixed', but really the fix was a workaround) Fixes istio#39720

howardjohn mentioned this issue Jul 1, 2022

xds: respond to requests previously miscategorized as ACKs #39746

Merged

istio-testing closed this as completed in #39746 Jul 2, 2022

howardjohn reopened this Jul 12, 2022

howardjohn mentioned this issue Jul 12, 2022

xds: improve incremental logging #39889

Merged

howardjohn closed this as completed Jul 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TestReachability flake; cluster warming #39720

TestReachability flake; cluster warming #39720

howardjohn commented Jun 30, 2022

howardjohn commented Jul 1, 2022

howardjohn commented Jul 1, 2022

howardjohn commented Jul 1, 2022 •

edited

Loading

howardjohn commented Jul 12, 2022

howardjohn commented Jul 12, 2022

howardjohn commented Jul 15, 2022

hzxuzhonghu commented Jul 16, 2022

howardjohn commented Jul 18, 2022

hzxuzhonghu commented Jul 19, 2022

howardjohn commented Jul 19, 2022 via email

istio-policy-bot commented Jul 23, 2022

howardjohn commented Jul 25, 2022

TestReachability flake; cluster warming #39720

TestReachability flake; cluster warming #39720

Comments

howardjohn commented Jun 30, 2022

howardjohn commented Jul 1, 2022

howardjohn commented Jul 1, 2022

howardjohn commented Jul 1, 2022 • edited Loading

howardjohn commented Jul 12, 2022

howardjohn commented Jul 12, 2022

howardjohn commented Jul 15, 2022

hzxuzhonghu commented Jul 16, 2022

howardjohn commented Jul 18, 2022

hzxuzhonghu commented Jul 19, 2022

howardjohn commented Jul 19, 2022 via email

istio-policy-bot commented Jul 23, 2022

howardjohn commented Jul 25, 2022

howardjohn commented Jul 1, 2022 •

edited

Loading