Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Failed to parse discovery request" after upgrade 1.1.5 #13853

Open
twu8 opened this issue May 6, 2019 · 10 comments

Comments

6 participants
@twu8
Copy link

commented May 6, 2019

Bug description
After upgrading istio from 1.1.3 to 1.1.5, the cluster seems not in a good state. Following are observed:

  1. HTTPS requests failed with codes such as 400 for both curl command and automate testing scrips. Same script runs good with istio 1.1.3 with good metrics e.g. QOS.
    Error message: LibreSSL SSL_connect: SSL_ERROR_SYSCALL
  2. HTTP seem good, but QPS drops significantly for same cluster same configuration
  3. Error were observed in the log of Ingress SDS pod:
xxx@m-c02x60rajgh7$k logs --tail=100  -n istio-system istio-ingressgateway-5899c86ddb-hfkm6 ingress-sds


2019-05-06T05:58:46.592611Z	error	Failed to parse discovery request: discovery request node:<id:"router~10.26.80.116~istio-ingressgateway-5899c86ddb-hfkm6.istio-system~istio-system.svc.cluster.local" cluster:"istio-ingressgateway" metadata:<fields:<key:"CONFIG_NAMESPACE" value:<string_value:"istio-system" > > fields:<key:"IDLE_TIMEOUT" value:<string_value:"2m" > > fields:<key:"ISTIO_META_INSTANCE_IPS" value:<string_value:"10.26.80.116,10.26.80.116" > > fields:<key:"ISTIO_PROXY_SHA" value:<string_value:"istio-proxy:73fa9b1f29f91029cc2485a685994a0d1dbcde21" > > fields:<key:"ISTIO_PROXY_VERSION" value:<string_value:"1.1.3" > > fields:<key:"ISTIO_VERSION" value:<string_value:"1.1.5" > > fields:<key:"POD_NAME" value:<string_value:"istio-ingressgateway-5899c86ddb-hfkm6" > > fields:<key:"ROUTER_MODE" value:<string_value:"sni-dnat" > > fields:<key:"USER_SDS" value:<string_value:"true" > > fields:<key:"istio" value:<string_value:"sidecar" > > > locality:<> build_version:"73fa9b1f29f91029cc2485a685994a0d1dbcde21/1.11.0-dev/Clean/RELEASE/BoringSSL" > type_url:"type.googleapis.com/envoy.api.v2.auth.Secret"  has invalid resourceNames []

Expected behavior

Expected Https work fine after istio 1.1.5 upgrade
metrics such as QPS are on par of istio 1.1.3

Steps to reproduce the bug

  1. istio 1.1.3 installed in the cluster
  2. upgrade istio to 1.1.5 with using helm

Version (include the output of istioctl version --remote and kubectl version)

➜ ~ istioctl version --remote
client version: version.BuildInfo{Version:"1.1.5", GitRevision:"9b6d31b74d1c0cc9358cc82d395b53f71393326b", User:"root", Host:"3e29fde4-6c3f-11e9-b00d-0a580a2c0205", GolangVersion:"go1.10.4", DockerHub:"docker.io/istio", BuildStatus:"Clean", GitTag:"1.1.4-10-g9b6d31b"}
citadel version: version.BuildInfo{Version:"1.1.5", GitRevision:"9b6d31b74d1c0cc9358cc82d395b53f71393326b-dirty", User:"root", Host:"3e29fde4-6c3f-11e9-b00d-0a580a2c0205", GolangVersion:"go1.10.4", DockerHub:"docker.io/istio", BuildStatus:"Modified", GitTag:"1.1.4-10-g9b6d31b"}
galley version: version.BuildInfo{Version:"1.1.5", GitRevision:"9b6d31b74d1c0cc9358cc82d395b53f71393326b-dirty", User:"root", Host:"3e29fde4-6c3f-11e9-b00d-0a580a2c0205", GolangVersion:"go1.10.4", DockerHub:"docker.io/istio", BuildStatus:"Modified", GitTag:"1.1.4-10-g9b6d31b"}
galley version: version.BuildInfo{Version:"1.1.5", GitRevision:"9b6d31b74d1c0cc9358cc82d395b53f71393326b-dirty", User:"root", Host:"3e29fde4-6c3f-11e9-b00d-0a580a2c0205", GolangVersion:"go1.10.4", DockerHub:"docker.io/istio", BuildStatus:"Modified", GitTag:"1.1.4-10-g9b6d31b"}
ingressgateway version: version.BuildInfo{Version:"1.1.5", GitRevision:"9b6d31b74d1c0cc9358cc82d395b53f71393326b", User:"root", Host:"3e29fde4-6c3f-11e9-b00d-0a580a2c0205", GolangVersion:"go1.10.4", DockerHub:"docker.io/istio", BuildStatus:"Clean", GitTag:"1.1.4-10-g9b6d31b"}
ingressgateway version: version.BuildInfo{Version:"1.1.5", GitRevision:"9b6d31b74d1c0cc9358cc82d395b53f71393326b", User:"root", Host:"3e29fde4-6c3f-11e9-b00d-0a580a2c0205", GolangVersion:"go1.10.4", DockerHub:"docker.io/istio", BuildStatus:"Clean", GitTag:"1.1.4-10-g9b6d31b"}
pilot version: version.BuildInfo{Version:"1.1.5", GitRevision:"9b6d31b74d1c0cc9358cc82d395b53f71393326b-dirty", User:"root", Host:"3e29fde4-6c3f-11e9-b00d-0a580a2c0205", GolangVersion:"go1.10.4", DockerHub:"docker.io/istio", BuildStatus:"Modified", GitTag:"1.1.4-10-g9b6d31b"}
pilot version: version.BuildInfo{Version:"1.1.5", GitRevision:"9b6d31b74d1c0cc9358cc82d395b53f71393326b-dirty", User:"root", Host:"3e29fde4-6c3f-11e9-b00d-0a580a2c0205", GolangVersion:"go1.10.4", DockerHub:"docker.io/istio", BuildStatus:"Modified", GitTag:"1.1.4-10-g9b6d31b"}
sidecar-injector version: version.BuildInfo{Version:"1.1.5", GitRevision:"9b6d31b74d1c0cc9358cc82d395b53f71393326b-dirty", User:"root", Host:"3e29fde4-6c3f-11e9-b00d-0a580a2c0205", GolangVersion:"go1.10.4", DockerHub:"docker.io/istio", BuildStatus:"Modified", GitTag:"1.1.4-10-g9b6d31b"}
sidecar-injector version: version.BuildInfo{Version:"1.1.5", GitRevision:"9b6d31b74d1c0cc9358cc82d395b53f71393326b-dirty", User:"root", Host:"3e29fde4-6c3f-11e9-b00d-0a580a2c0205", GolangVersion:"go1.10.4", DockerHub:"docker.io/istio", BuildStatus:"Modified", GitTag:"1.1.4-10-g9b6d31b"}
telemetry version: version.BuildInfo{Version:"1.1.5", GitRevision:"9b6d31b74d1c0cc9358cc82d395b53f71393326b-dirty", User:"root", Host:"3e29fde4-6c3f-11e9-b00d-0a580a2c0205", GolangVersion:"go1.10.4", DockerHub:"docker.io/istio", BuildStatus:"Modified", GitTag:"1.1.4-10-g9b6d31b"}
telemetry version: version.BuildInfo{Version:"1.1.5", GitRevision:"9b6d31b74d1c0cc9358cc82d395b53f71393326b-dirty", User:"root", Host:"3e29fde4-6c3f-11e9-b00d-0a580a2c0205", GolangVersion:"go1.10.4", DockerHub:"docker.io/istio", BuildStatus:"Modified", GitTag:"1.1.4-10-g9b6d31b"}
➜ ~

app@cpeus567b000000:~$ kubectl version
Client Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.5", GitCommit:"2166946f41b36dea2c4626f90a77706f426cdea2", GitTreeState:"clean", BuildDate:"2019-03-25T15:26:52Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.5", GitCommit:"2166946f41b36dea2c4626f90a77706f426cdea2", GitTreeState:"clean", BuildDate:"2019-03-25T15:19:22Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}

How was Istio installed?
use helm

Environment where bug was observed (cloud vendor, OS, etc)

Azure cluster
Linux

Affected product area (please put an X in all that apply)

[ x] Configuration Infrastructure
[ ] Docs
[ ] Installation
[ ] Networking
[ ] Performance and Scalability
[ ] Policies and Telemetry
[ x] Security
[ x] Test and Release
[ ] User Experience

Additionally, please consider attaching a cluster state archive by attaching
the dump file to this issue.

@duderino duderino added this to the 1.1 milestone May 6, 2019

@howardjohn

This comment has been minimized.

Copy link
Member

commented May 6, 2019

func parseDiscoveryRequest(discReq *xdsapi.DiscoveryRequest) (string /*resourceName*/, error) {
if discReq.Node.Id == "" {
return "", fmt.Errorf("discovery request %+v missing node id", discReq)
}
if len(discReq.ResourceNames) == 1 {
return discReq.ResourceNames[0], nil
}
return "", fmt.Errorf("discovery request %+v has invalid resourceNames %+v", discReq, discReq.ResourceNames)
}

This requires that resourceNames is not empty, when
// List of resources to subscribe to, e.g. list of cluster names or a route
// configuration name. If this is empty, all resources for the API are
// returned. LDS/CDS expect empty resource_names, since this is global
// discovery for the Envoy instance. The LDS and CDS responses will then imply
// a number of resources that need to be fetched via EDS/RDS, which will be
// explicitly enumerated in resource_names.
ResourceNames []string `protobuf:"bytes,3,rep,name=resource_names,json=resourceNames,proto3" json:"resource_names,omitempty"`
suggests that an empty request is valid

@JimmyCYJ

This comment has been minimized.

Copy link
Contributor

commented May 6, 2019

There are no significant changes to ingress gateway agent between 1.1.3 and 1.1.5. Perhaps 1.1.5 upgrades Envoy version that introduces the issue?

@quanjielin

This comment has been minimized.

Copy link
Contributor

commented May 6, 2019

@twu8, a few question -

  1. does HTTPS fail every time or intermittently (like happened only run some time, or only when you recreate k8s secret) ?
  2. do you have nodeagent logs ?
@duderino

This comment has been minimized.

Copy link
Contributor

commented May 7, 2019

@JimmyCYJ the same version of Envoy is used in 1.1.3, 1.1.4. and 1.1.5

@JimmyCYJ

This comment has been minimized.

Copy link
Contributor

commented May 7, 2019

@duderino Thanks Joshua. Then let's check ingress-sds and ingress-proxy logs to figure out what happened.

@incfly

This comment has been minimized.

Copy link
Member

commented May 9, 2019

/cc @incfly

@JimmyCYJ

This comment has been minimized.

Copy link
Contributor

commented May 16, 2019

#14080 fixes the issue. Shall we close it now?

@howardjohn

This comment has been minimized.

Copy link
Member

commented May 16, 2019

@JimmyCYJ my fix has nothing to do with sds, did you link the wrong pr?

@JimmyCYJ

This comment has been minimized.

Copy link
Contributor

commented May 16, 2019

The SDS issue is there is a race condition at node agent when Gateway CRD is deleted and k8s secret is deleted. @philrud's script could reproduce that, and the node agent fix is to address that issue.

This issue is not about sds, it is a config push that causes listener stuck in warming state during upgrade. If #14080 is not the fix, would you mind to add a link to the right fix?

@howardjohn

This comment has been minimized.

Copy link
Member

commented May 16, 2019

Maybe I am misunderstanding this issue, but it seems it is about Envoy sending a request for empty list of ResourceName and sds responds with an error?

I highly doubt this issue is related to 14080 which fixes an issue where you have two gatways with overlapping host names in the https field created in the same second (pretty small group of people I would imagine)?

Maybe k missed something though

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.