Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Internal Kubernetes API Calls Blocked by Istio #8696

Open
rsnj opened this issue Sep 13, 2018 · 11 comments

Comments

5 participants
@rsnj
Copy link

commented Sep 13, 2018

Describe the bug
I'm installing a monitoring service in to my pod which is trying to make a call to the Kubernetes API server. This request is being blocked by the Istio sidecar. If I disable the istio-injection and redeploy everything works as planned. Do I need to enable anything to make this work?

Expected behavior
My pods can access the internal Kubernetes API

Steps to reproduce the bug

curl https://$KUBERNETES_SERVICE_HOST:$KUBERNETES_SERVICE_PORT/api/v1/namespaces/default/pods

from inside my pod does not respond.

Version
Istio:

Version: 1.0.2
GitRevision: d639408fded355fb906ef2a1f9e8ffddc24c3d64
User: root@
Hub: gcr.io/istio-release
GolangVersion: go1.10.1
BuildStatus: Clean

Kubernetes:

Client Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.2", GitCommit:"bb9ffb1654d4a729bb4cec18ff088eacc153c239", GitTreeState:"clean", BuildDate:"2018-08-08T16:31:10Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.2", GitCommit:"bb9ffb1654d4a729bb4cec18ff088eacc153c239", GitTreeState:"clean", BuildDate:"2018-08-07T23:08:19Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}

Installation

helm install install/kubernetes/helm/istio \
    --name istio \
    --namespace istio-system \
    --set certmanager.enabled=true

Environment
Microsoft Azure AKS

@rsnj rsnj changed the title Internal Kubernetes API Calls Blocked by Istio Sidecar Internal Kubernetes API Calls Blocked by Istio Sep 20, 2018

@krancour

This comment has been minimized.

Copy link
Contributor

commented Oct 19, 2018

I can confirm this issue exists and is also the root cause of Knative not working on AKS-- their autoscaler, which is a controller, is unable to sync with the Kubernetes apiserver. Disabling the istio-proxy sidecar on that pod fixes things, but my sense is that's not the right thing to do.

It's perplexing to me that this occurs in AKS but not elsewhere.

Can someone help me troubleshoot this? (I work for Azure.)

@adinunzio84

This comment has been minimized.

Copy link

commented Nov 28, 2018

We have this problem - we made a ServiceEntry and VirtualService to account for the fact that the apiserver is now accessed over a public URL. I made the following ServiceEntry and VirtualService:

apiVersion: networking.istio.io/v1alpha3
kind: ServiceEntry
metadata:
  name: azmk8s-ext
spec:
  hosts:
  - "<my-cluster>.hcp.centralus.azmk8s.io"
  location: MESH_EXTERNAL
  ports:
  - number: 443
    name: https
    protocol: HTTPS
  resolution: DNS
---
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: tls-routing
spec:
  hosts:
  - <my-cluster>.hcp.centralus.azmk8s.io
  tls:
  - match:
    - port: 443
      sniHosts:
      - <my-cluster>.hcp.centralus.azmk8s.io
    route:
    - destination:
        host: <my-cluster>.hcp.centralus.azmk8s.io

This gets it to the point where I can access the api-server, but after 5 minutes, it stops working, and calls to the api-server hang. Also, calls to Go's net.LookupIP(host) hang during this period, when providing the FQDN of the AKS apiserver.

I found that if I wait 10-15 minutes, the problem seems to resolve itself, but starts failing again after another 5 minutes. I also found that I can make a request to the apiserver when it's working, and that seems to delay the point where it stops. I made a request, then another 1 minute later - and it started failing 5 minutes after the 2nd request, not the first.

I should mention that curl requests made directly to the API server when I kubectl exec into the pod DO succeed. That made me think that maybe it's client-go that I was using incorrectly, but the fact that net.LookupIP(host) would also hang made me decide that probably isn't the case...

@mkjoerg

This comment has been minimized.

Copy link

commented Dec 24, 2018

I am having the same issue with the RabbitMQ Kubernetes peer plugin, that wants to list the other pods, but can't connect to the AKS API server. The external service entry didn't work, the only way to fix this, was to set the ip range

I only experienced this issue on AKS.

@adinunzio84

This comment has been minimized.

Copy link

commented Dec 24, 2018

@mkjoerg, does the plug-in use client-go? If so, there's a strange issue that seems to only happen when combining Istio, AKS, and client-go (but any 2 are fine): kubernetes/client-go#527

@mkjoerg

This comment has been minimized.

Copy link

commented Dec 24, 2018

@adinunzio84, no the rabbitmq plugin is erlang based. https://github.com/adinunzio84

@rsnj rsnj referenced this issue Jan 10, 2019

Open

Support Istio #169

@krancour

This comment has been minimized.

Copy link
Contributor

commented Jan 29, 2019

While kubernetes/client-go#527 may technically still be an issue, I want to point out that there is now, at least, a workaround in place on the AKS end.

A mutating webhook is now overwriting environment variables such as KUBERNETES_SERVICE_HOST for all pods with the external DNS name for your Kubernetes apiserver. The practical effect of this is that traffic bound for the apiserver exits and re-enters the cluster (which isn't ideal-- hence why I count this as a workaround rather than a solution). The load balancer(s) that are involved in this alternative route to the apiserver are not subject to the difficulties explained in kubernetes/client-go#527.

While this may be more of a workaround than a strategic solution, it's fair to say that this issue is effectively remediated. Should we consider closing it?

EDIT: Because the apiserver address will appear to be external, you do have to add an appropriate ServiceEntry.

@adinunzio84

This comment has been minimized.

Copy link

commented Jan 29, 2019

@krancour are you sure about: "The load balancer(s) that are involved in this alternative route to the apiserver are not subject to the difficulties explained in kubernetes/client-go#527."?

I commented here more info about the issue I describe in kubernetes/client-go#527, and it seems it's actually more related to the load balancers involved with AKS than client-go.

If applications with Istio sidecars are able to access the API server after the 5-minute window (where the LB closes the connection), then I think this can be closed. Otherwise, in my opinion, this should remain open and depends on envoyproxy/envoy#3634

@krancour

This comment has been minimized.

Copy link
Contributor

commented Jan 29, 2019

@adinunzio84, the cluster-internal load balancers and externally facing load balancers are different. The externally facing load balancers have had TCP reset as an opt-in "preview" feature since mid-September, while the cluster-internal load balancers, if I understand correctly, still lack this feature.

https://azure.microsoft.com/en-us/updates/load-balancer-outbound-rules/

Oddly, when I dig down into LB details, I cannot see any evidence that AKS actually enabled the feature in question when it deployed the cluster, however, I am currently observing correct/desired behavior.

I'll follow up with my colleagues on the AKS team to figure out what's going on here.

If you want to try this yourself, perhaps you can independently verify / refute that this works as I claim.

@adinunzio84

This comment has been minimized.

Copy link

commented Jan 29, 2019

Sure I'll test it out when I have a chance. If I understand correctly, that TCP reset preview feature is for a Standard Load Balancer. One of the people I spoke with on the AKS team said that AKS does not have Standard Load Balancer support enabled yet, but it should happen soon.

@krancour

This comment has been minimized.

Copy link
Contributor

commented Jan 29, 2019

That is correct. The post I linked to does reference standard load balancers, whilst AKS is currently using basic load balancers only-- which deepens the mystery of why this is now working.

@vcanaa

This comment has been minimized.

Copy link

commented Mar 5, 2019

Internal Kubernetes API calls seem to fail for the first seconds. You might find the this repro useful: see update of #12187

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.