Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some times Pilot is skipping sending endpoint updates for a new pod #19025

Closed
ramaraochavali opened this issue Nov 17, 2019 · 11 comments · Fixed by #19035
Closed

Some times Pilot is skipping sending endpoint updates for a new pod #19025

ramaraochavali opened this issue Nov 17, 2019 · 11 comments · Fixed by #19035

Comments

@ramaraochavali
Copy link
Contributor

@ramaraochavali ramaraochavali commented Nov 17, 2019

Bug description
Some times when a pod get restarted, the new Pod IP not sent to all connected to Envoys. The old pod ip is removed from the cluster members but the new pod IP does not appear in the cluster members. Only when we restart Pilot all the Envoys get the updated endpoint.

Expected behavior
Pilot should send the new POD ip to all the Envoys and Envoys should have it in cluster member list.

Steps to reproduce the bug
This is not easy to reproduce but we need to keep trying to delete the pod.

Version (include the output of istioctl version --remote and kubectl version and helm version if you used Helm)

Master

How was Istio installed?

Helm

Environment where bug was observed (cloud vendor, OS, etc)
K8s

@ramaraochavali ramaraochavali changed the title Some times Pilot is skipping endpoint updates for a new pod Some times Pilot is skipping sending endpoint updates for a new pod Nov 17, 2019
@ramaraochavali

This comment has been minimized.

Copy link
Contributor Author

@ramaraochavali ramaraochavali commented Nov 17, 2019

@howardjohn I am suspecting that this might have been caused by this #18412

Here is what I found when this happened

  • Pilot log showed "Endpoint without pod 10.251.130.228 service.service-mesh" for the new pod instance that has come up.
  • When I go to /debug/edsz the Endpoint list has all three Endpoints. But the affected pod (10.251.130.228) does not have locality information and TLS Mode set. Since this is being displayed from the pre-populated LoadAssignments in EdsCluster's (which in turn is populated from the serviceregistry by calling InstanceByPort) - Endpoint is visible here.
"endpoints": [
    {
      "locality": {
        "zone": "b08-45"
      },
      "lbEndpoints": [
        {
          "endpoint": {
            "address": {
              "socketAddress": {
                "address": "10.251.130.18",
                "portValue": 7443
              }
            }
          },
          "metadata": {
            "filterMetadata": {
              "envoy.transport_socket_match": {
                  "tlsMode": "istio"
                }
            }
          },
          "loadBalancingWeight": 1
        }
      ],
      "loadBalancingWeight": 1
    },
    {
      "lbEndpoints": [
        {
          "endpoint": {
            "address": {
              "socketAddress": {
                "address": "10.251.130.228",
                "portValue": 7443
              }
            }
          },
          "loadBalancingWeight": 1
        }
      ],
      "loadBalancingWeight": 1
    },
    {
      "locality": {
        "zone": "h38-45"
      },
      "lbEndpoints": [
        {
          "endpoint": {
            "address": {
              "socketAddress": {
                "address": "10.251.130.78",
                "portValue": 7443
              }
            }
          },
          "metadata": {
            "filterMetadata": {
              "envoy.transport_socket_match": {
                  "tlsMode": "istio"
                }
            }
          },
          "loadBalancingWeight": 1
        }
      ],
      "loadBalancingWeight": 1
    }
  ]
},
  • However the stat pilot_xds_eds_instances{cluster="outbound_.7443_._.service.service-mesh.svc.cluster.local"} shows only 2 - that means it is aware of only localities. This gets populated from EndpointShards. EndpointShads was never updated with this endpoint.
  • To confirm that, I tried triggering a push from debug '/debug/adsz?push=true' - even after this the new endpoint did not go to Envoys.
  • The fact that the TLS mode is empty, suggests that if labels are updated after endpoint notification comes, we are skipping it because of above PR.

Does the above make sense? I think in general, we should trigger Eds update when labels change because localities + other things might have changed?

@hzxuzhonghu

This comment has been minimized.

Copy link
Member

@hzxuzhonghu hzxuzhonghu commented Nov 17, 2019

@ramaraochavali In general, we donot support workload labels update now.

@ramaraochavali

This comment has been minimized.

Copy link
Contributor Author

@ramaraochavali ramaraochavali commented Nov 17, 2019

@hzxuzhonghu I see. But should n't the standard labels like security.istio.io/tlsMode and k8s locality labels be there on the endpoint? Those should not be missing in the Endpoints right?

Or are you suggesting I should validate for change in those labels only?

@howardjohn

This comment has been minimized.

Copy link
Member

@howardjohn howardjohn commented Nov 17, 2019

I don't think either of those are labels that are on endpoints? I thought the tls one was on the pod and locality is on the node

@ramaraochavali

This comment has been minimized.

Copy link
Contributor Author

@ramaraochavali ramaraochavali commented Nov 17, 2019

Yeah. You are right. I verified the tls - the tls is on pod. Since pod is not available when this endpoint came it never entered in to endpoint shards? How do we handle that case then?

@ramaraochavali

This comment has been minimized.

Copy link
Contributor Author

@ramaraochavali ramaraochavali commented Nov 17, 2019

IIUC, pod cache is eventually consistent, if pod is not available for this new endpoint IP here https://github.com/istio/istio/blob/master/pilot/pkg/serviceregistry/kube/controller/controller.go#L938 - instead of skipping that endpoint, should we refresh the pod list by calling c.pods.informer.GetStore().List() or is there any better way?

@hzxuzhonghu

This comment has been minimized.

Copy link
Member

@hzxuzhonghu hzxuzhonghu commented Nov 18, 2019

One option is to get pod from k8s apiserver directly, it must exist there.

c.pods.informer.GetStore().List()

This may not feasible, the pod may not refreshed.

@hzxuzhonghu

This comment has been minimized.

Copy link
Member

@hzxuzhonghu hzxuzhonghu commented Nov 18, 2019

The other option is to trigger full xds push for all proxies as before

@howardjohn

This comment has been minimized.

Copy link
Member

@howardjohn howardjohn commented Nov 18, 2019

Once the pod does get added don't we trigger something?

@hzxuzhonghu

This comment has been minimized.

Copy link
Member

@hzxuzhonghu hzxuzhonghu commented Nov 18, 2019

Only xds push for the related proxy, and this is not enough in many cases.

@ramaraochavali

This comment has been minimized.

Copy link
Contributor Author

@ramaraochavali ramaraochavali commented Nov 18, 2019

@hzxuzhonghu @howardjohn I implemented to get the pod directly from K8S api in #19035 . Please check

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Linked pull requests

Successfully merging a pull request may close this issue.

4 participants
You can’t perform that action at this time.