New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stale node IPs on endpoint resource #69668

Open
lfundaro opened this Issue Oct 11, 2018 · 3 comments

Comments

Projects
None yet
3 participants
@lfundaro

lfundaro commented Oct 11, 2018

Is this a BUG REPORT or FEATURE REQUEST?:
/kind bug

What happened:
Endpoint contains IPs that are no longer assigned to any node in the cluster, leading to communication failures when pods use the service.

What you expected to happen:
IPs listed on the endpoint should be in sync with IPs of the cluster nodes, i.e.: endpoint ips should be a subset of node ips.

How to reproduce it (as minimally and precisely as possible):
We don't know. But we think this happens more often in a QA environment where we have pre-emptible machines and we think the shutdown hooks of this machines are not properly running or master is not aware of these nodes going down. However we started seeing this in prod clusters and we don't run preemptible machines there but we do run cluster autoscaler which will trigger node shutdowns.

Anything else we need to know?:
This is happening with a daemon set that's running nginx-ingress-controller with hostNetwork: true. We have two services pointing to these pods:

---
apiVersion: v1
kind: Service
metadata:
  name: foo
  namespace: default
spec:
  ports:
  - port: 80
    name: http
  selector:
    app: ingress-lb
    active: "yes"
---
apiVersion: v1
kind: Service
metadata:
  name: ingress
  namespace: default
spec:
  type: NodePort
  ports:
  - port: 80
    name: http
    nodePort: 30000
  selector:
    app: ingress-lb
    active: "yes"

ingress.default's underlying endpoint is in sync with all the node IPs, but foo.default is not.
When doing

$> kubectl describe ep foo
Name:         foo
Namespace:    default
Labels:       <none>
Annotations:  <none>
Subsets:
  Addresses:          10.240.0.106,.....more-ips-here......,10.240.250.177
  NotReadyAddresses:  10.240.0.55,10.240.0.91
  Ports:
    Name  Port  Protocol
    ----  ----  --------
    http  80    TCP

Events:  <none>

The NotReadyAddresses don't even exists on the cluster. And the Addresses contains some ips that are also not part of the the cluster.
When doing
$> kubectl get ep foo -o yaml we see node names that don't exists in the cluster as well.

Looking through this repo's issues, I found two other issues that seemed to be related to this or claim to have the same problem: #48396 and #56972

Environment:

  • Kubernetes version (use kubectl version): master: 1.10.6-gke.2, nodes: v1.10.5-gke.4
  • Cloud provider or hardware configuration: GCP
  • OS (e.g. from /etc/os-release):
BUILD_ID=10452.109.0
NAME="Container-Optimized OS"
KERNEL_COMMIT_ID=a64d42388743f77dc01fef398d96ffdda96b321b
GOOGLE_CRASH_ID=Lakitu
VERSION_ID=66
BUG_REPORT_URL=https://crbug.com/new
PRETTY_NAME="Container-Optimized OS from Google"
VERSION=66
GOOGLE_METRICS_PRODUCT_ID=26
HOME_URL="https://cloud.google.com/compute/docs/containers/vm-image/"
ID=cos
  • Kernel (e.g. uname -a): Linux gke-prod-ex-1-pool-2-828f4e0e-03bv 4.14.22+ #1 SMP Sat Aug 4 10:28:50 PDT 2018 x86_64 Intel(R) Xeon(R) CPU @ 2.50GHz GenuineIntel GNU/Linux
  • Install tools:
  • Others:
@lfundaro

This comment has been minimized.

Show comment
Hide comment
@lfundaro

lfundaro Oct 11, 2018

/sig gcp
/sig autoscaling
/sig network

lfundaro commented Oct 11, 2018

/sig gcp
/sig autoscaling
/sig network

@MrHohn

This comment has been minimized.

Show comment
Hide comment
@MrHohn

MrHohn Oct 11, 2018

Member

Echoing #48396 (comment) here, likely fixed by #68575.

Member

MrHohn commented Oct 11, 2018

Echoing #48396 (comment) here, likely fixed by #68575.

@lfundaro

This comment has been minimized.

Show comment
Hide comment
@lfundaro

lfundaro Oct 12, 2018

thank you @MrHohn . Looking into the PR, it's likely is the fix to our problem ! Unfortunately this fix has been introduced in v.1.12.1 and GCP is currently at v1.10.7, so there's no way for me to verify it. I will leave to the admins of this repo the decision whether to close this issue or not.

lfundaro commented Oct 12, 2018

thank you @MrHohn . Looking into the PR, it's likely is the fix to our problem ! Unfortunately this fix has been introduced in v.1.12.1 and GCP is currently at v1.10.7, so there's no way for me to verify it. I will leave to the admins of this repo the decision whether to close this issue or not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment