Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connectivity issue in newly spinned up cluster #16126

Closed
mzeeshan1 opened this issue Nov 23, 2023 · 5 comments
Closed

Connectivity issue in newly spinned up cluster #16126

mzeeshan1 opened this issue Nov 23, 2023 · 5 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@mzeeshan1
Copy link

mzeeshan1 commented Nov 23, 2023

/kind bug

1. What kops version are you running? The command kops version, will display
this information.

Client version: 1.27.0 (git-v1.27.0)

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

1.24.16

3. What cloud provider are you using?
aws

4. What commands did you run? What is the simplest way to reproduce this issue?
`#create a new cluster

kops create cluster --name=test-cluster.k8s.local --cloud aws --zones eu-central-1a --kubernetes-version v1.24.16 --networking kube-router

Wait for cluster to spin up

#update the cluster

kops update cluster --name=test-cluster.k8s.local --yes --admin

*********************************************************************************

A new kubernetes version is available: 1.24.17
Upgrading is recommended (try kops upgrade cluster)

More information: https://github.com/kubernetes/kops/blob/master/permalinks/upgrade_k8s.md#1.24.17

*********************************************************************************

I1123 18:06:09.361024   12454 executor.go:111] Tasks: 0 done / 99 total; 45 can run
W1123 18:06:09.536646   12454 vfs_keystorereader.go:143] CA private key was not found
I1123 18:06:09.543792   12454 keypair.go:226] Issuing new certificate: "etcd-manager-ca-events"
I1123 18:06:09.543793   12454 keypair.go:226] Issuing new certificate: "etcd-peers-ca-main"
I1123 18:06:09.544936   12454 keypair.go:226] Issuing new certificate: "etcd-peers-ca-events"
I1123 18:06:09.559616   12454 keypair.go:226] Issuing new certificate: "apiserver-aggregator-ca"
I1123 18:06:09.564005   12454 keypair.go:226] Issuing new certificate: "etcd-clients-ca"
I1123 18:06:09.564098   12454 keypair.go:226] Issuing new certificate: "etcd-manager-ca-main"
W1123 18:06:09.613163   12454 vfs_keystorereader.go:143] CA private key was not found
I1123 18:06:09.637009   12454 keypair.go:226] Issuing new certificate: "service-account"
I1123 18:06:09.653641   12454 keypair.go:226] Issuing new certificate: "kubernetes-ca"
I1123 18:06:10.504079   12454 executor.go:111] Tasks: 45 done / 99 total; 19 can run
I1123 18:06:11.437882   12454 executor.go:111] Tasks: 64 done / 99 total; 25 can run
I1123 18:06:12.461030   12454 executor.go:111] Tasks: 89 done / 99 total; 2 can run
I1123 18:06:12.684433   12454 executor.go:111] Tasks: 91 done / 99 total; 4 can run
I1123 18:06:13.295390   12454 executor.go:111] Tasks: 95 done / 99 total; 2 can run
I1123 18:06:14.245057   12454 executor.go:155] No progress made, sleeping before retrying 2 task(s)
I1123 18:06:24.247643   12454 executor.go:111] Tasks: 95 done / 99 total; 2 can run
I1123 18:06:25.510459   12454 executor.go:111] Tasks: 97 done / 99 total; 2 can run
I1123 18:06:25.568554   12454 executor.go:111] Tasks: 99 done / 99 total; 0 can run
I1123 18:06:25.698164   12454 update_cluster.go:323] Exporting kubeconfig for cluster
kOps has set your kubectl context to test-cluster.k8s.local

Cluster is starting.  It should be ready in a few minutes.

Suggestions:
 * validate cluster: kops validate cluster --wait 10m
 * list nodes: kubectl get nodes --show-labels
 * ssh to a control-plane node: ssh -i ~/.ssh/id_rsa ubuntu@
 * the ubuntu user is specific to Ubuntu. If not using Ubuntu please use the appropriate user based on your OS.
 * read about installing addons at: https://kops.sigs.k8s.io/addons.
 
kops edit ig nodes-eu-central-1a

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2023-11-23T17:03:40Z"
  labels:
    kops.k8s.io/cluster: test-cluster.k8s.local
  name: nodes-eu-central-1a
spec:
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20231004
  instanceMetadata:
    httpPutResponseHopLimit: 1
    httpTokens: required
  machineType: t3.medium
  maxSize: 2
  minSize: 2
  role: Node
  subnets:
  - eu-central-1a

Change min and max nodes to 2

kops update cluster --name=test-cluster.k8s.local --yes --admin

*********************************************************************************

A new kubernetes version is available: 1.24.17
Upgrading is recommended (try kops upgrade cluster)

More information: https://github.com/kubernetes/kops/blob/master/permalinks/upgrade_k8s.md#1.24.17

*********************************************************************************

I1123 18:22:47.004765   13985 executor.go:111] Tasks: 0 done / 99 total; 45 can run
I1123 18:22:47.501234   13985 executor.go:111] Tasks: 45 done / 99 total; 19 can run
I1123 18:22:47.990546   13985 executor.go:111] Tasks: 64 done / 99 total; 25 can run
I1123 18:22:48.398206   13985 executor.go:111] Tasks: 89 done / 99 total; 2 can run
I1123 18:22:48.592505   13985 executor.go:111] Tasks: 91 done / 99 total; 4 can run
I1123 18:22:49.017687   13985 executor.go:111] Tasks: 95 done / 99 total; 2 can run
I1123 18:22:49.287993   13985 executor.go:111] Tasks: 97 done / 99 total; 2 can run
I1123 18:22:49.386978   13985 executor.go:111] Tasks: 99 done / 99 total; 0 can run
I1123 18:22:49.446827   13985 update_cluster.go:323] Exporting kubeconfig for cluster
kOps has set your kubectl context to test-cluster.k8s.local

Cluster changes have been applied to the cloud.


Changes may require instances to restart: kops rolling-update cluster
cat whoami

apiVersion: v1
kind: Service
metadata:
  annotations:
    #kube-router.io/service.dsr: tunnel
    kube-router.io/service.local: "true"
    purpose: "Creates a VIP for balancing an application"
  labels:
    name: whoami
  name: whoami
  namespace: default
spec:
  ports:
  - name: flask
    port: 5000
    protocol: TCP
    targetPort: 5000
  selector:
    name: whoami
  type: LoadBalancer

---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: whoami
  namespace: default
spec:
  selector:
    matchLabels:
      name: whoami
  template:
    metadata:
      labels:
        name: whoami
    spec:
      securityContext:
        runAsUser: 1000
        fsGroup: 1000
      tolerations:
      - key: "node-role.kubernetes.io/master"
        operator: "Exists"
        effect: "NoSchedule"
      - key: "node-role.kubernetes.io/control-plane"
        operator: "Exists"
        effect: "NoSchedule"
      containers:
        - name: whoami
          image: "docker.io/containous/whoami"
          imagePullPolicy: Always
          command: ["/whoami"]
          args: ["--port", "5000"]
kubectl apply -f whoami                                        
service/whoami created
daemonset.apps/whoami created
kubectl get nodes      
NAME                  STATUS   ROLES           AGE   VERSION
i-0535b8ca26a11ff5c   Ready    node            56s   v1.24.16
i-0993be5d74b1f8a1e   Ready    control-plane   15m   v1.24.16
i-0c51517a5f08ccc69   Ready    node            13m   v1.24.16
kubectl get pods -n default -o wide
NAME           READY   STATUS    RESTARTS   AGE   IP           NODE                  NOMINATED NODE   READINESS GATES
whoami-sxmr5   1/1     Running   0          62s   100.96.0.3   i-0993be5d74b1f8a1e   <none>           <none>
whoami-twsfh   1/1     Running   0          62s   100.96.2.3   i-0535b8ca26a11ff5c   <none>           <none>
whoami-v8fdn   1/1     Running   0          62s   100.96.1.8   i-0c51517a5f08ccc69   <none>           <none>
kubectl debug -it whoami-v8fdn --image=nicolaka/netshoot --target=whoami --share-processes=true
Targeting container "whoami". If you don't see processes from this container it may be because the container runtime doesn't support this feature.
Defaulting debug container name to debugger-wr2z8.


If you don't see a command prompt, try pressing enter.

whoami-v8fdn% 
whoami-v8fdn% ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
3: eth0@if11: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue state UP group default 
    link/ether 12:9f:05:dd:8e:cd brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 100.96.1.8/24 brd 100.96.1.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::109f:5ff:fedd:8ecd/64 scope link 
       valid_lft forever preferred_lft forever
whoami-v8fdn% 
whoami-v8fdn% 
whoami-v8fdn% 
whoami-v8fdn% ping 100.96.2.3
PING 100.96.2.3 (100.96.2.3) 56(84) bytes of data.
^C
--- 100.96.2.3 ping statistics ---
29 packets transmitted, 0 received, 100% packet loss, time 28663ms

In above commands I am trying to from debugger pod on a worker node to ping debugger pod on newly added worker node in the same region and the connectivity fails.
If I restart kube-router on the master node, the connectivity begins to work.

5. What happened after the commands executed?
Pods on newly added nodes were not pingable from nodes in the same subnet.

6. What did you expect to happen?
I expected that pods on newly added nodes are pingable from nodes in the same subnet

7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: "2023-11-23T17:03:40Z"
  name: test-cluster.k8s.local
spec:
  api:
    loadBalancer:
      class: Network
      type: Public
  authorization:
    rbac: {}
  channel: stable
  cloudProvider: aws
  configBase: s3://kops-bucket-zeeshan/test-cluster.k8s.local
  etcdClusters:
  - cpuRequest: 200m
    etcdMembers:
    - encryptedVolume: true
      instanceGroup: control-plane-eu-central-1a
      name: a
    manager:
      backupRetentionDays: 90
    memoryRequest: 100Mi
    name: main
  - cpuRequest: 100m
    etcdMembers:
    - encryptedVolume: true
      instanceGroup: control-plane-eu-central-1a
      name: a
    manager:
      backupRetentionDays: 90
    memoryRequest: 100Mi
    name: events
  iam:
    allowContainerRegistry: true
    legacy: false
  kubeProxy:
    enabled: false
  kubelet:
    anonymousAuth: false
  kubernetesApiAccess:
  - 0.0.0.0/0
  - ::/0
  kubernetesVersion: v1.24.16
  networkCIDR: 172.20.0.0/16
  networking:
    kuberouter: {}
  nonMasqueradeCIDR: 100.64.0.0/10
  sshAccess:
  - 0.0.0.0/0
  - ::/0
  subnets:
  - cidr: 172.20.32.0/19
    name: eu-central-1a
    type: Public
    zone: eu-central-1a
  topology:
    dns:
      type: Private
    masters: public
    nodes: public

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2023-11-23T17:03:40Z"
  labels:
    kops.k8s.io/cluster: test-cluster.k8s.local
  name: control-plane-eu-central-1a
spec:
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20231004
  instanceMetadata:
    httpTokens: required
  machineType: t3.medium
  maxSize: 1
  minSize: 1
  role: Master
  subnets:
  - eu-central-1a

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2023-11-23T17:03:40Z"
  generation: 1
  labels:
    kops.k8s.io/cluster: test-cluster.k8s.local
  name: nodes-eu-central-1a
spec:
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20231004
  instanceMetadata:
    httpPutResponseHopLimit: 1
    httpTokens: required
  machineType: t3.medium
  maxSize: 2
  minSize: 2
  role: Node
  subnets:
  - eu-central-1a

8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.

9. Anything else do we need to know?

The issue does not appear with kubernetes version 1.23.17 but appears with kubernets 1.24.16. Can anyone please confirm if they face the same issue with kops 1.27.0 and with kubernetes version 1.24.16 and kube-router as CNI running in default mode as it comes with Kops.

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Nov 23, 2023
@mzeeshan1
Copy link
Author

mzeeshan1 commented Dec 1, 2023

For anyone else coming here facing the same issue, The issue started appearing when kubernetes was updated to v1.24.x. The reason was kube-router was not able to disable source/dest for newly spinned ec2 instances because kube-router lacked the permissions to do so. Why these permissions got changed after upgrading kubernets version to 1.24.x? don't know. If anyone knows please leave a comment.
When discovery store was provided with flag --discovery-store the issue was not reproducible anymore. How does kubernetes version impact kops discovery-store and permissions? And was it mentioned anywhere on kops documentation (should be because it is a breaking change). If anyone knows please leave a comment

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 3, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 2, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-ci-robot k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale May 2, 2024
@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

3 participants