Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubectl logs POD broken when using amazon-vpc-cni-k8s (kubelet registers the wrong IP) #4218

Closed
dezmodue opened this issue Jan 8, 2018 · 14 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@dezmodue
Copy link
Contributor

dezmodue commented Jan 8, 2018

  1. What kops version are you running? The command kops version, will display
    this information.
    kops and nodeup built from master 2fdf834 - golang 1.8.5

  2. What Kubernetes version are you running? kubectl version will print the
    version if a cluster is running or provide the Kubernetes version specified as
    a kops flag.
    Server Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.4", GitCommit:"9befc2b8928a9426501d3bf62f72849d5cbcd5a3", GitTreeState:"clean", BuildDate:"2017-11-20T05:17:43Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}

  3. What cloud provider are you using?
    AWS

  4. What commands did you run? What is the simplest way to reproduce this issue?
    kubectl logs POD -n NAMESPACE

  5. What happened after the commands executed?

Error from server: Get https://10.103.20.110:10250/containerLogs/monitoring/grafana-0/grafana: dial tcp 10.103.20.110:10250: getsockopt: connection timed out
  1. What did you expect to happen?
    The container logs to be displayed

  2. Please provide your cluster manifest. Execute
    kops get --name my.example.com -oyaml to display your cluster manifest.
    You may want to remove your cluster name and other sensitive information.

apiVersion: kops/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: 2018-01-03T13:20:31Z
  name: megamind.mycompany.io
spec:
  additionalPolicies:
    master: |
      [
        {
            "Effect": "Allow",
            "Action": [
                "ec2:CreateNetworkInterface",
                "ec2:AttachNetworkInterface",
                "ec2:DeleteNetworkInterface",
                "ec2:DetachNetworkInterface",
                "ec2:DescribeNetworkInterfaces",
                "ec2:DescribeInstances",
                "ec2:ModifyNetworkInterfaceAttribute",
                "ec2:AssignPrivateIpAddresses"
            ],
            "Resource": [
                "*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": "tag:TagResources",
            "Resource": "*"
        }
      ]
    node: |
      [
        {
            "Effect": "Allow",
            "Action": [
                "ec2:CreateNetworkInterface",
                "ec2:AttachNetworkInterface",
                "ec2:DeleteNetworkInterface",
                "ec2:DetachNetworkInterface",
                "ec2:DescribeNetworkInterfaces",
                "ec2:DescribeInstances",
                "ec2:ModifyNetworkInterfaceAttribute",
                "ec2:AssignPrivateIpAddresses"
            ],
            "Resource": [
                "*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": "tag:TagResources",
            "Resource": "*"
        }
      ]
  api:
    dns: {}
    loadBalancer:
      type: Public
  authorization:
    rbac: {}
  channel: stable
  cloudProvider: aws
  configBase: s3://megamind-dev-kops-state/megamind.mycompany.io
  etcdClusters:
  - enableEtcdTLS: true
    etcdMembers:
    - encryptedVolume: true
      instanceGroup: master-eu-west-1a
      name: a
    - encryptedVolume: true
      instanceGroup: master-eu-west-1b
      name: b
    - encryptedVolume: true
      instanceGroup: master-eu-west-1c
      name: c
    name: main
    version: 3.1.11
  - enableEtcdTLS: true
    etcdMembers:
    - encryptedVolume: true
      instanceGroup: master-eu-west-1a
      name: a
    - encryptedVolume: true
      instanceGroup: master-eu-west-1b
      name: b
    - encryptedVolume: true
      instanceGroup: master-eu-west-1c
      name: c
    name: events
    version: 3.1.11
  iam:
    allowContainerRegistry: true
    legacy: false
  kubernetesApiAccess:
  - 1.2.3.4/32
  kubernetesVersion: 1.8.4
  masterInternalName: api.internal.megamind.mycompany.io
  masterPublicName: api.megamind.mycompany.io
  networkCIDR: 10.103.0.0/16
  networkID: vpc-01234567
  networking:
    amazonvpc: {}
  nonMasqueradeCIDR: 100.64.0.0/10
  sshAccess:
  - 1.2.3.4/32
  subnets:
  - cidr: 10.103.16.0/21
    name: eu-west-1a
    type: Private
    zone: eu-west-1a
  - cidr: 10.103.32.0/21
    name: utility-eu-west-1a
    type: Utility
    zone: eu-west-1a
  - cidr: 10.103.48.0/21
    name: eu-west-1b
    type: Private
    zone: eu-west-1b
  - cidr: 10.103.64.0/21
    name: utility-eu-west-1b
    type: Utility
    zone: eu-west-1b
  - cidr: 10.103.80.0/21
    name: eu-west-1c
    type: Private
    zone: eu-west-1c
  - cidr: 10.103.112.0/21
    name: utility-eu-west-1c
    type: Utility
    zone: eu-west-1c
  topology:
    bastion:
      bastionPublicName: bastion.megamind.mycompany.io
    dns:
      type: Public
    masters: private
    nodes: private

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2018-01-03T13:20:34Z
  labels:
    kops.k8s.io/cluster: megamind.mycompany.io
  name: bastions
spec:
  associatePublicIp: true
  image: kope.io/k8s-1.8-debian-jessie-amd64-hvm-ebs-2017-12-02
  machineType: t2.micro
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: bastion-eu-west-1a
  role: Bastion
  subnets:
  - utility-eu-west-1a

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2018-01-03T13:20:32Z
  labels:
    kops.k8s.io/cluster: megamind.mycompany.io
  name: master-eu-west-1a
spec:
  image: kope.io/k8s-1.8-debian-jessie-amd64-hvm-ebs-2017-12-02
  machineType: m3.medium
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: master-eu-west-1a
  role: Master
  subnets:
  - eu-west-1a

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2018-01-03T13:20:32Z
  labels:
    kops.k8s.io/cluster: megamind.mycompany.io
  name: master-eu-west-1b
spec:
  image: kope.io/k8s-1.8-debian-jessie-amd64-hvm-ebs-2017-12-02
  machineType: m3.medium
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: master-eu-west-1b
  role: Master
  subnets:
  - eu-west-1b

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2018-01-03T13:20:32Z
  labels:
    kops.k8s.io/cluster: megamind.mycompany.io
  name: master-eu-west-1c
spec:
  image: kope.io/k8s-1.8-debian-jessie-amd64-hvm-ebs-2017-12-02
  machineType: m3.medium
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: master-eu-west-1c
  role: Master
  subnets:
  - eu-west-1c

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2018-01-03T13:20:34Z
  labels:
    kops.k8s.io/cluster: megamind.mycompany.io
  name: nodes-eu-west-1
spec:
  image: kope.io/k8s-1.8-debian-jessie-amd64-hvm-ebs-2017-12-02
  machineType: r4.2xlarge
  maxSize: 3
  minSize: 3
  nodeLabels:
    kops.k8s.io/instancegroup: nodes-eu-west-1
  role: Node
  subnets:
  - eu-west-1a
  - eu-west-1b
  - eu-west-1c
  1. Please run the commands with most verbose logging by adding the -v 10 flag.
    Paste the logs into this report, or in a gist and provide the gist link here.

  2. Anything else do we need to know?
    This error seems to be related to the fact that the kubelet registers the wrong IP, specifically it seems that the kubelet reports one of the secondary private IPs
    In the example error reported:

Error from server: Get https://10.103.20.110:10250/containerLogs/monitoring/grafana-0/grafana: dial tcp 10.103.20.110:10250: getsockopt: connection timed out

10.103.20.110 is a secondary private IP and it is indeed the IP shown by kubectl describe node:

Addresses:
  InternalIP:   10.103.20.110
  InternalDNS:  ip-10-103-21-40.megamind.internal
  Hostname:     ip-10-103-21-40.megamind.internal

Locally curl works on both the primary IPs on eth0 and eth1

The problem has also been occurring on the master nodes and it manifests with new nodes unable to join the cluster because the kubelet is unable to contact the API (same reason, wrong IPs)

As a poc I built a modified nodeup that passes the flag --node-ip=LOCAL-IPV4 to the kubelet (where LOCAL-IPV4 is the result of curl http://169.254.169.254/latest/meta-data/local-ipv4)

With that in place the master and nodes build correctly and kubectl logs works as expected

@chrislovecnm
Copy link
Contributor

Can you please file this issue with https://github.com/aws/amazon-vpc-cni-k8s?

@dezmodue
Copy link
Contributor Author

dezmodue commented Jan 9, 2018

Hi @chrislovecnm, I will do as I cannot seem to reproduce the same issue when running a private kops cluster with flannel-vxlan by simply adding eni and IPs to the nodes

@liwenwu-amazon
Copy link

Hi @dezmodue @chrislovecnm , the cni plugin back-end (L-IPAM) allocates IP addresses on the primary ENI interface right after it is initialized. So at the time when kubelet is reporting node address, the primary ENI interface already have multiple secondary IP addresses assigned.

I like the idea of your poc that passing --node-ip=LOCAL-IPV4 to the kubelet. Can we use that to fix this issue?

@dezmodue
Copy link
Contributor Author

dezmodue commented Jan 16, 2018

@chrislovecnm if that is a satisfactory solution I could try to work on it (with some guidance on how it should be implemented)

@chrislovecnm
Copy link
Contributor

@liwenwu-amazon what is the recommended solution?

@liwenwu-amazon
Copy link

The recommended solution is

  • kubeletes must also explicit specify using primary IPv4 address on the Primary ENI as its node-ip, for example:
--node-ip=$(curl http://169.254.169.254/latest/meta-data/local-ipv4)

@chrislovecnm
Copy link
Contributor

If someone wants to contribute this, I can provide more details on how to do this in nodeup.

@dezmodue
Copy link
Contributor Author

dezmodue commented Feb 9, 2018

@chrislovecnm I would like to contribute

@dezmodue
Copy link
Contributor Author

dezmodue commented Feb 9, 2018

@chrislovecnm I gave it a try - #4417 - let me know

@dezmodue
Copy link
Contributor Author

@chrislovecnm is it ok to close this issue since 4417 is released with 1.9.0 ?

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 16, 2018
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 15, 2018
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

5 participants