Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nodeup error downloading assets from storage.googleapis.com when using official Ubuntu 20.04 AMI #10206

Closed
CheyiLin opened this issue Nov 10, 2020 · 22 comments

Comments

@CheyiLin
Copy link
Contributor

CheyiLin commented Nov 10, 2020

1. What kops version are you running?
Version 1.18.2 (git-84495481e4)

2. What Kubernetes version are you running?

Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.13", GitCommit:"30d651da517185653e34e7ab99a792be6a3d9495", GitTreeState:"clean", BuildDate:"2020-10-15T01:06:31Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.13", GitCommit:"30d651da517185653e34e7ab99a792be6a3d9495", GitTreeState:"clean", BuildDate:"2020-10-15T00:59:17Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}

3. What cloud provider are you using?
AWS

4. What commands did you run? What is the simplest way to reproduce this issue?
kops rolling-update cluster --name my-cluster-name --yes

5. What happened after the commands executed?
The new node hasn't joined cluster over 20 minutes.

kops logs
...
I1110 16:21:43.388467   23599 instancegroups.go:143] waiting for 15s after detaching instance
I1110 16:21:58.391501   23599 instancegroups.go:383] Validating the cluster.
I1110 16:22:01.241446   23599 instancegroups.go:440] Cluster did not pass validation, will retry in "30s": InstanceGroup "nodes-a1" did not have enough nodes 0 vs 1.
I1110 16:22:34.057496   23599 instancegroups.go:440] Cluster did not pass validation, will retry in "30s": machine "i-0de5271375ff81ce1" has not yet joined cluster.
I1110 16:23:07.378220   23599 instancegroups.go:440] Cluster did not pass validation, will retry in "30s": machine "i-0de5271375ff81ce1" has not yet joined cluster.
I1110 16:23:40.039431   23599 instancegroups.go:440] Cluster did not pass validation, will retry in "30s": machine "i-0de5271375ff81ce1" has not yet joined cluster.
I1110 16:24:12.913123   23599 instancegroups.go:440] Cluster did not pass validation, will retry in "30s": machine "i-0de5271375ff81ce1" has not yet joined cluster.
I1110 16:24:45.776340   23599 instancegroups.go:440] Cluster did not pass validation, will retry in "30s": machine "i-0de5271375ff81ce1" has not yet joined cluster.
I1110 16:25:18.334645   23599 instancegroups.go:440] Cluster did not pass validation, will retry in "30s": machine "i-0de5271375ff81ce1" has not yet joined cluster.
I1110 16:25:50.946073   23599 instancegroups.go:440] Cluster did not pass validation, will retry in "30s": machine "i-0de5271375ff81ce1" has not yet joined cluster.
I1110 16:26:24.604930   23599 instancegroups.go:440] Cluster did not pass validation, will retry in "30s": machine "i-0de5271375ff81ce1" has not yet joined cluster.
I1110 16:26:57.899771   23599 instancegroups.go:440] Cluster did not pass validation, will retry in "30s": machine "i-0de5271375ff81ce1" has not yet joined cluster.
I1110 16:27:30.567355   23599 instancegroups.go:440] Cluster did not pass validation, will retry in "30s": machine "i-0de5271375ff81ce1" has not yet joined cluster.
I1110 16:28:03.045796   23599 instancegroups.go:440] Cluster did not pass validation, will retry in "30s": machine "i-0de5271375ff81ce1" has not yet joined cluster.
I1110 16:28:35.805567   23599 instancegroups.go:440] Cluster did not pass validation, will retry in "30s": machine "i-0de5271375ff81ce1" has not yet joined cluster.
I1110 16:29:08.260794   23599 instancegroups.go:440] Cluster did not pass validation, will retry in "30s": machine "i-0de5271375ff81ce1" has not yet joined cluster.
I1110 16:29:41.070586   23599 instancegroups.go:440] Cluster did not pass validation, will retry in "30s": machine "i-0de5271375ff81ce1" has not yet joined cluster.
I1110 16:30:13.450487   23599 instancegroups.go:440] Cluster did not pass validation, will retry in "30s": machine "i-0de5271375ff81ce1" has not yet joined cluster.
I1110 16:30:46.739787   23599 instancegroups.go:440] Cluster did not pass validation, will retry in "30s": machine "i-0de5271375ff81ce1" has not yet joined cluster.
I1110 16:31:20.317658   23599 instancegroups.go:440] Cluster did not pass validation, will retry in "30s": machine "i-0de5271375ff81ce1" has not yet joined cluster.
I1110 16:31:52.790707   23599 instancegroups.go:440] Cluster did not pass validation, will retry in "30s": machine "i-0de5271375ff81ce1" has not yet joined cluster.
I1110 16:32:25.907071   23599 instancegroups.go:440] Cluster did not pass validation, will retry in "30s": machine "i-0de5271375ff81ce1" has not yet joined cluster.
I1110 16:32:58.928122   23599 instancegroups.go:440] Cluster did not pass validation, will retry in "30s": machine "i-0de5271375ff81ce1" has not yet joined cluster.
I1110 16:33:32.108136   23599 instancegroups.go:440] Cluster did not pass validation, will retry in "30s": machine "i-0de5271375ff81ce1" has not yet joined cluster.
I1110 16:34:04.983422   23599 instancegroups.go:440] Cluster did not pass validation, will retry in "30s": machine "i-0de5271375ff81ce1" has not yet joined cluster.
I1110 16:34:37.503496   23599 instancegroups.go:440] Cluster did not pass validation, will retry in "30s": machine "i-0de5271375ff81ce1" has not yet joined cluster.
I1110 16:35:10.819474   23599 instancegroups.go:440] Cluster did not pass validation, will retry in "30s": machine "i-0de5271375ff81ce1" has not yet joined cluster.
I1110 16:35:43.688290   23599 instancegroups.go:440] Cluster did not pass validation, will retry in "30s": machine "i-0de5271375ff81ce1" has not yet joined cluster.
I1110 16:36:16.256246   23599 instancegroups.go:440] Cluster did not pass validation, will retry in "30s": machine "i-0de5271375ff81ce1" has not yet joined cluster.
I1110 16:36:48.916137   23599 instancegroups.go:440] Cluster did not pass validation, will retry in "30s": machine "i-0de5271375ff81ce1" has not yet joined cluster.
I1110 16:37:21.481365   23599 instancegroups.go:437] Cluster did not pass validation within deadline: machine "i-0de5271375ff81ce1" has not yet joined cluster.
E1110 16:37:21.481387   23599 instancegroups.go:388] Cluster did not validate within 15m0s
I1110 16:37:21.481393   23599 instancegroups.go:383] Validating the cluster.
I1110 16:37:23.467696   23599 instancegroups.go:440] Cluster did not pass validation, will retry in "30s": machine "i-0de5271375ff81ce1" has not yet joined cluster.
I1110 16:37:55.861082   23599 instancegroups.go:440] Cluster did not pass validation, will retry in "30s": machine "i-0de5271375ff81ce1" has not yet joined cluster.
I1110 16:38:28.627294   23599 instancegroups.go:440] Cluster did not pass validation, will retry in "30s": machine "i-0de5271375ff81ce1" has not yet joined cluster.
I1110 16:39:01.600674   23599 instancegroups.go:440] Cluster did not pass validation, will retry in "30s": machine "i-0de5271375ff81ce1" has not yet joined cluster.
I1110 16:39:34.421850   23599 instancegroups.go:440] Cluster did not pass validation, will retry in "30s": machine "i-0de5271375ff81ce1" has not yet joined cluster.
I1110 16:40:06.954583   23599 instancegroups.go:440] Cluster did not pass validation, will retry in "30s": machine "i-0de5271375ff81ce1" has not yet joined cluster.
...
kops-configuration logs
-- Logs begin at Tue 2020-11-10 05:30:22 UTC, end at Tue 2020-11-10 08:38:17 UTC. --
Nov 10 08:22:48 ip-aa-bb-cc-dd systemd[1]: Starting Run kops bootstrap (nodeup)...
-- Subject: A start job for unit kops-configuration.service has begun execution
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- A start job for unit kops-configuration.service has begun execution.
--
-- The job identifier is 532.
Nov 10 08:22:48 ip-aa-bb-cc-dd nodeup[1094]: nodeup version 1.18.2 (git-84495481e4)
Nov 10 08:22:48 ip-aa-bb-cc-dd nodeup[1094]: I1110 08:22:48.890020    1094 http.go:78] Downloading "https://storage.googleapis.com/kubernetes-release/release/v1.17.13/bin/linux/amd64/kubelet"
Nov 10 08:24:48 ip-aa-bb-cc-dd nodeup[1094]: W1110 08:24:48.890432    1094 assetstore.go:203] error downloading url "https://storage.googleapis.com/kubernetes-release/release/v1.17.13/bin/linux/amd64/kubelet": error downloading HTTP content from "https://storage.googleapis.com/kubernetes-release/release/v1.17.13/bin/linux/amd64/kubelet": net/http: request canceled (Client.Timeout exceeded while reading body)
Nov 10 08:24:48 ip-aa-bb-cc-dd nodeup[1094]: W1110 08:24:48.891951    1094 main.go:138] got error running nodeup (will retry in 30s): error adding asset "e71c3ce50f93abc2735ba601781355d86a49aec992e8cb235a369254c304fa7d@https://storage.googleapis.com/kubernetes-release/release/v1.17.13/bin/linux/amd64/kubelet": error downloading HTTP content from "https://storage.googleapis.com/kubernetes-release/release/v1.17.13/bin/linux/amd64/kubelet": net/http: request canceled (Client.Timeout exceeded while reading body)
Nov 10 08:25:19 ip-aa-bb-cc-dd nodeup[1094]: I1110 08:25:19.983102    1094 files.go:103] Hash did not match for "/var/cache/nodeup/sha256:e71c3ce50f93abc2735ba601781355d86a49aec992e8cb235a369254c304fa7d_https___storage_googleapis_com_kubernetes-release_release_v1_17_13_bin_linux_amd64_kubelet": actual=sha256:a82068288cb43cf86d958233912a320bf093b61a04eeb20549daec776fc5a947 vs expected=sha256:e71c3ce50f93abc2735ba601781355d86a49aec992e8cb235a369254c304fa7d
Nov 10 08:25:19 ip-aa-bb-cc-dd nodeup[1094]: I1110 08:25:19.988730    1094 http.go:78] Downloading "https://storage.googleapis.com/kubernetes-release/release/v1.17.13/bin/linux/amd64/kubelet"
Nov 10 08:27:20 ip-aa-bb-cc-dd nodeup[1094]: W1110 08:27:20.270909    1094 assetstore.go:203] error downloading url "https://storage.googleapis.com/kubernetes-release/release/v1.17.13/bin/linux/amd64/kubelet": error downloading HTTP content from "https://storage.googleapis.com/kubernetes-release/release/v1.17.13/bin/linux/amd64/kubelet": net/http: request canceled (Client.Timeout exceeded while reading body)
Nov 10 08:27:20 ip-aa-bb-cc-dd nodeup[1094]: W1110 08:27:20.270973    1094 main.go:138] got error running nodeup (will retry in 30s): error adding asset "e71c3ce50f93abc2735ba601781355d86a49aec992e8cb235a369254c304fa7d@https://storage.googleapis.com/kubernetes-release/release/v1.17.13/bin/linux/amd64/kubelet": error downloading HTTP content from "https://storage.googleapis.com/kubernetes-release/release/v1.17.13/bin/linux/amd64/kubelet": net/http: request canceled (Client.Timeout exceeded while reading body)
Nov 10 08:27:50 ip-aa-bb-cc-dd nodeup[1094]: I1110 08:27:50.448024    1094 files.go:103] Hash did not match for "/var/cache/nodeup/sha256:e71c3ce50f93abc2735ba601781355d86a49aec992e8cb235a369254c304fa7d_https___storage_googleapis_com_kubernetes-release_release_v1_17_13_bin_linux_amd64_kubelet": actual=sha256:8b4c7ca1689d5484dce846e08f296a03200c05f92e8a2de438d61da570a38d9d vs expected=sha256:e71c3ce50f93abc2735ba601781355d86a49aec992e8cb235a369254c304fa7d
Nov 10 08:27:50 ip-aa-bb-cc-dd nodeup[1094]: I1110 08:27:50.453848    1094 http.go:78] Downloading "https://storage.googleapis.com/kubernetes-release/release/v1.17.13/bin/linux/amd64/kubelet"
Nov 10 08:29:50 ip-aa-bb-cc-dd nodeup[1094]: W1110 08:29:50.454465    1094 assetstore.go:203] error downloading url "https://storage.googleapis.com/kubernetes-release/release/v1.17.13/bin/linux/amd64/kubelet": error downloading HTTP content from "https://storage.googleapis.com/kubernetes-release/release/v1.17.13/bin/linux/amd64/kubelet": net/http: request canceled (Client.Timeout exceeded while reading body)
Nov 10 08:29:50 ip-aa-bb-cc-dd nodeup[1094]: W1110 08:29:50.455273    1094 main.go:138] got error running nodeup (will retry in 30s): error adding asset "e71c3ce50f93abc2735ba601781355d86a49aec992e8cb235a369254c304fa7d@https://storage.googleapis.com/kubernetes-release/release/v1.17.13/bin/linux/amd64/kubelet": error downloading HTTP content from "https://storage.googleapis.com/kubernetes-release/release/v1.17.13/bin/linux/amd64/kubelet": net/http: request canceled (Client.Timeout exceeded while reading body)
Nov 10 08:30:20 ip-aa-bb-cc-dd nodeup[1094]: I1110 08:30:20.633689    1094 files.go:103] Hash did not match for "/var/cache/nodeup/sha256:e71c3ce50f93abc2735ba601781355d86a49aec992e8cb235a369254c304fa7d_https___storage_googleapis_com_kubernetes-release_release_v1_17_13_bin_linux_amd64_kubelet": actual=sha256:d0ba48d28a01ee3821db53313bd6318dcb2667f329f6d4c7cfc51a8b1b4dbd96 vs expected=sha256:e71c3ce50f93abc2735ba601781355d86a49aec992e8cb235a369254c304fa7d
Nov 10 08:30:20 ip-aa-bb-cc-dd nodeup[1094]: I1110 08:30:20.640534    1094 http.go:78] Downloading "https://storage.googleapis.com/kubernetes-release/release/v1.17.13/bin/linux/amd64/kubelet"
Nov 10 08:32:20 ip-aa-bb-cc-dd nodeup[1094]: W1110 08:32:20.656351    1094 assetstore.go:203] error downloading url "https://storage.googleapis.com/kubernetes-release/release/v1.17.13/bin/linux/amd64/kubelet": error downloading HTTP content from "https://storage.googleapis.com/kubernetes-release/release/v1.17.13/bin/linux/amd64/kubelet": net/http: request canceled (Client.Timeout exceeded while reading body)
Nov 10 08:32:20 ip-aa-bb-cc-dd nodeup[1094]: W1110 08:32:20.657007    1094 main.go:138] got error running nodeup (will retry in 30s): error adding asset "e71c3ce50f93abc2735ba601781355d86a49aec992e8cb235a369254c304fa7d@https://storage.googleapis.com/kubernetes-release/release/v1.17.13/bin/linux/amd64/kubelet": error downloading HTTP content from "https://storage.googleapis.com/kubernetes-release/release/v1.17.13/bin/linux/amd64/kubelet": net/http: request canceled (Client.Timeout exceeded while reading body)
Nov 10 08:32:50 ip-aa-bb-cc-dd nodeup[1094]: I1110 08:32:50.834808    1094 files.go:103] Hash did not match for "/var/cache/nodeup/sha256:e71c3ce50f93abc2735ba601781355d86a49aec992e8cb235a369254c304fa7d_https___storage_googleapis_com_kubernetes-release_release_v1_17_13_bin_linux_amd64_kubelet": actual=sha256:d71116f32aefc3047cb88144d67ae11807abb992e679cfce1c5266812f67d82f vs expected=sha256:e71c3ce50f93abc2735ba601781355d86a49aec992e8cb235a369254c304fa7d
Nov 10 08:32:50 ip-aa-bb-cc-dd nodeup[1094]: I1110 08:32:50.840658    1094 http.go:78] Downloading "https://storage.googleapis.com/kubernetes-release/release/v1.17.13/bin/linux/amd64/kubelet"
Nov 10 08:34:50 ip-aa-bb-cc-dd nodeup[1094]: W1110 08:34:50.852161    1094 assetstore.go:203] error downloading url "https://storage.googleapis.com/kubernetes-release/release/v1.17.13/bin/linux/amd64/kubelet": error downloading HTTP content from "https://storage.googleapis.com/kubernetes-release/release/v1.17.13/bin/linux/amd64/kubelet": net/http: request canceled (Client.Timeout exceeded while reading body)
Nov 10 08:34:50 ip-aa-bb-cc-dd nodeup[1094]: W1110 08:34:50.852237    1094 main.go:138] got error running nodeup (will retry in 30s): error adding asset "e71c3ce50f93abc2735ba601781355d86a49aec992e8cb235a369254c304fa7d@https://storage.googleapis.com/kubernetes-release/release/v1.17.13/bin/linux/amd64/kubelet": error downloading HTTP content from "https://storage.googleapis.com/kubernetes-release/release/v1.17.13/bin/linux/amd64/kubelet": net/http: request canceled (Client.Timeout exceeded while reading body)
Nov 10 08:35:21 ip-aa-bb-cc-dd nodeup[1094]: I1110 08:35:21.025586    1094 files.go:103] Hash did not match for "/var/cache/nodeup/sha256:e71c3ce50f93abc2735ba601781355d86a49aec992e8cb235a369254c304fa7d_https___storage_googleapis_com_kubernetes-release_release_v1_17_13_bin_linux_amd64_kubelet": actual=sha256:e0835088000e8d677a33eb825d2343960f602c83de1eeb5639ef99786f1b931e vs expected=sha256:e71c3ce50f93abc2735ba601781355d86a49aec992e8cb235a369254c304fa7d
Nov 10 08:35:21 ip-aa-bb-cc-dd nodeup[1094]: I1110 08:35:21.031075    1094 http.go:78] Downloading "https://storage.googleapis.com/kubernetes-release/release/v1.17.13/bin/linux/amd64/kubelet"
Nov 10 08:37:21 ip-aa-bb-cc-dd nodeup[1094]: W1110 08:37:21.046954    1094 assetstore.go:203] error downloading url "https://storage.googleapis.com/kubernetes-release/release/v1.17.13/bin/linux/amd64/kubelet": error downloading HTTP content from "https://storage.googleapis.com/kubernetes-release/release/v1.17.13/bin/linux/amd64/kubelet": net/http: request canceled (Client.Timeout exceeded while reading body)
Nov 10 08:37:21 ip-aa-bb-cc-dd nodeup[1094]: W1110 08:37:21.047019    1094 main.go:138] got error running nodeup (will retry in 30s): error adding asset "e71c3ce50f93abc2735ba601781355d86a49aec992e8cb235a369254c304fa7d@https://storage.googleapis.com/kubernetes-release/release/v1.17.13/bin/linux/amd64/kubelet": error downloading HTTP content from "https://storage.googleapis.com/kubernetes-release/release/v1.17.13/bin/linux/amd64/kubelet": net/http: request canceled (Client.Timeout exceeded while reading body)
Nov 10 08:37:51 ip-aa-bb-cc-dd nodeup[1094]: I1110 08:37:51.226316    1094 files.go:103] Hash did not match for "/var/cache/nodeup/sha256:e71c3ce50f93abc2735ba601781355d86a49aec992e8cb235a369254c304fa7d_https___storage_googleapis_com_kubernetes-release_release_v1_17_13_bin_linux_amd64_kubelet": actual=sha256:02b9b011a59df81232d038791e227c3961c37787d878b7b46f4a06f2bb088efe vs expected=sha256:e71c3ce50f93abc2735ba601781355d86a49aec992e8cb235a369254c304fa7d
Nov 10 08:37:51 ip-aa-bb-cc-dd nodeup[1094]: I1110 08:37:51.232073    1094 http.go:78] Downloading "https://storage.googleapis.com/kubernetes-release/release/v1.17.13/bin/linux/amd64/kubelet"
...

6. What did you expect to happen?
The new node will join the cluster in few minutes.

7. Please provide your cluster manifest

kops cluster & ig manifests
---
apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
  name: my-cluster-name.k8s.local
spec:
  additionalNetworkCIDRs:
  - my-network-cidr
  api:
    loadBalancer:
      type: Internal
  authorization:
    rbac: {}
  channel: stable
  cloudProvider: aws
  configBase: s3://my-k8s-state/my-cluster-name.k8s.local
  etcdClusters:
  - etcdMembers:
    - encryptedVolume: true
      instanceGroup: master-a1
      name: master-a1
    name: main
  - etcdMembers:
    - encryptedVolume: true
      instanceGroup: master-a1
      name: master-a1
    name: events
  externalPolicies:
    master:
    - arn:aws:iam::aws:policy/service-role/AmazonEC2RoleforSSM
    node:
    - arn:aws:iam::aws:policy/service-role/AmazonEC2RoleforSSM
  iam:
    allowContainerRegistry: true
    legacy: false
  kubeAPIServer:
    disableBasicAuth: true
    eventTTL: 3h0m0s
  kubeControllerManager:
    horizontalPodAutoscalerUseRestClients: true
  kubeDNS:
    coreDNSImage: k8s.gcr.io/coredns:1.7.0
    externalCoreFile: |-
      # forward AWS records to Amazon DNS server within VPC
      amazonaws.com:53 {
        errors
        prometheus :9153
        forward . 169.254.169.253
        cache 30
      }
      # forward external records to public DNS servers
      # public DNS providers: Google > Cloudflare > Quad9 (IBM)
      com net org io {
        errors
        prometheus :9153
        forward . 8.8.8.8 1.1.1.1 9.9.9.9 {
          policy sequential
        }
        cache 30
      }
      .:53 {
        log . "* - {type} {class} {name} {proto} {size} {rcode} {rsize} {duration} > {remote}"
        errors
        health
        autopath @kubernetes
        kubernetes cluster.local. in-addr.arpa ip6.arpa {
          pods verified
          endpoint_pod_names
          fallthrough in-addr.arpa ip6.arpa
        }
        prometheus :9153
        forward . /etc/resolv.conf
        loop
        cache 30
        loadbalance
        reload
        ready
      }
    nodeLocalDNS:
      enabled: true
    provider: CoreDNS
  kubelet:
    anonymousAuth: false
    authenticationTokenWebhook: true
    authorizationMode: Webhook
    enforceNodeAllocatable: pods
    kubeReserved:
      cpu: 100m
      ephemeral-storage: 1Gi
      memory: 200Mi
    systemReserved:
      cpu: 100m
      ephemeral-storage: 1Gi
      memory: 200Mi
  kubernetesApiAccess:
  - 10.0.0.0/8
  kubernetesVersion: 1.17.13
  networkCIDR: my-network-cidr
  networkID: my-vpc-id
  networking:
    calico:
      crossSubnet: true
  nonMasqueradeCIDR: 100.64.0.0/10
  sshAccess:
  - 10.0.0.0/8
  subnets:
  - id: subnet-000001
    name: subnet-a1
    type: Private
    zone: us-west-2a
  - id: subnet-000002
    name: subnet-b1
    type: Private
    zone: us-west-2b
  - id: subnet-000003
    name: subnet-c1
    type: Private
    zone: us-west-2c
  topology:
    dns:
      type: Public
    masters: private
    nodes: private

---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  labels:
    kops.k8s.io/cluster: my-cluster-name.k8s.local
  name: master-a1
spec:
  associatePublicIp: false
  cloudLabels:
    k8s.io/cluster-autoscaler/disabled: ""
    k8s.io/cluster-autoscaler/my-cluster-name.k8s.local: owned
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20201014
  machineType: t3a.large
  maxPrice: "0.04"
  maxSize: 1
  minSize: 1
  mixedInstancesPolicy:
    instances:
    - t3a.large
    - t3.large
    onDemandAboveBase: 0
    onDemandBase: 0
    spotAllocationStrategy: capacity-optimized
  nodeLabels:
    kops.k8s.io/instancegroup: master-a1
  role: Master
  rootVolumeOptimization: true
  subnets:
  - subnet-a1
  suspendProcesses:
  - AZRebalance

---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  labels:
    kops.k8s.io/cluster: my-cluster-name.k8s.local
  name: nodes-a1
spec:
  associatePublicIp: false
  cloudLabels:
    k8s.io/cluster-autoscaler/enabled: ""
    k8s.io/cluster-autoscaler/my-cluster-name.k8s.local: owned
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20201014
  machineType: t3a.large
  maxPrice: "0.04"
  maxSize: 3
  minSize: 1
  mixedInstancesPolicy:
    instances:
    - t3a.large
    - t3.large
    onDemandAboveBase: 0
    onDemandBase: 0
    spotAllocationStrategy: capacity-optimized
  nodeLabels:
    kops.k8s.io/instancegroup: nodes-a1
  role: Node
  rootVolumeOptimization: true
  subnets:
  - subnet-a1
  suspendProcesses:
  - AZRebalance

---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  labels:
    kops.k8s.io/cluster: my-cluster-name.k8s.local
  name: nodes-b1
spec:
  associatePublicIp: false
  cloudLabels:
    k8s.io/cluster-autoscaler/enabled: ""
    k8s.io/cluster-autoscaler/my-cluster-name.k8s.local: owned
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20201014
  machineType: t3a.large
  maxPrice: "0.04"
  maxSize: 3
  minSize: 1
  mixedInstancesPolicy:
    instances:
    - t3a.large
    - t3.large
    onDemandAboveBase: 0
    onDemandBase: 0
    spotAllocationStrategy: capacity-optimized
  nodeLabels:
    kops.k8s.io/instancegroup: nodes-b1
  role: Node
  rootVolumeOptimization: true
  subnets:
  - subnet-b1
  suspendProcesses:
  - AZRebalance

---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  labels:
    kops.k8s.io/cluster: my-cluster-name.k8s.local
  name: nodes-c1
spec:
  associatePublicIp: false
  cloudLabels:
    k8s.io/cluster-autoscaler/enabled: ""
    k8s.io/cluster-autoscaler/my-cluster-name.k8s.local: owned
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20201014
  machineType: t3a.large
  maxPrice: "0.04"
  maxSize: 3
  minSize: 1
  mixedInstancesPolicy:
    instances:
    - t3a.large
    - t3.large
    onDemandAboveBase: 0
    onDemandBase: 0
    spotAllocationStrategy: capacity-optimized
  nodeLabels:
    kops.k8s.io/instancegroup: nodes-c1
  role: Node
  rootVolumeOptimization: true
  subnets:
  - subnet-c1
  suspendProcesses:
  - AZRebalance

8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.

I thinks it might be a network performance issue in Ubuntu 20.04 on AWS, or a nodeup issue, not kops command line tool related.

9. Anything else do we need to know?
Not have this issue when using official Debian Stretch AMI kope.io/k8s-1.17-debian-stretch-amd64-hvm-ebs-2020-07-20.

UPDATE

Just ssh to that node and wget the file that nodeup trying to download, and it spends almost 5 minutes to download.

$ wget https://storage.googleapis.com/kubernetes-release/release/v1.17.13/bin/linux/amd64/kubelet

--2020-11-10 09:19:23--  https://storage.googleapis.com/kubernetes-release/release/v1.17.13/bin/linux/amd64/kubelet
Resolving storage.googleapis.com (storage.googleapis.com)... 172.217.27.144, 172.217.160.112, 216.58.200.48, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|172.217.27.144|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 111712216 (107M) [application/octet-stream]
Saving to: ‘kubelet’

kubelet                100%[=============================================>] 106.54M   386KB/s in 4m 50s

Other nodes (not rolling updated yet, using Debian Stretch AMI) within the same VPC don't have this issue:

$ wget https://storage.googleapis.com/kubernetes-release/release/v1.17.13/bin/linux/amd64/kubelet
--2020-11-10 09:29:15--  https://storage.googleapis.com/kubernetes-release/release/v1.17.13/bin/linux/amd64/kubelet
Resolving storage.googleapis.com (storage.googleapis.com)... 172.217.24.16, 216.58.200.48, 172.217.160.80, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|172.217.24.16|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 111712216 (107M) [application/octet-stream]
Saving to: ‘kubelet’

kubelet                100%[=============================================>] 106.54M  29.5MB/s in 4.6s

2020-11-10 09:29:21 (23.3 MB/s) - ‘kubelet’ saved [111712216/111712216]
@CheyiLin CheyiLin changed the title nodeup error downloading assets from storage.googleapis.com when using official ubuntu 20.04 AMI nodeup error downloading assets from storage.googleapis.com when using official Ubuntu 20.04 AMI Nov 10, 2020
@hakman
Copy link
Member

hakman commented Nov 10, 2020

@CheyiLin just out of curiosity, what AWS region are you using?

@CheyiLin
Copy link
Contributor Author

CheyiLin commented Nov 10, 2020

@hakman I'm using us-west-2.
Just tested with larger instance type c5.large, got the same result.

And I found PR #9136 hard coded the HTTP timeout:

// Create a client with a shorter timeout
httpClient := http.Client{
    Timeout: 2 * time.Minute
}

So nodes will never get ready if kubelet cannot be downloaded in 2 minutes.

@hakman
Copy link
Member

hakman commented Nov 10, 2020

Hmm this is extremely slow. Will ask around if k8s has other faster mirrors.

100MB in 2 minutes is very reasonable in this century. It is quite puzzling that one download finished in 4 min and the other in 4 sec.

@CheyiLin
Copy link
Contributor Author

@hakman Thank you.

I just realized that why switching back to Debian Stretch AMI is working as expected.

The nodeup in Debian Stretch AMI downloads only few assets like docker, cri-containerd, protokube tarball and protokube images, not including big binaries like kubelet and kubectl. That's why it is faster than Ubuntu 20.04 AMI.

@CheyiLin
Copy link
Contributor Author

UPDATE

The Debian Stretch I used kope.io/k8s-1.17-debian-stretch-amd64-hvm-ebs-2020-07-20 removed [1] the prebaked kubelet and kubectl.

So it is weird that even the Debian Stretch AMI still need to download kubelet and kubectl from storage.googleapis.com, it is much faster [2] than Ubuntu 20.04 AMI which is also need to download them from the same place.

[1] kubernetes-sigs/image-builder@54c1026
[2] kops-configuration logs for Debian Stretch AMI, both files downloaded in few seconds.

-- Logs begin at Thu 2020-09-03 02:28:54 UTC, end at Wed 2020-11-11 03:28:31 UTC. --
Nov 11 02:17:42 ip-10-104-191-7 systemd[1]: Starting Run kops bootstrap (nodeup)...
Nov 11 02:17:42 ip-10-104-191-7 nodeup[945]: nodeup version 1.18.2 (git-84495481e4)
Nov 11 02:17:42 ip-10-104-191-7 nodeup[945]: I1111 02:17:42.114084     945 http.go:78] Downloading "https://storage.googleapis.com/kubernetes-release/release/v1.17.13/bin/linux/amd64/kubelet"
Nov 11 02:17:48 ip-10-104-191-7 nodeup[945]: I1111 02:17:48.821078     945 files.go:100] Hash matched for "/var/cache/nodeup/sha256:e71c3ce50f93abc2735ba601781355d86a49aec992e8cb235a369254c304fa7d_https___storage_googleapis_com_kubernetes-release_release_v1_17_13_bin_linux_amd64_kubelet": sha256:e71c3ce50f93abc2735ba601781355d86a49aec992e8cb235a369254c304fa7d
Nov 11 02:17:48 ip-10-104-191-7 nodeup[945]: I1111 02:17:48.821131     945 assetstore.go:225] added asset "kubelet" for &{"/var/cache/nodeup/sha256:e71c3ce50f93abc2735ba601781355d86a49aec992e8cb235a369254c304fa7d_https___storage_googleapis_com_kubernetes-release_release_v1_17_13_bin_linux_amd64_kubelet"}
Nov 11 02:17:48 ip-10-104-191-7 nodeup[945]: I1111 02:17:48.821226     945 http.go:78] Downloading "https://storage.googleapis.com/kubernetes-release/release/v1.17.13/bin/linux/amd64/kubectl"
Nov 11 02:17:51 ip-10-104-191-7 nodeup[945]: I1111 02:17:51.252850     945 files.go:100] Hash matched for "/var/cache/nodeup/sha256:25824bf20d8d9501e9b6cabdd6bd25ec10cc649d4e3155dba86037f57bba842e_https___storage_googleapis_com_kubernetes-release_release_v1_17_13_bin_linux_amd64_kubectl": sha256:25824bf20d8d9501e9b6cabdd6bd25ec10cc649d4e3155dba86037f57bba842e
Nov 11 02:17:51 ip-10-104-191-7 nodeup[945]: I1111 02:17:51.252899     945 assetstore.go:225] added asset "kubectl" for &{"/var/cache/nodeup/sha256:25824bf20d8d9501e9b6cabdd6bd25ec10cc649d4e3155dba86037f57bba842e_https___storage_googleapis_com_kubernetes-release_release_v1_17_13_bin_linux_amd64_kubectl"}
...

@hakman
Copy link
Member

hakman commented Nov 11, 2020

Can't say I can reproduce this even on a t3.small instance. Any special config on your side?

ubuntu@ip-10-4-211-54:~$ wget https://storage.googleapis.com/kubernetes-release/release/v1.17.13/bin/linux/amd64/kubelet
--2020-11-11 05:35:54--  https://storage.googleapis.com/kubernetes-release/release/v1.17.13/bin/linux/amd64/kubelet
Resolving storage.googleapis.com (storage.googleapis.com)... 172.217.21.240, 172.217.23.176, 172.217.22.80, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|172.217.21.240|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 111712216 (107M) [application/octet-stream]
Saving to: ‘kubelet’

kubelet 100%[===================================================>] 106.54M  73.0MB/s    in 1.5s    

2020-11-11 05:35:56 (73.0 MB/s) - ‘kubelet’ saved [111712216/111712216]

ubuntu@ip-10-4-211-54:~$ cat /etc/os-release 
NAME="Ubuntu"
VERSION="20.04.1 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.1 LTS"
VERSION_ID="20.04"

@yuha0
Copy link

yuha0 commented Nov 12, 2020

I am facing the exact same issue and I am afraid it is not related to kops.

We are experiencing slow download speed in a private subnet when connecting to an Internet gateway. And the slowness is only observed when downloading from Google hosted servers:

  • Downloading kubelet from storage.googleapis.com, container images from GCR, Google Chrome browser, even golang tarball from golang.org, are all very slow.
  • But downloading anything from anywhere non-google is fast (tried downloading Ubuntu images from different mirrors in North America, some software packages from Microsoft, and some random files on the internet... and they are all very fast).

This can be reproduced in our environment (subnet) on a standalone EC2 instance not managed by KOPS ASG. It is not happening 100% of time, but probably 90%. All existing, old EC2 instances in the same subnet still have fast download speed and this only affect newly created instances. All instances are using pegged Ubuntu 20.04 AMI and new worker nodes were always able to download kubelet fast until this week.

Captured some pcaps and we found that on those new, slow instances, looks like they are using too big of a TCP MSS for some reason (8961 bytes), thus causing fragmentation/reassembly. On existing old, fast instances, the TCP MSS is 1430. On the slow instances, if we manually adjust MSS (for instance, ip route add 216.58.199.80/32 via 10.162.0.1 advmss 1340), the download speed recovers.

Also, we haven't seen this issue when using debian or ubuntu 18.04, which all come with older kernels.

@hakman
Copy link
Member

hakman commented Nov 12, 2020

Interesting @yuha0. Any idea if there in which Ubuntu AMI this was changed?

------------------------------------------------------------------------------------------------
|                                        DescribeImages                                        |
+---------------------------+------------------------------------------------------------------+
|  2020-04-23T11:35:29.000Z |  ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20200423  |
|  2020-05-29T01:39:29.000Z |  ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20200528  |
|  2020-06-10T01:14:32.000Z |  ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20200609  |
|  2020-06-26T16:09:58.000Z |  ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20200625  |
|  2020-07-02T03:20:26.000Z |  ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20200701  |
|  2020-07-16T19:18:36.000Z |  ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20200716  |
|  2020-07-21T00:35:45.000Z |  ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20200720  |
|  2020-07-30T15:39:59.000Z |  ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20200729  |
|  2020-08-18T17:27:08.000Z |  ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20200817  |
|  2020-09-04T22:46:25.000Z |  ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20200903  |
|  2020-09-08T00:56:09.000Z |  ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20200907  |
|  2020-09-17T16:24:38.000Z |  ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20200916  |
|  2020-09-25T01:16:48.000Z |  ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20200924  |
|  2020-10-14T16:39:06.000Z |  ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20201014  |
|  2020-10-27T01:02:33.000Z |  ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20201026  |
+---------------------------+------------------------------------------------------------------+

@hongymagic
Copy link

hongymagic commented Nov 12, 2020

Last I heard was that this was potentially introduced somewhere after kernel versions 4.19.86.

While this behaviour still exists, and is at some random frequency (i.e., a lot of focal images worked okay, then after certain point all new instances using focal decided to exhibit this behaviour..), we don't have any conclusive evidence of what the exact cause is yet.

Even later builds of the bionic image exhibit this same intermittent behaviour. But the one that has been going okay so far is:

Linux 4.15.0-1031-aws #33-Ubuntu SMP Fri Dec 7 09:32:27 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

@abuehaze
Copy link

abuehaze commented Nov 13, 2020

Hey,

This behavior has been introduced in kernel 4.19.86 because of https://lore.kernel.org/patchwork/patch/1157936/.

Workarounds:
1-Decrease TCP socket default rmem from 131072 to 87380 -----------------reliable

sysctl -a | grep -i net.ipv4.tcp_rmem

net.ipv4.tcp_rmem = 4096 131072 6291456

sysctl -w net.ipv4.tcp_rmem="4096 87380 6291456"

net.ipv4.tcp_rmem = 4096 87380 6291456

sysctl -a | grep -i net.ipv4.tcp_rmem

net.ipv4.tcp_rmem = 4096 87380 6291456
2-reverting https://lore.kernel.org/patchwork/patch/1157936/---------------reliable
3-Decrease MTU to 2000 or lower --------------------------------------------- not 100% reliable
4-change sysctl_tcp_adv_win_scale from default 1 to 0 or 2.----------------not 100% reliable

Technical explanation:

-Before https://lore.kernel.org/patchwork/patch/1157936/ got merged we were actually relying on tcp_fixup_rcvbuf function to increase the receive buffer while the connection is going.In this case receiver buffer auto-tuning starts after receiving one advertised window amount of data however After the initial receiver buffer was raised by patch a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB"), the reciver buffer may take too long to start raising.

-After patch a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB") got merged we no longer have the tcp_fixup_rcvbuf which is raising the receiver buffer by multiplying the rcvmem*4 if we have the sysctl_tcp_moderate_rcvbuf enabled this indeed give more freedom to the client to send data which is reflecting in the download speed since the connection start.

-Regarding why decreasing MTU and receive buffer enhancing the performance, going through the code it looks like the dynamic receive window is now relying on the socket receive space which the minimum between tp->rcv_wnd(controlled by the tcp.rmem parameter) and TCP_INIT_CWND * tp->advmss which is 10 * MMS(controlled by MTU), as we can see decreasing one of these parameters(MTU or tcp_rmem) will decrease the initial rcvq_space.space hence the dynamic scaling will kickoff early because we are scaling the window whenever we found that the rcv_space won’t be enough to accommodate the received data.This patch https://www.spinics.net/lists/netdev/msg526390.html was actually trying to solve this issue but it’s relying on low advertised MSS which we can’t have with Jumbo frames configured as we have in this case.

http://lkml.iu.edu/hypermail/linux/kernel/1912.2/01978.html

https://www.spinics.net/lists/netdev/msg526390.html

Kernel 4.19.86
https://elixir.bootlin.com/linux/v4.19.86/source/net/ipv4/tcp_input.c#L441

tp->rcvq_space.space = min_t(u32, tp->rcv_wnd, TCP_INIT_CWND * tp->advmss);

Kenrel 4.19.85

https://elixir.bootlin.com/linux/v4.19.85/source/net/ipv4/tcp_input.c#L451

tp->rcvq_space.space = tp->rcv_wnd;

7-After I disable net.ipv4.tcp_moderate_rcvbuf on 4.14 kernel, rcv_space is clearly capped because tcp_fixup_rcvbuf won’t increase the rcvmem and the socket rcvbuf
(rcvspace)

https://elixir.bootlin.com/linux/v4.19.85/source/net/ipv4/tcp_input.c#L441

static void tcp_fixup_rcvbuf(struct sock *sk)
{
u32 mss = tcp_sk(sk)->advmss;
int rcvmem;

rcvmem = 2 * SKB_TRUESIZE(mss + MAX_TCP_HEADER) *
	 tcp_default_init_rwnd(mss);

/* Dynamic Right Sizing (DRS) has 2 to 3 RTT latency
 * Allow enough cushion so that sender is not limited by our window
 */
if (sock_net(sk)->ipv4.sysctl_tcp_moderate_rcvbuf) ———>increasing the rcvmem(rcvspace) if this flag is enabled.
	rcvmem <<= 2;

if (sk->sk_rcvbuf < rcvmem)
	sk->sk_rcvbuf = min(rcvmem, sock_net(sk)->ipv4.sysctl_tcp_rmem[2]);

}

@abuehaze
Copy link

For awareness this is a kernel Bug, I have submitted patch [1] upstream to get this fixed and [2] should be in the stable queue at the moment.

You can also apply the below workaround until we have the fix merged:
$echo "net.ipv4.tcp_rmem=4096 196608 6291456" >> /etc/sysctl.conf
$sysctl -p

Links:

[1] https://lore.kernel.org/netdev/20201204180622.14285-1-abuehaze@amazon.com/
[2] https://lore.kernel.org/netdev/20201208.162852.2205708169665484487.davem@davemloft.net/

@hakman
Copy link
Member

hakman commented Dec 14, 2020

Very nice. Thanks a lot for the fix and for the explanations @abuehaze. Probably this should be picker up by Ubuntu pretty soon.

@justinsb
Copy link
Member

justinsb commented Jan 24, 2021

By default kops overrides a set of syscalls from the OS defaults:

"# Increase the number of connections",
"net.core.somaxconn = 32768",
"",
"# Maximum Socket Receive Buffer",
"net.core.rmem_max = 16777216",
"",
"# Maximum Socket Send Buffer",
"net.core.wmem_max = 16777216",
"",
"# Increase the maximum total buffer-space allocatable",
"net.ipv4.tcp_wmem = 4096 12582912 16777216",
"net.ipv4.tcp_rmem = 4096 12582912 16777216",
"",

(And there are more in that file!)

In particular: net.ipv4.tcp_rmem = 4096 12582912 16777216; which has a much higher second number (the default receive buffer size, IIUC).

Do I understand correctly that we should reduce the size of the default buffer, because it's stopping TCP autotuning?

Also, if any of the TCP experts on this thread have any input on those values as defaults, that would be greatly appreciated - they haven't been touched since about 2017 that I can see ... and 4 years is a long time in kubernetes! The values were originally based on nginx performance recommendations (provided here )

@justinsb
Copy link
Member

justinsb commented Jan 24, 2021

Although of course I've realized that these TCP settings are applied by nodeup, so don't apply to the download of nodeup itself. I don't think we have a dependency configured so that we download only after applying the sysctls, so I would guess it won't apply to the download of kubelet etc.

A potential workaround could therefore be to explicitly set net.ipv4.tcp_rmem in the bootstrap script, and then let nodeup set it again anyway.

Would still appreciate any guidance on whether the values we're setting make sense, but this feels like a fairly easy and safe fix!

justinsb added a commit to justinsb/kops that referenced this issue Jan 25, 2021
This ensures that we're using our settings for downloading nodeup
itself and any assets that nodeup downloads.  This is a workaround for
reported problems with the initial download on some kernels otherwise.

Issue kubernetes#10206
@CheyiLin
Copy link
Contributor Author

As a workaround, I set sysctl parameters by using additionalUserData and make sure it won't be overridden by nodeup. In recent 2 months every new node can join the cluster in reasonable ~5m.

  additionalUserData:
  - name: custom-sysctl.sh
    type: text/x-shellscript
    content: |
      #!/bin/bash
      set -xe
      cat <<EOF > /etc/sysctl.d/999-k8s-custom.conf
      net.core.rmem_max = 16777216
      net.core.wmem_max = 16777216
      net.ipv4.tcp_rmem = 4096 87380 16777216
      net.ipv4.tcp_wmem = 4096 87380 16777216
      EOF
      sysctl -p /etc/sysctl.d/999-k8s-custom.conf

hakman pushed a commit to hakman/kops that referenced this issue Jan 25, 2021
This ensures that we're using our settings for downloading nodeup
itself and any assets that nodeup downloads.  This is a workaround for
reported problems with the initial download on some kernels otherwise.

Issue kubernetes#10206
@callum-p
Copy link

callum-p commented Feb 6, 2021

I'm still having this issue intermittently even with the newest Ubuntu AMI and kops 1.19 that includes this fix #10654.

DOboznyi pushed a commit to DOboznyi/kops that referenced this issue Feb 10, 2021
This ensures that we're using our settings for downloading nodeup
itself and any assets that nodeup downloads.  This is a workaround for
reported problems with the initial download on some kernels otherwise.

Issue kubernetes#10206

Signed-off-by: Dmytro Oboznyi <dmytro.oboznyi@syncier.com>
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 10, 2021
@fvasco
Copy link
Contributor

fvasco commented May 28, 2021

We detected a similar error when launching two nodes at the same time.
One node works as expected and it downloads kubelet in 0.7s.

The other one logs:

May 28 13:53:23 ubuntu kernel: Linux version 5.4.0-1045-aws (buildd@lcy01-amd64-026) (gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04)) #47-Ubuntu SMP Tue Apr 13 07:02:25 UTC 2021 (Ubuntu 5.4.0-1045.47-aws 5.4.101)
...
May 28 13:53:45 hidden-name nodeup[1304]: I0528 13:53:45.749057    1304 http.go:78] Downloading "https://storage.googleapis.com/kubernetes-release/release/v1.20.6/bin/linux/amd64/kubelet"
...
May 28 13:55:45 hidden-name nodeup[1304]: W0528 13:55:45.749528    1304 assetstore.go:248] error downloading url "https://storage.googleapis.com/kubernetes-release/release/v1.20.6/bin/linux/amd64/kubelet": error downloading HTTP content from "https://storage.googleapis.com/kubernetes-release/release/v1.20.6/bin/linux/amd64/kubelet": c>
May 28 13:55:45 hidden-name nodeup[1304]: W0528 13:55:45.749578    1304 main.go:133] got error running nodeup (will retry in 30s): error adding asset "7688a663dd06222d337c8fdb5b05e1d9377e6d64aa048c6acf484bc3f2a596a8@https://storage.googleapis.com/kubernetes-release/release/v1.20.6/bin/linux/amd64/kubelet": error downloading HTTP conte>
...
May 28 13:56:16 hidden-name nodeup[1304]: I0528 13:56:16.333319    1304 http.go:78] Downloading "https://storage.googleapis.com/kubernetes-release/release/v1.20.6/bin/linux/amd64/kubelet"
...
May 28 13:58:16 hidden-name nodeup[1304]: W0528 13:58:16.333554    1304 assetstore.go:248] error downloading url "https://storage.googleapis.com/kubernetes-release/release/v1.20.6/bin/linux/amd64/kubelet": error doing HTTP fetch of "https://storage.googleapis.com/kubernetes-release/release/v1.20.6/bin/linux/amd64/kubelet": Get "https:>
May 28 13:58:16 hidden-name nodeup[1304]: W0528 13:58:16.333591    1304 main.go:133] got error running nodeup (will retry in 30s): error adding asset "7688a663dd06222d337c8fdb5b05e1d9377e6d64aa048c6acf484bc3f2a596a8@https://storage.googleapis.com/kubernetes-release/release/v1.20.6/bin/linux/amd64/kubelet": error doing HTTP fetch of "h>
...
May 28 13:58:46 hidden-name nodeup[1304]: I0528 13:58:46.719200    1304 http.go:78] Downloading "https://storage.googleapis.com/kubernetes-release/release/v1.20.6/bin/linux/amd64/kubelet"
...
May 28 14:00:46 hidden-name nodeup[1304]: W0528 14:00:46.719423    1304 assetstore.go:248] error downloading url "https://storage.googleapis.com/kubernetes-release/release/v1.20.6/bin/linux/amd64/kubelet": error doing HTTP fetch of "https://storage.googleapis.com/kubernetes-release/release/v1.20.6/bin/linux/amd64/kubelet": Get "https://storage.googleapis.com/kubernetes-release/release/v1.20.6/bin/linux/amd64/kubelet": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
May 28 14:00:46 hidden-name nodeup[1304]: W0528 14:00:46.719458    1304 main.go:133] got error running nodeup (will retry in 30s): error adding asset "7688a663dd06222d337c8fdb5b05e1d9377e6d64aa048c6acf484bc3f2a596a8@https://storage.googleapis.com/kubernetes-release/release/v1.20.6/bin/linux/amd64/kubelet": error doing HTTP fetch of "https://storage.googleapis.com/kubernetes-release/release/v1.20.6/bin/linux/amd64/kubelet": Get "https://storage.googleapis.com/kubernetes-release/release/v1.20.6/bin/linux/amd64/kubelet": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
...
May 28 14:01:17 hidden-name nodeup[1304]: I0528 14:01:17.056870    1304 http.go:78] Downloading "https://storage.googleapis.com/kubernetes-release/release/v1.20.6/bin/linux/amd64/kubelet"
...
May 28 14:03:17 hidden-name nodeup[1304]: W0528 14:03:17.057099    1304 assetstore.go:248] error downloading url "https://storage.googleapis.com/kubernetes-release/release/v1.20.6/bin/linux/amd64/kubelet": error doing HTTP fetch of "https://storage.googleapis.com/kubernetes-release/release/v1.20.6/bin/linux/amd64/kubelet": Get "https://storage.googleapis.com/kubernetes-release/release/v1.20.6/bin/linux/amd64/kubelet": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
May 28 14:03:17 hidden-name nodeup[1304]: W0528 14:03:17.057144    1304 main.go:133] got error running nodeup (will retry in 30s): error adding asset "7688a663dd06222d337c8fdb5b05e1d9377e6d64aa048c6acf484bc3f2a596a8@https://storage.googleapis.com/kubernetes-release/release/v1.20.6/bin/linux/amd64/kubelet": error doing HTTP fetch of "https://storage.googleapis.com/kubernetes-release/release/v1.20.6/bin/linux/amd64/kubelet": Get "https://storage.googleapis.com/kubernetes-release/release/v1.20.6/bin/linux/amd64/kubelet": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

Is this error related?

@callum-p
Copy link

We detected a similar error when launching two nodes at the same time.
One node works as expected and it downloads kubelet in 0.7s.

The other one logs:

May 28 13:53:23 ubuntu kernel: Linux version 5.4.0-1045-aws (buildd@lcy01-amd64-026) (gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04)) #47-Ubuntu SMP Tue Apr 13 07:02:25 UTC 2021 (Ubuntu 5.4.0-1045.47-aws 5.4.101)
...
May 28 13:53:45 hidden-name nodeup[1304]: I0528 13:53:45.749057    1304 http.go:78] Downloading "https://storage.googleapis.com/kubernetes-release/release/v1.20.6/bin/linux/amd64/kubelet"
...
May 28 13:55:45 hidden-name nodeup[1304]: W0528 13:55:45.749528    1304 assetstore.go:248] error downloading url "https://storage.googleapis.com/kubernetes-release/release/v1.20.6/bin/linux/amd64/kubelet": error downloading HTTP content from "https://storage.googleapis.com/kubernetes-release/release/v1.20.6/bin/linux/amd64/kubelet": c>
May 28 13:55:45 hidden-name nodeup[1304]: W0528 13:55:45.749578    1304 main.go:133] got error running nodeup (will retry in 30s): error adding asset "7688a663dd06222d337c8fdb5b05e1d9377e6d64aa048c6acf484bc3f2a596a8@https://storage.googleapis.com/kubernetes-release/release/v1.20.6/bin/linux/amd64/kubelet": error downloading HTTP conte>
...
May 28 13:56:16 hidden-name nodeup[1304]: I0528 13:56:16.333319    1304 http.go:78] Downloading "https://storage.googleapis.com/kubernetes-release/release/v1.20.6/bin/linux/amd64/kubelet"
...
May 28 13:58:16 hidden-name nodeup[1304]: W0528 13:58:16.333554    1304 assetstore.go:248] error downloading url "https://storage.googleapis.com/kubernetes-release/release/v1.20.6/bin/linux/amd64/kubelet": error doing HTTP fetch of "https://storage.googleapis.com/kubernetes-release/release/v1.20.6/bin/linux/amd64/kubelet": Get "https:>
May 28 13:58:16 hidden-name nodeup[1304]: W0528 13:58:16.333591    1304 main.go:133] got error running nodeup (will retry in 30s): error adding asset "7688a663dd06222d337c8fdb5b05e1d9377e6d64aa048c6acf484bc3f2a596a8@https://storage.googleapis.com/kubernetes-release/release/v1.20.6/bin/linux/amd64/kubelet": error doing HTTP fetch of "h>
...
May 28 13:58:46 hidden-name nodeup[1304]: I0528 13:58:46.719200    1304 http.go:78] Downloading "https://storage.googleapis.com/kubernetes-release/release/v1.20.6/bin/linux/amd64/kubelet"
...
May 28 14:00:46 hidden-name nodeup[1304]: W0528 14:00:46.719423    1304 assetstore.go:248] error downloading url "https://storage.googleapis.com/kubernetes-release/release/v1.20.6/bin/linux/amd64/kubelet": error doing HTTP fetch of "https://storage.googleapis.com/kubernetes-release/release/v1.20.6/bin/linux/amd64/kubelet": Get "https://storage.googleapis.com/kubernetes-release/release/v1.20.6/bin/linux/amd64/kubelet": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
May 28 14:00:46 hidden-name nodeup[1304]: W0528 14:00:46.719458    1304 main.go:133] got error running nodeup (will retry in 30s): error adding asset "7688a663dd06222d337c8fdb5b05e1d9377e6d64aa048c6acf484bc3f2a596a8@https://storage.googleapis.com/kubernetes-release/release/v1.20.6/bin/linux/amd64/kubelet": error doing HTTP fetch of "https://storage.googleapis.com/kubernetes-release/release/v1.20.6/bin/linux/amd64/kubelet": Get "https://storage.googleapis.com/kubernetes-release/release/v1.20.6/bin/linux/amd64/kubelet": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
...
May 28 14:01:17 hidden-name nodeup[1304]: I0528 14:01:17.056870    1304 http.go:78] Downloading "https://storage.googleapis.com/kubernetes-release/release/v1.20.6/bin/linux/amd64/kubelet"
...
May 28 14:03:17 hidden-name nodeup[1304]: W0528 14:03:17.057099    1304 assetstore.go:248] error downloading url "https://storage.googleapis.com/kubernetes-release/release/v1.20.6/bin/linux/amd64/kubelet": error doing HTTP fetch of "https://storage.googleapis.com/kubernetes-release/release/v1.20.6/bin/linux/amd64/kubelet": Get "https://storage.googleapis.com/kubernetes-release/release/v1.20.6/bin/linux/amd64/kubelet": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
May 28 14:03:17 hidden-name nodeup[1304]: W0528 14:03:17.057144    1304 main.go:133] got error running nodeup (will retry in 30s): error adding asset "7688a663dd06222d337c8fdb5b05e1d9377e6d64aa048c6acf484bc3f2a596a8@https://storage.googleapis.com/kubernetes-release/release/v1.20.6/bin/linux/amd64/kubelet": error doing HTTP fetch of "https://storage.googleapis.com/kubernetes-release/release/v1.20.6/bin/linux/amd64/kubelet": Get "https://storage.googleapis.com/kubernetes-release/release/v1.20.6/bin/linux/amd64/kubelet": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

Is this error related?

@fvasco yes, those are the same symptoms :(

@hakman
Copy link
Member

hakman commented May 31, 2021

/remove-lifecycle stale
/lifecycle frozen

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 31, 2021
@k8s-ci-robot k8s-ci-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label May 31, 2021
@olemarkus olemarkus removed the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Aug 2, 2022
@olemarkus
Copy link
Member

Believe this one to be solved now.

/close

@k8s-ci-robot
Copy link
Contributor

@olemarkus: Closing this issue.

In response to this:

Believe this one to be solved now.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests