Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restore hangs waiting for load balancer ip #1365

Closed
lennart opened this issue May 29, 2024 · 2 comments
Closed

Restore hangs waiting for load balancer ip #1365

lennart opened this issue May 29, 2024 · 2 comments

Comments

@lennart
Copy link
Contributor

lennart commented May 29, 2024

Description

When restoring from an etcd snapshot (as described in the docs), the terraform apply step hangs waiting for the load balancer ip:

...
module.kube-hetzner.null_resource.kustomization (remote-exec): + kubectl apply -k /var/post_install
module.kube-hetzner.null_resource.kustomization: Still creating... [10s elapsed]
module.kube-hetzner.null_resource.kustomization (remote-exec): namespace/cert-manager unchanged
module.kube-hetzner.null_resource.kustomization (remote-exec): namespace/longhorn-system unchanged
module.kube-hetzner.null_resource.kustomization (remote-exec): namespace/system-upgrade unchanged
module.kube-hetzner.null_resource.kustomization (remote-exec): namespace/traefik unchanged
module.kube-hetzner.null_resource.kustomization (remote-exec): serviceaccount/hcloud-cloud-controller-manager unchanged
module.kube-hetzner.null_resource.kustomization (remote-exec): serviceaccount/kured unchanged
module.kube-hetzner.null_resource.kustomization (remote-exec): serviceaccount/system-upgrade unchanged
module.kube-hetzner.null_resource.kustomization (remote-exec): role.rbac.authorization.k8s.io/kured unchanged
module.kube-hetzner.null_resource.kustomization (remote-exec): clusterrole.rbac.authorization.k8s.io/kured unchanged
module.kube-hetzner.null_resource.kustomization (remote-exec): rolebinding.rbac.authorization.k8s.io/kured unchanged
module.kube-hetzner.null_resource.kustomization (remote-exec): clusterrolebinding.rbac.authorization.k8s.io/kured unchanged
module.kube-hetzner.null_resource.kustomization (remote-exec): clusterrolebinding.rbac.authorization.k8s.io/system-upgrade unchanged
module.kube-hetzner.null_resource.kustomization (remote-exec): clusterrolebinding.rbac.authorization.k8s.io/system:hcloud-cloud-controller-manager unchanged
module.kube-hetzner.null_resource.kustomization (remote-exec): configmap/default-controller-env unchanged
module.kube-hetzner.null_resource.kustomization (remote-exec): deployment.apps/hcloud-cloud-controller-manager unchanged
module.kube-hetzner.null_resource.kustomization (remote-exec): deployment.apps/system-upgrade-controller unchanged
module.kube-hetzner.null_resource.kustomization (remote-exec): daemonset.apps/kured unchanged
module.kube-hetzner.null_resource.kustomization (remote-exec): helmchart.helm.cattle.io/cert-manager unchanged
module.kube-hetzner.null_resource.kustomization (remote-exec): helmchart.helm.cattle.io/cilium unchanged
module.kube-hetzner.null_resource.kustomization (remote-exec): helmchart.helm.cattle.io/longhorn unchanged
module.kube-hetzner.null_resource.kustomization (remote-exec): helmchart.helm.cattle.io/traefik unchanged
module.kube-hetzner.null_resource.kustomization (remote-exec): + echo 'Waiting for the system-upgrade-controller deployment to become available...'
module.kube-hetzner.null_resource.kustomization (remote-exec): Waiting for the system-upgrade-controller deployment to become available...
module.kube-hetzner.null_resource.kustomization (remote-exec): + kubectl -n system-upgrade wait --for=condition=available --timeout=360s deployment/system-upgrade-controller
module.kube-hetzner.null_resource.kustomization (remote-exec): deployment.apps/system-upgrade-controller condition met
module.kube-hetzner.null_resource.kustomization (remote-exec): + sleep 7
module.kube-hetzner.null_resource.kustomization: Still creating... [20s elapsed]
module.kube-hetzner.null_resource.kustomization (remote-exec): + kubectl -n system-upgrade apply -f /var/post_install/plans.yaml
module.kube-hetzner.null_resource.kustomization (remote-exec): plan.upgrade.cattle.io/k3s-agent unchanged
module.kube-hetzner.null_resource.kustomization (remote-exec): plan.upgrade.cattle.io/k3s-server unchanged
module.kube-hetzner.null_resource.kustomization (remote-exec): + timeout 360 bash
module.kube-hetzner.null_resource.kustomization (remote-exec): Waiting for load-balancer to get an IP...
...
module.kube-hetzner.null_resource.kustomization: Still creating... [6m10s elapsed]
module.kube-hetzner.null_resource.kustomization (remote-exec): Waiting for load-balancer to get an IP...
module.kube-hetzner.null_resource.kustomization (remote-exec): Waiting for load-balancer to get an IP...
module.kube-hetzner.null_resource.kustomization (remote-exec): Waiting for load-balancer to get an IP...
module.kube-hetzner.null_resource.kustomization (remote-exec): Waiting for load-balancer to get an IP...
module.kube-hetzner.null_resource.kustomization: Still creating... [6m20s elapsed]
module.kube-hetzner.null_resource.kustomization (remote-exec): Waiting for load-balancer to get an IP...
╷
│ Error: remote-exec provisioner error
│ 
│   with module.kube-hetzner.null_resource.kustomization,
│   on .terraform/modules/kube-hetzner/init.tf line 291, in resource "null_resource" "kustomization":
│  291:   provisioner "remote-exec" {
│ 
│ error executing "/tmp/terraform_883860071.sh": Process exited with status
│ 124
╵
Cleaning up project directory and file based variables 00:01
ERROR: Job failed: exit code 1

I ssh'ed into one of the control plane nodes and could confirm, the etcd snapshot is in fact restored but since the traefik service was removed during restore (with reasons) I am wondering how the apply step should finish? It seems that this is a deadlock situation since the rest of the deployment waits for the load balancer ip and it won't get one since the traefik service (that would contain the corresponding annotations to connect to the hetzner load balancer) was removed.

I understand that not deleting the traefik service might alter existing load balancers, which is unwanted, but is there something I am missing on how the traefik service is eventually restored? I can see from the deploy logs, that the post_install kustomizations are applied and that the traefik helm chart is obviously unchanged. In order to recreate the traefik service, I had to:

  • ssh into one of the control planes
  • delete the helmchart resource
  • change the kube.tf to not restore from an etcd snapshot anymore (set it to an empty string)
  • deploy the cluster

this re-ran the helm install job and created the service. Re-running the terraform apply step then could finish as the load balancer now contained targets (it did not before the service re-creation)

I guess this is not the way a restore is supposed to work, so I am interested where I used it wrong, or if there is something different in the way restore used to work...

Kube.tf file

locals {
  hcloud_token = "xxxxxxxxxxx"
  k3s_token = var.k3s_token # this is secret information, hence it is passed as an environment variable
  etcd_version = "v3.5.9"

  s3_access_key_id = "xxxxx"
  etcd_snapshot_name = "etcd-snapshot-al0-control-plane-nbg1-gph-1716834604"
  etcd_s3_endpoint   = "s3.nl-ams.scw.cloud"
  etcd_s3_bucket     = "al0-landscape-k3s-etcd-snapshots"
  etcd_s3_region     = "nl-ams"
  etcd_s3_access_key = local.s3_access_key_id
  etcd_s3_secret_key = var.s3_secret_key # this is secret information, hence it is passed as an environment variable
}

variable "k3s_token" {
  sensitive = true
  type      = string
}

variable "cloudflare_api_token" {
  type = string


  sensitive = true
}

module "kube-hetzner" {
  providers = {
    hcloud = hcloud
  }
  hcloud_token = var.hcloud_token != "" ? var.hcloud_token : local.hcloud_token
  k3s_token = local.k3s_token

  postinstall_exec = local.etcd_snapshot_name == "" ? [] : [
    (
      <<-EOF
      export CLUSTERINIT=$(cat /etc/rancher/k3s/config.yaml | grep -i '"cluster-init": true')
      if [ -n "$CLUSTERINIT" ]; then
        echo indeed this is the first control plane node > /tmp/restorenotes
        k3s server \
          --cluster-reset \
          --etcd-s3 \
          --cluster-reset-restore-path=${local.etcd_snapshot_name} \
          --etcd-s3-endpoint=${local.etcd_s3_endpoint} \
          --etcd-s3-bucket=${local.etcd_s3_bucket} \
          --etcd-s3-access-key=${local.etcd_s3_access_key} \
          --etcd-s3-secret-key=${local.etcd_s3_secret_key} \
          --etcd-s3-region=${local.etcd_s3_region}
        mv /etc/rancher/k3s/k3s.yaml /etc/rancher/k3s/k3s.backup.yaml
        ETCD_VER=${local.etcd_version}
        case "$(uname -m)" in
            aarch64) ETCD_ARCH="arm64" ;;
            x86_64) ETCD_ARCH="amd64" ;;
        esac;
        DOWNLOAD_URL=https://github.com/etcd-io/etcd/releases/download
        rm -f /tmp/etcd-$ETCD_VER-linux-$ETCD_ARCH.tar.gz
        curl -L $DOWNLOAD_URL/$ETCD_VER/etcd-$ETCD_VER-linux-$ETCD_ARCH.tar.gz -o /tmp/etcd-$ETCD_VER-linux-$ETCD_ARCH.tar.gz
        tar xzvf /tmp/etcd-$ETCD_VER-linux-$ETCD_ARCH.tar.gz -C /usr/local/bin --strip-components=1
        rm -f /tmp/etcd-$ETCD_VER-linux-$ETCD_ARCH.tar.gz

        etcd --version
        etcdctl version
        nohup etcd --data-dir /var/lib/rancher/k3s/server/db/etcd &
        echo $! > save_pid.txt
        etcdctl del /registry/services/specs/traefik/traefik
        etcdctl del /registry/services/endpoints/traefik/traefik
        OLD_NODES=$(etcdctl get "" --prefix --keys-only | grep /registry/minions/ | cut -c 19-)
        for NODE in $OLD_NODES; do
          for KEY in $(etcdctl get "" --prefix --keys-only | grep $NODE); do
            etcdctl del $KEY
          done
        done

        kill -9 `cat save_pid.txt`
        rm save_pid.txt
      else
        echo this is not the first control plane node > /tmp/restorenotes
      fi
      EOF
    )
  ]
  source = "github.com/lennart/terraform-hcloud-kube-hetzner?ref=feature%2Fsys-upgrade-disable-eviction-flag"

  system_upgrade_enable_eviction = false
  ssh_port = 23232
  ssh_public_key = "xyz"
  ssh_private_key = null
  ssh_max_auth_tries = 10
  network_region = "eu-central" # change to `us-east` if location is ash

  control_plane_nodepools = [
    {
      name = "control-plane-fsn1",
      server_type = "cax11",
      location    = "fsn1",
      labels      = [],
      taints      = [],
      count       = 1
      zram_size = "2G" # remember to add the suffix, examples: 512M, 1G
      kubelet_args = ["runtime-request-timeout=10m0s"]
    },
    {
      name = "control-plane-nbg1",
      server_type = "cax11",
      location    = "nbg1",
      labels      = [],
      taints      = [],
      count       = 1
      kubelet_args = ["runtime-request-timeout=10m0s"]

      zram_size = "2G" # remember to add the suffix, examples: 512M, 1G
    },
    {
      name = "control-plane-hel1",
      server_type = "cax11",
      location    = "hel1",
      labels      = [],
      taints      = [],
      count       = 1
      kubelet_args = ["runtime-request-timeout=10m0s"]

      zram_size = "2G" # remember to add the suffix, examples: 512M, 1G
    }
  ]

  agent_nodepools = [
    {
      name        = "agent-arm",
      server_type = "cax31",
      location    = "nbg1",
      labels = [
        "node.longhorn.io/create-default-disk=config",
        "node.kubernetes.io/server-usage=local-storage-only"
      ],
      taints = [],
      nodes = {
        "3" : {
          zram_size = "4G" # remember to add the suffix, examples: 512M, 1G
        },
      }
      longhorn_volume_size = 0
      zram_size = "4G" # remember to add the suffix, examples: 512M, 1G
      kubelet_args = ["runtime-request-timeout=10m0s"]
    },
    {
      name        = "agent-x86",
      server_type = "cx41",
      location    = "nbg1",
      labels = [
        "node.longhorn.io/create-default-disk=config",
        "node.kubernetes.io/server-usage=storage"
      ],
      taints = [],
      nodes = {
        "1" : {
          zram_size = "3G" # remember to add the suffix, examples: 512M, 1G
        },
      }
      longhorn_volume_size = 30
      zram_size = "3G" # remember to add the suffix, examples: 512M, 1G
      kubelet_args = ["runtime-request-timeout=10m0s"]
    },
  ]
  enable_wireguard = true
  load_balancer_type     = "lb11"
  load_balancer_location = "fsn1"
  base_domain = "beta.al0.de"
  etcd_s3_backup = {
    etcd-s3-endpoint            = local.etcd_s3_endpoint
    etcd-s3-access-key          = local.etcd_s3_access_key
    etcd-s3-secret-key          = local.etcd_s3_secret_key
    etcd-s3-bucket              = local.etcd_s3_bucket
    etcd-s3-region              = local.etcd_s3_region
    etcd-snapshot-schedule-cron = "*/10 * * * *"
    etcd-snapshot-retention     = 9 # 3 of each control plane node
  }
  enable_longhorn = true
  disable_hetzner_csi = true
  kured_options = {
    "drain-timeout" : "5m",
    "force-reboot" : "true"
  }
  cluster_name = "al0"

  k3s_exec_server_args = "--kube-apiserver-arg enable-admission-plugins=PodTolerationRestriction,PodNodeSelector"
  cni_plugin = "cilium"
  disable_kube_proxy = true
  disable_network_policy = true
  dns_servers = [
    "1.1.1.1",
    "8.8.8.8",
    "2606:4700:4700::1111",
  ]
  extra_kustomize_deployment_commands = <<-EOT
    kubectl -n cert-manager wait --for condition=established --timeout=120s crds/clusterissuers.cert-manager.io
    sleep 7
    kubectl -n cert-manager wait --for condition=Available --timeout=120s deployment.apps/cert-manager-webhook
    kubectl apply -f /var/user_kustomize/letsencrypt-cloudflare.yaml
    kubectl delete secret git-via-ssh-server-keys
    kubectl create secret generic git-via-ssh-server-keys --from-file ssh_host_rsa_key=<(openssl genpkey -algorithm rsa -outform PEM 2> /dev/null)
    kubectl annotate nodes -l node.kubernetes.io/server-usage=storage node.longhorn.io/default-disks-config='[ { "path":"/var/lib/longhorn","allowScheduling":true, "storageReserved":42949672960, "tags":[ "nvme" ]}, { "name":"hcloud-volume", "path":"/var/longhorn","allowScheduling":true, "storageReserved":3221225000,"tags":[ "ssd" ] }]'
    kubectl annotate nodes -l node.kubernetes.io/server-usage=local-storage-only node.longhorn.io/default-disks-config='[ { "path":"/var/lib/longhorn","allowScheduling":true, "storageReserved": 10737420000, "tags":[ "nvme" ] }]'
  EOT
  extra_kustomize_parameters = {
    cloudflare_api_token : var.cloudflare_api_token
    s3_access_key_id : local.s3_access_key_id
    s3_secret_key : var.s3_secret_key
  }
  create_kubeconfig = false

  longhorn_values = <<EOT
defaultSettings:
  createDefaultDiskLabeledNodes: true
  kubernetesClusterAutoscalerEnabled: false
  defaultDataPath: /var/longhorn
  node-down-pod-deletion-policy: delete-both-statefulset-and-deployment-pod
  backupTarget: s3://al0-landscape-longhorn-system-backup@nl-ams/
  backupTargetCredentialSecret: longhorn-s3-backups
persistence:
  defaultFsType: ext4
  defaultClassReplicaCount: 1
  defaultClass: false
  EOT

  traefik_version = "27.0.2"
}

provider "hcloud" {
  token = var.hcloud_token != "" ? var.hcloud_token : local.hcloud_token
}

terraform {
  required_version = ">= 1.5.0"
  required_providers {
    hcloud = {
      source  = "hetznercloud/hcloud"
      version = ">= 1.43.0"
    }
  }
}

output "kubeconfig" {
  value     = module.kube-hetzner.kubeconfig
  sensitive = true
}

output "k3s_token" {
  value     = module.kube-hetzner.k3s_token
  sensitive = true
}

variable "s3_secret_key" {
  sensitive = true
  type      = string
}

variable "hcloud_token" {
  sensitive = true
  default   = ""
}

Screenshots

No response

Platform

linux

@lennart lennart added the bug Something isn't working label May 29, 2024
@mysticaltech
Copy link
Collaborator

@lennart Try unsetting the version of traefik and running terraform init -upgrade

@lennart
Copy link
Contributor Author

lennart commented Jun 16, 2024

thanks for the reply I will try and report back (I had problems with some kustomizations depending on a certain version of the traefik helm chart so I locked the version, I guess I can now lift this restriction again)

@mysticaltech mysticaltech changed the title [Bug]: Restore hangs waiting for load balancer ip Restore hangs waiting for load balancer ip Jun 21, 2024
@mysticaltech mysticaltech removed the bug Something isn't working label Jun 21, 2024
@kube-hetzner kube-hetzner locked and limited conversation to collaborators Jun 21, 2024
@mysticaltech mysticaltech converted this issue into discussion #1387 Jun 21, 2024

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants