You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When restoring from an etcd snapshot (as described in the docs), the terraform apply step hangs waiting for the load balancer ip:
...
module.kube-hetzner.null_resource.kustomization (remote-exec): + kubectl apply -k /var/post_install
module.kube-hetzner.null_resource.kustomization: Still creating... [10s elapsed]
module.kube-hetzner.null_resource.kustomization (remote-exec): namespace/cert-manager unchanged
module.kube-hetzner.null_resource.kustomization (remote-exec): namespace/longhorn-system unchanged
module.kube-hetzner.null_resource.kustomization (remote-exec): namespace/system-upgrade unchanged
module.kube-hetzner.null_resource.kustomization (remote-exec): namespace/traefik unchanged
module.kube-hetzner.null_resource.kustomization (remote-exec): serviceaccount/hcloud-cloud-controller-manager unchanged
module.kube-hetzner.null_resource.kustomization (remote-exec): serviceaccount/kured unchanged
module.kube-hetzner.null_resource.kustomization (remote-exec): serviceaccount/system-upgrade unchanged
module.kube-hetzner.null_resource.kustomization (remote-exec): role.rbac.authorization.k8s.io/kured unchanged
module.kube-hetzner.null_resource.kustomization (remote-exec): clusterrole.rbac.authorization.k8s.io/kured unchanged
module.kube-hetzner.null_resource.kustomization (remote-exec): rolebinding.rbac.authorization.k8s.io/kured unchanged
module.kube-hetzner.null_resource.kustomization (remote-exec): clusterrolebinding.rbac.authorization.k8s.io/kured unchanged
module.kube-hetzner.null_resource.kustomization (remote-exec): clusterrolebinding.rbac.authorization.k8s.io/system-upgrade unchanged
module.kube-hetzner.null_resource.kustomization (remote-exec): clusterrolebinding.rbac.authorization.k8s.io/system:hcloud-cloud-controller-manager unchanged
module.kube-hetzner.null_resource.kustomization (remote-exec): configmap/default-controller-env unchanged
module.kube-hetzner.null_resource.kustomization (remote-exec): deployment.apps/hcloud-cloud-controller-manager unchanged
module.kube-hetzner.null_resource.kustomization (remote-exec): deployment.apps/system-upgrade-controller unchanged
module.kube-hetzner.null_resource.kustomization (remote-exec): daemonset.apps/kured unchanged
module.kube-hetzner.null_resource.kustomization (remote-exec): helmchart.helm.cattle.io/cert-manager unchanged
module.kube-hetzner.null_resource.kustomization (remote-exec): helmchart.helm.cattle.io/cilium unchanged
module.kube-hetzner.null_resource.kustomization (remote-exec): helmchart.helm.cattle.io/longhorn unchanged
module.kube-hetzner.null_resource.kustomization (remote-exec): helmchart.helm.cattle.io/traefik unchanged
module.kube-hetzner.null_resource.kustomization (remote-exec): + echo 'Waiting for the system-upgrade-controller deployment to become available...'
module.kube-hetzner.null_resource.kustomization (remote-exec): Waiting for the system-upgrade-controller deployment to become available...
module.kube-hetzner.null_resource.kustomization (remote-exec): + kubectl -n system-upgrade wait --for=condition=available --timeout=360s deployment/system-upgrade-controller
module.kube-hetzner.null_resource.kustomization (remote-exec): deployment.apps/system-upgrade-controller condition met
module.kube-hetzner.null_resource.kustomization (remote-exec): + sleep 7
module.kube-hetzner.null_resource.kustomization: Still creating... [20s elapsed]
module.kube-hetzner.null_resource.kustomization (remote-exec): + kubectl -n system-upgrade apply -f /var/post_install/plans.yaml
module.kube-hetzner.null_resource.kustomization (remote-exec): plan.upgrade.cattle.io/k3s-agent unchanged
module.kube-hetzner.null_resource.kustomization (remote-exec): plan.upgrade.cattle.io/k3s-server unchanged
module.kube-hetzner.null_resource.kustomization (remote-exec): + timeout 360 bash
module.kube-hetzner.null_resource.kustomization (remote-exec): Waiting for load-balancer to get an IP...
...
module.kube-hetzner.null_resource.kustomization: Still creating... [6m10s elapsed]
module.kube-hetzner.null_resource.kustomization (remote-exec): Waiting for load-balancer to get an IP...
module.kube-hetzner.null_resource.kustomization (remote-exec): Waiting for load-balancer to get an IP...
module.kube-hetzner.null_resource.kustomization (remote-exec): Waiting for load-balancer to get an IP...
module.kube-hetzner.null_resource.kustomization (remote-exec): Waiting for load-balancer to get an IP...
module.kube-hetzner.null_resource.kustomization: Still creating... [6m20s elapsed]
module.kube-hetzner.null_resource.kustomization (remote-exec): Waiting for load-balancer to get an IP...
╷
│ Error: remote-exec provisioner error
│
│ with module.kube-hetzner.null_resource.kustomization,
│ on .terraform/modules/kube-hetzner/init.tf line 291, in resource "null_resource" "kustomization":
│ 291: provisioner "remote-exec" {
│
│ error executing "/tmp/terraform_883860071.sh": Process exited with status
│ 124
╵
Cleaning up project directory and file based variables 00:01
ERROR: Job failed: exit code 1
I ssh'ed into one of the control plane nodes and could confirm, the etcd snapshot is in fact restored but since the traefik service was removed during restore (with reasons) I am wondering how the apply step should finish? It seems that this is a deadlock situation since the rest of the deployment waits for the load balancer ip and it won't get one since the traefik service (that would contain the corresponding annotations to connect to the hetzner load balancer) was removed.
I understand that not deleting the traefik service might alter existing load balancers, which is unwanted, but is there something I am missing on how the traefik service is eventually restored? I can see from the deploy logs, that the post_install kustomizations are applied and that the traefik helm chart is obviously unchanged. In order to recreate the traefik service, I had to:
ssh into one of the control planes
delete the helmchart resource
change the kube.tf to not restore from an etcd snapshot anymore (set it to an empty string)
deploy the cluster
this re-ran the helm install job and created the service. Re-running the terraform apply step then could finish as the load balancer now contained targets (it did not before the service re-creation)
I guess this is not the way a restore is supposed to work, so I am interested where I used it wrong, or if there is something different in the way restore used to work...
Kube.tf file
locals {
hcloud_token="xxxxxxxxxxx"k3s_token=var.k3s_token# this is secret information, hence it is passed as an environment variableetcd_version="v3.5.9"s3_access_key_id="xxxxx"etcd_snapshot_name="etcd-snapshot-al0-control-plane-nbg1-gph-1716834604"etcd_s3_endpoint="s3.nl-ams.scw.cloud"etcd_s3_bucket="al0-landscape-k3s-etcd-snapshots"etcd_s3_region="nl-ams"etcd_s3_access_key=local.s3_access_key_idetcd_s3_secret_key=var.s3_secret_key# this is secret information, hence it is passed as an environment variable
}
variable"k3s_token" {
sensitive=truetype=string
}
variable"cloudflare_api_token" {
type=stringsensitive=true
}
module"kube-hetzner" {
providers={
hcloud = hcloud
}
hcloud_token=var.hcloud_token!=""? var.hcloud_token: local.hcloud_tokenk3s_token=local.k3s_tokenpostinstall_exec=local.etcd_snapshot_name==""? [] : [
(
<<-EOF export CLUSTERINIT=$(cat /etc/rancher/k3s/config.yaml | grep -i '"cluster-init": true') if [ -n "$CLUSTERINIT" ]; then echo indeed this is the first control plane node > /tmp/restorenotes k3s server \ --cluster-reset \ --etcd-s3 \ --cluster-reset-restore-path=${local.etcd_snapshot_name} \ --etcd-s3-endpoint=${local.etcd_s3_endpoint} \ --etcd-s3-bucket=${local.etcd_s3_bucket} \ --etcd-s3-access-key=${local.etcd_s3_access_key} \ --etcd-s3-secret-key=${local.etcd_s3_secret_key} \ --etcd-s3-region=${local.etcd_s3_region} mv /etc/rancher/k3s/k3s.yaml /etc/rancher/k3s/k3s.backup.yaml ETCD_VER=${local.etcd_version} case "$(uname -m)" in aarch64) ETCD_ARCH="arm64" ;; x86_64) ETCD_ARCH="amd64" ;; esac; DOWNLOAD_URL=https://github.com/etcd-io/etcd/releases/download rm -f /tmp/etcd-$ETCD_VER-linux-$ETCD_ARCH.tar.gz curl -L $DOWNLOAD_URL/$ETCD_VER/etcd-$ETCD_VER-linux-$ETCD_ARCH.tar.gz -o /tmp/etcd-$ETCD_VER-linux-$ETCD_ARCH.tar.gz tar xzvf /tmp/etcd-$ETCD_VER-linux-$ETCD_ARCH.tar.gz -C /usr/local/bin --strip-components=1 rm -f /tmp/etcd-$ETCD_VER-linux-$ETCD_ARCH.tar.gz etcd --version etcdctl version nohup etcd --data-dir /var/lib/rancher/k3s/server/db/etcd & echo $! > save_pid.txt etcdctl del /registry/services/specs/traefik/traefik etcdctl del /registry/services/endpoints/traefik/traefik OLD_NODES=$(etcdctl get "" --prefix --keys-only | grep /registry/minions/ | cut -c 19-) for NODE in $OLD_NODES; do for KEY in $(etcdctl get "" --prefix --keys-only | grep $NODE); do etcdctl del $KEY done done kill -9 `cat save_pid.txt` rm save_pid.txt else echo this is not the first control plane node > /tmp/restorenotes fi EOF
)
]
source="github.com/lennart/terraform-hcloud-kube-hetzner?ref=feature%2Fsys-upgrade-disable-eviction-flag"system_upgrade_enable_eviction=falsessh_port=23232ssh_public_key="xyz"ssh_private_key=nullssh_max_auth_tries=10network_region="eu-central"# change to `us-east` if location is ashcontrol_plane_nodepools=[
{
name ="control-plane-fsn1",
server_type ="cax11",
location ="fsn1",
labels = [],
taints = [],
count =1
zram_size ="2G"# remember to add the suffix, examples: 512M, 1G
kubelet_args = ["runtime-request-timeout=10m0s"]
},
{
name ="control-plane-nbg1",
server_type ="cax11",
location ="nbg1",
labels = [],
taints = [],
count =1
kubelet_args = ["runtime-request-timeout=10m0s"]
zram_size ="2G"# remember to add the suffix, examples: 512M, 1G
},
{
name ="control-plane-hel1",
server_type ="cax11",
location ="hel1",
labels = [],
taints = [],
count =1
kubelet_args = ["runtime-request-timeout=10m0s"]
zram_size ="2G"# remember to add the suffix, examples: 512M, 1G
}
]
agent_nodepools=[
{
name ="agent-arm",
server_type ="cax31",
location ="nbg1",
labels = [
"node.longhorn.io/create-default-disk=config",
"node.kubernetes.io/server-usage=local-storage-only"
],
taints = [],
nodes = {
"3": {
zram_size ="4G"# remember to add the suffix, examples: 512M, 1G
},
}
longhorn_volume_size =0
zram_size ="4G"# remember to add the suffix, examples: 512M, 1G
kubelet_args = ["runtime-request-timeout=10m0s"]
},
{
name ="agent-x86",
server_type ="cx41",
location ="nbg1",
labels = [
"node.longhorn.io/create-default-disk=config",
"node.kubernetes.io/server-usage=storage"
],
taints = [],
nodes = {
"1": {
zram_size ="3G"# remember to add the suffix, examples: 512M, 1G
},
}
longhorn_volume_size =30
zram_size ="3G"# remember to add the suffix, examples: 512M, 1G
kubelet_args = ["runtime-request-timeout=10m0s"]
},
]
enable_wireguard=trueload_balancer_type="lb11"load_balancer_location="fsn1"base_domain="beta.al0.de"etcd_s3_backup={
etcd-s3-endpoint = local.etcd_s3_endpoint
etcd-s3-access-key = local.etcd_s3_access_key
etcd-s3-secret-key = local.etcd_s3_secret_key
etcd-s3-bucket = local.etcd_s3_bucket
etcd-s3-region = local.etcd_s3_region
etcd-snapshot-schedule-cron ="*/10 * * * *"
etcd-snapshot-retention =9# 3 of each control plane node
}
enable_longhorn=truedisable_hetzner_csi=truekured_options={
"drain-timeout":"5m",
"force-reboot":"true"
}
cluster_name="al0"k3s_exec_server_args="--kube-apiserver-arg enable-admission-plugins=PodTolerationRestriction,PodNodeSelector"cni_plugin="cilium"disable_kube_proxy=truedisable_network_policy=truedns_servers=[
"1.1.1.1",
"8.8.8.8",
"2606:4700:4700::1111",
]
extra_kustomize_deployment_commands=<<-EOT kubectl -n cert-manager wait --for condition=established --timeout=120s crds/clusterissuers.cert-manager.io sleep 7 kubectl -n cert-manager wait --for condition=Available --timeout=120s deployment.apps/cert-manager-webhook kubectl apply -f /var/user_kustomize/letsencrypt-cloudflare.yaml kubectl delete secret git-via-ssh-server-keys kubectl create secret generic git-via-ssh-server-keys --from-file ssh_host_rsa_key=<(openssl genpkey -algorithm rsa -outform PEM 2> /dev/null) kubectl annotate nodes -l node.kubernetes.io/server-usage=storage node.longhorn.io/default-disks-config='[ { "path":"/var/lib/longhorn","allowScheduling":true, "storageReserved":42949672960, "tags":[ "nvme" ]}, { "name":"hcloud-volume", "path":"/var/longhorn","allowScheduling":true, "storageReserved":3221225000,"tags":[ "ssd" ] }]' kubectl annotate nodes -l node.kubernetes.io/server-usage=local-storage-only node.longhorn.io/default-disks-config='[ { "path":"/var/lib/longhorn","allowScheduling":true, "storageReserved": 10737420000, "tags":[ "nvme" ] }]' EOTextra_kustomize_parameters={
cloudflare_api_token : var.cloudflare_api_token
s3_access_key_id : local.s3_access_key_id
s3_secret_key : var.s3_secret_key
}
create_kubeconfig=falselonghorn_values=<<EOTdefaultSettings: createDefaultDiskLabeledNodes: true kubernetesClusterAutoscalerEnabled: false defaultDataPath: /var/longhorn node-down-pod-deletion-policy: delete-both-statefulset-and-deployment-pod backupTarget: s3://al0-landscape-longhorn-system-backup@nl-ams/ backupTargetCredentialSecret: longhorn-s3-backupspersistence: defaultFsType: ext4 defaultClassReplicaCount: 1 defaultClass: false EOTtraefik_version="27.0.2"
}
provider"hcloud" {
token=var.hcloud_token!=""? var.hcloud_token: local.hcloud_token
}
terraform {
required_version=">= 1.5.0"required_providers {
hcloud={
source ="hetznercloud/hcloud"
version =">= 1.43.0"
}
}
}
output"kubeconfig" {
value=module.kube-hetzner.kubeconfigsensitive=true
}
output"k3s_token" {
value=module.kube-hetzner.k3s_tokensensitive=true
}
variable"s3_secret_key" {
sensitive=truetype=string
}
variable"hcloud_token" {
sensitive=truedefault=""
}
Screenshots
No response
Platform
linux
The text was updated successfully, but these errors were encountered:
thanks for the reply I will try and report back (I had problems with some kustomizations depending on a certain version of the traefik helm chart so I locked the version, I guess I can now lift this restriction again)
mysticaltech
changed the title
[Bug]: Restore hangs waiting for load balancer ip
Restore hangs waiting for load balancer ip
Jun 21, 2024
Description
When restoring from an etcd snapshot (as described in the docs), the terraform apply step hangs waiting for the load balancer ip:
I ssh'ed into one of the control plane nodes and could confirm, the etcd snapshot is in fact restored but since the traefik service was removed during restore (with reasons) I am wondering how the apply step should finish? It seems that this is a deadlock situation since the rest of the deployment waits for the load balancer ip and it won't get one since the traefik service (that would contain the corresponding annotations to connect to the hetzner load balancer) was removed.
I understand that not deleting the traefik service might alter existing load balancers, which is unwanted, but is there something I am missing on how the traefik service is eventually restored? I can see from the deploy logs, that the post_install kustomizations are applied and that the traefik helm chart is obviously unchanged. In order to recreate the traefik service, I had to:
this re-ran the helm install job and created the service. Re-running the terraform apply step then could finish as the load balancer now contained targets (it did not before the service re-creation)
I guess this is not the way a restore is supposed to work, so I am interested where I used it wrong, or if there is something different in the way restore used to work...
Kube.tf file
Screenshots
No response
Platform
linux
The text was updated successfully, but these errors were encountered: