Upgrading a clean cluster 1.27 to 1.28 - one of the nodes stuck in emergency mode #1355

asafbennatan · 2024-05-22T07:50:02Z

Description

i have installed the cluster on 1.27 after it was done without doing anything else i have upgraded it to 1.28 (~3:42 UTC)
all nodes were up and running except one ( see screenshot).
i used hetzner console to have a look at that node , terminal is stuck in emergency mode (see screenshot 2)
i pressed 'Enter' and everything started and the node is now online again and upgraded (this was at 6:41 UTC).

upon following the recommendation by the OS looking into journalctl -xb (output_reducted.txt attached) i see that the root cause of the issue as far as i gather is that /boot/writable could not be mounted

any idea why this would happen?

Kube.tf file

## All values are referenced from here - https://github.com/kube-hetzner/terraform-hcloud-kube-hetzner/blob/master/kube.tf.example

module "kube-hetzner" {
  providers = {
    hcloud = hcloud
  }

  source = "kube-hetzner/kube-hetzner/hcloud"
  
  hcloud_token = var.hcloud_token

  rancher_install_channel = "latest"
  initial_k3s_channel = "v1.28"

  version = "2.13.5"
  # ssh_port = 2222clear
  base_domain = "${replace(var.app_name,"-",".")}.XXX.XX"
  cluster_name = "${var.app_name}"
  # rancher_hostname = "XX.XX.XX"

  enable_cert_manager = false
  enable_rancher = false
  enable_longhorn = false
  # enable_traefik = false
  enable_klipper_metal_lb = "false"
  control_plane_lb_enable_public_interface = true
  # enable_nginx = true

  load_balancer_disable_public_network = false

  ssh_public_key = file("./ssh-key/id_rsa.pub")

  # For more details on SSH see https://github.com/kube-hetzner/kube-hetzner/blob/master/docs/ssh.md
  ssh_private_key = file("./ssh-key/id_rsa")
  network_region = "eu-central" # change to `us-east` if location is ash

  control_plane_nodepools = [
    {
      name        = "control-plane-nbg1",
      server_type = "cx21",
      location    = "nbg1",
      labels      = [],
      taints      = [],
      count       = 2
    },
    {
      name        = "control-plane-hel1",
      server_type = "cx21",
      location    = "hel1",
      labels      = [],
      taints      = [],
      count       = 1
    }
  ]

  agent_nodepools = [
    {
      name        = "workload-agent-0",
      server_type = "cx41",
      location    = "nbg1",
      labels = [
        "node.kubernetes.io/pool=workload-agent-cx41"
      ],
      taints      = [],
      count       = 3,
      # longhorn_volume_size = 50
    },
    {
      name        = "longhorn-agent-0",
      server_type = "cx41",
      location    = "nbg1",
      labels = [
        "node.kubernetes.io/server-usage=storage",
        "node.kubernetes.io/pool=longhorn-agent-0"
      ],
      taints      = [],
      count       = 3,
      longhorn_volume_size = 50
    }
  ]

  # * LB location and type, the latter will depend on how much load you want it to handle, see https://www.hetzner.com/cloud/load-balancer
  load_balancer_type     = "lb11"
  load_balancer_location = "nbg1"

  ### The following values are entirely optional (and can be removed from this if unused)

  # You can refine a base domain name to be use in this form of nodename.base_domain for setting the reserve dns inside Hetzner
  

  # To use local storage on the nodes, you can enable Longhorn, default is "false".

  # The file system type for Longhorn, if enabled (ext4 is the default, otherwise you can choose xfs)
  # longhorn_fstype = "xfs"

  # how many replica volumes should longhorn create (default is 3)
  longhorn_replica_count = 1
  disable_hetzner_csi = false

  kured_options = {
    "concurrency": 3
  }

  # If you want to disable the Traefik ingress controller, to use the Nginx ingress controller for instance, you can can set this to "false". Default is "true".


  # We give you the possibility to use letsencrypt directly with Traefik because it's an easy setup, however it's not optimal,
  # as the free version of Traefik causes a little bit of downtime when when the certificates get renewed. For proper SSL management,
  # we instead recommend you to use cert-manager, that you can easily deploy with helm; see https://cert-manager.io/.
  # traefik_acme_tls = true
  ingress_controller = "none"  
  automatically_upgrade_os = true

  
  allow_scheduling_on_control_plane = false  
  automatically_upgrade_k3s = true

  cni_plugin = "cilium"
  cilium_version = "v1.15.4"
  cilium_routing_mode = "native"
}

Screenshots

status after upgrade:

stuck at emergency:

output_reducted.txt

Platform

Linux

The text was updated successfully, but these errors were encountered:

mysticaltech · 2024-05-23T18:33:55Z

@asafbennatan You are in HA, so just turn off and turn back on the node. Normally it should work.

asafbennatan added the bug Something isn't working label May 22, 2024

mysticaltech removed the bug Something isn't working label May 23, 2024

mysticaltech changed the title ~~[Bug]: upgrading a clean cluster( just installed) 1.27 to 1.28 - one of the nodes stuck in emergency mode~~ Upgrading a clean cluster 1.27 to 1.28 - one of the nodes stuck in emergency mode May 23, 2024

kube-hetzner locked and limited conversation to collaborators May 23, 2024

mysticaltech converted this issue into discussion #1362 May 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Upgrading a clean cluster 1.27 to 1.28 - one of the nodes stuck in emergency mode #1355

Upgrading a clean cluster 1.27 to 1.28 - one of the nodes stuck in emergency mode #1355

asafbennatan commented May 22, 2024 •

edited

Loading

mysticaltech commented May 23, 2024

This issue was moved to a discussion.

This issue was moved to a discussion.

Upgrading a clean cluster 1.27 to 1.28 - one of the nodes stuck in emergency mode #1355

Upgrading a clean cluster 1.27 to 1.28 - one of the nodes stuck in emergency mode #1355

Comments

asafbennatan commented May 22, 2024 • edited Loading

Description

Kube.tf file

Screenshots

Platform

mysticaltech commented May 23, 2024

This issue was moved to a discussion.

asafbennatan commented May 22, 2024 •

edited

Loading