Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrading a clean cluster 1.27 to 1.28 - one of the nodes stuck in emergency mode #1355

Closed
asafbennatan opened this issue May 22, 2024 · 1 comment

Comments

@asafbennatan
Copy link

asafbennatan commented May 22, 2024

Description

i have installed the cluster on 1.27 after it was done without doing anything else i have upgraded it to 1.28 (~3:42 UTC)
all nodes were up and running except one ( see screenshot).
i used hetzner console to have a look at that node , terminal is stuck in emergency mode (see screenshot 2)
i pressed 'Enter' and everything started and the node is now online again and upgraded (this was at 6:41 UTC).

upon following the recommendation by the OS looking into journalctl -xb (output_reducted.txt attached) i see that the root cause of the issue as far as i gather is that /boot/writable could not be mounted

any idea why this would happen?

Kube.tf file

## All values are referenced from here - https://github.com/kube-hetzner/terraform-hcloud-kube-hetzner/blob/master/kube.tf.example

module "kube-hetzner" {
  providers = {
    hcloud = hcloud
  }

  source = "kube-hetzner/kube-hetzner/hcloud"
  
  hcloud_token = var.hcloud_token

  rancher_install_channel = "latest"
  initial_k3s_channel = "v1.28"

  version = "2.13.5"
  # ssh_port = 2222clear
  base_domain = "${replace(var.app_name,"-",".")}.XXX.XX"
  cluster_name = "${var.app_name}"
  # rancher_hostname = "XX.XX.XX"

  enable_cert_manager = false
  enable_rancher = false
  enable_longhorn = false
  # enable_traefik = false
  enable_klipper_metal_lb = "false"
  control_plane_lb_enable_public_interface = true
  # enable_nginx = true

  load_balancer_disable_public_network = false

  ssh_public_key = file("./ssh-key/id_rsa.pub")

  # For more details on SSH see https://github.com/kube-hetzner/kube-hetzner/blob/master/docs/ssh.md
  ssh_private_key = file("./ssh-key/id_rsa")
  network_region = "eu-central" # change to `us-east` if location is ash

  control_plane_nodepools = [
    {
      name        = "control-plane-nbg1",
      server_type = "cx21",
      location    = "nbg1",
      labels      = [],
      taints      = [],
      count       = 2
    },
    {
      name        = "control-plane-hel1",
      server_type = "cx21",
      location    = "hel1",
      labels      = [],
      taints      = [],
      count       = 1
    }
  ]

  agent_nodepools = [
    {
      name        = "workload-agent-0",
      server_type = "cx41",
      location    = "nbg1",
      labels = [
        "node.kubernetes.io/pool=workload-agent-cx41"
      ],
      taints      = [],
      count       = 3,
      # longhorn_volume_size = 50
    },
    {
      name        = "longhorn-agent-0",
      server_type = "cx41",
      location    = "nbg1",
      labels = [
        "node.kubernetes.io/server-usage=storage",
        "node.kubernetes.io/pool=longhorn-agent-0"
      ],
      taints      = [],
      count       = 3,
      longhorn_volume_size = 50
    }
  ]

  # * LB location and type, the latter will depend on how much load you want it to handle, see https://www.hetzner.com/cloud/load-balancer
  load_balancer_type     = "lb11"
  load_balancer_location = "nbg1"

  ### The following values are entirely optional (and can be removed from this if unused)

  # You can refine a base domain name to be use in this form of nodename.base_domain for setting the reserve dns inside Hetzner
  

  # To use local storage on the nodes, you can enable Longhorn, default is "false".

  # The file system type for Longhorn, if enabled (ext4 is the default, otherwise you can choose xfs)
  # longhorn_fstype = "xfs"

  # how many replica volumes should longhorn create (default is 3)
  longhorn_replica_count = 1
  disable_hetzner_csi = false

  kured_options = {
    "concurrency": 3
  }

  # If you want to disable the Traefik ingress controller, to use the Nginx ingress controller for instance, you can can set this to "false". Default is "true".


  # We give you the possibility to use letsencrypt directly with Traefik because it's an easy setup, however it's not optimal,
  # as the free version of Traefik causes a little bit of downtime when when the certificates get renewed. For proper SSL management,
  # we instead recommend you to use cert-manager, that you can easily deploy with helm; see https://cert-manager.io/.
  # traefik_acme_tls = true
  ingress_controller = "none"  
  automatically_upgrade_os = true

  
  allow_scheduling_on_control_plane = false  
  automatically_upgrade_k3s = true

  cni_plugin = "cilium"
  cilium_version = "v1.15.4"
  cilium_routing_mode = "native"
}

Screenshots

status after upgrade:
image

stuck at emergency:

image

output_reducted.txt

Platform

Linux

@asafbennatan asafbennatan added the bug Something isn't working label May 22, 2024
@mysticaltech
Copy link
Collaborator

@asafbennatan You are in HA, so just turn off and turn back on the node. Normally it should work.

@mysticaltech mysticaltech removed the bug Something isn't working label May 23, 2024
@mysticaltech mysticaltech changed the title [Bug]: upgrading a clean cluster( just installed) 1.27 to 1.28 - one of the nodes stuck in emergency mode Upgrading a clean cluster 1.27 to 1.28 - one of the nodes stuck in emergency mode May 23, 2024
@kube-hetzner kube-hetzner locked and limited conversation to collaborators May 23, 2024
@mysticaltech mysticaltech converted this issue into discussion #1362 May 23, 2024

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants