Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ImagePullBackOff on hcloud-csi-node and hcloud-csi-controller #442

Closed
melalj opened this issue Dec 5, 2022 · 7 comments
Closed

ImagePullBackOff on hcloud-csi-node and hcloud-csi-controller #442

melalj opened this issue Dec 5, 2022 · 7 comments

Comments

@melalj
Copy link

melalj commented Dec 5, 2022

As I started a fresh cluster, the terraform execution went well, but after running kubectl get pods --all-namespaces I get the pods hcloud-csi-node and hcloud-csi-controller stuck on the error ImagePullBackOff

When I inspect one them I get that k8s.gcr.io is being forbidden:

Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  5m12s                default-scheduler  0/1 nodes are available: 1 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling.
  Normal   Scheduled         3m8s                 default-scheduler  Successfully assigned kube-system/hcloud-csi-controller-bb7658b8f-5fbbq to mycluster-it05-worker-large-lmb
  Normal   Pulled            3m7s                 kubelet            Successfully pulled image "hetznercloud/hcloud-csi-driver:2.1.0" in 130.499337ms
  Warning  Failed            3m7s                 kubelet            Failed to pull image "k8s.gcr.io/sig-storage/csi-provisioner:v2.2.2": rpc error: code = Unknown desc = failed to pull and unpack image "k8s.gcr.io/sig-storage/csi-provisioner:v2.2.2": failed to resolve reference "k8s.gcr.io/sig-storage/csi-provisioner:v2.2.2": pulling from host k8s.gcr.io failed with status code [manifests v2.2.2]: 403 Forbidden
  Warning  Failed            3m7s                 kubelet            Error: ErrImagePull
  Normal   Pulling           3m7s                 kubelet            Pulling image "k8s.gcr.io/sig-storage/csi-resizer:v1.2.0"
  Warning  Failed            3m7s                 kubelet            Failed to pull image "k8s.gcr.io/sig-storage/csi-resizer:v1.2.0": rpc error: code = Unknown desc = failed to pull and unpack image "k8s.gcr.io/sig-storage/csi-resizer:v1.2.0": failed to resolve reference "k8s.gcr.io/sig-storage/csi-resizer:v1.2.0": pulling from host k8s.gcr.io failed with status code [manifests v1.2.0]: 403 Forbidden
  Warning  Failed            3m7s                 kubelet            Error: ErrImagePull
  Normal   Pulling           3m7s                 kubelet            Pulling image "k8s.gcr.io/sig-storage/csi-provisioner:v2.2.2"
  Normal   Created           3m7s                 kubelet            Created container hcloud-csi-driver
  Warning  Failed            3m7s                 kubelet            Error: ErrImagePull
  Normal   Pulling           3m7s                 kubelet            Pulling image "hetznercloud/hcloud-csi-driver:2.1.0"
  Normal   Pulling           3m7s                 kubelet            Pulling image "k8s.gcr.io/sig-storage/csi-attacher:v3.2.1"
  Warning  Failed            3m7s                 kubelet            Failed to pull image "k8s.gcr.io/sig-storage/csi-attacher:v3.2.1": rpc error: code = Unknown desc = failed to pull and unpack image "k8s.gcr.io/sig-storage/csi-attacher:v3.2.1": failed to resolve reference "k8s.gcr.io/sig-storage/csi-attacher:v3.2.1": pulling from host k8s.gcr.io failed with status code [manifests v3.2.1]: 403 Forbidden
  Warning  Failed            3m7s                 kubelet            Failed to pull image "k8s.gcr.io/sig-storage/livenessprobe:v2.3.0": rpc error: code = Unknown desc = failed to pull and unpack image "k8s.gcr.io/sig-storage/livenessprobe:v2.3.0": failed to resolve reference "k8s.gcr.io/sig-storage/livenessprobe:v2.3.0": pulling from host k8s.gcr.io failed with status code [manifests v2.3.0]: 403 Forbidden
  Normal   Pulling           3m7s                 kubelet            Pulling image "k8s.gcr.io/sig-storage/livenessprobe:v2.3.0"
  Normal   Started           3m7s                 kubelet            Started container hcloud-csi-driver
  Warning  Failed            3m7s                 kubelet            Error: ErrImagePull
  Warning  Failed            3m6s                 kubelet            Error: ImagePullBackOff
  Warning  Failed            3m6s                 kubelet            Error: ImagePullBackOff
  Normal   BackOff           3m6s                 kubelet            Back-off pulling image "k8s.gcr.io/sig-storage/csi-resizer:v1.2.0"
  Warning  Failed            3m6s                 kubelet            Error: ImagePullBackOff
  Normal   BackOff           3m6s                 kubelet            Back-off pulling image "k8s.gcr.io/sig-storage/csi-provisioner:v2.2.2"
  Normal   BackOff           3m6s                 kubelet            Back-off pulling image "k8s.gcr.io/sig-storage/livenessprobe:v2.3.0"
  Warning  Failed            3m6s                 kubelet            Error: ImagePullBackOff
  Normal   BackOff           3m5s (x2 over 3m6s)  kubelet            Back-off pulling image "k8s.gcr.io/sig-storage/csi-attacher:v3.2.1"

FYI here's my terraform file:

terraform {
  required_version = ">= 1.3.5"
  required_providers {
    hcloud = {
      source  = "hetznercloud/hcloud"
      version = "1.36.1"
    }
  }
  backend "gcs" {
    bucket = "xxx"
    credentials = "xxx"
  }
}

provider "hcloud" {
  token = var.hcloud_token
}

module "kube-hetzner" {
  providers = {
    hcloud = hcloud
  }
  hcloud_token = var.hcloud_token
  source = "kube-hetzner/kube-hetzner/hcloud"
  version = "1.6.8"
  ssh_public_key = file(var.ssh_public_key)
  ssh_private_key = file(var.ssh_private_key)
  network_region = var.network_region
  enable_cert_manager = false

  control_plane_nodepools = [
    {
      name        = "master",
      server_type = "cpx11",
      location    = var.node_location,
      labels      = [],
      taints      = [],
      count       = var.node_count_master
    }
  ]

  agent_nodepools = [
    {
      name        = "worker-small",
      server_type = "cpx11",
      location    = var.node_location,
      labels      = [],
      taints = [],
      count       = var.node_count_workers_small
    },
    {
      name        = "worker-medium",
      server_type = "cpx21",
      location    = var.node_location,
      labels      = [],
      taints      = [],
      count = var.node_count_workers_medium
    },
    {
      name        = "worker-large",
      server_type = "cpx31",
      location    = var.node_location,
      labels      = [],
      taints      = [],
      count = var.node_count_workers_large
    }
  ]

  load_balancer_type     = "lb11"
  load_balancer_location = var.node_location
  base_domain = ""
  cluster_name = var.cluster_name

  extra_firewall_rules = [
    # all TCP
    {
      description     = "TCP all"
      direction       = "out"
      protocol        = "tcp"
      port            = "any"
      source_ips      = []
      destination_ips = ["0.0.0.0/0", "::/0"]
    },
    # all UDP
    {
      description     = "UDP all"
      direction       = "out"
      protocol        = "udp"
      port            = "any"
      source_ips      = []
      destination_ips = ["0.0.0.0/0", "::/0"]
    }
  ]
}

output "kubeconfig" {
  value     = module.kube-hetzner.kubeconfig
  sensitive = true
}

Any help would be much appreciated

@mysticaltech
Copy link
Collaborator

@melalj Please terraform destroy -auto-approve and terraform init -upgrade and try again.

Also try clear the content in extra_firewall_rules, basically, I think the rules you have are blocking all outgoing comms and that may be interfering with the container pull.

@mysticaltech
Copy link
Collaborator

Here are the default firewall rules if you are curious, and most are needed for the proper functioning of the cluster.

base_firewall_rules = concat([

@melalj
Copy link
Author

melalj commented Dec 6, 2022

I have filed a ticket on hcloud csi-driver repo (hetznercloud/csi-driver#339 (comment)), And they mentioned that it might be due to a temporary outage in the k8s.gcr.io registry. I solved that by using a proxy registry.

Regarding the firewall rules, what I did was to open all outgoing network traffic:

extra_firewall_rules = [
    # all TCP
    {
      description     = "TCP all"
      direction       = "out"
      protocol        = "tcp"
      port            = "any"
      source_ips      = []
      destination_ips = ["0.0.0.0/0", "::/0"]
    },
    # all UDP
    {
      description     = "UDP all"
      direction       = "out"
      protocol        = "udp"
      port            = "any"
      source_ips      = []
      destination_ips = ["0.0.0.0/0", "::/0"]
    }
  ]

Any insights on why this is not a good practice?

Thanks :)

@mysticaltech
Copy link
Collaborator

My bad @melalj yes you did open the out traffic to the world indeed. My bad, I got confused for a sec there.

IMHO I wouldn't say it's best practice unless you know a service needs to use a specific port, so you can always open that later by adding the firewall rule for that specific port in terraform and applying again.

@mysticaltech
Copy link
Collaborator

@melalj Out of curiosity, by proxy registry do you mean you used our new k3s_registries feature that adds support for k3s private registries, or something else?

@melalj
Copy link
Author

melalj commented Dec 7, 2022

Exact :) that feature came in pretty handy!

I used a Sonatype Nexus self-hosted registry that hosts my private docker images but also proxies all public repositories (docker.io, k8s.gcr.io, registry.k8s.io...)

@mysticaltech
Copy link
Collaborator

Wonderful, good to hear!

@s3rius Thanks again for your contribution! 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants