Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random download failures - 403 errors [hetzner] #138

Closed
marblerun opened this issue Jan 12, 2023 · 69 comments
Closed

Random download failures - 403 errors [hetzner] #138

marblerun opened this issue Jan 12, 2023 · 69 comments

Comments

@marblerun
Copy link

Hi,

Attepting to build a 3 node kubernetes cluster, using kubespray (latest) on hetzner cloud instances running Debian 11.

First attempt failed due to download failure for kubeadm on 1 of the 3 instances. Confirmed using local download, 1 fail, 2 sucess.
Swapped in a replacement instance, and moved past this point, assumed possible ip blacklisting, though not confirmed.

All 3 instances then downloaded 4 calico networking containers, and came to the pause 3.7 download, using a command like this.

root@kube-3:# /usr/local/bin/nerdctl -n k8s.io pull --quiet registry.k8s.io/pause:3.7
root@kube-3:
# nerdctl images
REPOSITORY TAG IMAGE ID CREATED PLATFORM SIZE BLOB SIZE
registry.k8s.io/pause 3.7 bb6ed397957e 4 seconds ago linux/amd64 700.0 KiB 304.0 KiB
bb6ed397957e 4 seconds ago linux/amd64 700.0 KiB 304.0 KiB

on the failing instance, we see the following error if applied by hand, using kubespray it tries 4 times and then fails the whole install at that point.

root@kube-2:~# /usr/local/bin/nerdctl -n k8s.io pull --quiet registry.k8s.io/pause:3.7
FATA[0000] failed to resolve reference "registry.k8s.io/pause:3.7": unexpected status from HEAD request to https://registry.k8s.io/v2/pause/manifests/3.7: 403 Forbidden

Do you have any idea why the download from this registry might be failing, and is there any alternative source I could try ?

The ip address starts and ends as shown below, and was run a couple of minutes ago

Thu 12 Jan 2023 02:52:21 PM UTC

65.x.x.244

Many thanks

Mike

@BenTheElder
Copy link
Member

That endpoint works fine from here

$ curl -IL https://registry.k8s.io/v2/pause/manifests/3.7
HTTP/2 307 
content-type: text/html; charset=utf-8
location: https://us-west2-docker.pkg.dev/v2/k8s-artifacts-prod/images/pause/manifests/3.7
x-cloud-trace-context: 9e8f3405a102bf4332d81593461d200a
date: Thu, 12 Jan 2023 20:13:55 GMT
server: Google Frontend
via: 1.1 google
alt-svc: h3=":443"; ma=2592000,h3-29=":443"; ma=2592000

HTTP/2 200 
content-length: 2761
content-type: application/vnd.docker.distribution.manifest.list.v2+json
docker-content-digest: sha256:bb6ed397957e9ca7c65ada0db5c5d1c707c9c8afc80a94acbe69f3ae76988f0c
docker-distribution-api-version: registry/2.0
date: Thu, 12 Jan 2023 20:13:55 GMT
alt-svc: h3=":443"; ma=2592000,h3-29=":443"; ma=2592000,h3-Q050=":443"; ma=2592000,h3-Q046=":443"; ma=2592000,h3-Q043=":443"; ma=2592000,quic=":443"; ma=2592000; v="46,43"

Is there a proxy involved?

Can nerdctl produce more verbose results? That path should have served a redirect to some other backend.

@BenTheElder
Copy link
Member

We don't even have code to serve 403 in the registry.k8s.io application, so that would be coming from the backing store we redirect to, but from the logs above we can't see that part.

@marblerun
Copy link
Author

Thanks Ben,

As a temp fix, I've looked at the kubespray logs, downloaded the missing elements on a working instance, then exported them to a local file, copied over and imported them back into the instance that is being blocked. I now have a working cluster, but it is concerning that access seems to being blocked in an arbitary fashion. Have a good weekend

Mike

@tcahill
Copy link

tcahill commented Jan 18, 2023

I'm seeing the same behavior in a similar context. I'm trying to install the kube-prometheus-stack helm chart on a k3s cluster in Hetzner Cloud (hosted in their Oregon location) and getting a 403 when pulling registry.k8s.io/kube-state-metrics/kube-state-metrics:v2.7.0. Interestingly I'm only seeing this behavior on one of the three hosts comprising my cluster, which are all running Ubuntu 22. It's also not consistent on the problematic host - I occasionally get a successful response but primarily see 403s.

We don't even have code to serve 403 in the registry.k8s.io application

For me the 403 is appearing without following the redirect:

curl -v https://registry.k8s.io/v2/pause/manifests/3.7
*   Trying 34.107.244.51:443...
* Connected to registry.k8s.io (34.107.244.51) port 443 (#0)
* ALPN: offers h2
* ALPN: offers http/1.1
*  CAfile: /etc/ssl/certs/ca-certificates.crt
*  CApath: none
* [CONN-0-0][CF-SSL] TLSv1.0 (OUT), TLS header, Certificate Status (22):
* [CONN-0-0][CF-SSL] TLSv1.3 (OUT), TLS handshake, Client hello (1):
* [CONN-0-0][CF-SSL] TLSv1.2 (IN), TLS header, Certificate Status (22):
* [CONN-0-0][CF-SSL] TLSv1.3 (IN), TLS handshake, Server hello (2):
* [CONN-0-0][CF-SSL] TLSv1.2 (IN), TLS header, Finished (20):
* [CONN-0-0][CF-SSL] TLSv1.2 (IN), TLS header, Supplemental data (23):
* [CONN-0-0][CF-SSL] TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* [CONN-0-0][CF-SSL] TLSv1.3 (IN), TLS handshake, Certificate (11):
* [CONN-0-0][CF-SSL] TLSv1.3 (IN), TLS handshake, CERT verify (15):
* [CONN-0-0][CF-SSL] TLSv1.3 (IN), TLS handshake, Finished (20):
* [CONN-0-0][CF-SSL] TLSv1.2 (OUT), TLS header, Finished (20):
* [CONN-0-0][CF-SSL] TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* [CONN-0-0][CF-SSL] TLSv1.2 (OUT), TLS header, Supplemental data (23):
* [CONN-0-0][CF-SSL] TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN: server accepted h2
* Server certificate:
*  subject: CN=registry.k8s.io
*  start date: Dec 31 01:52:06 2022 GMT
*  expire date: Mar 31 02:44:39 2023 GMT
*  subjectAltName: host "registry.k8s.io" matched cert's "registry.k8s.io"
*  issuer: C=US; O=Google Trust Services LLC; CN=GTS CA 1D4
*  SSL certificate verify ok.
* Using HTTP2, server supports multiplexing
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* [CONN-0-0][CF-SSL] TLSv1.2 (OUT), TLS header, Supplemental data (23):
* [CONN-0-0][CF-SSL] TLSv1.2 (OUT), TLS header, Supplemental data (23):
* [CONN-0-0][CF-SSL] TLSv1.2 (OUT), TLS header, Supplemental data (23):
* h2h3 [:method: GET]
* h2h3 [:path: /v2/pause/manifests/3.7]
* h2h3 [:scheme: https]
* h2h3 [:authority: registry.k8s.io]
* h2h3 [user-agent: curl/7.87.0]
* h2h3 [accept: */*]
* Using Stream ID: 1 (easy handle 0x7f356982fa90)
* [CONN-0-0][CF-SSL] TLSv1.2 (OUT), TLS header, Supplemental data (23):
> GET /v2/pause/manifests/3.7 HTTP/2
> Host: registry.k8s.io
> user-agent: curl/7.87.0
> accept: */*
> 
* [CONN-0-0][CF-SSL] TLSv1.2 (IN), TLS header, Supplemental data (23):
* [CONN-0-0][CF-SSL] TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* [CONN-0-0][CF-SSL] TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* old SSL session ID is stale, removing
* [CONN-0-0][CF-SSL] TLSv1.2 (IN), TLS header, Supplemental data (23):
* [CONN-0-0][CF-SSL] TLSv1.2 (OUT), TLS header, Supplemental data (23):
* [CONN-0-0][CF-SSL] TLSv1.2 (IN), TLS header, Supplemental data (23):
* [CONN-0-0][CF-SSL] TLSv1.2 (IN), TLS header, Supplemental data (23):
< HTTP/2 403 
< content-type: text/html; charset=UTF-8
< referrer-policy: no-referrer
< content-length: 317
< alt-svc: h3=":443"; ma=2592000,h3-29=":443"; ma=2592000
< 
* [CONN-0-0][CF-SSL] TLSv1.2 (IN), TLS header, Supplemental data (23):

<html><head>
<meta http-equiv="content-type" content="text/html;charset=utf-8">
<title>403 Forbidden</title>
</head>
<body text=#000000 bgcolor=#ffffff>
<h1>Error: Forbidden</h1>
<h2>Your client does not have permission to get URL <code>/v2/pause/manifests/3.7</code> from this server.</h2>
<h2></h2>
</body></html>
* [CONN-0-0][CF-SSL] TLSv1.2 (IN), TLS header, Supplemental data (23):
* [CONN-0-0][CF-SSL] TLSv1.2 (OUT), TLS header, Supplemental data (23):
* Connection #0 to host registry.k8s.io left intact

@BenTheElder
Copy link
Member

Thanks for the additional logs.

cc @ameukam maybe cloud armor? I forgot about that dimension in the actual deployment.

This definitely looks like it's coming from the infra in front of the app, we also don't serve HTML, only redirects (or simple API errors).

@BenTheElder
Copy link
Member

@ameukam and I discussed this yesterday.

This appears to be coming from the cloud loadbalancer security policy (we're using cloud armor, configured here:
https://github.com/kubernetes/k8s.io/blob/f858f4680ada6385eaa4c76b2a295e33ec0ed51c/infra/gcp/terraform/k8s-infra-oci-proxy-prod/network.tf#L112

I don't think we're doing anything special here, best guess is hetzner IPs have been flagged for abuse?

I actually can't seem to find these particular requests in the loadbalancer logs, otherwise we could see what preconfigured rule this is hitting.

@BenTheElder
Copy link
Member

I can see other 403s served by the security policy for more obviously problematic incoming requests like https://registry.k8s.io/?../../../../../../../../../../../etc/profile

@mysticaltech
Copy link

mysticaltech commented Jan 26, 2023

Folks, I can confirm this issue shows randomly when pulling CSI images. It seems that some IPs are blacklisted or something!

This has been a huge issue this last month for us! It started in late December.

@mysticaltech
Copy link

mysticaltech commented Jan 26, 2023

@ameukam and I discussed this yesterday.

This appears to be coming from the cloud loadbalancer security policy (we're using cloud armor, configured here: https://github.com/kubernetes/k8s.io/blob/f858f4680ada6385eaa4c76b2a295e33ec0ed51c/infra/gcp/terraform/k8s-infra-oci-proxy-prod/network.tf#L112

I don't think we're doing anything special here, best guess is hetzner IPs have been flagged for abuse?

I actually can't seem to find these particular requests in the loadbalancer logs, otherwise we could see what preconfigured rule this is hitting.

That would make absolute sense! Somehow, some Hetzner IPs seem to be blacklisted. For our Kube-Hetzner project, it's been a real pain. Please fix 🙏

kube-hetzner/terraform-hcloud-kube-hetzner#524
kube-hetzner/terraform-hcloud-kube-hetzner#451
kube-hetzner/terraform-hcloud-kube-hetzner#442

@dims
Copy link
Member

dims commented Jan 26, 2023

@mysticaltech can you please drop a few ip address(es) of boxes that seem to have trouble?

@mysticaltech
Copy link

@dims Definitely, I can try to get some.

@aleksasiriski Could you fetch some of the 10 IPs that you had reserved as static IPs because they were blocked by registry.k8s.io when used for nodes?

@mysticaltech
Copy link

@dims I just deployed a test cluster of 10 nodes, and got "lucky" on one of them. The one affected IP is 5.75.240.113.

ksnip_20230126-051423

@aleksasiriski
Copy link

@dims Definitely, I can try to get some.

@aleksasiriski Could you fetch some of the 10 IPs that you had reserved as static IPs because they were blocked by registry.k8s.io when used for nodes?

I had like 3 IPs that were blacklisted, I'll try to fetch them later today (UTC+1) when I'm home.

@dims
Copy link
Member

dims commented Jan 26, 2023

I just deployed a test cluster of 10 nodes, and got "lucky" on one of them. The one affected IP is 5.75.240.113.

Uploading downloaded-logs-20230126-065347.json.txt…

I see 4 hits, all with a valid redirect using http status 307's, no 403's at all

the code it hits is here:
https://cs.k8s.io/?q=StatusTemporaryRedirect&i=nope&files=handlers.go&excludeFiles=&repos=kubernetes/registry.k8s.io

@mysticaltech
Copy link

@dims Thanks for looking into this. The 403 are most probably appearing closer later down the request chain. As stated by @BenTheElder, it could be your LB security policy (cloud armor) configured here https://github.com/kubernetes/k8s.io/blob/f858f4680ada6385eaa4c76b2a295e33ec0ed51c/infra/gcp/terraform/k8s-infra-oci-proxy-prod/network.tf#L112

@mysticaltech
Copy link

mysticaltech commented Jan 26, 2023

Also @dims, something interesting, discovered by one our users, is that if they tried to pull the image manually with crictl pull a bunch of times, it would actually work at some point. As if magically whitelisting the node again, it works for pulling other images afterward.

Sometimes it works after 100 tries, sometimes it just does not work. So kind of a hit-or-miss situation! All this to say, there's something up with your LB IMHO.

@mysticaltech
Copy link

mysticaltech commented Jan 26, 2023

@dims I have created another small test cluster and the IP above 5.75.240.113 has been reused and it does it again. I will leave it on for 24h so that you can have more logs.

pulling from host registry.k8s.io failed with status code [manifests v2.7.0]: 403 Forbidden

ksnip_20230126-135442

@mysticaltech
Copy link

Now if I ssh into the node and run the crictl pull command, I get the same:

ksnip_20230126-140027

@mysticaltech
Copy link

@dims Also an interesting finding. If I simply issue curl -v https://registry.k8s.io/sig-storage/csi-node-driver-registrar:v2.7.0 a few times in a row. It randomly returns either a 404 or a 403.

404
ksnip_20230126-141816

403
ksnip_20230126-141921

@dims
Copy link
Member

dims commented Jan 26, 2023

@mysticaltech yeah, looks like there is very little tolerance for your range of ips from hetzner

@mysticaltech
Copy link

Exactly! Which is really a pain when working with Kubernetes. If possible to fix, it would be awesome.

@mysticaltech
Copy link

@dims Did you do something? Because it started to work.

ksnip_20230126-165253

@dims
Copy link
Member

dims commented Jan 26, 2023

@mysticaltech nope. theory is still the same - cloud armor!

@mysticaltech
Copy link

Oh my! Maybe some kind of form to request whitelisting? That would be kind of good. But not great for autoscaling nodes for instance.

@valkenburg-prevue-ch
Copy link

Hi, I'm getting an HTTP 404 code on this request:

{"access_time":"26/Jan/2023:17:04:49 +0000","upstream_cache_status":"","method":"GET","uri":"/artifacts-downloads/namespaces/k8s-artifacts-prod/repositories/images/downloads/ALMFTafKuNkWwH8ArOFD4KogY3p5kp9zcsZSbyhKLzMCEPih3pGxlf8hdweputz3nxUZBrevwToc16OLF7zMqHYUiYRUHvlEfEVSsuu2L5J4uzlOgj_1BY7ZHOHwmRLsHwyaJ8TQE8XlkrCSQSak71-6ZVgvBT9nv57reoR-AE6o4ei_iszTDpPq2xtnFA4tZpIL0tBJor_u8ZoD83KGOGN-aAHsqelMjVqLR5fPp3uluRC1I8coYtFZgafJjEKsqrkeVUdt9hQTHpQ-dGdlbIBOVPWaZCl1IeoDzlHcwrybwcYTB8hyYzJ--mHnaZWfOWs8i2p-dFzdPy68CBTaXgW-gDRymEFDCJe_3b8GhvFMnOOo0ldCZEk4K2fJsnTt_gMC2-4y1zr5k_TrUmcrV_nt8bo4tw4cvYCvb9EJn7GQ3LbkY41avfNbipQmoBkR-rZ9lPhySAVcmiharpD7gJYrqvSxSafP_IBJ3Oxkt0_aUY4A9n4qeqtZRZeSE-BoWdGhiagQVnPWDewkpAMY2M9XfotDZhOUIR_kb8nYWzSi4cjECfltywKzgriY2IT0TS1GoHBLwuJPpGrRFR0afzF-BOQTR8SUnb0b70zprBC8lSc4HkzzW_4MiPBbxPGpa6OXiIZbvjO6ORb-YXGXwCSsee4nkheizN1xTof6z_GHPVJFqhNRqNJSaN8Jfm2Dd0w0C6MBrkTP34K2hKnORXI=","request_type":"unknown","status":"404","bytes_sent":"19","upstream_response_time":"0.000, 0.048","host":"registry.k8s.io","proxy_host":"registry.k8s.io","upstream":"[2600:1901:0:1013::]:443, 34.107.244.51:443"}

from ip 78.47.222.2 (same project and provider as @mysticaltech ) . Does posting ip's and info like this, help? Or is there something else I can provide?

@BenTheElder
Copy link
Member

@valkenburg-prevue-ch /artifacts-downloads/.* is not a valid OCI distribution API path. The 404 is because that URL does not exist.


Also @dims, something interesting, kube-hetzner/terraform-hcloud-kube-hetzner#451 (comment),

This actually points to the issue with hetzner IPs existing with plain GCR.

k8s.gcr.io is a special alias domain provided by GCR but it has the same allow-listing etc as any other gcr.io registry.

Kubernetes doesn't run that infra, just populate the images.

Maybe some kind of form to request whitelisting?

I'm not sure how well this would scale given the relatively volunteer staffing we have for this sort of free image host ...

It seems registry.k8s.io has no regression here vs k8s.gcr.io, though I can't recall ever having seen a similar issue reported to Kubernetes previously.

@BenTheElder
Copy link
Member

At present time I would recommend mirroring images, which also helps us reduce our massive distribution costs and reallocate resources towards testing etc.

@mysticaltech
Copy link

@BenTheElder Thanks for clarifying. But Hetzner Cloud is still a major European cloud, not supporting fully it is a shame IMHO, and for a young open-source project like ours, we don't yet have the resource to deploy a full-blown mirror.

However, if we were to do that, how would you recommend we proceed? This is something we obviously thought about, and have considered already both https://docs.k3s.io/installation/private-registry and https://github.com/goharbor/harbor, would you recommend anything else that is an easy fix for that particular issue?

@BenTheElder
Copy link
Member

curl -v https://registry.k8s.io/sig-storage/csi-node-driver-registrar:v2.7.0 a few times in a row. It randomly returns either a 404 or a 403.

Again, this is not a valid API path. So the 404s are expected, the request is invalid. 403 are seemingly due to the security mechanism(s).

I recommend crane pull --verbose registry.k8s.io/sig-storage/csi-node-driver-registrar:v2.7.0 /dev/null to see what valid request paths look like, or the distribution spec.


Thanks for clarifying. But Hetzner Cloud is still a major European cloud, not supporting fully it is a shame IMHO, and for a young open-source project like ours, we don't yet have the resource to deploy a full-blown mirror.

I hear that, but even as a large open source project we have constrained resources to host things and we're not actively choosing to block these IPs, some security layer on our donated hosting infrastructure is blocking these IPs. At the moment keeping things online and trying to bring our spend back within the budget is a bigger priority than resolving an issue present in the previous infrastructure, and even that is a bit of a stretch. Open source staffing is hard :(

Perhaps you could ask your users to mirror for themselves if they encounter issues like this.

Hetzner might also have thoughts about this issue? It seems in their best interest to avoid what seems to be an IP reputation issue.

Searching online I see similar discussions for Amazon CloudFront and CloudFlare with respect to hetzner IP ban issues.

However, if we were to do that, how would you recommend we proceed? This is something we obviously thought about, and have considered already both https://docs.k3s.io/installation/private-registry and https://github.com/goharbor/harbor, would you recommend anything else that is an easy fix for that particular issue?

Mirroring guides are something I hope to get folks to contribute. Options will depend on the tools involved client-side (like container runtime).

For consuming a mirror, I usually recommend containerd's mirroring config (as dockershim is deprecated), cri-o has something similar I beleive.

For hosting a mirror, I recommend roughly populate images with crane cp upstream mirror, where mirror is any preferred registry host. However there are many other options like harbor, that I've not personally used.

@guettli
Copy link

guettli commented Mar 21, 2024

Today, on this machine ipv4 was blocked, and ipv6 worked:

❯ curl -6 -L https://registry.k8s.io/v2/pause/manifests/3.7 > /dev/null 
❯ curl -4 -L https://registry.k8s.io/v2/pause/manifests/3.7
<!DOCTYPE html>
<html lang=en>
  <meta charset=utf-8>
  <meta name=viewport content="initial-scale=1, minimum-scale=1, width=device-width">
  <title>Error 403 (Forbidden)!!1</title>
❯ nslookup registry.k8s.io
Server:         127.0.0.53
Address:        127.0.0.53#53

Non-authoritative answer:
Name:   registry.k8s.io
Address: 34.96.108.209
Name:   registry.k8s.io
Address: 2600:1901:0:bbc4::
❯ curl -4 ipecho.net/plain
95.217.9.112

❯ curl -6 ipecho.net/plain
2a01:4f9:c011:8866::1

@rbjorklin
Copy link

I'm finding myself in a similar predicament. I'm seeing ImagePullBackOff from nodes with these IPs:

  • 5.161.124.7
  • 5.161.233.212

Mirroring is a fine solution assuming the nodes I'm trying to mirror via can reach registry.k8s.io in the first place.

Here's to hoping someone at Google reads this and unblocks the Hetzner IP ranges associated with AS213230.

@BenTheElder
Copy link
Member

I'm finding myself in a similar predicament. I'm seeing ImagePullBackOff from nodes with these IPs

Sorry, unfortunately we cannot do more here. Please see the note at the top of: https://github.com/kubernetes/registry.k8s.io/blob/main/docs/debugging.md#debugging-issues-with-registryk8sio

Mirroring is a fine solution assuming the nodes I'm trying to mirror via can reach registry.k8s.io in the first place.

The intent is that you populate the mirror from elsewhere (e.g. even your local development machine could push to a mirror) for more reliable consumption from your hosts / users.

Again: https://registry.k8s.io#stability

These images are being hosted for free download at great expense and run by ~volunteers from multiple companies / independent.

Here's to hoping someone at Google reads this and unblocks the Hetzner IP ranges associated with AS213230.

There are people from Google working on this project 👋 but unfortunately I cannot publicly discuss the specifics of GCP's restrictions.

I will point out however that what is happening is not new to registry.k8s.io and applied to k8s.gcr.io and prior hosts (which were 100% funded by Google).

If someone wants to come work with SIG K8s Infra on an alternate implementation with sponsorship from other vendors, there are details about how to contact and participate in the README.

Alternatively if someone wanted to investigate hosting a mirror for Hetzner, that would be great, feel free to reach out.
https://github.com/kubernetes/registry.k8s.io#community-discussion-contribution-and-support

@mysticaltech
Copy link

@apricote FYI the above. Running a registry.k8s.io mirror for Hetzner would be great.

@rbjorklin
Copy link

For anyone still facing this problem I have been able to work around it by deploying peerd in my cluster.

@vitobotta
Copy link

For anyone still facing this problem I have been able to work around it by deploying peerd in my cluster.

Hi! I installed peerd in k3s (had to build the image with a changed containerd socket path and it's now running) but how to use it? Thanks

@rbjorklin
Copy link

@vitobotta

  • Ensure /etc/containerd/config.toml contains:
[plugins."io.containerd.grpc.v1.cri".registry]
   config_path = "/etc/containerd/certs.d"

After that images will automatically be pulled from other nodes in your cluster if they are present.

@vitobotta
Copy link

@vitobotta

  • Ensure /etc/containerd/config.toml contains:
[plugins."io.containerd.grpc.v1.cri".registry]
   config_path = "/etc/containerd/certs.d"

After that images will automatically be pulled from other nodes in your cluster if they are present.

Thanks :) In the meantime I ended up using https://github.com/spegel-org/spegel since it doesn't require me to open any ports in the firewall. Peerd does if I am not mistaken, right?

@mysticaltech
Copy link

@valkenburg-prevue-ch FYI above, the landscape of solutions for this has evolved fast! 🤯

@mysticaltech
Copy link

Thanks :) In the meantime I ended up using https://github.com/spegel-org/spegel since it doesn't require me to open any ports in the firewall. Peerd does if I am not mistaken, right?

@vitobotta Any tips on the config for k3s (or any other cluster), is it straightforward?

@mysticaltech
Copy link

@phillebaba Your project is resolving a big need, thank you for that 🙏

@valkenburg-prevue-ch
Copy link

@valkenburg-prevue-ch FYI above, the landscape of solutions for this has evolved fast! 🤯

Yeah, I've been following this discussion closely! Very interested.

@vitobotta
Copy link

Thanks :) In the meantime I ended up using https://github.com/spegel-org/spegel since it doesn't require me to open any ports in the firewall. Peerd does if I am not mistaken, right?

@vitobotta Any tips on the config for k3s (or any other cluster), is it straightforward?

Yes, there are some settings that differ on k3s, so here's how I ended up configuring it after some investigation. Hope it can save you and/or others some time:

    helm upgrade --install \
    --version v0.0.22 \
    --create-namespace \
    --namespace spegel \
    --set spegel.containerdSock=/run/k3s/containerd/containerd.sock \
    --set spegel.containerdContentPath=/var/lib/rancher/k3s/agent/containerd/io.containerd.content.v1.content \
    --set spegel.containerdRegistryConfigPath=/var/lib/rancher/k3s/agent/etc/containerd/certs.d \
    --set spegel.logLevel="DEBUG" \
    spegel oci://ghcr.io/spegel-org/helm-charts/spegel

The only problem I have encountered with Spegel is that despite it's a DaemonSet, for some reason only max 100 pods exactly are up and running. With larger clusters (I tried with 400 and 500 nodes clusters) it always maxes at 100 pods and all the others on the other nodes remain in a non running state. I opened an issue about it here spegel-org/spegel#459.

Other than that it seems to work pretty well. Like I mentioned I tested with clusters of up to 500 nodes to increase the likelihood to get some problematic IPs and in fact every time there were many out of that large number of nodes, and thanks to Spegel all pods that require images from problematic registries were started without any issue. And I see a nice boost in speed re: the time it takes for a node to acquire the image from other nodes, so deployments scale more quickly which is awesome.

@mysticaltech
Copy link

Thanks @vitobotta, appreciate it. I guess 100 IPs should be enough to successfully do the job. FYI, @valkenburg-prevue-ch just found out that Spegel is already integrated within k3s and can be enabled with the --embedded-registry flag. https://docs.k3s.io/installation/registry-mirror 🥳

@vitobotta
Copy link

Thanks @vitobotta, appreciate it. I guess 100 IPs should be enough to successfully do the job. FYI, @valkenburg-prevue-ch just found out that Spegel is already integrated within k3s and can be enabled with the --embedded-registry flag. https://docs.k3s.io/installation/registry-mirror 🥳

I know and I forgot to mention it. Maybe it's because it's still experimental and perhaps buggy, but I couldn't get the embedded spegel support to work after many attempts. Also seems that with that version you need to open a port in the firewall. Perhaps I will try again when I have some more time.

@mysticaltech
Copy link

mysticaltech commented May 1, 2024

@vitobotta Ah ok, good to know! So maybe for now your helm setup will be best to get the latest and greatest 🙏 @valkenburg-prevue-ch FYI

@vitobotta
Copy link

@vitobotta Ah ok, good to know! So maybe for now your helm setup will be best to get the latest and greatest 🙏 @valkenburg-prevue-ch FYI

I am trying the embedded spegel now again. Let's see if I can figure it out.

@phillebaba
Copy link

Just to add to this, if you are running k3s I would suggest using the embedded Spegel. It works just as well without having to deal with daemonsets.

@vitobotta
Copy link

Just to add to this, if you are running k3s I would suggest using the embedded Spegel. It works just as well without having to deal with daemonsets.

Like I mentioned in a previous comment I couldn't get it to work, so I tried the Helm installation of Spegel and it worked.

I am trying the embedded one now again and still can't get it to work. I have tried also with the port 5001 open for the peer to peer exchange but it's not working. There is nothing even listening on the nodes on that port as if the embedded registry is not configured at all, but I am indeed using the --embedded-registry flag on the servers. Any suggestions?

@vitobotta
Copy link

I think I know what the problem might be: when I create a cluster in Hetzner without private network, and then I install the Hetzner Cloud Controller Manager, the CCM populates the external IP field of the nodes, leaving the internal ip unset. But the k3s documentation about the embedded registry mirror talks about communication with internal IPs, so perhaps that's why it's not working.

@vitobotta
Copy link

@phillebaba have you actually gotten the embedded registry mirror working?

@vitobotta
Copy link

Got it working! My mistake was that I enabled the embedded registry on existing clusters without restarting the agents. When I restart the agents or create a new cluster then all is working. It seems that it requires the port 5001 to be opened in the firewall though... will test this more.

@valkenburg-prevue-ch
Copy link

Awesome news! Thanks for doing all this and reporting here.

In which firewall do you have to open port 5001 though? The hetzner firewall only applies to public internet, right? Isn't the private network always all open? Or am I missing something in our setup of microos, is there a firewall too?

@vitobotta
Copy link

Awesome news! Thanks for doing all this and reporting here.

In which firewall do you have to open port 5001 though? The hetzner firewall only applies to public internet, right? Isn't the private network always all open? Or am I missing something in our setup of microos, is there a firewall too?

Yep public firewall. There are no restrictions with private network afaik. The reason my I am testing without private networks is that they support max 100 nodes, so it's impossible to create a large cluster with them. I have tested (using hetzner-k3s, my tool) with clusters of up to 500 nodes using the public network and I could probably scale into the thousands now that I added support for cilium as cni and for external data stores like postgres instead of etcd. I wish I had the money to experiment with more nodes lol.

@valkenburg-prevue-ch
Copy link

Thanks for clarifying. Do I understand correctly that for up to 100 nodes, one does not need to open anything on the firewall, and that your use-case with everything over public ip's might be "beyond the scope of the default supported setups"?

@vitobotta
Copy link

Correct. If you use the private network you don't need to open anything in the firewall provided you configure everything to use the private interface.

@mysticaltech
Copy link

Thanks @vitobotta and @phillebaba for sharing, really appreciate it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests