Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for bare-metal workers #113

Closed
ByteAlex opened this issue Nov 8, 2020 · 30 comments
Closed

Support for bare-metal workers #113

ByteAlex opened this issue Nov 8, 2020 · 30 comments
Labels
enhancement New feature or request stale

Comments

@ByteAlex
Copy link

ByteAlex commented Nov 8, 2020

Hello,

is it possible to add servers from the Hetzner Robot to the cluster created with the CCM?

I've been using a K3s cluster which I bootstrapped manually and when I tried to install the hcloud CCM the hcloud:// provider was not working for all servers - whether they were Cloud or Robot servers.

Now I've bootstrapped a cluster using kubeadm and followed the instructions and the hcloud:// provider seems to be working, yet I still have my bare-metal servers and before I let them join my cluster and possibly destroy the CCM, I'd rather ask for clarification first.

My expectations would be:

  • Pods requesting a PVC can't schedule on robot-server
  • robot-servers will be added to Load-Balancers using "external ip"

Thank you!

@malikkirchner
Copy link

The bare metal support would be highly appreciated. A label, that causes CCM to ignore bare metal nodes, would be fine as an intermediate step. This would make CCM still functional and useful in the meantime.

@LKaemmerling
Copy link
Member

Additional (already closed) issues:
#9
#5

There are a few problems with adding dedicated servers are real "nodes" to the k8s cluster.

  1. Dedicated Servers have a completely different API, Using a Hetzner Cloud Token does not allow getting the data about a root server.
  2. Based on the spec: https://kubernetes.io/docs/concepts/architecture/cloud-controller/#node-controller k8s deactivates all nodes that are not known from the cloud provider

We will look into how we can improve it, but I can not promise something.

@ctodea
Copy link

ctodea commented Dec 18, 2020

Any update on this?

@batistein
Copy link
Contributor

@malikkirchner
Copy link

malikkirchner commented Mar 18, 2021

@ctodea we managed to get a cluster working, where most nodes, including the master, are cloud servers. And some nodes are root servers, e.g. for databases. Basically the root servers should be mostly ignored by the CCM and CSI plugin. Maybe this helps:

You need to connect the root servers via vSwitch, though.

Maybe #172 results in a mainline solution ...

@ctodea
Copy link

ctodea commented Mar 18, 2021

Many thanks for the update @malikkirchner @batistein
Will give it a try, but unfortunately, I guess won't be any time soon.

@identw
Copy link

identw commented Mar 21, 2021

@ctodea we managed to get a cluster working, where most nodes, including the master, are cloud servers. And some nodes are root servers, e.g. for databases. Basically the root servers should be mostly ignored by the CCM and CSI plugin. Maybe this helps:

Hi @malikkirchner
I can see from the code that you are skipping creating routes for root servers because the API doesn't allow it (https://github.com/xelonic/hcloud-cloud-controller-manager/blob/root-server-support/hcloud/routes.go#L104). But I don't understand how pod-to-pod communication between cloud and dedicated nodes works for you.
For example:
10.240.0.2 - cloud node, 10.244.0.0/24 pod network on the cloud node
10.240.1.2 - dedicated node, 10.244.1.0/24 pod network on the dedicated node

But you can't create a route 10.240.1.0/24 via 10.240.1.2 in api. Then how will the communication between the pods of the 10.240.0.0/24 and 10.240.1.0/24 network work?

@malikkirchner
Copy link

Hi @identw,

that is an excellent point, I do not know and was wondering myself. According to #133 (comment) that should never have worked. We are using kubeadm to setup the cluster and Cilium as CNI plugin. I am happy to share the exact config, if you are interested.

I have two guesses how this can 'work'. Either the vSwitch does some routing, that I do not understand, or Cilium somehow manages to route to the root server. Leakage over the public device is ruled out by the root server's Hetzner firewall.

Though, it is possible, that this is a bug, that will be fixed and not work anymore, like #133. If so, I was wondering if it would make sense, to use a layer of wireguard peer-to-peer between all nodes, kinda as a unified substrate for cilium.

Any clarification on this topic is highly appreciated.

@identw
Copy link

identw commented Mar 22, 2021

@malikkirchner

that is an excellent point, I do not know and was wondering myself

Cilium uses an overlay network between nodes (vxlan or geneve) by default, maybe you haven't disabled it?
Check your cilium configmap. For example:

$ kubectl -n kube-system get cm cilium-config -o yaml | grep "tunnel"
  tunnel: vxlan

This configuration will work in any way, even without hetzner cloud networks and vswich.

I was wondering if it would make sense, to use a layer of wireguard peer-to-peer between all nodes, kinda as a unified substrate for cilium

For cilium, this is not necessary, since it already knows how to build tunnels between nodes and does it by default. If encryption is required then cilium supports ipsec (https://docs.cilium.io/en/v1.9/gettingstarted/encryption/).

Also I recommend paying attention to latency when connecting vswitch to the cloud:

ping from cloud node to dedicated node via public ip:

$ ping 135.181.96.131
PING 135.181.96.131 (135.181.96.131) 56(84) bytes of data.
64 bytes from 135.181.96.131: icmp_seq=1 ttl=59 time=0.442 ms
64 bytes from 135.181.96.131: icmp_seq=2 ttl=59 time=0.372 ms
64 bytes from 135.181.96.131: icmp_seq=3 ttl=59 time=0.460 ms
64 bytes from 135.181.96.131: icmp_seq=4 ttl=59 time=0.539 ms

ping from cloud node to same dedicated node via vswitch:

$ ping 10.240.1.2
PING 10.240.1.2 (10.240.1.2) 56(84) bytes of data.
64 bytes from 10.240.1.2: icmp_seq=1 ttl=63 time=47.4 ms
64 bytes from 10.240.1.2: icmp_seq=2 ttl=63 time=47.0 ms
64 bytes from 10.240.1.2: icmp_seq=3 ttl=63 time=46.9 ms
64 bytes from 10.240.1.2: icmp_seq=4 ttl=63 time=46.9 ms

~0.5ms via public network vs ~46.5ms via private network =(.

@malikkirchner
Copy link

malikkirchner commented Mar 22, 2021

@identw thank you for the hint, you are right, our Cilium uses vxlan as tunnel. That explains why it works. We deploy Istio on top of Cilium, I guess there is no real need for the Cilium encryption for us at the moment. As I understand enabling the Cilium encryption also conflicts with some features of Istio.

The ping from a cloud server to the dedicated server via vSwitch is not that bad for us:

# ping starfleet-janeway 
PING starfleet-janeway (10.0.1.2) 56(84) bytes of data.
64 bytes from starfleet-janeway (10.0.1.2): icmp_seq=1 ttl=63 time=3.70 ms
64 bytes from starfleet-janeway (10.0.1.2): icmp_seq=2 ttl=63 time=3.57 ms

Our cloud nodes are hosted in nbg1-dc3 and the dedicated server lives in fsn1-dc15. I guess that would be even better, if we moved the cloud nodes to Falkenstein.

FYI we encountered a problem with Cilium and systemd in Debian bullseye, buster is fine: cilium/cilium#14658.

@identw
Copy link

identw commented Mar 22, 2021

@malikkirchner

As I understand enabling the Cilium encryption also conflicts with some features of Istio.

I mentioned encryption because you wrote about wireguard. Encryption is optional

The ping from a cloud server to the dedicated server via vSwitch is not that bad for us:

Not so bad. I tested in the hel1 location (dedicated node from hel1-dc4, cloud node from hel1-dc2).

FYI we encountered a problem with Cilium and systemd in Debian bullseye, buster is fine: cilium/cilium#14658.

Thank you interesting. I really also use cilium without kube-proxy, but I have not seen this bug.

@github-actions
Copy link
Contributor

This issue has been marked as stale because it has not had recent activity. The bot will close the issue if no further action occurs.

@github-actions github-actions bot added the stale label May 21, 2021
@Bessonov
Copy link

further action occurs

@LKaemmerling LKaemmerling added enhancement New feature or request and removed stale labels May 21, 2021
@Donatas-L
Copy link

I saw that someone made a repo (https://github.com/identw/hetzner-cloud-controller-manager) to solve this, has anyone tried it?

@randrusiak
Copy link

Any updates here? @LKaemmerling are you going to implement support for root server soon?

@hendrikkiedrowski
Copy link

@Donatas-L I tried it. It works great with some tidbits. It would need a bit of attention from the community to keep track with the development of the Hetzner Team @LKaemmerling you may also want to have a look here. Maybe you can take this idea ;)

@github-actions
Copy link
Contributor

github-actions bot commented Nov 5, 2021

This issue has been marked as stale because it has not had recent activity. The bot will close the issue if no further action occurs.

@acjohnson
Copy link

acjohnson commented Nov 16, 2021

I also am interested in using bare-metal workers via vSwitch and have it working with calico CNI. Any chance this could become mainlined in the hcloud-cloud-controller-manager?

@wethinkagile
Copy link

If we want to push the european cloud we need to push awesome Hetzner to push itself to grow above itself. This way many open source cloud projects and startups with GDPR and DSGVO compliant ISMS' will able to get founded in Europe. tl;dr yea I'm interested, too.

@acjohnson
Copy link

acjohnson commented Nov 17, 2021

I went ahead and rebased the work that @malikkirchner did against master from this repo and built a new image with a few fixes that seemed to be required to use Hetzner Robot servers via vSwitch/Cloud Networks

src: https://github.com/acjohnson/hcloud-cloud-controller-manager/tree/root-server-support
image: https://hub.docker.com/r/acjohnson/hcloud-cloud-controller-manager

This seems to work next to perfectly with only a couple of transient messages in the cloud-controllers logs such as

I1117 01:31:27.718391       1 util.go:39] hcloud/getServerByName: server with name kube02 not found, are the name in the Hetzner Cloud and the node name identical?
E1117 01:31:27.718445       1 node_controller.go:245] Error getting node addresses for node "kube02": error fetching node by provider ID: hcloud/instances.NodeAddressesByProviderID: hcloud/providerIDToServerID: missing prefix hcloud://: , and error by node name: hcloud/instances.NodeAddresses: instance not found

...but otherwise load balancer creation works and ignores all nodes that have the instance.hetzner.cloud/is-root-server=true label set

I'd file a PR but this really isn't my work, just a few fixes on top of what y'all have already done.

Hoping something more legit will make its way into this repo but for now this will have to do.

@acjohnson
Copy link

@LKaemmerling would you consider reopening this issue as there is a fair bit of support for this feature and quite a bit of hacking that has gone into it already

@malikkirchner
Copy link

@acjohnson thank you for improving on Boris' change.

@maaft
Copy link

maaft commented Oct 5, 2022

Uhm, why is this closed? Currently it does not work. What can I do please? Any step by step instructions how I can provision a LB connected to my 3 root servers?

@batistein
Copy link
Contributor

@batistein
Copy link
Contributor

It's already full integrated with: https://github.com/syself/cluster-api-provider-hetzner

@maaft
Copy link

maaft commented Oct 5, 2022

Ah, yes. I've read about that CAPI a few days ago already. Thanks mate!

@maaft
Copy link

maaft commented Oct 5, 2022

I'm getting Cloud provider could not be initialized: unknown cloud provider "hetzner" from the logs.

Any Idea how to fix this?

@batistein
Copy link
Contributor

batistein commented Oct 5, 2022

sounds like you have the wrong provider argument in the deployment... Did you only replaced the image? see: https://github.com/syself/hetzner-cloud-controller-manager/blob/master/deploy/ccm.yaml#L63

@maaft
Copy link

maaft commented Oct 5, 2022

Well, after removing the "old" ccm, I installed the suggested one with:

kubectl apply -f https://github.com/syself/hetzner-cloud-controller-manager/releases/latest/download/ccm.yaml

Which contains:

containers:
        - image: quay.io/syself/hetzner-cloud-controller-manager:v1.13.0-0.0.1
          name: hcloud-cloud-controller-manager
          command:
            - "/bin/hetzner-cloud-controller-manager"
            - "--cloud-provider=hetzner"
            - "--leader-elect=false"
            - "--allow-untagged-cloud"

Any slack/discord channels available? Don't want to spam this issue here further.

@batistein
Copy link
Contributor

kubernetes slack workspace channel #hetzner

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request stale
Projects
None yet
Development

No branches or pull requests