Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend cluster to more than 50 nodepools #894

Closed
thomasprade opened this issue Jul 20, 2023 · 14 comments
Closed

Extend cluster to more than 50 nodepools #894

thomasprade opened this issue Jul 20, 2023 · 14 comments
Labels
enhancement New feature or request

Comments

@thomasprade
Copy link
Contributor

Description

I have an existing cluster with ~40 nodepools. After adding ~15 more nodepools to the agent-nodepool list terraform apply failed with the following error message:

│ Error: subnet limit reached (resource_limit_exceeded)
│ 
│   with module.kube-hetzner.hcloud_network_subnet.agent[50],
│   on ../kubehetzner/main.tf line 42, in resource "hcloud_network_subnet" "agent":
│   42: resource "hcloud_network_subnet" "agent" {
│ 
╵

The documentation in the kube.tf file states, that the maximum number of nodepools is 255 in total.
But since a new subnet is created for every nodepool, the limit for maximum subnets set by hetzner, which is 50, prevents the creation of that many nodepools.

I could not find any option to configure this behaviour. So apparently the actual limit of combined nodepools is 50 due to the hetzner subnet limitation.

A potential solution to this would be to add an optional subnet parameter to the nodepool configuration, so one can configure multiple nodepools to use the same internal subnet, or to disable the creation of subnets for the nodepools alltogether and just add all nodes to the overlaying 10.0.0.0/8 subnet.

Kube.tf file

...
module "kube-hetzner" {
  providers = {
    hcloud = hcloud
  }
  hcloud_token = var.hcloud_token != "" ? var.hcloud_token : local.hcloud_token

  source = "../terraform-hcloud-kube-hetzner"
  ssh_public_key = file("~/.ssh/id_ed25519.pub")
  ssh_private_key = file("~/.ssh/id_ed25519")
  network_region = "eu-central"

  control_plane_nodepools = [
    {
      name        = "control-plane-fsn1",
      server_type = "cx21",
      location    = "fsn1",
      labels      = [],
      taints      = [],
      count       = 1
    },
    {
      name        = "control-plane-nbg1",
      server_type = "cx21",
      location    = "nbg1",
      labels      = [],
      taints      = [],
      count       = 1
    },
    {
      name        = "control-plane-hel1",
      server_type = "cx21",
      location    = "hel1",
      labels      = [],
      taints      = [],
      count       = 1
    }
  ]

  agent_nodepools = [
    ## more than 47 nodepools
  ]  

  # * LB location and type, the latter will depend on how much load you want it to handle, see https://www.hetzner.com/cloud/load-balancer
  # load_balancer_type     = "lb11"
  # load_balancer_location = "fsn1"

  enable_klipper_metal_lb = "true"

  initial_k3s_channel = "v1.25"

  use_cluster_name_in_node_name = false

  enable_rancher = true

  rancher_hostname = "rancher.mydomain.xyz"
}
  
provider "hcloud" {
  token = var.hcloud_token != "" ? var.hcloud_token : local.hcloud_token
}

terraform {
  required_version = ">= 1.4.0"
  required_providers {
    hcloud = {
      source  = "hetznercloud/hcloud"
      version = ">= 1.41.0"
    }
  }
}

output "kubeconfig" {
  value     = module.kube-hetzner.kubeconfig
  sensitive = true
}

variable "hcloud_token" {
  sensitive = true
  default   = ""
}

Screenshots

No response

Platform

Ubuntu

@thomasprade thomasprade added the bug Something isn't working label Jul 20, 2023
@mysticaltech
Copy link
Collaborator

@thomasprade It makes sense. Please have a look at the subnet creation logic in locals.tf. If you can think of a way to solve this, PR most welcome, of course it needs to be optional with a variable flag, so as to stay backward compatible. It's kind of a deep problem because of how the cluster networking works (but it's really the best solution we could find). And I believe you also have a limit of 500 nodes per Hetzner network, so I would imagine you are going to hit that pretty soon too.

What you could try, is create a vpn network overlay with tailscale on top of multiple clusters. See https://docs.k3s.io/networking#multicluster-cidr-experimental. PR welcome too for that feature.

As a plan c, there's also something called submariner.io that was built by the rancher team. But the tailscale solution seems better.

@mysticaltech mysticaltech added enhancement New feature or request and removed bug Something isn't working labels Jul 22, 2023
@mysticaltech mysticaltech changed the title [Bug]: Cannot extend cluster to more than 50 nodepools Extend cluster to more than 50 nodepools Jul 22, 2023
@M4t7e
Copy link
Contributor

M4t7e commented Jul 24, 2023

I think there is a limit of 100 nodes per network: https://docs.hetzner.com/cloud/networks/faq#are-there-any-limits-on-how-networks-can-be-used

You can attach up to 100 servers to a Network.

I am not sure if it is possible to expand it via a request to Hetzner

@M4t7e
Copy link
Contributor

M4t7e commented Jul 24, 2023

@thomasprade Can you explain your use case in a bit more detail? It feels like you are creating a separate node pool for each node. Just wondering why you can't work with increasing the node count per node pool like this?

agent_nodepools = [
    {
      name        = "agent-small",
      server_type = "cpx11",
      location    = "fsn1",
      labels      = [],
      taints      = [],
      count       = 10
    }
]

@thomasprade
Copy link
Contributor Author

@M4t7e That is correct, each nodepool only has a single server in it.
Our application (unfortunately) is completely monolithic and since we promise a certain amount of compute power to each of our customers, we provision one server for each deployment of our application i.e. each customer.
And as for using multiple nodes within one nodepool, we decided against that for better management of the customer specific server, so that if a customer terminates their contract, we can easily deprovision the node by setting the count to 0.

I know, this is not the intended way to use a tool like this, or even kubernetes in general, but it allows us to more easily manage both, servers and application deployments.

So technically my issue is pretty much an edge case, but others may hit that problem as well, so a fix or at the least a warning/disclaimer is in order.

@maggie44
Copy link
Contributor

maggie44 commented Jul 24, 2023

I subscribed to this one as I am going to have a similar issue in the future, and for similar reasons. If there was a way to label nodes in groups of three when increasing the count I wouldn't need to scale the node pools, but instead I have to have groups of nodepools which allows me to specify a cluster of three to deploy to through my Kubernetes config like this:

default = [
    {
      name        = "x1-1",
      server_type = "cax11",
      location    = "fsn1",
      labels      = ["cluster=a1"],
      taints      = [],
      count       = 1
    },
    {
      name        = "x1-2",
      server_type = "cpx11",
      location    = "nbg1",
      labels      = ["cluster=a1"],
      taints      = [],
      count       = 1
    },
    {
      name        = "x1-3",
      server_type = "cpx11",
      location    = "hel1",
      labels      = ["cluster=a1"],
      taints      = [],
      count       = 1
    },
    {
      name        = "x2-1",
      server_type = "cax11",
      location    = "fsn1",
      labels      = ["cluster=a2"],
      taints      = [],
      count       = 1
    },
    {
      name        = "x2-2",
      server_type = "cpx11",
      location    = "nbg1",
      labels      = ["cluster=a2"],
      taints      = [],
      count       = 1
    },
    {
      name        = "x2-3",
      server_type = "cpx11",
      location    = "hel1",
      labels      = ["cluster=a2"],
      taints      = [],
      count       = 1
    }
  ]

Refactoring the monolithic app that runs on it is certainly the ideal solution, but that has to be a longer term objective.

@thomasprade
Copy link
Contributor Author

@mysticaltech While debugging a few attempts at the subnet creation logic, i also found, that creating the 42th nodepool results in an ip-address conflict with the default cluster_cidr of k3s, which is 10.42.0.0/16.

I tried setting the cluster_ipv4_cidr variable to something like 10.200.0.0/16 and these changes are reflected in the deployment of the hcloud controller manager, but all pods are still given IPs in the default 10.42.x.x range.
Only when I set k3s_exec_server_args = "--cluster-cidr=10.200.0.0/24 pods are now scheduled into that IP range.

So I added the argument to the k3s server start command in locals.tf:69 with --cluster-cidr=${var.cluster_ipv4_cidr} after the k3s_exec_server_args. Since this variable's default is the k3s default it should be completely backwards compatible.
PR for this change is coming asap.

This definitely isn't a solution for the whole problem, that @maggie44 is also facing, but at least it allows to create nodepools up to the limit of 50 in total (for now).

@M4t7e
Copy link
Contributor

M4t7e commented Jul 24, 2023

@thomasprade That's a nice observation! I've found another problem recently with cluster_ipv4_cidr. I use it with cilium (wireguard enabled) and there it works if you change it. The Pods will get IPs from the specified range, because Cilium takes care about it. The K3s config is not adjusted as you've also seen, but K3s still expects the native default range (10.42.0.0/16). See: https://docs.k3s.io/cli/server#networking

The issue here is that the Hetzner Network Routes for the Pod ranges are still created by K3s (my assumption) and they are totally decoupled from Ciliums network assignments. I still try to debug all the details, but my guess is that routed outbound connections (without SNAT) to/via the Hetzner Network from the Pods can break, because the Hetzner Router does not have the correct Pod routes for each node (still the K3s network routes that are derived from the default 10.42.0.0/16). This might be an advanced use case (e.g. connecting K3s cluster with another DC via VPN or if a Pod will contact the Hetzner Ingress LB via the internal IP -> in general everything that leaves the overlay network to the Hetzner network), but I don't know exactly if this is a big issue (maybe it is mitigated by masquerading of the CNI automatically) and if yes for what use cases exactly.

In general my expectation was that K3s decides which ranges a Node is allowed to use for the Pods, sets the Hetzner Network Routes accordingly and the CNI should just use what K3s assigned to the node. Here I have not enough experience with K3s and additional CNI network assignments and still trying to figure out what are the best practices here...

Whatever we do here, we should try to follow best practices and find a solution that will work for all CNIs in a generic way.

@thomasprade
Copy link
Contributor Author

@M4t7e good to know, that using Cilium the issue of the pod IPs is different.

In #902 I added the use of the custom cluster- and service-cidrs to the k3s start command, which results in k3s reliably using an IP range other than 10.42.0.0/16. If you could try these changes with your Cilium, that would be great.

A component, that is, afaik, also involved in the pod IP scheduling, or rather the addition of routes you mentioned, is the hcloud controller manager.

Right now I'm testing some solutions to omit the creation of subnets for each nodepool altogether, i.e. just one for all agents and control-planes respectively.

@maggie44
Copy link
Contributor

maggie44 commented Jul 24, 2023

I'm thinking something like this may be the best solution for me:

agent_nodepools = [
    {
      name        = "my_nodes_location1",
      server_type = "cax11",
      location    = "fsn1",
      labels      = [],
      taints      = [],
      count      = [
        {
            name = "node1", 
            labels = ["label1"]
        }, 
        {
            name = "node2",
            labels = ["label2"]
        },
        {
            name = "unique_useful_user_id_for_reference",
            labels = ["label2"]
        }
      ], # <-- starts 3 nodes
    }
  ]

For @thomasprade, without knowing the specifics of the use case, this might look something like:

control_plane_nodepools = [
    {
      name        = "my_nodes_location1",
      server_type = "cax11",
      location    = "fsn1",
      labels      = [],
      taints      = [],
      count      = [{name = "customer1", labels = ["customer1"]}, {name = "node2", labels = []}]],
    },
    {
      name        = "my_nodes_location2",
      server_type = "cpx11",
      location    = "nbg1",
      labels      = [],
      taints      = [],
      count      = [{name = "customer1", labels = ["customer1"]}, {name = "node2", labels = []}]],
    },
    {
      name        = "my_nodes_location3",
      server_type = "cpx11",
      location    = "hel1",
      labels      = [],
      taints      = [],
      count      = [{name = "customer1", labels = ["customer1"]}, {name = "node2", labels = []}]],
    }
  ]

A deployment for customer 1 could then be created, and the deployment could use a NodeSelector based on the label "customer1". Probably also it's own namespace. Then the namespace could be deleted upon a customer leaving and the nodes deprovisioned:

count = [{}, {name = "node2", labels = []}]],

Not super confident in this particular syntax being the best, but as an example.

Think it is safe to say the two of us are looking for particularly niche setups, a better way to articulate this as a feature for the kube-hetzner project would be:

adding the ability to apply labels to individual nodes in a nodepool.

It wouldn't mean we would be able to use 255 node pools as originally thought but permits a similar effect.

@mysticaltech
Copy link
Collaborator

@M4t7e @thomasprade It is indeed the cloud controller, in our case the hcloud cloud controller that assigns the routes. You can see them its pod's logs, and also while runnig hcloud network inspect ....

In the past with Cilium, in the past, I was able to make it work with the Hetzner network directly without overlay network (native). It was not partly working but not super stable, but recently a user found out that the mtu for cilium was not set correctly and that is now fixed, so maybe that was the cause.

@mysticaltech
Copy link
Collaborator

@thomasprade I submitted your issue to GPT-4 out of curiousity, and you may find it interesting. I think it's really worth it considering tweaking your follow, here what it said:


While it's a deviation from the conventional use of Kubernetes, your reasons for such an approach are understood. Let's explore some suggestions and thoughts:

  1. Embracing Kubernetes: Even though Kubernetes is often associated with microservices, monolithic applications can definitely run on it. The one-server-per-customer strategy is atypical, but perhaps there's a middle ground where you can maintain management convenience while leveraging Kubernetes more effectively.

  2. Namespace Isolation: If your primary objective is to isolate customer deployments, have you considered using Kubernetes namespaces? By assigning a namespace to each customer, you can isolate and manage resources on a per-customer basis.

  3. Node Tainting and Affinities: To ensure each customer's deployment only runs on its specific node, consider node taints and tolerations along with node/pod affinities. This way, only the designated customer's pods will run on their dedicated node.

  4. Cost Efficiency: Allocating a full server to each customer might not be the most economical, especially if not all customers utilize their server's resources entirely. Perhaps grouping smaller customers onto shared nodes and reserving dedicated nodes for larger customers could be more efficient.

  5. Infrastructure as Code (IaC): To streamline the provisioning and deprovisioning of customer resources, tools like Terraform or Pulumi might be beneficial. These can help automate the management of infrastructure elements specific to each customer.

  6. Logging and Monitoring: It's crucial to have robust monitoring and logging in place, especially when dedicating entire nodes to individual customers. Tools such as Prometheus, Grafana, or the ELK stack can provide invaluable insights.

  7. Documentation: Given the potential implications for others, it might be valuable to document this scenario in your project's FAQ or issues section. This could assist others who adopt similar deployment strategies.

  8. Feedback Loop: I encourage you to continuously share feedback or any modifications you make to meet your requirements. Collaboration can lead to a project that's both robust and versatile.

  9. Scaling Considerations: As your application grows, this approach may present scaling challenges. We should discuss potential scaling scenarios and how to utilize Kubernetes' inherent scaling mechanisms more efficiently.

  10. Potential Redesign: If there's ever a window of opportunity, consider breaking your monolithic application into smaller microservices. This could not only optimize Kubernetes usage but also offer improved scalability and flexibility.

In conclusion, while unconventional, the flexibility of Kubernetes allows for various deployment strategies. What's crucial is aligning these strategies with business needs while being aware of best practices and potential challenges.

@mysticaltech
Copy link
Collaborator

@maggie44 Let's open a proper feature requests for what you need in the issues section, as it seems related to the node labels we talked about tthe under day. (In discussions, I may forget about it, but not in the issues).

@mysticaltech
Copy link
Collaborator

@thomasprade Closing this for now, as there is not much we can do. The limitation has you mentioned above is from the Hetzner network. I will adjust the docs to give proper expectations.

@thomasprade
Copy link
Contributor Author

@mysticaltech In my PR #902 I already updated/clarified the documentation in the kube.tf to a certain degree. I hope that is at least somewhat helpful/sufficient.

I will also further investigate a potential solution for the subnet limitation, and if I get to something practical I will open a new PR.

Thanks for the help and for the suggestions above 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants