Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Azure Kubernetes publick8s suffers from SNAT port exhaustion: network slowness #3908

Closed
dduportal opened this issue Jan 15, 2024 · 17 comments
Closed
Assignees
Labels

Comments

@dduportal
Copy link
Contributor

Service(s)

Azure

Summary

The AKS cluster publick8ssuffers from SNAT port exhaustion since around 1 month (example below for the last 24 hours):

Capture d’écran 2024-01-15 à 17 37 50

It causes the following problems:

Reproduction steps

No response

@dduportal dduportal added the triage Incoming issues that need review label Jan 15, 2024
@dduportal dduportal changed the title Azure Kubernetes publick8s suffers from SNAT port ehxautsion: network slowness Azure Kubernetes publick8s suffers from SNAT port exhaustion: network slowness Jan 15, 2024
@dduportal
Copy link
Contributor Author

dduportal commented Jan 15, 2024

@dduportal
Copy link
Contributor Author

dduportal commented Jan 15, 2024

  • The ci.jenkins.io networks could be moved out of the publick8s network to separate outbounds routes:
    • Since agents are using the new subscription, we could set it up) to use a NAT gateway for outbound like we already did with trusted.ci and cert.ci
    • ci.jenkins.io is currently using public Ip outbound: moving its controller VM/disk to the secondary subscription would ensure:
      • Complete removal of the public vnet peering (and separate concerns) as ci.j would be on another network
      • More usage on the subscription billing (~450$ monthly)
Capture d’écran 2024-01-15 à 18 29 12

@lemeurherve lemeurherve changed the title Azure Kubernetes publick8s suffers from SNAT port exhaustion: network slowness Azure Kubernetes publick8s suffers from SNAT port exhaustion: network slowness Jan 16, 2024
@smerle33 smerle33 added this to the infra-team-sync-2024-01-23 milestone Jan 16, 2024
@dduportal dduportal self-assigned this Jan 17, 2024
@dduportal dduportal removed the triage Incoming issues that need review label Jan 17, 2024
dduportal added a commit to jenkins-infra/azure that referenced this issue Jan 17, 2024
…nnections (#579)

Ref. jenkins-infra/helpdesk#3908

This PR tunes the network outbound method used for both `publick8s` and
`privatek8s` (using loadbalancers) with:

- TCP idle timeout decrease from 30 min (default) to 4 min to recycle
sockets way more often
- Force static allocation of `3200` (and `1600`) port on the public
outbound IPs as per the Azure Metrics (these values are the upper of
each amount of SNAT connection diagrams).
- Note it disable the dynamic allocation: this should be a problem if we
have more than 50 nodes per cluster. Not the case for these 2.

Signed-off-by: Damien Duportal <damien.duportal@gmail.com>
@dduportal
Copy link
Contributor Author

Update:

Next step:

  • NAT gateway for ci.jenkins.io sponsorship network
  • migration of ci.jenkins.io VM to the sponsorship network

@dduportal
Copy link
Contributor Author

Update:

  • ci.jenkins.io is now using a NAT gateway for its outbound traffic: if there are any remaning slowness it won't be related to SNAT port exhaustion

@dduportal
Copy link
Contributor Author

Update:

* Loadbalancer tuned for both `publick8s` and `privatek8s`: [feat(publick8s,privatek8s) set up outbound LB to support more SNAT connections azure#579](https://github.com/jenkins-infra/azure/pull/579)
  
  * It required hotfixes [hotfix(publick8s) correct outbound LB SNAT port azure#580](https://github.com/jenkins-infra/azure/pull/580) [ hotfix(privatek8s) correct outbound LB SNAT port azure#581](https://github.com/jenkins-infra/azure/pull/581) and [hotfix(publick8s,privatek8s) set up LB ports amount to valid values azure#582](https://github.com/jenkins-infra/azure/pull/582)) as the specified max outbound port number must be:
    
    * Per VM
    * A multiple of 8 (or said differently, a division of the 64000 max ports available per public IP)

No SNAT port exhaustion detected in the past 12 hours as per Azure metrics:

Capture d’écran 2024-01-18 à 13 00 57 Capture d’écran 2024-01-18 à 13 01 05

@dduportal
Copy link
Contributor Author

  • The ci.jenkins.io networks could be moved out of the publick8s network to separate outbounds routes:

    * Since agents are using the new [subscription](https://github.com/jenkins-infra/helpdesk/issues/3818), we could set it up) to use a NAT gateway for outbound like we [already did with trusted.ci and cert.ci](https://github.com/jenkins-infra/azure/pull/567)
    * ci.jenkins.io is currently using public Ip outbound: moving its controller VM/disk to the secondary subscription would ensure:
      
      * Complete removal of the public vnet peering (and separate concerns) as ci.j would be on another network
      * More usage on the subscription billing (~450$ monthly)
    
Capture d’écran 2024-01-15 à 18 29 12

Tracking the ci.jenkins.io migration in #3913

@dduportal dduportal reopened this Jan 19, 2024
@dduportal
Copy link
Contributor Author

While working on #3837 (comment), we saw the problem to re-appear due to additional nodes.

Capture d’écran 2024-01-19 à 11 16 25

It sounds like we should add more public IP to increase the threshold. If it does not suffice, we'll have to plan a cluster re-creation during the Kubernetes 1.27 upgrade with a new subnet (and associated NAT gateway).

dduportal added a commit to jenkins-infra/azure that referenced this issue Jan 19, 2024
…NAT exhaustion (#587)

Related to jenkins-infra/helpdesk#3908

This PR increases the amount of public IPs used for outbound connection
in `publick8s` as a tentative to increase the SNAT exhaustion threshold.

Signed-off-by: Damien Duportal <damien.duportal@gmail.com>
@dduportal
Copy link
Contributor Author

dduportal commented Jan 19, 2024

Update:

dduportal added a commit to jenkins-infra/kubernetes-management that referenced this issue Jan 19, 2024
@dduportal
Copy link
Contributor Author

Update:

* Opened a PR to increase the number of Public IPs: [fix(publick8s) add more public IPs for outbound connection to avoid SNAT exhaustion azure#587](https://github.com/jenkins-infra/azure/pull/587)
  
  * Required an hotfix as increasing the amount of IPv4 also required to specify IPv6 (set to 2): [jenkins-infra/azure@f3ba3d8](https://github.com/jenkins-infra/azure/commit/f3ba3d81e758a3a1d9d033d51ab99893d14a6d7a)

* Once deployed, we'll have to retrieve the value of these public IPs and add it in https://github.com/jenkins-infra/kubernetes-management/blob/main/config/ldap.yaml
  • Current outbound IPs:
outbound publick8s:

- 2603:1030:403:3::106
- 2603:1030:403:3::217
- 20.85.71.108
- 20.22.30.9
- 20.22.30.74

outbound privatek8s:

- 20.22.6.81

Let's check how is the week-end going

@dduportal
Copy link
Contributor Author

Alas we still see SNAT port problem. Lets go the "add a NAT gateway but not explicitly" as described in https://www.danielstechblog.io/preventing-snat-port-exhaustion-on-azure-kubernetes-service-with-virtual-network-nat/ (and other internet posts)

@dduportal
Copy link
Contributor Author

Update: opened jenkins-infra/azure-net#198, let's prepare this PR and check the SNAT metrics before and after deploying to confirm SNAT exhaustion disappear.

If it does, then we'll decrease the LB outbound setup IPs to pay less.

@dduportal
Copy link
Contributor Author

Update:

  • Dashboard for SNAT tracking created in Azure (Go to "Dashboard Hub" Section -> select the "ASK Outbound SNAT Connection" shared dashboard")
Capture d’écran 2024-01-23 à 14 40 28

dduportal added a commit to jenkins-infra/kubernetes-management that referenced this issue Jan 23, 2024
dduportal added a commit to jenkins-infra/azure that referenced this issue Jan 23, 2024
Ref. jenkins-infra/helpdesk#3908

This PR adds the NAT gatewat public IP in the allow list for both
`publick8s` and `privatek8s` to ensure all requests originated from
inside the clusters (autoscaler, nodes healthchecks, API commands for
`kubectl logs/exec`, etc.) are allowed to reach the control plane.

Signed-off-by: Damien Duportal <damien.duportal@gmail.com>
@dduportal
Copy link
Contributor Author

Update: we'll delay the switch to a NAT gateway for after the 2.426.3 LTS release

@dduportal
Copy link
Contributor Author

Update: we'll delay the switch to a NAT gateway for after the 2.426.3 LTS release

Let's go! Ref. jenkins-infra/azure-net#201

@dduportal
Copy link
Contributor Author

Looks good for now:

Capture d’écran 2024-01-24 à 16 47 03

@dduportal
Copy link
Contributor Author

dduportal commented Jan 26, 2024

The fix was efficient: we see no more SNAT exhaustion 🥳

Capture d’écran 2024-01-26 à 08 07 25

@dduportal
Copy link
Contributor Author

Confirmed after the weekend: we can close:

Capture d’écran 2024-01-29 à 08 47 21

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants