Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ci.jenkins.io] Migrate ci.jenkins.io EKS clusters out from CloudBees AWS account #3954

Open
dduportal opened this issue Feb 16, 2024 · 10 comments

Comments

@dduportal
Copy link
Contributor

dduportal commented Feb 16, 2024

Service(s)

AWS, Azure, ci.jenkins.io, sponsors

Summary

Today, ci.jenkins.io utilizes 2 EKS clusters to spin up ephemeral agents (for plugin and BOM builds). These clusters are hosted in a CloudBees-sponsored account (historically used to host a lot of Jenkins services).

We want to move these clusters out of CloudBees AWS to ensure non CloudBees Jenkins contributors can manage it and to use credits from other sponsors as AWS, DigitalOcean and Azure gave us credits to be used.

Initial working path (destination: AWS sponsored account)

AWS is sponsoring the Jenkins project with $60.000 for 2024, which are applied to a fresh new AWS account.

We want to migrate the 2 clusters used by ci.jenkins.io into this new AWS account:

  • Moving out from CloudBees-owned AWS account allows non CloudBees employees to help managing these resources
  • Consuming these credits is key to ensure we can continue sponsor on long term

Updated working path

As discussed during the 2 previous infra SIG meetings, we have around 28k$ credits on the Azure sponsored account which expires end of August 2024 (was May 2024 but @MarkEWaite asked for extension of this deadline ❤️ ), while both DigitalOcean and AWS (non CloudBees) accounts have credits until January 2025.

=> As such, let's start by using a Kubernetes cluster in Azure (sponsored) AKS to use these credits until end of summer before moving to the new AWS account


Notes 📖

A few elements for planning these migrations:

Reproduction steps

No response

@dduportal
Copy link
Contributor Author

First things first: connected to the account with the jenkins-infra-team account (and its shared TOTP for 2FA) and was able to confirm we have the $60,000 credits:

Capture d’écran 2024-04-03 à 16 20 21

@dduportal
Copy link
Contributor Author

dduportal commented Apr 5, 2024

Update: proposal to boostrap the AWS account. To be discussed and validated during the next weekly team meeting.

  • Root account:

  • Each Jenkins Infra team member ("OPS") will have a nominative AWS account with mandatory password and MFA, no API access (only Web Console) and only the permission to assume a role based on their "trust" level.

  • The following roles are proposed:

    • infra-admin: allows management of usual resources (EC2, EKS, S3, etc.) but also access (read only) to billing
    • infra-user: allows management of usual resources (EC2, EKS, S3, etc.)
    • infra-read: allows access (read-only) of usual resources (EC2, EKS, S3, etc.)
  • The infrastructure as code (jenkins-infra/aws, Terraform project) will have 2 IAM users, and each one will only be able to assume a role.

  • The "Assume Role" means AWS STS will be used to generate 1 hour valid token (e.g. whether Web Console or API is used, the credential is only valid 1 hour). It will require additional commands for end users or Terraform but it will avoid keeping APi keys unchanged for months (years?).

  • We won't use the AWS IAM Identity Center as it is overkill (we only have one AWS account with just a few resources).

  • We won't deploy stuff outside of a base region (eventually 2), in a single AZ per region (no HA: it fails, then it fails).

  • The scope of resources must only be ephemeral workloads. Ideally for ci.jenkins.io: public services so the workloads are considered unsafe and untrusted by default (so no mix up with other controllers such as infra.ci.jenkins.io).

@dduportal dduportal changed the title [AWS] Migrate ci.jenkins.io EKS clusters from CloudBees AWS account to Jenkins AWS (sponsored) account [ci.jenkins.io] Migrate ci.jenkins.io EKS clusters out from CloudBees AWS account May 2, 2024
@dduportal
Copy link
Contributor Author

Update:

  • Body of this issue opened to materialize the new destination chosen during team meeting: until end of August 2024, we want ci.jenkins.io to use Azure credits (instead of non-CloudBees AWS or DigitalOcean credits)

@dduportal
Copy link
Contributor Author

dduportal commented May 6, 2024

Update: proposal for the new AKS cluster to be soon created:

  • Subscription: the Azure "sponsored" subscription (same as
  • Name: cijenkinsio-agents-1. This name is valid as per https://learn.microsoft.com/en-us/troubleshoot/azure/azure-kubernetes/create-upgrade-delete/aks-common-issues-faq#what-naming-restrictions-are-enforced-for-aks-resources-and-parameters-
    • No dots, only hyphens, letters and number. Less than 63 characters.
    • Full name of the service (cijenkinsio) to make identification easier
    • agents wording to make explicit this is the only acceptable usage for this cluster
    • Suffix -1 as we'll most probably need to create more clusters in the future (AWS and eventually DOKS): migration will be easier if we can increment while keeping the the same naming convention
  • Connectivity:
    • API access restricted to Jenkins Infra admins, privatek8s (to allow infra.ci to operate the cluster with terraform) and of course the ci.jenkins.io controller's subnet
    • Should use its own subnet in the "public_sponsorship" virtual network.
      • ⚠️ Need to refine the network sizing: This vnet already is a /14 and already has 2 x /24 subnets. Need to carefully plan the sizing of the new subnet with the AKS network rules + sizing of Nodes and pods.
      • No ingress controller to be added:
        • None was present on cik8s
        • The ACP instance will be internal only: we'll set-it up with the Kubernetes Service internal DNS.
          • 💡 If ACP cannot be used with internal SVC hostname, we can always install a "private" ingress controller as fallback, but only on last resort
      • No network segregation between node pools: this cluster is not multi tenant
  • Node pools:
    • All Linux node pools should be Azure Linux as base OS (already done for infra.ci and release.ci controllers)
    • Naming convention: given the constraint when naming node pool (ref. ):
      • Linux Node pools (12 char. max.) will have:
        • the OS on 1-char (l for Azure Linux, w for Windows, u for Ubuntu Linux)
        • the CPU arch on 3 chars (x86 for Intel/AMD x86_64 or a64 for arm64)
        • The OS on 3 chars (lin)
        • The number of pods agents expected to be run on a single node of this pool (integer) preceded by a n (n3, n24, etc.) on 3 chars max
        • An eventual suffix on 3 letters to specify a custom usage. May replace the sizing if needed.
      • Linux node pools naming examples:
        • lx86n3 => Azure Linux x86_64 nodes used which can run 3 "normal" pod agents at the same time
        • lx86n4bom => Azure Linux x86_64 nodes which can run 4 ("bom" only) pod agents at the same time
        • ua64n24bom => Ubuntu Linux arm64 nodes which can run 24 ("bom" only) pod agents at the same time
        • la64n2side => Linux arm64 nodes used to run 2 "side" pod (e.g. custom applications such as ACP).
      • Windows Node pools (6 char. max. which is trickier) will have:
        • The OS (Windows) as 1-char prefix (w of course)
        • The Windows edition on 4-chars (2019, 2022, etc.)
        • An eventual suffix to nodepools rotation
    • Expecting the following mappings (comparing to the existing cik8s EKS cluster):
      • tiny_ondemand_linux => will be a syspool following AKS good practises (HA, etc.). Should only hosts the Azure or AKS technical side-services, not ours (CSI, CNI, etc.)
      • default_linux_az1 => la64n2app (2 "app" pods per node: ACP and datadog-cluster's agent)
        • May change to la64n3app if we add falco or any other tool
        • ⚠️ Need to refine the nodes sizing
      • spot_linux_4xlarge => lx86n3agt1 (Azure Linux node pool for agents number "1" supporting 3x pod agents)
        • ⚠️ Need to refine the nodes sizing: should be 16 vCPUS / 32 Gb / 90gb+ disk to map to current EKS sizing
        • No spot as we don't have access in the Azure subscription
        • Auto-scaled with minimum of 0 and maximum of 50 (same as EKS)
      • spot_linux_4xlarge_bom => lx86n3bom1 (Azure Linux node pool number "1" for BOM only supporting 3x pod agents)
        • Same as spot_linux_4xlarge except taints to be added to ensure only bom builds are using this node pool
      • spot_linux_24xlarge_bom not retained

@dduportal
Copy link
Contributor Author

Network considerations:

  • We'll create a private cluster: https://learn.microsoft.com/en-us/azure/aks/private-clusters?tabs=azure-portal to ensure no external API is possible

    • Only ci.jenkins.io controller, Jenkins admins or the infra.ci.jenkins.io agents will communicate with the API of this new cluster
    • ci.jenkins.io is on the same vnet: the private endpoint to reach AKS API will be used
    • Jenkins admin will have to use VPN connection with the VPN VM is peered to the public sponsorship network => we'll need to set up a public DNS record to ensure it works.
    • infra.ci.jenkins.io agents are in private network which may be peered to the public-sponsorship (or could be setup for a private endpoint)
  • The selected network mode will be "Azure CNI Overlay" as per https://learn.microsoft.com/en-us/azure/aks/azure-cni-overlay?tabs=kubectl#choosing-a-network-model-to-use

    • Most of the agent pod network communication should be either to the internal ACP or the ci.jenkins.io controller (another subnet of the same vnet).
    • We don't want to use advanced / edge features of AKS such as virtual nodes => The NAT pod <-> node subnet is acceptable
    • We want to limit the amount of used IPs in the subnets while ensuring we can grow the cluster
  • No inbound method is expected (we won't use an inbound LB)

  • The outbound method should be a "User assign NAT gateway" which will be the NAT gateway associated to the "public-sponsorship" network (same as ci.jenkins.io VM and ACI agents)

    • We'll then use a Standard LB SKU as per the doc
  • IP addresses planning (ref. https://learn.microsoft.com/en-us/azure/aks/azure-cni-overlay?tabs=kubectl#ip-address-planning)

    • Cluster Nodes:
      • Former cluster cik8s was set up to handle maximum 117 nodes (102 without the experimental 24x node pool we won't add in AKS) with 30 pod per nodes max
      • Former cluster eks-public was set up to handle a maximum of 4 nodes with 15 pods max per nodes
      • We plan to add (soon) node pools for Linux arm64, Windows 2019 and eventually Windows 2022
      • Proposal: A /24 subnet for nodes, allowing ~250 max. nodes is enough => if we hit a limit we can had more pods per node!
    • Pods: 250 pods per nod (maximum) is clearly a limit we won't reach until we run out of credits on Azure ;) The implicit default /24 internal CIDR per node is good enough.
      • Pod CIDR: 10.50.0.0/24 to ensure no overlap with ANY of the peered networks. Note that /24 is mandatory
    • Kubernetes service address range: Let's use the internal default (we don't need a lot of internal services)

@dduportal
Copy link
Contributor Author

dduportal commented May 6, 2024

Nodes sizing considerations:

@dduportal
Copy link
Contributor Author

@dduportal
Copy link
Contributor Author

Update:

  • Edited [ci.jenkins.io] Migrate ci.jenkins.io EKS clusters out from CloudBees AWS account #3954 (comment) to map to new naming convention for node pools as per discussion with @lemeurherve
  • Manually tested creation of a cluster with the expected parameters to verify the "private" access:
    • A minor tweaks in terraform are needed but it is only provider "logic" but it works
    • Access to the AKS cluster in the Azure UI or from my admin machine both both works using the public DNS record and require VPN to be connected as expected
    • The ci.jio controller's NSG requires an update to reach the controlle plane (port 443) => incoming PR
    • NSG is required for the the kubernetes agent => incoming PR for a terraform module as we'll need the same for infra.ci and release.ci kubernetes agents

dduportal added a commit to jenkins-infra/azure that referenced this issue May 10, 2024
Related to jenkins-infra/helpdesk#3954

Blocked by jenkins-infra/shared-tools#146

This PR introduces a new AKS cluster to host ci.jenkins.io container
agents workload with the
[specified](jenkins-infra/helpdesk#3954 (comment))
attributes:

- [Private
cluster](jenkins-infra/helpdesk#3954 (comment))
(e.g. API not exposed except internally) which means we need cluster to
reach it => it might need subsequent PRs to fine-tune the
infra.ci.jenkins.io agent network accesses.
- Outbound with NAT gateway and no ingress (as per
jenkins-infra/helpdesk#3954 (comment))
- Initial set of node pools with the [proposed
sizings](jenkins-infra/helpdesk#3954 (comment))

Notes:

- Allowing ci.jenkins.io to reach the AKS API of this cluster requires a
few additional NSGs rules specified in the `ci.jenkins.io.tf` file
- The PR jenkins-infra/shared-tools#146 is
needed so we can set up NSG rules to restrict the agents in and out
network requests.

---------

Signed-off-by: Damien Duportal <damien.duportal@gmail.com>
dduportal added a commit to jenkins-infra/azure that referenced this issue May 11, 2024
Second tentative at creating the new cluster (after #693 rollbacked by
#694)

> Related to jenkins-infra/helpdesk#3954
> 
> Blocked by jenkins-infra/shared-tools#146
> 
> This PR introduces a new AKS cluster to host ci.jenkins.io container
agents workload with the
[specified](jenkins-infra/helpdesk#3954 (comment))
attributes:
> 
> - [Private
cluster](jenkins-infra/helpdesk#3954 (comment))
(e.g. API not exposed except internally) which means we need cluster to
reach it => it might need subsequent PRs to fine-tune the
infra.ci.jenkins.io agent network accesses.
> - Outbound with NAT gateway and no ingress (as per
jenkins-infra/helpdesk#3954 (comment))
> - Initial set of node pools with the [proposed
sizings](jenkins-infra/helpdesk#3954 (comment))
> 
> Notes:
> 
> - Allowing ci.jenkins.io to reach the AKS API of this cluster requires
a few additional NSGs rules specified in the `ci.jenkins.io.tf` file
> - The PR jenkins-infra/shared-tools#146 is
needed so we can set up NSG rules to restrict the agents in and out
network requests.

The following elements were changed since the first tentative:

- Commented out the kubernetes configuration (until infra.ci
configuration is tuned to reach the API controle plane) to avoid failing
deployment initially (during bootstrap)
- Fixed the "inbound agent" module to ensure naming of NSG and its
security rule won't fail like they did on the initial deployment (ref.
jenkins-infra/shared-tools@f251e97)

Signed-off-by: Damien Duportal <damien.duportal@gmail.com>
@dduportal
Copy link
Contributor Author

dduportal commented May 13, 2024

Update: the cluster is created after many retries:

=> cluster is now created, with node pools and terraform project works as expected. Access works from ci.jenkins.io AND through VPN.

Next steps:

  • Required validations:
    • Check (and fix if needed) AKS API (private) access from infra.ci pod agents to ensure we can manage kubernetes
    • Check that the user outbound gateway is used properly (checking the 2 outbound public IPs) from the new AKS cluster
    • Check the Azure diagnostic tools is not pointing out any error + there are no network overlap between Pod CIDR and our networks
  • Add cluster to kubernetes-management in infra.ci.jenkins.io:
    • Set up admin credential
    • Install datadog, jenkins-agent and jenkins-agents-bom releases at first (no ACP yet, no ingress/cert-manager/acme/docker-registry-credz)
  • Add initial template on ci.jenkins.io for testing agent creation
    • Set up ci.jio agent user AND agents-bom credentials
    • Create 2 new kubernetes cloud with custom labels
    • Verify we can create agents on both clouds
    • Add setups in Puppet with updated resources requirements/limits (less required CPU to pack 4 agents per nodes, more memory for both)
    • Verify the new setup is valid (spining up 5 agents and 5 bom agents)
  • Add ACP to the mix
    • Need to add tolerations supports for ACP helm chart
    • Deploy ACP without ingress
    • Add and validate a new settings.xml (without username/password and using the internal SVC)
  • E2E Validation with jenkins-infra-plugin-test
  • Disable ci.jenkins.io artifact storing in S3
    • Check disk size prior!
  • Then prepare migration:
    • Announce to developers (status/mailing list/IRC)
    • In ci.jenkins.io Jenkins controller, replace the current AWS + DigitalOcean agent kubernetes "Cloud" templates by only the new one (remove the formers, update labels for the latters)
  • Cleanup:
    • After 1_2 days with bom and plugins builds (and no disk full issues!)
    • EKS clusterS removal (cik8s and eks-public)
      • ci.jio JCasc and credentials cleanup
      • kubernetes management
      • Terraform
      • Cloud resources
    • DOKS clusterS removal (doks and doks-public)
      • ci.jio JCasc and credentials cleanup
      • kubernetes management
      • Terraform
      • Cloud resources

dduportal added a commit to jenkins-infra/azure that referenced this issue May 13, 2024
…ge kubernetes resources (#696)

Ref.
jenkins-infra/helpdesk#3954 (comment)

This PR adds managed resources to create a kubernetes cluster-admin
account to be used by infra.ci.jenkins.io to run the
jenkins-infra/kubernetes-management jobs on the new ci.jio agent 1
cluster.

---------

Signed-off-by: Damien Duportal <damien.duportal@gmail.com>
Co-authored-by: Tim Jacomb <21194782+timja@users.noreply.github.com>
dduportal added a commit to jenkins-infra/azure that referenced this issue May 13, 2024
…d pod issues (#698)

Related to jenkins-infra/helpdesk#3954

After a pairing session with @smerle33 where we discovered that the
applications running in the AKS cluster's `kube-system` namespace where
failing due to timeouts, we realized that the NSG in charge fo setting
up rules for inbound agents is not shaped to handle an AKS cluster as it
blocks technical requests within the cluster:

- Pods (`10.100.0.0/14`) to internal DNS (`10.0.0.0`) server
- Kubelet (unknown CIDR) to pods (`10.100.0.0/14`)
- Pods (`10.100.0.0/14`) to Kubernetes services
- Etc.

This PR removes the NSG (for now) until we have deployed the cluster

Signed-off-by: Damien Duportal <damien.duportal@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants