New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ci.jenkins.io] Migrate ci.jenkins.io EKS clusters out from CloudBees AWS account #3954
Comments
Update: proposal to boostrap the AWS account. To be discussed and validated during the next weekly team meeting.
|
Update:
|
Update: proposal for the new AKS cluster to be soon created:
|
Network considerations:
|
Nodes sizing considerations:
|
Update: first wabe of PRs on the network part: |
Update:
|
Related to jenkins-infra/helpdesk#3954 Blocked by jenkins-infra/shared-tools#146 This PR introduces a new AKS cluster to host ci.jenkins.io container agents workload with the [specified](jenkins-infra/helpdesk#3954 (comment)) attributes: - [Private cluster](jenkins-infra/helpdesk#3954 (comment)) (e.g. API not exposed except internally) which means we need cluster to reach it => it might need subsequent PRs to fine-tune the infra.ci.jenkins.io agent network accesses. - Outbound with NAT gateway and no ingress (as per jenkins-infra/helpdesk#3954 (comment)) - Initial set of node pools with the [proposed sizings](jenkins-infra/helpdesk#3954 (comment)) Notes: - Allowing ci.jenkins.io to reach the AKS API of this cluster requires a few additional NSGs rules specified in the `ci.jenkins.io.tf` file - The PR jenkins-infra/shared-tools#146 is needed so we can set up NSG rules to restrict the agents in and out network requests. --------- Signed-off-by: Damien Duportal <damien.duportal@gmail.com>
Second tentative at creating the new cluster (after #693 rollbacked by #694) > Related to jenkins-infra/helpdesk#3954 > > Blocked by jenkins-infra/shared-tools#146 > > This PR introduces a new AKS cluster to host ci.jenkins.io container agents workload with the [specified](jenkins-infra/helpdesk#3954 (comment)) attributes: > > - [Private cluster](jenkins-infra/helpdesk#3954 (comment)) (e.g. API not exposed except internally) which means we need cluster to reach it => it might need subsequent PRs to fine-tune the infra.ci.jenkins.io agent network accesses. > - Outbound with NAT gateway and no ingress (as per jenkins-infra/helpdesk#3954 (comment)) > - Initial set of node pools with the [proposed sizings](jenkins-infra/helpdesk#3954 (comment)) > > Notes: > > - Allowing ci.jenkins.io to reach the AKS API of this cluster requires a few additional NSGs rules specified in the `ci.jenkins.io.tf` file > - The PR jenkins-infra/shared-tools#146 is needed so we can set up NSG rules to restrict the agents in and out network requests. The following elements were changed since the first tentative: - Commented out the kubernetes configuration (until infra.ci configuration is tuned to reach the API controle plane) to avoid failing deployment initially (during bootstrap) - Fixed the "inbound agent" module to ensure naming of NSG and its security rule won't fail like they did on the initial deployment (ref. jenkins-infra/shared-tools@f251e97) Signed-off-by: Damien Duportal <damien.duportal@gmail.com>
Update: the cluster is created after many retries:
=> cluster is now created, with node pools and terraform project works as expected. Access works from ci.jenkins.io AND through VPN. Next steps:
|
…ge kubernetes resources (#696) Ref. jenkins-infra/helpdesk#3954 (comment) This PR adds managed resources to create a kubernetes cluster-admin account to be used by infra.ci.jenkins.io to run the jenkins-infra/kubernetes-management jobs on the new ci.jio agent 1 cluster. --------- Signed-off-by: Damien Duportal <damien.duportal@gmail.com> Co-authored-by: Tim Jacomb <21194782+timja@users.noreply.github.com>
…d pod issues (#698) Related to jenkins-infra/helpdesk#3954 After a pairing session with @smerle33 where we discovered that the applications running in the AKS cluster's `kube-system` namespace where failing due to timeouts, we realized that the NSG in charge fo setting up rules for inbound agents is not shaped to handle an AKS cluster as it blocks technical requests within the cluster: - Pods (`10.100.0.0/14`) to internal DNS (`10.0.0.0`) server - Kubelet (unknown CIDR) to pods (`10.100.0.0/14`) - Pods (`10.100.0.0/14`) to Kubernetes services - Etc. This PR removes the NSG (for now) until we have deployed the cluster Signed-off-by: Damien Duportal <damien.duportal@gmail.com>
Update:
|
Service(s)
AWS, Azure, ci.jenkins.io, sponsors
Summary
Today, ci.jenkins.io utilizes 2 EKS clusters to spin up ephemeral agents (for plugin and BOM builds). These clusters are hosted in a CloudBees-sponsored account (historically used to host a lot of Jenkins services).
We want to move these clusters out of CloudBees AWS to ensure non CloudBees Jenkins contributors can manage it and to use credits from other sponsors as AWS, DigitalOcean and Azure gave us credits to be used.
Initial working path (destination: AWS sponsored account)
Updated working path
As discussed during the 2 previous infra SIG meetings, we have around 28k$ credits on the Azure sponsored account which expires end of August 2024 (was May 2024 but @MarkEWaite asked for extension of this deadline ❤️ ), while both DigitalOcean and AWS (non CloudBees) accounts have credits until January 2025.
=> As such, let's start by using a Kubernetes cluster in Azure (sponsored) AKS to use these credits until end of summer before moving to the new AWS account
Notes 📖
A few elements for planning these migrations:
This is a good opportunity to re-assess the naming convention we used for jenkins-infra/aws project:
cik8s
andeks-public
for instance...The terraform module for EKS has a major upgrade version currently waiting (20.x): Bump version of the Terraform module "eks" to 20.10.0 aws#517 . It features breaking changes around the management of the EKS configmap. Upgrading the module by using the new version on fresh new cluster would avoid a tedious migration of existing ones...
We have an upcoming Kubernetes 1.27 upgrade: it will most probably be applied to AWS cluster before, but we have to keep it in mind
We'll have to define at least 2 different AWS providers in the Terraform project to allow management of both accounts at the same time: https://build5nines.com/terraform-deploy-to-multiple-aws-accounts-in-single-project/ (we already have this kind of pattern with Azure)
Reproduction steps
No response
The text was updated successfully, but these errors were encountered: