Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a new private kubernetes cluster in the new sponsored azure subscription #3923

Closed
16 tasks done
smerle33 opened this issue Jan 26, 2024 · 10 comments
Closed
16 tasks done

Comments

@smerle33
Copy link
Contributor

smerle33 commented Jan 26, 2024

Service(s)

Azure, infra.ci.jenkins.io

Summary

as per #3918 (comment)

=> Let's scope the initial implementation to only infra.ci.jenkins.io agents, and only 1 "non system" nodepool of type linux/arm64 so we can start switching workloads out of privatek8s.

add a new AKS cluster only for infra.ci.jenkins.io and release.ci.jenkins.io Kubernetes agents

use the new cluster from infra.ci.jenkins.io

  • Add the generated secrets from kubernetes in the infra.ci controller
    • Add to sops
    • Add to jcasc
  • Create a new kubernetes cloud in infra.ci
    • Manually first
    • Port as code in jcasc

migration/cleanup during this time:

  • Switch the workload to the new kubernetes cloud kubernetes-infrasicjio-agents1
  • check all jobs and for each using storage account, check if they need an extension of their network rules to add the new cluster
  • remove the old kubernetes cloud from infra.ci to avoid consuming

Definition of done

when the cost will be moved from the CDF payed account to the sponsored one

@smerle33
Copy link
Contributor Author

smerle33 commented Jun 4, 2024

Update: proposal for the new AKS cluster to be soon created:

  • Subscription: the Azure "sponsored" subscription (same as cijenkinsio-agents-1 [ci.jenkins.io] Migrate ci.jenkins.io EKS clusters out from CloudBees AWS account #3954)

  • Name: infracijenkinsio-agents-1. This name is valid as per learn.microsoft.com/en-us/troubleshoot/azure/azure-kubernetes/create-upgrade-delete/aks-common-issues-faq#what-naming-restrictions-are-enforced-for-aks-resources-and-parameters-

    • No dots, only hyphens, letters and number. Less than 63 characters.
    • Full name of the service (infracijenkinsioagents1) to make identification easier
    • agents wording to make explicit this is the only acceptable usage for this cluster
    • Suffix -1 as we'll most probably need to create more clusters in the future (AWS and eventually DOKS): migration will be easier if we can increment while keeping the the same naming convention
  • Connectivity:

    • API access restricted to Jenkins Infra admins, privatek8s (to allow infra.ci to operate the cluster with terraform AND spawning agents)
      • Should use its own subnet in the ["infra_ci_jenkins_io_sponsorship"] virtual network.
      • ⚠️ Need to refine the network sizing: This vnet already is a /23 and already has 2 x /24 subnets. Need to carefully plan the sizing of the new subnet with the AKS network rules + sizing of Nodes and pods.
      • No ingress controller to be added:
        • None was present on cik8s
        • The ACP is not needed on infra.
        • No network segregation between node pools: this cluster is not multi tenant
  • Node pools: (same convention as ci [ci.jenkins.io] Migrate ci.jenkins.io EKS clusters out from CloudBees AWS account #3954 (comment))

    • All Linux node pools should be Azure Linux as base OS (already done for infra.ci and release.ci controllers)

    • Naming convention: given the constraint when naming node pool (ref. ):

      • Linux Node pools (12 char. max.) will have:
        • the OS on 1-char (l for Azure Linux, w for Windows, u for Ubuntu Linux)
        • the CPU arch on 3 chars (x86 for Intel/AMD x86_64 or a64 for arm64)
        • The estimation of the number of pods agents expected to be run on a single node of this pool (integer) preceded by a n (n3, n24, etc.) on 3 chars max
        • An eventual suffix on 3 letters to specify a custom usage. May replace the sizing if needed.
    • Linux node pools naming examples:

      • lx86n3 => Azure Linux x86_64 nodes used which can run 3 "normal" pod agents at the same time
      • la64n2side => Linux arm64 nodes used to run 2 "side" pod (e.g. custom applications such as ACP).
    • no Windows Node pool planned

    • Expecting the following mappings (comparing to the existing privatek8s AKS cluster):

      • syspool => exact matching with privatek8s
        • Auto-scaled with minimum of 1 and maximum of 3 as per recommandations
      • infracipool lx86n41 (Azure Linux node pool for agents number "1" supporting 4x pod agents)
        • No spot as we don't have access in the Azure subscription
        • Auto-scaled with minimum of 0 and maximum of 20 (minimum 0 as no spot instances)
      • infraciarm64 => la64n21 (Azure Linux node for agents ARM64 pool number "1" supporting 2x pod agents)
        • No spot as we don't have access in the Azure subscription
        • Auto-scaled with minimum of 0 and maximum of 20 (minimum 0 as no spot instances)

EDIT:
pool names : lx86n14agt1 and la64n14agt1

smerle33 added a commit to jenkins-infra/azure that referenced this issue Jun 10, 2024
…nts (#715)

as per
jenkins-infra/helpdesk#3923 (comment)
kubernetes cluster within the sponsored subscription of azure

split in 3 PR: 
	- creation of the cluster (this one)
  	- creation of the nodes
  	- creation of kubernetes-admin-sa with the module


depends on jenkins-infra/azure-net#249 for the
network definition
@smerle33
Copy link
Contributor Author

smerle33 commented Jun 11, 2024

the only expected "application" is the Datadog's cluster-agent (2 pods). We should reuse the system pool to run it as it's not an heavy consumer: confirmed by checking the load on the ci.jenkins.io-agents-1 cluster with the same kind of node pool.

TODO: add taint toleration to allow datadog cluster-agent to spawn on the system pool "CriticalAddonsOnly=true:NoSchedule" taint to the default node pool

smerle33 added a commit to jenkins-infra/azure that referenced this issue Jun 11, 2024
as per jenkins-infra/helpdesk#3923
and following #715 

this PR create 3 nodes pools:
- application one in arm64
- agents in x86-64
- agents in arm64

---------

Co-authored-by: Damien Duportal <damien.duportal@gmail.com>
@smerle33
Copy link
Contributor Author

datadog namespace created manually:

>kubectl get ns                  
NAME              STATUS   AGE
datadog           Active   5s

@dduportal
Copy link
Contributor

Reopening: The cluster is not used by infra.ci yet

smerle33 added a commit to jenkins-infra/azure that referenced this issue Jun 20, 2024
…each aks api (#735)

as per jenkins-infra/helpdesk#3923

we need to allow new agents from the cluster `infracijenkinsioagents1`
to access aks api for privatek8s and publick8s.
@smerle33
Copy link
Contributor Author

We had to :

@dduportal
Copy link
Contributor

Update: infra.ci.jenkins.io is now using the new cluster.

=> All Kubernetes, Terraform and Website jobs are green

@dduportal
Copy link
Contributor

Keeping the issue opened for the week end. Some cleanup might be needed, but CDF is not paying anymore for these agents!

dduportal added a commit to jenkins-infra/azure that referenced this issue Jun 21, 2024
Related to jenkins-infra/helpdesk#3923,

This PR removes the 2 infra.ci agent node pools (now unused)

Signed-off-by: Damien Duportal <damien.duportal@gmail.com>
@dduportal
Copy link
Contributor

Update: a bit of cleanup with jenkins-infra/azure#747 and jenkins-infra/kubernetes-management#5343

@dduportal
Copy link
Contributor

Closing as it works as expected

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants