Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spring 2023: Decrease AWS costs #3502

Closed
dduportal opened this issue Apr 11, 2023 · 10 comments
Closed

Spring 2023: Decrease AWS costs #3502

dduportal opened this issue Apr 11, 2023 · 10 comments

Comments

@dduportal
Copy link
Contributor

More than 1 year ago, #2646 was closed after we controlled the AWS spending.

But we're back at an AWS unsustainable bill: in March 2023, we spent ~ $18,000 in AWS.

As a reminder, the following Jenkins Infrastructure elements are present in AWS:

  • us-east-1:
    • The VM hosting pkg.origin.jenkins.io (packaging and serving Jenkins packages) and the Update Center index (updates.jenkins.io / updates.jenkins-ci.org)
    • The 3 VMs hosting trusted.ci.jenkins.io (SSH bastion bounce, the controller and the permanent agent with the jenkins-infra/update_center2 cache)
    • The VM hosting the service usage.jenkins.io
    • The VM hosting the service census.jenkins.io
    • The associated resources: datadisks, snapshots, SSH keys, Security groups
  • us-east-2:
    • An EKS cluster named cik8s used for the Linux container agents for ci.jenkins.io (and associated resources: node pools, networks, etc.)
    • An EKS cluster named eks-public used for hosting the Artifact Caching Registry for AWS (and associated resources: node pools, networks, load balancers, nat gateway, etc.)
    • Ephemeral VM agents for ci.jenkins.io:
      • Since EC2s are not available #3421, it is only the Linux arm64 machines (Linux x86 and Windows-Server-* are now in Azure)
      • Associated resources (security groups, network, SSH keys, etc.)
      • The resources for the jenkins-infra/packer-images process to build AMIs

As discussed with @lemeurherve and @smerle33 during today's team mob-programming about cloud budgets:

Capture d’écran 2023-04-07 à 20 53 49

  • We checked that there is a huge cost due to the outbound badnwidth in us-east-1, associated to the VM hosting pkg.origin.jenkins.io. Moving this VM to another cloud where the bandwitdh if cheaper would clearly allow us to avoid spending 3500-4000 bucks monthly! Tracked in [INFRA-3100] Migrate updates.jenkins.io to another Cloud #2649

Capture d’écran 2023-04-07 à 21 06 44

  • Costs "first level" breakdown on us-east-2 shows we have the following elements to take care of:
    • Snapshots, which are likely due to packer image build process, can be cleaned up (~$ 1200 monthly!!)
    • The cluster cik8s has different direct costs:
      • Any SpotUsage:m5* is a cost related to the usage of VM used as nodes. Decreasing this cost means either decreasing build rate, build times, and optimize the agent packing (checking pod limits, pack more pods by using bigger instances, avoid building BOM builds when not needed, etc.).
      • The USE2-NAT-Gateway need to be analysised with more details. But it could be related to the "symetric" of [ci.jenkins.io] Azure billing shows huge cloud cost due to outbound bandwidth #3485 (involving the stash and archiveArtifacts steps sending data to ci.jenkins.io controller in Azure) and to the Artifact Caching proxy downloads from Jfrog (when uncaching artifacts).

Capture d’écran 2023-04-07 à 20 57 46

@dduportal
Copy link
Contributor Author

@dduportal
Copy link
Contributor Author

Update: #2846 is now closed with the following results:

  • Amount of AMIs went from ~3000 to ~75
  • Amount of snapshots went from ~4500 to ~85

=> the garbage collecting is now tracking and removing these 2 resources on each build of jenkins-infra/packer-image.
=> We expect a gain of ~50$ per day, let's check this in 2 day (time for AWS to compute billings)

@dduportal
Copy link
Contributor Author

Update: as per jenkins-infra/packer-images#596, we should not build anymore EC2 templates with packer-images for Linux x86 and Windows (*) as we do not use them.

Not a lot of cost gain to expect: it's currently ~1$ daily so might not even be visible. But it's of intereste for use (less complex pipeline).

@dduportal
Copy link
Contributor Author

dduportal commented Apr 17, 2023

Resource tuning for ci.jenkins.io tracked in #3521

@dduportal
Copy link
Contributor Author

dduportal commented Apr 24, 2023

  • Global status: April forecast is ~$14.000. That is $4.000 less than March, half due to our cleanup efforts

Capture d’écran 2023-04-24 à 12 34 25

  • Check about using S3 for storing ci.jenkins.io artifacts:

    • Pure S3 costs are forecasted way under $100 per month (1% the monthly budget today, 2% of the targeted monthly budget) => looks acceptable
    • It's too early to tell if the S3 traffic would have an impact on the NAT gateway transfer
  • Check about the ci.jenkins.io agents:

    • ✅ Switching more adapted instance (stop using most of the m*.4xlarge) size for the usual node pool had a positive impact: Last 7 days costed us $542.21 in *.4xlarge` spot instances (~ $77.46 per day average) for ~ 3.300 hours of compute, while the 7 days before costed us twice more ($1 029.68) for a bit less of compute time (~ 3.200 hours).
      • Cost per hour went from $0.32 to $0.16
    • ❌ Using bigger machines for BOM builds is not fullfilling its promise (ref. chore(pipeline) switch to podTemplate instead of label to use the new node pool jenkinsci/bom#1969) for now due to slowness issue when higly parallelizing. Work is being done on this topic but it's not easy and obvious.
      • The current experiment is roughly $1 per hour of compute

Capture d’écran 2023-04-24 à 12 35 33

Capture d’écran 2023-04-24 à 12 44 03

Capture d’écran 2023-04-24 à 12 44 49

Capture d’écran 2023-04-24 à 12 49 46

@dduportal
Copy link
Contributor Author

Report about the month of April:

  • We were able to contain costs to $13,766.71 (5000 $ less than March)
  • The cost is split roughly split half / half between the 2 usages (us-east1: webservices, us-east-2: CI workloads)

Capture d’écran 2023-05-02 à 12 31 47

Effort on removing all of resources from us-east-1 should help to decrease by half

@dduportal
Copy link
Contributor Author

Please note that @basil 's PR jenkinsci/bom#2031 on the bom would also greatly decrease the costs on us-east-2.

@dduportal
Copy link
Contributor Author

dduportal commented May 12, 2023

Update: the tuning on the bom was pretty effective and we see now a sustainable usage in ohio:

Capture d’écran 2023-05-12 à 19 23 47

The main focus has to be on us-east-1 for now. The outbound bandwidth of the update center in next in line: #2649

Capture d’écran 2023-05-12 à 19 24 22

@dduportal
Copy link
Contributor Author

Created a daily monitor for cost anomalies, let's see how it behaves

@dduportal
Copy link
Contributor Author

Update:

  • Decrease of the billing continued during the 2 past months but is slowing down: May's bill was $11,862.43 and June's bill was $10,425.52 (see the trend of the past 3 and 6 month to view this decrease)
  • No more unusual alert with the measures taken (bom builds are safer, optimizations, monitoring, etc.)

We're closing this issue as it was scoped at analysing and controlling unexpected costs, and only during Spring 2023.
Goals were met: costs have an improved monitoring and control.

Next steps for summer 2023 in #3662 to continue reaching the $5,000 monthly goal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant