Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setup a budget and budget alerts #1375

Closed
spiffxp opened this issue Oct 29, 2020 · 24 comments
Closed

Setup a budget and budget alerts #1375

spiffxp opened this issue Oct 29, 2020 · 24 comments
Assignees
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra.

Comments

@spiffxp
Copy link
Member

spiffxp commented Oct 29, 2020

ref: https://cloud.google.com/billing/docs/how-to/budgets

Currently we review our billing reports at each meeting, which means we'll notice abnormalities within a 14-day window. As our utilization increases, it would be wise for us to use a budget and alerts to catch things sooner.

I tried experimenting with my account, and didn't have sufficient privileges. We should start there

/priority important-longterm
/wg k8s-infra

@k8s-ci-robot k8s-ci-robot added priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. wg/k8s-infra labels Oct 29, 2020
@spiffxp spiffxp added this to Needs Triage in sig-k8s-infra Jan 21, 2021
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 27, 2021
@spiffxp
Copy link
Member Author

spiffxp commented Feb 8, 2021

/remove-lifecycle stale
/assign @thockin
I'm assigning you to get your input on whether you think this is worth investing time in

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 8, 2021
ameukam added a commit to ameukam/k8s.io that referenced this issue Apr 26, 2021
Add a defined budget to `k8s-infra-ii-sandbox` but also use the project
to experiment GCP budgets.

Ref: kubernetes#1375

Signed-off-by: Arnaud Meukam <ameukam@gmail.com>
ameukam added a commit to ameukam/k8s.io that referenced this issue Apr 26, 2021
Add a defined budget to `k8s-infra-ii-sandbox` but also use the project
to experiment GCP budgets.

Ref: kubernetes#1375

Signed-off-by: Arnaud Meukam <ameukam@gmail.com>
@ameukam ameukam added this to the v1.22 milestone May 4, 2021
@thockin
Copy link
Member

thockin commented Jun 7, 2021

I think it is long-term valuable but not near-term

@spiffxp
Copy link
Member Author

spiffxp commented Sep 29, 2021

/remove-priority important-longterm
/priority critical-urgent
/milestone v1.23

We discussed last meeting that our spend looks like it's going to put us very near the threshold this year

It's time to come up with a plan for how to make sure we don't cross it, and how to detect if we are about to. Maybe it's not worth implementing technically with cloud budgets, but we should then at least know what number over what period is a flashing danger sign, and have some kind of framework / guidance for what to do next once we see it.

@k8s-ci-robot k8s-ci-robot added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. labels Sep 29, 2021
@k8s-ci-robot k8s-ci-robot modified the milestones: v1.24, v1.23 Sep 29, 2021
@spiffxp spiffxp moved this from Needs Triage to Backlog (existing infra) in sig-k8s-infra Sep 29, 2021
@k8s-ci-robot k8s-ci-robot added sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra. and removed wg/k8s-infra labels Sep 29, 2021
@spiffxp
Copy link
Member Author

spiffxp commented Oct 14, 2021

#2940 - Adds a monthly budget for k8s-infra as a whole, we'll get e-mail alerts if we hit 90% (225K) for the month (which we have been crossing continually since August, but with no alerts setup) and 100% (which we crossed once in August accidentally due to 5k node clusters hanging around for too long)

@spiffxp
Copy link
Member Author

spiffxp commented Oct 18, 2021

What bin is egress bandwidth going into?

Egress is charged to the project hosting the artifacts being transferred, so regardless of which SKU it's billed against, it all goes against the k8s-artifacts-prod project

From https://datastudio.google.com/c/u/0/reporting/14UWSuqD5ef9E4LnsCD9uJWTPv8MHOA3e/page/bPVn
Screen Shot 2021-10-18 at 7 57 33 AM

Would it be possible to get the artifacts broken out in terms of size in bytes instead of $/months?

select
    sum(cost) as total_cost,
    sku.description as sku,
    sum(usage.amount_in_pricing_units) amount,
    usage.pricing_unit pricing_unit,
    invoice.month,
from 
    `kubernetes-public.kubernetes_public_billing.gcp_billing_export_v1_018801_93540E_22A20E`
where
    billing_account_id = "018801-93540E-22A20E"
    and project.name = 'k8s-artifacts-prod'
    and usage.pricing_unit = 'gibibyte'
group by
    invoice.month,
    sku,
    pricing_unit
order by
    invoice.month desc, total_cost desc 

The units here are GB
Screen Shot 2021-10-18 at 8 22 04 AM

From https://console.cloud.google.com/monitoring/metrics-explorer?pageState=%7B%22xyChart%22:%7B%22dataSets%22:%5B%7B%22timeSeriesFilter%22:%7B%22filter%22:%22metric.type%3D%5C%22storage.googleapis.com%2Fnetwork%2Fsent_bytes_count%5C%22%20resource.type%3D%5C%22gcs_bucket%5C%22%22,%22minAlignmentPeriod%22:%2260s%22,%22aggregations%22:%5B%7B%22perSeriesAligner%22:%22ALIGN_RATE%22,%22crossSeriesReducer%22:%22REDUCE_SUM%22,%22alignmentPeriod%22:%2260s%22,%22groupByFields%22:%5B%22resource.label.%5C%22bucket_name%5C%22%22%5D%7D,%7B%22perSeriesAligner%22:%22ALIGN_NONE%22,%22crossSeriesReducer%22:%22REDUCE_NONE%22,%22alignmentPeriod%22:%2260s%22,%22groupByFields%22:%5B%5D%7D%5D,%22pickTimeSeriesFilter%22:%7B%22rankingMethod%22:%22METHOD_MAX%22,%22numTimeSeries%22:%225%22,%22direction%22:%22TOP%22%7D%7D,%22targetAxis%22:%22Y1%22,%22plotType%22:%22LINE%22,%22legendTemplate%22:%22$%7Bresource.labels.bucket_name%7D%22%7D%5D,%22options%22:%7B%22mode%22:%22COLOR%22%7D,%22constantLines%22:%5B%5D,%22timeshiftDuration%22:%220s%22,%22y1Axis%22:%7B%22label%22:%22y1Axis%22,%22scale%22:%22LINEAR%22%7D%7D,%22isAutoRefresh%22:true,%22timeSelection%22:%7B%22timeRange%22:%226w%22%7D%7D&project=kubernetes-public

Bytes sent, top 5 by max value over the last 6W (I don't think our cloud monitoring retention goes further back than that)
Screen Shot 2021-10-18 at 8 31 31 AM

I would defer to @BobyMCbobs and @Riaankl to provide a report on which specific artifacts are how large, and how often they're being transferred. That said, I think this is a problem of volume and not specific artifacts.

@spiffxp
Copy link
Member Author

spiffxp commented Oct 18, 2021

#1834 (comment) is our umbrella issue for mitigating artifact hosting costs by use of mirrors, which would allow us to mitigate costs due to large consumers by having them pull from mirrors located closer to them or on their own infra. The comment I'm linking posits that if we could use something like Cloud CDN we could also lower the cost of hosting regardless of where requests are coming from.

It is unclear whether this is possible for container images hosted at k8s.gcr.io which are the vast majority of bytes transferred, as they live in a subdomain of gcr.io that I'm not sure we can take ownership of (replace the endpoint), my understanding is it was provided to us internally

@Riaankl
Copy link
Contributor

Riaankl commented Oct 19, 2021

@jhoblitt we have a report on artifact traffic. This data run form 9 April till Sept 2021
There are several graphs and tables. Here is tables that might answer some of you questions:
image

@jhoblitt
Copy link

@spiffxp Thanks for doing that extra analysis. I agree that this sounds more like a pure popularity problem rather than bloated artifacts. I'm not sure what a fitted slope works out to but I'm going to guess that transfers are going to grow faster than gcp bandwidth prices will decrease in the near term and will eventually exceed the total cost envelope. Has there been any discussion of moving away from gcr.io? I would easily believe it will take > 3 years to shift the majority of pulls over to a k8s project registry.

@jhoblitt
Copy link

@Riaankl I was wondering if there were large artifacts that could be put on a diet but nothing is showing up in the top 10.

@Riaankl
Copy link
Contributor

Riaankl commented Oct 19, 2021

With #1834 we aim to get 2-3 redicetor POC's up where by cloud providers could have local artifacts, and routing is affected by the Redirector based on the requesitng IP's ASN information. Therefore the load is spread to all providers. Idealy the complete set of aritfacts should be hosted by the participants.
80% of the traffic is related to <30 images.
image

@ameukam
Copy link
Member

ameukam commented Dec 6, 2021

/milestone v1.24

@k8s-ci-robot k8s-ci-robot modified the milestones: v1.23, v1.24 Dec 6, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 6, 2022
@thockin thockin removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 6, 2022
@ameukam
Copy link
Member

ameukam commented Mar 7, 2022

/remove-lifecycle stale

@ameukam ameukam added priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. and removed priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels Mar 22, 2022
@ameukam ameukam moved this from Backlog (existing infra) to In Progress in sig-k8s-infra Apr 25, 2022
@ameukam
Copy link
Member

ameukam commented May 12, 2022

/milestone clear

@k8s-ci-robot k8s-ci-robot removed this from the v1.24 milestone May 12, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 10, 2022
@ameukam
Copy link
Member

ameukam commented Aug 19, 2022

/remove-lifecycle stale
/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 19, 2022
@BenTheElder
Copy link
Member

What exactly do we see as outstanding here?

@spiffxp
Copy link
Member Author

spiffxp commented Nov 11, 2022

spend breakdown: https://datastudio.google.com/c/u/0/reporting/14UWSuqD5ef9E4LnsCD9uJWTPv8MHOA3e1

FWIW I can't access this

What exactly do we see as outstanding here?

I agree with capping this off as the first pass. I think we'll want to revisit how we track our budget in the new year, and that should probably be a separate issue.

Things you might want to consider before capping this off:

  • The current budget alerts at 90% and 100% of 250K/mo (3M/y). Since we're running over that rate, the alerts are going to be noise for those watching "are we out of credits for the year". Disable and setup a new budget that tracks our remaining spend for the year?
  • The alerts currently get sent out to k8s-infra leads, consider adding a wider audience?

I'll leave it to @ameukam or others to close if you're fine with this as-is.

@BenTheElder
Copy link
Member

@ameukam
Copy link
Member

ameukam commented Nov 12, 2022

I think it's ok to close this. Let revisit budget tracking for next year in a separate issue. The different attempts to move workloads to different cloud providers will hopefully impact overall 2023 budget.

/close

@k8s-ci-robot
Copy link
Contributor

@ameukam: Closing this issue.

In response to this:

I think it's ok to close this. Let revisit budget tracking for next year in a separate issue. The different attempts to move workloads to different cloud providers will hopefully impact overall 2023 budget.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

sig-k8s-infra automation moved this from In Progress to Done Nov 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra.
Projects
sig-k8s-infra
  
Done
Status: Done
Development

No branches or pull requests

9 participants