carefully consider enabling artifact registry user quotas #153

BenTheElder · 2023-02-14T18:28:12Z

Currently quotas are per region/project. For us that means each AR region.

https://cloud.google.com/artifact-registry/quotas#project-quota

60000 requests per minute in each region or multi-region.
18000 write or delete requests per minute in each region or multi-region.

In most cases, a single HTTP request or API call counts as a single request. However, some operations count as multiple requests. For example, a batch request like ImportAptArtifacts might charge quota for each item in the batch. A Docker pull or push usually makes multiple HTTP requests, so quota is charged for each request.

We've seen https://kubernetes-sigs/promo-tools and now #151 while in development (excessive concurrency) hit 429 Too Many Requests, which currently causes an outage for the involved region during that 1-minute quota window.

We should fix our tools (I've throttled #151) but intentional abuse is also a concern.

Per-user request quota
By default, projects have unlimited per-user quota. You can optionally cap per-user quota within a project. Per-user quota applies per authenticated user or per client IP address for unauthenticated requests to a public repository.

We should look at enabling this. It's worth noting that doing so may break the image promoter if it exceeds the new per-user quota, but we can adjust for that.

We can also request a quota increase. Jon suggested 2x for the busier regions would probably be reasonable for us.

The text was updated successfully, but these errors were encountered:

BenTheElder · 2023-02-14T23:59:40Z

@ameukam filed an initial quota increase request.

We still need to make sure that we can safely enable per-user quotas (we're a bit of an odd user), including considering quota needs for promo-tools.

Setting a per-user quota should at least help us avoid accidentally DOSing ourselves.
We may need to consider further action to deal with malicious intent.

ameukam · 2023-02-15T15:16:15Z

Filled 2x of default quota (60000) for:
europe-west2
europe-west4
us-central1
us-east4

ameukam · 2023-02-16T07:49:37Z

Request have been approved for only 2 regions:

BenTheElder · 2023-02-17T07:08:10Z

I can't create public read AR instances at work, but we could enable it on one of the low-traffic regions and run some tests before rolling it out to more regions.

GCR had:

There is a fixed quota limit on requests to Container Registry hosts. This limit is per client IP address.

50,000 HTTP requests every 10 minutes
1,000,000 HTTP requests per day

So 83.333333 QPS per user. That would only be 12 users per region on a shared AR 60,000 RPM.

k8s.gcr.io saw something like 2500 QPS globally previously, so maybe we impose 1% of that or 25 qps per user.
That would be 40 per region at 60,000 RPM and ~67 per region in 100,000 RPM regions.

I'm a little worried about the image promoter, but we're also only talking about read calls here, and the promoter should not be using up all the quota anyhow, we really should impose some per user cap to avoid a single client loading up a region again.

BenTheElder · 2023-03-02T23:52:38Z

Thought about this some more:

lets just set this consistently in all regions. regions with higher regional quota will also have more usage and shouldn't have higher per reason quota. we should just have a consistent per-ip/user per-region limit
let's start at ~83 QPS cap per user/ip per region, that at least gets us something in place and doesn't notably regress over the GCR for end users, we may want to lower it though, and we may want to attempt to request more shared per-region quota in the future when we see more of an uptick in traffic

cc some folks I think may have thoughts / suggestions about this: @upodroid @ameukam @justinsb @dims

upodroid · 2023-03-02T23:57:40Z

Can we set quotas for authenticated vs unauthenticates calls? That would be a very nice feature as it excludes calls made by our tooling from the general per-user/per-ip quota.

BenTheElder · 2023-03-03T00:02:07Z

I don't think that's available but i'm not certain.

I think we want to avoid our tooling making really excessive API usage itself though, which is actually the original motivation I have in setting one at all, I managed to DOS a region with geranos by using 100% of the regional quota in us-central1 aggressively scanning images 😅

Since working on geranos further, I'm convinced that we really should not need this many API calls for our tools. Nearly everything in the registry is content-addressed and we can list content (google.Walk etc) and skip already-handled digests with comparatively few API calls (just the list calls, which return tags with their digests and other metadata) if we build our tools properly.

BenTheElder · 2023-03-03T00:09:18Z

Looked in cloud console and we have the option to set:

Requests per project per user per minute per user
Write requests per project per user per minute per user

The latter should only apply to our tools and we can probably leave uncapped for now?

It does not appear there is a "per user" vs "per ip", ip is just the fallback "user" for unauthenticated calls.

BenTheElder · 2023-03-07T01:19:19Z

So also these limits are not region scoped, so that solves that part.

IMO: let's start by setting Requests per project per user per minute per user to 5000 to be somewhat comparable to GCR. We can refine from there but we should probably not leave it unset for long, especially as we're looking at driving more traffic here.

BenTheElder · 2023-03-07T01:26:05Z

I'd also actually suggest we lean lower, but I'm not sure by how much, and I'd prefer we get a non-infinite limit in place sooner and iterate.

Most users really should not need high QPS. Even 1000 RPM per IP/user region is probably more than sufficient except perhaps for NAT situations and even then GCR was only set to ~5x that I think.

Our normal image pull traffic is only a few thousand QPS on average globally for some sense of scale and a lot of normal user API calls should be offloaded to S3, never reaching AR.

BenTheElder · 2023-03-08T21:37:15Z

On a call with @ameukam now, we've enabled a 5,000 limit to start and make sure everything looks good. So far so good, doing some more tests.

ameukam · 2023-03-08T21:40:44Z

Request Requests per project per user per minute per user set to 5000.

BenTheElder · 2023-03-08T23:36:18Z

we hit this oureselves https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ar-to-s3-sync/1633597203790434304

reduced our concurrency #163, still hit this:
https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ar-to-s3-sync/1633611571659804672

will reduce again and look into proper rate limiting ...

the good news is, confirmed that this works as expected and individual clients cannot consume excessive quota now

BenTheElder · 2023-03-15T08:06:42Z

So far this is working fine.

We can consider reducing them later, but closing for now as we have enabled them.

BenTheElder · 2023-03-16T03:09:47Z

see: https://kubernetes.slack.com/archives/CJH2GBF7Y/p1678861049737169 / kubernetes-sigs/promo-tools#771

kpromo ran into this when promoting kubernetes releases (but the periodic had been previously passing), we're adapting the rate limiter from geranos and adjusting the jobs kubernetes/test-infra#29060

BenTheElder mentioned this issue Feb 17, 2023

iterate on geranos MVP #155

Merged

BenTheElder added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra. labels Mar 7, 2023

BenTheElder mentioned this issue Mar 8, 2023

reduce geranos concurrency #163

Merged

This was referenced Mar 8, 2023

reject requests for /v2/_catalog #164

Merged

rate limit most AR calls #165

Merged

rate limit all AR API calls #166

Merged

increase geranos registry rate limit #170

Merged

BenTheElder closed this as completed Mar 15, 2023

BenTheElder assigned BenTheElder and ameukam Mar 15, 2023

puerco mentioned this issue Mar 16, 2023

Rate limit crane.Copy() operations kubernetes-sigs/promo-tools#771

Merged

BenTheElder mentioned this issue Mar 23, 2023

Rate Limiting In Depth #195

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

carefully consider enabling artifact registry user quotas #153

carefully consider enabling artifact registry user quotas #153

BenTheElder commented Feb 14, 2023

BenTheElder commented Feb 14, 2023

ameukam commented Feb 15, 2023

ameukam commented Feb 16, 2023

BenTheElder commented Feb 17, 2023

BenTheElder commented Mar 2, 2023 •

edited

Loading

upodroid commented Mar 2, 2023

BenTheElder commented Mar 3, 2023

BenTheElder commented Mar 3, 2023 •

edited

Loading

BenTheElder commented Mar 7, 2023

BenTheElder commented Mar 7, 2023

BenTheElder commented Mar 8, 2023

ameukam commented Mar 8, 2023 •

edited

Loading

BenTheElder commented Mar 8, 2023

BenTheElder commented Mar 15, 2023

BenTheElder commented Mar 16, 2023

carefully consider enabling artifact registry user quotas #153

carefully consider enabling artifact registry user quotas #153

Comments

BenTheElder commented Feb 14, 2023

BenTheElder commented Feb 14, 2023

ameukam commented Feb 15, 2023

ameukam commented Feb 16, 2023

BenTheElder commented Feb 17, 2023

BenTheElder commented Mar 2, 2023 • edited Loading

upodroid commented Mar 2, 2023

BenTheElder commented Mar 3, 2023

BenTheElder commented Mar 3, 2023 • edited Loading

BenTheElder commented Mar 7, 2023

BenTheElder commented Mar 7, 2023

BenTheElder commented Mar 8, 2023

ameukam commented Mar 8, 2023 • edited Loading

BenTheElder commented Mar 8, 2023

BenTheElder commented Mar 15, 2023

BenTheElder commented Mar 16, 2023

BenTheElder commented Mar 2, 2023 •

edited

Loading

BenTheElder commented Mar 3, 2023 •

edited

Loading

ameukam commented Mar 8, 2023 •

edited

Loading