Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

carefully consider enabling artifact registry user quotas #153

Closed
BenTheElder opened this issue Feb 14, 2023 · 15 comments
Closed

carefully consider enabling artifact registry user quotas #153

BenTheElder opened this issue Feb 14, 2023 · 15 comments
Assignees
Labels
priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra.

Comments

@BenTheElder
Copy link
Member

Currently quotas are per region/project. For us that means each AR region.

https://cloud.google.com/artifact-registry/quotas#project-quota

60000 requests per minute in each region or multi-region.
18000 write or delete requests per minute in each region or multi-region.

In most cases, a single HTTP request or API call counts as a single request. However, some operations count as multiple requests. For example, a batch request like ImportAptArtifacts might charge quota for each item in the batch. A Docker pull or push usually makes multiple HTTP requests, so quota is charged for each request.

We've seen https://kubernetes-sigs/promo-tools and now #151 while in development (excessive concurrency) hit 429 Too Many Requests, which currently causes an outage for the involved region during that 1-minute quota window.

We should fix our tools (I've throttled #151) but intentional abuse is also a concern.

Per-user request quota
By default, projects have unlimited per-user quota. You can optionally cap per-user quota within a project. Per-user quota applies per authenticated user or per client IP address for unauthenticated requests to a public repository.

We should look at enabling this. It's worth noting that doing so may break the image promoter if it exceeds the new per-user quota, but we can adjust for that.

We can also request a quota increase. Jon suggested 2x for the busier regions would probably be reasonable for us.

@BenTheElder
Copy link
Member Author

@ameukam filed an initial quota increase request.

We still need to make sure that we can safely enable per-user quotas (we're a bit of an odd user), including considering quota needs for promo-tools.

Setting a per-user quota should at least help us avoid accidentally DOSing ourselves.
We may need to consider further action to deal with malicious intent.

@ameukam
Copy link
Member

ameukam commented Feb 15, 2023

Filled 2x of default quota (60000) for:
europe-west2
europe-west4
us-central1
us-east4

@ameukam
Copy link
Member

ameukam commented Feb 16, 2023

Request have been approved for only 2 regions:

image

@BenTheElder
Copy link
Member Author

I can't create public read AR instances at work, but we could enable it on one of the low-traffic regions and run some tests before rolling it out to more regions.

GCR had:

There is a fixed quota limit on requests to Container Registry hosts. This limit is per client IP address.

50,000 HTTP requests every 10 minutes
1,000,000 HTTP requests per day

So 83.333333 QPS per user. That would only be 12 users per region on a shared AR 60,000 RPM.

k8s.gcr.io saw something like 2500 QPS globally previously, so maybe we impose 1% of that or 25 qps per user.
That would be 40 per region at 60,000 RPM and ~67 per region in 100,000 RPM regions.

I'm a little worried about the image promoter, but we're also only talking about read calls here, and the promoter should not be using up all the quota anyhow, we really should impose some per user cap to avoid a single client loading up a region again.

@BenTheElder
Copy link
Member Author

BenTheElder commented Mar 2, 2023

Thought about this some more:

  • lets just set this consistently in all regions. regions with higher regional quota will also have more usage and shouldn't have higher per reason quota. we should just have a consistent per-ip/user per-region limit
  • let's start at ~83 QPS cap per user/ip per region, that at least gets us something in place and doesn't notably regress over the GCR for end users, we may want to lower it though, and we may want to attempt to request more shared per-region quota in the future when we see more of an uptick in traffic

cc some folks I think may have thoughts / suggestions about this: @upodroid @ameukam @justinsb @dims

@upodroid
Copy link
Member

upodroid commented Mar 2, 2023

Can we set quotas for authenticated vs unauthenticates calls? That would be a very nice feature as it excludes calls made by our tooling from the general per-user/per-ip quota.

@BenTheElder
Copy link
Member Author

I don't think that's available but i'm not certain.

I think we want to avoid our tooling making really excessive API usage itself though, which is actually the original motivation I have in setting one at all, I managed to DOS a region with geranos by using 100% of the regional quota in us-central1 aggressively scanning images 😅

Since working on geranos further, I'm convinced that we really should not need this many API calls for our tools. Nearly everything in the registry is content-addressed and we can list content (google.Walk etc) and skip already-handled digests with comparatively few API calls (just the list calls, which return tags with their digests and other metadata) if we build our tools properly.

@BenTheElder
Copy link
Member Author

BenTheElder commented Mar 3, 2023

Looked in cloud console and we have the option to set:

Requests per project per user per minute per user
Write requests per project per user per minute per user

The latter should only apply to our tools and we can probably leave uncapped for now?

It does not appear there is a "per user" vs "per ip", ip is just the fallback "user" for unauthenticated calls.

@BenTheElder BenTheElder added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra. labels Mar 7, 2023
@BenTheElder
Copy link
Member Author

So also these limits are not region scoped, so that solves that part.

IMO: let's start by setting Requests per project per user per minute per user to 5000 to be somewhat comparable to GCR. We can refine from there but we should probably not leave it unset for long, especially as we're looking at driving more traffic here.

@BenTheElder
Copy link
Member Author

I'd also actually suggest we lean lower, but I'm not sure by how much, and I'd prefer we get a non-infinite limit in place sooner and iterate.

Most users really should not need high QPS. Even 1000 RPM per IP/user region is probably more than sufficient except perhaps for NAT situations and even then GCR was only set to ~5x that I think.

Our normal image pull traffic is only a few thousand QPS on average globally for some sense of scale and a lot of normal user API calls should be offloaded to S3, never reaching AR.

@BenTheElder
Copy link
Member Author

On a call with @ameukam now, we've enabled a 5,000 limit to start and make sure everything looks good. So far so good, doing some more tests.

@ameukam
Copy link
Member

ameukam commented Mar 8, 2023

Request Requests per project per user per minute per user set to 5000.

@BenTheElder
Copy link
Member Author

we hit this oureselves https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ar-to-s3-sync/1633597203790434304

reduced our concurrency #163, still hit this:
https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ar-to-s3-sync/1633611571659804672

will reduce again and look into proper rate limiting ...

the good news is, confirmed that this works as expected and individual clients cannot consume excessive quota now

@BenTheElder
Copy link
Member Author

So far this is working fine.

We can consider reducing them later, but closing for now as we have enabled them.

@BenTheElder
Copy link
Member Author

see: https://kubernetes.slack.com/archives/CJH2GBF7Y/p1678861049737169 / kubernetes-sigs/promo-tools#771

kpromo ran into this when promoting kubernetes releases (but the periodic had been previously passing), we're adapting the rate limiter from geranos and adjusting the jobs kubernetes/test-infra#29060

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra.
Projects
Status: Done
Development

No branches or pull requests

3 participants