-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
carefully consider enabling artifact registry user quotas #153
Comments
@ameukam filed an initial quota increase request. We still need to make sure that we can safely enable per-user quotas (we're a bit of an odd user), including considering quota needs for promo-tools. Setting a per-user quota should at least help us avoid accidentally DOSing ourselves. |
Filled 2x of default quota (60000) for: |
I can't create public read AR instances at work, but we could enable it on one of the low-traffic regions and run some tests before rolling it out to more regions. GCR had:
So 83.333333 QPS per user. That would only be 12 users per region on a shared AR 60,000 RPM. k8s.gcr.io saw something like 2500 QPS globally previously, so maybe we impose 1% of that or 25 qps per user. I'm a little worried about the image promoter, but we're also only talking about read calls here, and the promoter should not be using up all the quota anyhow, we really should impose some per user cap to avoid a single client loading up a region again. |
Thought about this some more:
cc some folks I think may have thoughts / suggestions about this: @upodroid @ameukam @justinsb @dims |
Can we set quotas for authenticated vs unauthenticates calls? That would be a very nice feature as it excludes calls made by our tooling from the general per-user/per-ip quota. |
I don't think that's available but i'm not certain. I think we want to avoid our tooling making really excessive API usage itself though, which is actually the original motivation I have in setting one at all, I managed to DOS a region with geranos by using 100% of the regional quota in us-central1 aggressively scanning images 😅 Since working on geranos further, I'm convinced that we really should not need this many API calls for our tools. Nearly everything in the registry is content-addressed and we can list content ( |
Looked in cloud console and we have the option to set:
The latter should only apply to our tools and we can probably leave uncapped for now? It does not appear there is a "per user" vs "per ip", ip is just the fallback "user" for unauthenticated calls. |
So also these limits are not region scoped, so that solves that part. IMO: let's start by setting |
I'd also actually suggest we lean lower, but I'm not sure by how much, and I'd prefer we get a non-infinite limit in place sooner and iterate. Most users really should not need high QPS. Even 1000 RPM per IP/user region is probably more than sufficient except perhaps for NAT situations and even then GCR was only set to ~5x that I think. Our normal image pull traffic is only a few thousand QPS on average globally for some sense of scale and a lot of normal user API calls should be offloaded to S3, never reaching AR. |
On a call with @ameukam now, we've enabled a 5,000 limit to start and make sure everything looks good. So far so good, doing some more tests. |
Request |
we hit this oureselves https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ar-to-s3-sync/1633597203790434304 reduced our concurrency #163, still hit this: will reduce again and look into proper rate limiting ... the good news is, confirmed that this works as expected and individual clients cannot consume excessive quota now |
So far this is working fine. We can consider reducing them later, but closing for now as we have enabled them. |
see: https://kubernetes.slack.com/archives/CJH2GBF7Y/p1678861049737169 / kubernetes-sigs/promo-tools#771 kpromo ran into this when promoting kubernetes releases (but the periodic had been previously passing), we're adapting the rate limiter from geranos and adjusting the jobs kubernetes/test-infra#29060 |
Currently quotas are per region/project. For us that means each AR region.
https://cloud.google.com/artifact-registry/quotas#project-quota
We've seen https://kubernetes-sigs/promo-tools and now #151 while in development (excessive concurrency) hit
429 Too Many Requests
, which currently causes an outage for the involved region during that 1-minute quota window.We should fix our tools (I've throttled #151) but intentional abuse is also a concern.
We should look at enabling this. It's worth noting that doing so may break the image promoter if it exceeds the new per-user quota, but we can adjust for that.
We can also request a quota increase. Jon suggested 2x for the busier regions would probably be reasonable for us.
The text was updated successfully, but these errors were encountered: