Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scaling Interconnect #15348

Open
hennevogel opened this issue Dec 11, 2023 · 6 comments
Open

Scaling Interconnect #15348

hennevogel opened this issue Dec 11, 2023 · 6 comments
Labels
API Things regarding our API Feature Reference Server 🖥️ Things related to build.opensuse.org

Comments

@hennevogel
Copy link
Member

Is your feature request related to a problem? Please describe.

If you have a popular public instance of OBS (let's call it https://build.opensuse.org) that many other OBS interconnect to, to be able to build against some popular project (let's call it openSUSE:Factory), sooner or later you will run into the situation that events on your OBS (let's say a glibc update that causes a rebuild of many things in one of your popular projects) cause a storm of interconnect API requests and you run into scaling issues.

Describe the solution you'd like
Scale better.

@hennevogel hennevogel added Feature Reference Server 🖥️ Things related to build.opensuse.org API Things regarding our API labels Dec 11, 2023
@bmwiedemann
Copy link
Member

One idea is to add a feature to fetch these packages from a scalable package-serving-infrastructure - lets call it download.opensuse.org. For that we need a way to link certain OBS projects to certain download base-URLs (maybe even more than 1).

A similar mechanism could be used by osc build.
E.g. a recent Leap build printed

45/46 (SUSE:SLE-15-SP4:GA) libXfixes-devel-6.0.0-150400.1.4.x86_64.rpm
SUSE:SLE-15-SP4:GA/libXfixes-devel: attempting download from api, since not found

when it could have got the file from download.opensuse.org/distribution/leap/15.5/repo/oss/x86_64/

@mlschroe
Copy link
Member

#15361 should make it much better

@mlschroe
Copy link
Member

(the original report, not the osc side)

@hennevogel
Copy link
Member Author

@mlschroe it helped a little but it's still not enough.

This is a typical situation right now

Screenshot from 2024-01-26 14-04-23
(live on obs-measure)

The majority of those interconnect requests come from SUSE, followed by other well known interconnects. All in all there are 25 interconnect instances that do more than 1K requests per hour whenever something happens that causes them to re-evaluate openSUSE/SUSE distros.

Those are the requests during made during the time for over 1K times during the hour of 10:00 -> 11:00

   1038 /public/build/openSUSE:Leap:15.6/standard/x86_64/_repository
   1142 /public/build/openSUSE:Backports:SLE-15-SP6:Checks/standard/x86_64/_repository
   1196 /public/build/openSUSE:Backports:SLE-15-SP6/standard/x86_64/_repository
   1212 /public/build/SUSE:SLE-15-SP6:GA/pool/x86_64/_repository
   1236 /public/source/openSUSE:Leap:15.5:Update/_config
   1250 /public/source/openSUSE:Backports:SLE-15-SP5:Update/_config
   1264 /public/source/openSUSE:Backports:SLE-15-SP5:Checks/_config
   1264 /public/source/openSUSE:Leap:15.5:Update/_meta
   1272 /public/source/openSUSE:Leap:15.5/_config
   1310 /public/source/Ubuntu:22.04/_config
   1310 /public/source/openSUSE:Backports:SLE-15-SP5/_config
   1661 /public/build/SUSE:SLE-15-SP5:GA/standard/x86_64/_repository
   1826 /public/source/SUSE:SLE-15-SP5:Update/_config
   2010 /public/build/openSUSE:Leap:15.4:Update/standard/x86_64/_repository
   2029 /public/build/openSUSE:Backports:SLE-15-SP4:Update/standard/x86_64/_repository
   2084 /public/build/openSUSE:Leap:15.4/standard/x86_64/_repository
   2112 /public/build/openSUSE:Backports:SLE-15-SP4:Checks/standard/x86_64/_repository
   2226 /public/build/openSUSE:Backports:SLE-15-SP4/standard/x86_64/_repository
   2309 /public/source/SUSE:SLE-15-SP5:GA/_config
   2519 /public/source/SUSE:SLE-15-SP4:Update/_config
   2580 /public/source/SUSE:SLE-15-SP4:GA/_config
   2589 /public/source/Ubuntu:22.04/_meta
   2601 /public/source/SUSE:SLE-15-SP3:GA/_config
   2603 /public/source/SUSE:SLE-15-SP2:Update/_config
   2604 /public/source/SUSE:SLE-15-SP2:GA/_config
   2608 /public/source/SUSE:SLE-15-SP1:Update/_config
   2620 /public/source/SUSE:SLE-15-SP3:Update/_config
   2629 /public/source/SUSE:SLE-15-SP1:GA/_config
   2638 /public/source/SUSE:SLE-15:Update/_config
   2645 /public/source/SUSE:SLE-15:GA/_config
   2675 /public/source/SUSE:SLE-15:GA/_meta
   3097 /public/build/openSUSE:Leap:15.5:Update/standard/x86_64/_repository
   3101 /public/build/openSUSE:Backports:SLE-15-SP5:Update/standard/x86_64/_repository
   3144 /public/build/openSUSE:Leap:15.5/standard/x86_64/_repository
   3213 /public/build/openSUSE:Backports:SLE-15-SP5:Checks/standard/x86_64/_repository
   3312 /public/build/openSUSE:Backports:SLE-15-SP5/standard/x86_64/_repository
   3591 /public/build/SUSE:SLE-15:GA/pool/x86_64/_repository
   3775 /public/build/SUSE:SLE-15-SP1:GA/pool/x86_64/_repository
   3840 /public/build/SUSE:SLE-15-SP1:Update/pool/x86_64/_repository
   3884 /public/build/SUSE:SLE-15-SP5:Update/pool/x86_64/_repository
   3949 /public/build/SUSE:SLE-15-SP2:GA/pool/x86_64/_repository
   4183 /public/build/SUSE:SLE-15:Update/pool/x86_64/_repository
   4290 /public/build/SUSE:SLE-15-SP2:Update/pool/x86_64/_repository
   4312 /public/build/SUSE:SLE-15-SP3:GA/pool/x86_64/_repository
   4935 /public/build/SUSE:SLE-15-SP3:Update/pool/x86_64/_repository
   5683 /public/build/SUSE:SLE-15-SP5:GA/pool/x86_64/_repository
   6318 /public/build/SUSE:SLE-15-SP4:GA/pool/x86_64/_repository
   7271 /public/build/SUSE:SLE-15-SP4:Update/pool/x86_64/_repository

@hennevogel
Copy link
Member Author

@mlschroe mabye you have another idea for serialization?

@bmwiedemann
Copy link
Member

bmwiedemann commented Feb 2, 2024

What we know about the problem:

  • triggered by a released maintenance update
  • 50 OBSes get notified and fetch updates, schedule rebuilds
  • the many requests for GA/pool repos indicate OBSes go into "lost sync" state = bad
  • 30000 requests answered at a rate of 50/s -> DDoS causes 10+ minutes of slowness
  • typical request:
"90.187.xxIPxx, 172.16.42.25" - [01/Feb/2024:14:30:56 +0000] "GET /public/build/SUSE:SLE-15-SP3:Update/pool/x86_64/_repository?view=binaryversions&nometa=1&binary=aaa_base&binary=attr&binary=diffutils&binary=filesystem&binary=fillup&binary=glibc&binary=grep&binary=libgcc_s1&binary=libncurses6&binary=pam&binary=rpm&binary=sed&binary=tar&binary=libz1&binary=liblzma5&binary=libacl1&binary=libattr1&binary=libpopt0&binary=liblua5_3-5&binary=libpcre1&binary=libcrypt1&binary=perl-base&binary=libdb-4_8&binary=libmnl0&binary=build&binary=rpm-build&binary=gcc-PIE&binary=findutils&binary=binutils&binary=psmisc&binary=glibc-devel&binary=make&binary=gcc&binary=gawk&binary=glibc-locale&binary=gzip&binary=which&binary=xz&binary=file&binary=systemd-rpm-macros&binary=patch&binary=dwz&binary=libgdbm4&binary=update-alternatives&binary=libctf-nobfd0&binary=libctf0&binary=libcap-ng0&binary=libutempter0&binary=libxcrypt-devel&binary=gcc7&binary=cpp&binary=glibc-locale-base&binary=libmagic1&binary=terminfo-base&binary=system-user-root&binary=libtirpc3&binary=libnsl2&binary=libcrack2&binary=pkg-config&binary=libgmp10&binary=libmpfr6&binary=libisl15&binary=libmpc3&binary=cpp7&binary=libasan4&binary=libcilkrts5&binary=libubsan0&binary=file-magic&binary=libtirpc-netconfig&binary=cracklib&binary=gettext-tools-mini&binary=libstdc%2B%2B6&binary=libatomic1&binary=libgomp1&binary=libitm1&binary=liblsan0&binary=libmpx2&binary=libmpxwrappers2&binary=libtsan0&binary=gettext-runtime-mini&binary=libkeyutils1&binary=libverto1&binary=pam-modules&binary=perl&binary=build-mkbaselibs&binary=brp-check-suse&binary=rpmlint-Factory&binary=hostname&binary=brp-extract-appdata&binary=brp-extract-translations&binary=rpmlint-Factory-strict&binary=build-compare&binary=ncurses-utils HTTP/1.1" 200 5276 "-" "BSRPC 0.9.1" 5
  • mls says, binaryversions requests come from obs build workers (relayed through src-server).
    • While requests to a single repo are serialized, SLE-15 repo-overlay probably means that IBS can still send 10 parallel requests.
    • The more workers there are, the more requests they send.
    • There is no cache in src-server atm, because it is designed to be state-less and does not get information for cache-invalidation (only scheduler does)
  • typical response size is 5kB
  • fills up all slots in OBS backend and frontend
  • backend requests take 2-10s to respond (waiting in queue for a slot?)
  • IBS was only responsible for 16% of requests in one measurement (940 of 5827)
  • a single request for _repositories took 330 ms on a fast backend without DDoS
  • OBSes request the same path multiple times
    • some requests have different binary= param
    • e.g. 84 requests from IBS for path=/public/build/SUSE:SLE-15-SP1:Update/pool/x86_64/_repository params={"view"=>"binaryversions", "nometa"=>"1", "binary"=>"ncurses-utils", "withevr"=>"1", "project"=>"SUSE:SLE-15-SP1:Update", "repository"=>"pool", "arch"=>"x86_64", "package"=>"_repository"}
      • but production.log does not track binary= param correctly
  • IBS-requests come from 8 different IPs - so a bit harder to account for
  • This is how it looks in monitoring

Some ideas on how we can improve the situation:

  • somehow prevent lost sync state
  • stop re-requesting the same info
    • needs updates in dozens of OBSes
  • rate-limit requests on the client side
    • needs updates in dozens of OBSes
  • rate-limit requests on the server side
    • e.g. reduce number of slots available to this type of request to 10 - should allow us to answer 30/s without over-loading
      • can be done in haproxy between login-proxy and frontend
  • cache responses on the server side (needs active cache-invalidation upon updates before notifying OBSes)
    • will not work well with variations in binary= params
  • use If-Modified-Since header to avoid re-transmit of identical data - maybe not much benefit

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Things regarding our API Feature Reference Server 🖥️ Things related to build.opensuse.org
Projects
None yet
Development

No branches or pull requests

3 participants