Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large scale considerations #173

Open
adriendelsalle opened this issue Nov 19, 2020 · 10 comments
Open

Large scale considerations #173

adriendelsalle opened this issue Nov 19, 2020 · 10 comments

Comments

@adriendelsalle
Copy link
Member

I would like to open that issue to list what points are important to keep in mind in the development of Quetz in the perspective of a large scale use.

What I have in mind:

Language or dependencies

  • what is the max load that FastAPI could handle
  • Choice of Python as a base language for backend operations (extracting tarballs, generation of json patches, etc.)
    • context of providing views depending on the users authorizations, partially handle by database requests
    • multi-threading for cpu bound ops
    • etc.

Database/storage

  • even using PGSQL, projections of volumetry and ops/s to be able to handle
  • do we expect the need of implementing machinery to speedup requests (caching) in others databases? On the filesystem?
  • impact of the filesystem, best choice for the read/write operations
  • need for distributed filesystems?

Others

  • role-based vs attribute-based control?

This is just a draft to be updated with contributions (concerns, solutions, links to pr, etc.)!

@bollwyvl
Copy link

Thanks for bringing these up! A couple thoughts:

python/fastapi perf:
yeah, sure, python isn't rust or c++. fastapi is down around 250 in the benchmark game, so there are plenty of other things to choose. PyPy could potentially jump it up a hair, though I don't think all the deps are there yet. but man, I'd sure like a conda package repo that spoke graphql! anyhow, the variant that does best also uses orjson, but who knows, maybe simdjson, or one of the others has even more to say. aside: hadn't heard of apidaora (current leading python framework)... learn some new web junk every day!

distributed filesystem:
perhaps not what @adriendelsalle had in mind, but ipfs is a very interesting beast, as it theoretically has no single point of failure. I've almost got it built for conda-forge, which is cute, but what's more interesting is it can handle netflix-level volume/velocity. If a community (say conda-forge) can fiat a peer-of-last-resort (seems like 2tb of conda-forge would be ~$100/mo from a pinning service), cloudflare will foot the bill (for now) for CDN, and quetz would be none the wiser when replicating it... or some deeper integration would be possible. an ipfs-native client hardly seems infeasible at this point.

database:
this is one of the places where the go-to fastapi/sqlalchemy orm strategy can be a bear. if specifically talking pg, it's possible to use the binary protocol (even with orm) with asyncpg, which handles a number of issues on the database and app server by doing less work.

@btel
Copy link
Collaborator

btel commented Feb 11, 2021

BTW we have done some load testing using locust and we can process around 100 rps (requests per second) on a standard laptop using single quetz worker (for the download endpoint which generate a redirect to S3 file).

locust_quetz

@bollwyvl
Copy link

@btel Aw, yeah, locust is wonderful! (disclaimer: maintains conda-forge feedstock 👿). For giggles, can you toss the stats summary output? While pretty, i find the charts lie, as small error counts, etc. can still look flat.

It would be lovely to have this under test... not for absolute numbers, but to catch significant regressions (e.g. starts throwing lots of 500s). Basically, CI caches the repo's HEAD summary, PRs download that, and start failing if the numbers change by SOME_THRESHOLD, depending on the route.

To that point, having this for every route is important, especially with a couple admins and a horde of users changing lots of stuff (especially permissions!) at a furious rate, as it can reveal nasty things like full-table database locks which don't get caught when tested in isolation.

Another tool in the shed, to both improve the baseline, and help debug perf regressions, is the opencensus stack, with a simple example here. It looks like there is some work going on to give finer-grained insights of the fastapi side, while the sqlalchemy integration is already very robust. I've used the jaeger integration (also on conda-forge, might need some maintainer ❤️) for reporting.

I've yet to do a FULL full stack integration with opencensus-web, but this is the real cadillac, as you can trace button gets pushed in SPA to pixels on page for a single request, which is a thing of beauty when it works properly.

Having all these hooks built in to the various tiers, ready to be turned on by a site admin, can help them really own an application, beyond simple log mining, and can yield much better issue reports. Trying to get this level of insight from a "hostile" application is... harder.

@btel
Copy link
Collaborator

btel commented Feb 11, 2021

hi @bollwyvl, thanks for the valuable suggestions. Automatizing load testing is definitely on our roadmap. I haven't every used the openncensus stack, it's definitely something I would like to investigate. Thanks again for the pointers!

@btel
Copy link
Collaborator

btel commented Feb 11, 2021

i forgot about the locust stats, I need to re-generate them, because stupidly I did not conserve them.

btw we benchmarked the download endpoint, because it's the one that's going to be most frequently hit by users (and CIs), but I agree we should test other endpoints as well.

Bartosz

@bollwyvl
Copy link

minor update: we got the jaeger-feedstock updated to the most recent version (go has been rethinking their packaging approach, har).

@bollwyvl
Copy link

another update: go-ipfs-feedstock should exist soon (not up yet, but GH having a bad day, I guess). @wolfv @yuvipanda and i have been semi-seriously kicking around ideas on federated stuff for a while, so i guess it's a little more real (to me) now!

@atrawog
Copy link
Collaborator

atrawog commented May 30, 2021

What could speed up quetz by a lot is a smart caching system. Most quetz content are static files that don't get updated very often and there is no need to make a db or even a fastapi request for content that already got served and hasn't changed since.

For an in memory cache fastapi-cache is the obvious choice. But the easiest strategy to implement is probably generating proper ETag in fastapi and then put a really large NGINX Content Cache on top. NGINX is really fast with serving static content and if most of the requests get cached the raw fastapi performance is less of an issue.

The other things that might be worth a look are Backblaze for storage and Cloudflare as CDN.

@atrawog
Copy link
Collaborator

atrawog commented May 30, 2021

another update: go-ipfs-feedstock should exist soon (not up yet, but GH having a bad day, I guess). @wolfv @yuvipanda and i have been semi-seriously kicking around ideas on federated stuff for a while, so i guess it's a little more real (to me) now!

Personally I wouldn't dare to put a production system right on top of IPFS. But IPFS could the perfect solution for a long term package archive and /or package distribution system.

@yuvipanda
Copy link

I spent time trying to run IPFS for one of my side projects, but switched back to an S3 API instead. You still need to run pinning nodes, and the well tested setups pin content to filesystems. So you end up needing to run a cluster that requires you run file systems, which can get messy. Latency was also highly variable. I think it's getting better, but it's not useful in medium - large scales right now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants