Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Limit Thumbnail Generation Concurrency #3794

Conversation

t-muehlberger
Copy link
Contributor

@t-muehlberger t-muehlberger commented Nov 20, 2023

This should fix the issue described in #3759. I tested this solution in my PB deployment and it did not run out of memory anymore.

This solution does not use a separate Worker-Pool, rather it just blocks the request handler if too many thumbnails are generated at the same time. I think this is the simpler approach.

It will also be easier to remove the semaphore in case this fix would be replaced by a more general solution in the new version FS abstraction (like limiting the number of open files or the number of bytes stored in memory).

The limit should probably be configurable. @ganigeorgiev do you think I should read the Env-Variable MAX_THUMBNAIL_CONCURRENCY?

…t thumbnail generation processes.

TODO:
- [ ] Add tests/benchmarks to verify that memory consumption is lowered
- [ ] Cleanup => move code related to concurrency to separate file
@ganigeorgiev
Copy link
Member

ganigeorgiev commented Nov 20, 2023

Without tests to show that the thumbs generation is the culprit and not the aws-sdk-go file upload I don't feel confident merging it, sorry.

Based on what runtime.NumCPU() * 2 is decided? Does it have significant impact on the memory if it is higher? Could you elaborate how you are testing to try to reproduce your results?
What is your memory usage when GOMEMLIMIT is set to ~70% of your max memory?

Some notes on the code:

  • The semaphore variable could be moved as field inside the fileApi struct.
  • I'm not sure about whether it would be better to hold the connection open instead of returning 503 (or even returning the original image instead).
  • The double fs.Exists ideally should be avoided if possible. In the pasts users have complained about hitting S3 rate limits (see failed Add CDN field to the S3 configuration #28) and I want to minimize the extra S3 calls (eg. we can store the current generating thumb name in a concurrent safe map[string]struct{}{}).

Note that it is possible the higher memory consumption to be caused by aws/aws-sdk-go#4916.

Before complicating further the implementation I suggest to prepare a test case (or at least some manual steps) that consistently trigger the issue and to start profile from there.
We can try changing back to aws-sdk-go-v2 to see if there would be any improvement (we've used v2 in the past but had to revert to v1 because of #2231). If v2 perform better we can try to upgrade again and apply a fix for the GCS headers and avoid the locks altogether.

@t-muehlberger
Copy link
Contributor Author

Alright, let me get back to you with a benchmark or script to replicate the issue.

runtime.NumCPU() * 2 seemed to utilize the CPU pretty well for my small Hetzner VPS with Backblaze B2 as storage bucket. This would assume that 50% of time would be spent fetching images (IO-Bound) and 50% resizing images (CPU-Bound) but I did not measure that. Also the factors involved (S3 Latency, Image Size, CPU/RAM available) would vary widely between different deployments.

With limited concurrency the memory usage always stayed around GOMEMLIMIT or a little bit above, even when setting GOMEMLIMIT as low as 256MiB. The memory that is actually in use at any moment is probably a lot less, with a lot of garbage being generated and cleaned as soon as GOMEMLIMIT is reached.

I will keep digging deeper into the issue, but it is probably going to take me a few days to find the time. You cen close the PR in the mean time if you want to keep github clean.

ganigeorgiev added a commit that referenced this pull request Dec 4, 2023
Co-authored-by: Tobias Muehlberger <tobias@muehlberger.dev>
@ganigeorgiev
Copy link
Member

I was able to experiment a little with this and can confirm that the the thumb generation or large images requires a lot of memory allocations and the the problem is not just the S3 file upload as the higher memory consumption can be examine even when only the local filesystem is used.

I've performed a test locally with ~7MB 9319x5792px jpg and generating 3 different thumbs concurrently for both storage options - 100x100, 500x500 and 1000x1000. runtime.MemStats.Alloc indicated ~1.2GB for the local filesystem case and ~2GB for the S3. Encoding the images just with the std image/jpeg also seems to produce similar results which I guess make sense as imaging uses the std lib under the hood (in all cases draw.NearestNeighbor resampling filter was used as a base).

So I don't think there is much we can do (at least not easily) in this case for optimizing the memory allocations.

I've pushed a slightly modified version of your PR in the develop branch that also should cover the case where multiple concurrent requests for the same thumb are being executed (there is also a small automated test to ensure that new thumbs are being saved properly). The changes will be available with the next v0.20.0 release soon.

I still don't exactly like the runtime.NumCPU() usage (by default I changed it to runtime.NumCPU() + 1) because the number of logical CPUs are not directly related to the memory allocations, which is the actual problem here, but since I don't have a better solution at the moment I'll keep it for now as it is and eventually will adjust it in the future based on the users feedback and my tests with Presentator.

But in general, if you are still experiencing OOM errors, you can try combining the latest changes with GOMEMLIMIT as this is expected to keep the GC allocations within your desired limit (I'll consider adding a note in the documentation for it).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants