-
Notifications
You must be signed in to change notification settings - Fork 702
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Charts themselves may need caching in fluxv2 plugin #3636
Comments
Yep. Another possibility is to do something similar to helm where a |
yeah, I've thought about that one yesterday too. I thought our helm support caches all chart versions. If I remember correctly, bitnami repo has around 14K versions and I just thought 'heck no': caching 14K things when in most cases you might need just one or two packages is so bad signal-to-noise wise. But also, if I remember correctly, there something like 100 or so distinct packages in bitnami repo and so caching 2-3 latest versions would result in 200-300 things being cached, which is not great but of course much better than caching 14K. So I'll consider this approach as well Side note: this is where exact implementation of 'caching' is actually pretty important. Its not so bad to have 14K rows in a SQL table (on disk). Who cares if you one ever need one or two rows - disk space is cheap. But the same set of data stored in memory in process is kind of bad - process memory footprint is very important |
I've been thinking about this on and off past few days. Here is where I currently stand (hint: not quite where I'd like to be). One could argue there are two use cases here to speed up:
Ideally, I would really love to solve (1), which obviously would also solve (2). It's kind of a "first impression" thing. I don't want there to be an (however misguided or superficial) impression by the end user that flux is somehow slower than say direct-helm and therefore "inferior". However, I can't think of an elegant solution to (1) yet :-( Here is why: I don't want to go down the same path direct-helm plugin uses to address this issue, which is to download and cache all the charts locally, all done in the background once the repo has been added. Even if we are talking about having 2-3 latest versions of the charts it's still hundreds of charts cached for bitnami repo. Why don't I like that? Because in my mind, the "proper" way to get charts with flux is to create a HelmChart CR, wait for it to reconcile and then get the chart tarball cached locally in the cluster. I don't like this idea of having to create, say 300 CRs when user adds bitnami repo. As in, at the end of this one could do If I had to solve just (2) by itself that I can do fairly elegantly. My preference would be to implement this garbage collector that deletes all but N least recently accessed charts. It would be a relatively simple solution to implement and light on process footprint too. But again, I really want to solve (1), just haven't found an elegant way to do it yet. Thoughts are welcome. |
I'd say this issue is totally applicable to the helm-direct plugin too. Currently, we are solving (2), but the delay is only noticeable when installing/upgrading a chart for the very first time. I've just started looking into this issue today, so I don't have still a good foundation for the whole thing. However, I'm sharing some thoughts:
I haven't had a chance to think about it carefully, but it should quite odd to me that we do have CR for the appRepos spinning up several jobs... and we end up calling and parsing the index.yaml by hand each time we deploy something. I don't think we should cache everything indefinitely in memory as we are currently doing; at least we should implement a kind of LRU cache to prevent the memory to increase in O(#repos*avgRepoSize). So this issue is pretty similar to what I also wanted to solve (eg. see #3677) (well, mine is less ambitious, just reducing the memory consumption in our current scenario), so I'm also really interesting in the thoughts from you all here :P |
+1, if possible, but there may be other ways that involve perceived impression vs actual time taken.
The premise of thought wouldn't even enter my mind... let's just focus on helping each other improve the product, you shouldn't be worried about this, IMO.
Totally agree - this doesn't appear to be an option that can be considered. Even if we added them a few at a time, cached, then removed them, it'd still generate all the churn.
Not 100% sure about this one. That layer (flux) is there for installing packages from repositories. It's not there for displaying a browsable catalog of packages. So I don't think that layer is necessarily there to help you solve this problem. I wouldn't see an issue with a go-routine that begins extracting and caching the readme etc. of the latest version for each package (only).
Just some thoughts about perceived time: If we did have (just) the metadata of the latest chart version for each package in a repo already cached, the initial display would be without delay and if the user changes versions, we can update some data on the page immediately (version mainly) while replacing the other metadata in the background, so that it's much less of an issue (ie. not waiting 2s while a spinner spins). |
I agree with Michael on this. The HelmChart has a function for the HelmRelease. I do not think you are really using any of its features. I would think it is ok to bypass the CR. |
I was thinking something along the same lines. In the UI, a user will select a package from the catalog. This means the UI already has all the "summary" information. helm cli is doing the same. It caches the index.yaml, and then pulls in the tar files only when needed. |
Thank you everyone for your feedback. Some specific things:
I was under impression helm-direct was solving (1) as well
Are you talking about direct-helm? I wasn't aware of any cache there that stores things in memory
I don't know about that. This issue doesn't have do with with parsing repo index. Are you referring to data stored in postgresql or some other cache I am not aware of?
Wasn't your mind I was worried about. End user or customer or anyone in general that doesn't know how "the sausage is made", so to speak
Yes both you and Dimitri essentially are saying that. I am going to try and work something out along these lines |
Yep, I was talking about a sort of cache of the index.yaml [1] which, AFAIK, it is used just in the helm-direct each time it does [1] https://github.com/kubeapps/kubeapps/blob/master/pkg/chart/chart.go#L155
I meant that we may need to also cache the chart themselves even in the helm-direct plugin to avoid re-fetching the index.yaml again and again. |
Gasp! I didn't realize we had a global variable map that stores a potentially unbound number of entries that never expire. And this is in addition to the postgresql that stores the indexes too. Yikes. I thought we only had postgresql to fetch the data from, but I was wrong. You're right. This is a very bad anti-pattern and should be fixed right away! Ahem, full disclosure: turns out I do the same with redis cache, which I remembered at some point to fix (I vaguely remember me and Michael discussing putting TTL on cache entries, but I dropped the ball on that) so I put a TODO in there to fix this. I promise to fix this very soon with redis cache. I know its bad. It's a different issue that this one because this one is merely ponders whether or not we should cache chart details for flux plugin. Nevertheless, it just became my top priority |
Guys, FYI an update: I have implemented one tentative approach where when you add a new repo source, it caches the parsed index (as it already does) but now also the details of only THE LATEST version of each chart (hopefully that's the one that would be displayed in the catalog by default, so it looks instantaneous (a cache hit)). Just FYI, the chart details are README, values and schema files. Not zipped or anything, just in plain text. So, after bitnami is added to flux, I end up with about 100 keys in Redis (one for helmrepository #66 and the rest are details for each chart):
The whole thing ends up taking about 25MB of Redis memory. Just checking if this sounds like a reasonable approach and I should continue, or try something else |
Wow just 25MB, that's great. This approach will speed up the user-perceived app performance, as they will see something as soon as they select something. That said, I just wonder (not meaning we should/shouldn't do it), whether we should make this "number of cached versions" configurable. Like in "latest version + latest-N previous ones" or "latest + latest-N and then LRU based on user-selection". |
OK, thanks, Antonio that's encouraging. Sounds like I'm on the right track. If I can cache 1, I can cache N, that's not too big a leap. I am considering another variant of this solution that would instead cache chart .tgz files instead of plain text. The chart details pretty much contain text files (such as README, chart manifest yaml, etc.) which compress REALLY well. What is 140KB uncompressed might end up only being less than 10KB compressed. So caching, say, 300 charts may not be a big deal if compressed. Typical time/space tradeoff. We would save a lot in space and lose a little bit in time as it takes more cycles to uncompress data on-the-fly. Seems like a reasonable tradeoff to me. I will get back to you when I have the numbers. BTW, this is the same thing that flux does when you create an instance of HelmChart CRD - they cache the chart .tgz file in the cluster. |
Nice work Greg! I agree, the main think we want to solve is the initial load time, so much less of an issue that a user needs to wait when selecting a different version (a cache-miss, which I assume is then added to the cache). Out of interest, how long does it take to cache just the first versions? (I assume it does this after the plugin is registered?). |
Thank you, Michael. I have the numbers now from my latest experiment. bitnami+100 latest chart versions in the original format (.tgz) is 18MB. Vs 25 if uncompressed. I thought there'd be a bigger gain, but maybe the fact that the cache entry for bitnami repo (parsed model that includes metadata for all 14K versions) is the big elephant in the room, so everything else is small in comparison. Oh well, I still like that approach.
and the last one ends at
So that's a minute. Obviously this work can and will benefit from parallelization. I will have several (5 or 10) workers doing this concurrently. So that number is going to change |
Here are a few more numbers from my MacBook I just collected:
Flux plugin GetAvailablePackageDetail (cache hit) on my local branch with the implementation described above:
Flux plugin GetAvailablePackageDetail (cache miss):
Looks OK to me |
…in (#3899) * step 1 * step 2 * step 3: add basic auth support to chart cache. Next: TLS and http proxy config * step 3a: clean up bootstap of cache * step 3b. Use helm HttpGetter for downloading charts like flux does * step 3c. get rid of HashSet in favor of sets.String * step 4: added TLS support in chart cache * step 4: http proxy support in chart cache. next: integration test that should surface some RBAC limitations * step 5. added integration test for private repo with basic auth. Next: transient HTTP failures for chart cache * step 6: handle transient HTTP errors * step 7: unit tests don't bootstrap chart chacht unless they actuallyy need it * step 7a: fix an intermittent unit test failure * clean up minor TODOs * minor TODOs * fixed minor TODOs * +integration test for scenario when GetAvailablePackageDetail fails due to RBAC settings * move ChartCache into cache package * small performance optimization in chart cache * some cleanup of resync() logic * clean up resync unit test * more resync-related goodness * small fixes and cleanup * more cleanup * minor fixes * small cleanup * make use of nifty statuserror.FromK8sError() func * moved PackageAppVersionsSummary func and unit test into pkg so that it can be used by multiple plug-ins * add support for plugin config YAML * moved ResourceRefsFromManifest func into pkg so that it can be re-used by multiple plug-ins 2 unit tests are commented out, I have yet to figure out how to fix them also, some feedback from Antonio * - couple unintended changes * update comment
Just moving from the PR discussion to an actual issue, as requested by @gfichtenholt , where Greg said:
This was in response to my comment that:
The text was updated successfully, but these errors were encountered: