-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HIP for download cache (charts & provenance files) #185
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,94 @@ | ||
--- | ||
hip: "0013" | ||
title: "Download-cache for charts & provenance files" | ||
authors: [ "Matt Farina <matt.farina@suse.com","Christian Richter <crichter@suse.com>" ] | ||
created: "2021-20-20" | ||
type: "feature" | ||
status: "draft" | ||
--- | ||
|
||
## Abstract | ||
|
||
This document describes the introduction of a download-cache for charts and provenance files to helm. This will speed up repeated operations and enable offline access to charts in repositories that have been previously pulled. | ||
|
||
## Motivation | ||
|
||
There are three primary motivations for leveraging a cache the impact [both the application operator and the application distributor](https://github.com/helm/community/blob/main/user-profiles.md). | ||
|
||
When repeatedly installing, upgrading, or showing information about a chart version within a repository, Helm will download the chart each time. When downloading the chart Helm currently puts the chart into the repositories cache directory (where index files are cached) but never leverages the cache. Helm cannot use the chart version in the cache because the identifier is the name and charts version (e.g., `wordpress-1.2.3.tgz`). A name and version is insufficient as the same chart and version could come from two or more different repositories and have different content. | ||
|
||
The first motivation is to make the cache useful by handling it in a manner where the downloaded chart's identifier can be make unique enough to be useful. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. typo: made* |
||
|
||
The second motivation is around use and experience. Downloading the chart version each time an operation is run on a chart in a repository can lead to wasted time. For example, if one run two show commands (i.e., `helm show readme example-repo/foo` and `helm show values example-repo/foo`) the chart will be downloaded twice from the repository. This can be avoided with the use of a cache. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "if one run two show" -- not sure I understand There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh... I think it's supposed to be "If one runs two show commands". |
||
|
||
The third motivation is for Helm repository owners. Cached charts means fewer downloads from repositories. As the Helm maintainers saw when operating their own repository, bandwidth use from a Helm repository can become large. There is a cost to that for the host. Reducing that number can have a real financial impact on the host. | ||
|
||
## Rationale | ||
|
||
The cache proposed here is designed to work with programmatic use through the SDK and use existing information about charts for identifiers. | ||
|
||
### Digest Identifier | ||
|
||
The digest of a chart, used in provenance files and repository indexes, provides a unique identifier about a chart. A digest for a chart is a sha512 hash. This hash will be the identifier in the cache. | ||
|
||
Using the digest is both unique and useful for keeping the size of the cache minimal. If the same chart is in two repositories, such as when a chart is copied from an upstream repository to an organizations internal repository, the hash is the same and the cached contents are the same. This means a cache based on this hash can be content addressable. | ||
|
||
### SDK Usage | ||
|
||
The Helm SDK is used by other applications to interact with instances running in clusters, charts, and repositories. Those applications should not be coupled to the way in which the Helm Client works with its storage (i.e. the file system). Applications using the SDK may want to have the cache reside in memory or in another system (e.g., object storage). The design of this cache system will use an interface and provide an implementation that uses the file system for the Helm Client cache. | ||
|
||
## Specification | ||
|
||
The `ChartDownloader` will be extended to provide the caching features. To do this, two new properties will be added to the `ChartDownloader` to accommodate a cache for both the chart archives and the provenance files. These can be stored in two separate locations. | ||
|
||
When the `ChartDownloader` resolves a chart in a repository it will also obtain the digest from the repository index. This piece of information is already available in the index. Using the digest, `ChartDownloader` will check if the chart archive and provenance file are already in the cache. If the chart archive is not in the cache the `ChartDownloader` will retrieve it and place it in the cache prior to returning as before. The same will happen for provenance files. If no provenance file is found on download the cache will be marked in a manner to note that none was available. If the files are in the cache they will be returned rather than downloading the files again. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So let's say that an attacker compromises a chart repository index file, but not the packages (a possible scenario, given Helm's design). What if the attacker replaced the correct SHA with a fake SHA that pointed to another package? Might this result in a way for an attacker to replace the user's intended package with another package (provided that other package is in the user's cache)? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If an attacker can compromise the index file they could point a chart version to a different file to download install. There is already a vector there if there is no provenance file. Using just a the hash could lead to an issue if someone modifies it. Since the index file contains more information than the hash, we could use that metadata (e.g. name and version) to verify it is the right package after looking it up in the cache. If that does not match we could download the file listed in the index. Would that work? |
||
|
||
The [cache provided for the Helm client](https://github.com/rancher/sandbox/gofilecache) is based on the Go build cache and uses the file system. It uses the first two characters of the hash as the top directory followed by the rest of the hash for the second level directory. This is similar to the way in with Git stores objects on disk. Both the chart and provenance caches will live within the cache directory Helm already uses. | ||
|
||
The Helm Client would have two new environment variables and flags to specify the locations for the provenance files and chart archive cache. These would follow the same format as the current environment variable use for individual caching locations. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For posterity, can you list those environment variable names here? That way users can refer to this document instead of looking up the source code. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not entirely sure that allowing caches to be outside of a single top-level directory is a good idea, and I don't see this pattern replicated in similar systems. This is hard to debug, hard to trace, and could lead to bewildering behavior when multiple users/tools are attempting to share the same cache. Can you please explain the rationale for allowing this very specific feature of Helm to be configured? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Helm currently has
If we allow the current cache to be changed should the new caches be able to be changed, too? We just tried to follow the existing pattern. It is easy enough not to. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there a command line flag for overriding download chart name? e.g. my complex service may have an Istio gateway, then some other gateway. and both of them named their charts 'gateway'. |
||
|
||
## Backwards compatibility | ||
|
||
Helm charts are untouched in their structure. There is no impact to the charts themselves. | ||
|
||
Index files, where digests are looked up, remain untouched. There is no impact to the index files. | ||
|
||
The existing Helm cache directory is used for the cache storage. This location needs to exist for storing index files locally. There is no impact to the cache location. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. But if you introduce new env vars for configuring cache dir, it is not clear that this is true. |
||
|
||
Those who use the `ChartDownloader` from the Helm SDK will have two new properties to define when instantiating it. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are these properties optional? What do these properties do when set? What happens if the user does not set these new properties? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same question. I feel like this needs some clarification to show that it is not a breaking change. In order to not be breaking, existing userland code must not need to be changed. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If you look at the chart downloader you'll see there is a property for the Repository Cache. This is the specific cache for the repository and not the cache location in general. See my other comment on the environment variables for the way those are setup. So, the idea is to pass in the locations for the new cache here following the existing pattern the repository cache uses. The design is simply based to follow existing patterns. We might want to discuss this in a dev call. The new properties would need to be optional. If they are not set a default location would then need to be used. |
||
|
||
## Security implications | ||
|
||
If a chart archive or a provenance file is in the cache it will be used instead of being downloaded. This means that a malicious user could place a chart archive in the cache in the location where the good one should be. When Helm looks up the chart listed in the repository it would retrieve the the malicious chart from the cache instead. | ||
|
||
For this to work the malicious user would need access to the system Helm is running on with write access to the cache. The current cache directory does not provide write access to those who are not the user or a system admin. That would mean the malicious user would need access as the user or a system admin to exploit this issue. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is not necessarily true. A malicious user needs to only (a) compel a user to fetch another chart, or (b) assume the user has already fetched the corrupted chart. For example, say there is a chart There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for the example here. If an attacker can alter the index file they can already point someone to a bad file to install. I think we need to add signing to them but that's another HIP. As I noted earlier, we could inspect the chart to make sure it is the right name and version. Would that work? |
||
|
||
To further mitigate the issue, `ChartDownloader` will check that the archive conforms to the digest that is being requested. If it does not the `ChartDownloader` will retrieve it from the source. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is checked using the cached index file, correct? Could it be possible for an attacker to modify both digests (the .tgz in the cache, and the string in the index) to forge a "valid" chart digest? I think there is an underlying assumption that the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This solution does not mitigate the above. One could attempt to match the local cached metadata (chart name and version, for instance) with the info in the index. That would help, but would not totally solve the problem. Checking that the actual SHA is correct would otherwise require Helm to fetch the entire binary package and digest it, which completely defeats the purpose of a cache. A provenance file solves the problem if they are used. But otherwise, there is likely a whole in the security model of this feature. |
||
|
||
This does mean that if the source chart archive does not have the same digest the one listed in the index it will be retrieved every time and bypass the cache. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Coming in late, but I'd suggest the least-surprise approach here would be to fail in this case, rather than accept the source archive with an unexpected digest. Motivated by helm/helm#7623 where we're not doing this correctly now. As I mentioned in my comment there, I would implement this HIP by never bypassing the cache. I'd always have the downloader (OCI Registry or Chart Repository) downloading into the cache, hashing as it's written to ensure that the cache entry's content matches its address before marking the entry as usable; and separately consumers would always pull from the cache, hashing as they read to ensure the cache entry's content matches its address before using the data so-read. For the 'no disk access' case, the "cache" there could be in-memory blobs or streaming. This is mostly based on what I've seen of the OCI container image layer downloading/extracting mechanisms in containerd, using its "content store" the same way we're talking about a "cache" here. Of course, if I've misunderstood the intent of this HIP, that's fine too. |
||
|
||
## How to teach this | ||
|
||
The Helm client will have a flag to bypass the cache. This will be in the client documentation alongside the other flags. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Again for posterity, can we give that flag a name so this document can be self-describing? |
||
|
||
## Reference implementation | ||
|
||
An implementation of the design described in this HIP is under development [here](https://github.com/dragonchaser/helm/tree/add_download_cache). | ||
|
||
## Rejected ideas | ||
|
||
There are two rejected ideas with regard to the cache. | ||
|
||
### Using The Current Repository Cache | ||
|
||
Charts are currently downloaded to the `$HELM_CACHE_HOME/repository` directory in the form `name-1.2.3.tgz`. This directory could be used to store chart archives and provenance files using the digest as the name. This is not ideal as users debugging issues would have many files and would not be able to easily identify the repositories. It would also make clearing the cache of charts and provenance files while preserving the repository cache more difficult. | ||
|
||
A better solution is to use the same caching methods as other disk based object caches. | ||
|
||
### Placing Files In The ~/.cache Directory | ||
|
||
One suggestion was to use the `~/.cache` directory which is common on POSIX systems to hold the cache. This does not follow the current caching directory used by Helm that works across platform. A better solution is to use the existing Helm cache location. | ||
|
||
## Open issues | ||
|
||
There are no open issues |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: that* impact