Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HIP for download cache (charts & provenance files) #185

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
94 changes: 94 additions & 0 deletions hips/hip-0013.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
---
hip: "0013"
title: "Download-cache for charts & provenance files"
authors: [ "Matt Farina <matt.farina@suse.com","Christian Richter <crichter@suse.com>" ]
created: "2021-20-20"
type: "feature"
status: "draft"
---

## Abstract

This document describes the introduction of a download-cache for charts and provenance files to helm. This will speed up repeated operations and enable offline access to charts in repositories that have been previously pulled.

## Motivation

There are three primary motivations for leveraging a cache the impact [both the application operator and the application distributor](https://github.com/helm/community/blob/main/user-profiles.md).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: that* impact


When repeatedly installing, upgrading, or showing information about a chart version within a repository, Helm will download the chart each time. When downloading the chart Helm currently puts the chart into the repositories cache directory (where index files are cached) but never leverages the cache. Helm cannot use the chart version in the cache because the identifier is the name and charts version (e.g., `wordpress-1.2.3.tgz`). A name and version is insufficient as the same chart and version could come from two or more different repositories and have different content.

The first motivation is to make the cache useful by handling it in a manner where the downloaded chart's identifier can be make unique enough to be useful.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: made*


The second motivation is around use and experience. Downloading the chart version each time an operation is run on a chart in a repository can lead to wasted time. For example, if one run two show commands (i.e., `helm show readme example-repo/foo` and `helm show values example-repo/foo`) the chart will be downloaded twice from the repository. This can be avoided with the use of a cache.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"if one run two show" -- not sure I understand

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh... I think it's supposed to be "If one runs two show commands".


The third motivation is for Helm repository owners. Cached charts means fewer downloads from repositories. As the Helm maintainers saw when operating their own repository, bandwidth use from a Helm repository can become large. There is a cost to that for the host. Reducing that number can have a real financial impact on the host.

## Rationale

The cache proposed here is designed to work with programmatic use through the SDK and use existing information about charts for identifiers.

### Digest Identifier

The digest of a chart, used in provenance files and repository indexes, provides a unique identifier about a chart. A digest for a chart is a sha512 hash. This hash will be the identifier in the cache.

Using the digest is both unique and useful for keeping the size of the cache minimal. If the same chart is in two repositories, such as when a chart is copied from an upstream repository to an organizations internal repository, the hash is the same and the cached contents are the same. This means a cache based on this hash can be content addressable.

### SDK Usage

The Helm SDK is used by other applications to interact with instances running in clusters, charts, and repositories. Those applications should not be coupled to the way in which the Helm Client works with its storage (i.e. the file system). Applications using the SDK may want to have the cache reside in memory or in another system (e.g., object storage). The design of this cache system will use an interface and provide an implementation that uses the file system for the Helm Client cache.

## Specification

The `ChartDownloader` will be extended to provide the caching features. To do this, two new properties will be added to the `ChartDownloader` to accommodate a cache for both the chart archives and the provenance files. These can be stored in two separate locations.

When the `ChartDownloader` resolves a chart in a repository it will also obtain the digest from the repository index. This piece of information is already available in the index. Using the digest, `ChartDownloader` will check if the chart archive and provenance file are already in the cache. If the chart archive is not in the cache the `ChartDownloader` will retrieve it and place it in the cache prior to returning as before. The same will happen for provenance files. If no provenance file is found on download the cache will be marked in a manner to note that none was available. If the files are in the cache they will be returned rather than downloading the files again.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So let's say that an attacker compromises a chart repository index file, but not the packages (a possible scenario, given Helm's design).

What if the attacker replaced the correct SHA with a fake SHA that pointed to another package? Might this result in a way for an attacker to replace the user's intended package with another package (provided that other package is in the user's cache)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If an attacker can compromise the index file they could point a chart version to a different file to download install. There is already a vector there if there is no provenance file.

Using just a the hash could lead to an issue if someone modifies it. Since the index file contains more information than the hash, we could use that metadata (e.g. name and version) to verify it is the right package after looking it up in the cache. If that does not match we could download the file listed in the index.

Would that work?


The [cache provided for the Helm client](https://github.com/rancher/sandbox/gofilecache) is based on the Go build cache and uses the file system. It uses the first two characters of the hash as the top directory followed by the rest of the hash for the second level directory. This is similar to the way in with Git stores objects on disk. Both the chart and provenance caches will live within the cache directory Helm already uses.

The Helm Client would have two new environment variables and flags to specify the locations for the provenance files and chart archive cache. These would follow the same format as the current environment variable use for individual caching locations.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For posterity, can you list those environment variable names here? That way users can refer to this document instead of looking up the source code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure that allowing caches to be outside of a single top-level directory is a good idea, and I don't see this pattern replicated in similar systems. This is hard to debug, hard to trace, and could lead to bewildering behavior when multiple users/tools are attempting to share the same cache.

Can you please explain the rationale for allowing this very specific feature of Helm to be configured?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Helm currently has HELM_CACHE_HOME and HELM_REPOSITORY_CACHE. The HELM_REPOSITORY_CACHE is where index files and charts are downloaded to. The two additional ones were to continue the pattern of allowing them to be overridden.

HELM_CACHE_HOME currently provides a default base for caching.

HELM_REPOSITORY_CACHE, by default, is relative to HELM_CACHE_HOME.

If we allow the current cache to be changed should the new caches be able to be changed, too? We just tried to follow the existing pattern. It is easy enough not to.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a command line flag for overriding download chart name? e.g. my complex service may have an Istio gateway, then some other gateway. and both of them named their charts 'gateway'.


## Backwards compatibility

Helm charts are untouched in their structure. There is no impact to the charts themselves.

Index files, where digests are looked up, remain untouched. There is no impact to the index files.

The existing Helm cache directory is used for the cache storage. This location needs to exist for storing index files locally. There is no impact to the cache location.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if you introduce new env vars for configuring cache dir, it is not clear that this is true.


Those who use the `ChartDownloader` from the Helm SDK will have two new properties to define when instantiating it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these properties optional?

What do these properties do when set?

What happens if the user does not set these new properties?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question. I feel like this needs some clarification to show that it is not a breaking change. In order to not be breaking, existing userland code must not need to be changed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you look at the chart downloader you'll see there is a property for the Repository Cache. This is the specific cache for the repository and not the cache location in general. See my other comment on the environment variables for the way those are setup.

So, the idea is to pass in the locations for the new cache here following the existing pattern the repository cache uses. The design is simply based to follow existing patterns. We might want to discuss this in a dev call.

The new properties would need to be optional. If they are not set a default location would then need to be used.


## Security implications

If a chart archive or a provenance file is in the cache it will be used instead of being downloaded. This means that a malicious user could place a chart archive in the cache in the location where the good one should be. When Helm looks up the chart listed in the repository it would retrieve the the malicious chart from the cache instead.

For this to work the malicious user would need access to the system Helm is running on with write access to the cache. The current cache directory does not provide write access to those who are not the user or a system admin. That would mean the malicious user would need access as the user or a system admin to exploit this issue.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not necessarily true. A malicious user needs to only (a) compel a user to fetch another chart, or (b) assume the user has already fetched the corrupted chart.

For example, say there is a chart foo-1.2.3 that has a security vulnerability. The maintainers update and release foo-1.2.4 to fix. The attacker can assume that many of the users who update from 1.2.3 to 1.2.4 may also have the compromised 1.2.3 version in their cache. So an attacker could alter the SHA in the index to point to the older compromised version, and in so doing potentially prevent the user from fetching the updated version, instead using the known-bad version.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the example here.

If an attacker can alter the index file they can already point someone to a bad file to install. I think we need to add signing to them but that's another HIP.

As I noted earlier, we could inspect the chart to make sure it is the right name and version. Would that work?


To further mitigate the issue, `ChartDownloader` will check that the archive conforms to the digest that is being requested. If it does not the `ChartDownloader` will retrieve it from the source.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is checked using the cached index file, correct? Could it be possible for an attacker to modify both digests (the .tgz in the cache, and the string in the index) to forge a "valid" chart digest?

I think there is an underlying assumption that the --sign | --verify flag will check for malicious intent, but I just wanted to check and clarify how can we mitigate a local attack vector. Or rather, if it's an unreasonable attack vector to mitigate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This solution does not mitigate the above.

One could attempt to match the local cached metadata (chart name and version, for instance) with the info in the index. That would help, but would not totally solve the problem.

Checking that the actual SHA is correct would otherwise require Helm to fetch the entire binary package and digest it, which completely defeats the purpose of a cache.

A provenance file solves the problem if they are used. But otherwise, there is likely a whole in the security model of this feature.


This does mean that if the source chart archive does not have the same digest the one listed in the index it will be retrieved every time and bypass the cache.
Copy link

@TBBle TBBle Sep 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coming in late, but I'd suggest the least-surprise approach here would be to fail in this case, rather than accept the source archive with an unexpected digest. Motivated by helm/helm#7623 where we're not doing this correctly now.

As I mentioned in my comment there, I would implement this HIP by never bypassing the cache. I'd always have the downloader (OCI Registry or Chart Repository) downloading into the cache, hashing as it's written to ensure that the cache entry's content matches its address before marking the entry as usable; and separately consumers would always pull from the cache, hashing as they read to ensure the cache entry's content matches its address before using the data so-read.

For the 'no disk access' case, the "cache" there could be in-memory blobs or streaming.

This is mostly based on what I've seen of the OCI container image layer downloading/extracting mechanisms in containerd, using its "content store" the same way we're talking about a "cache" here.

Of course, if I've misunderstood the intent of this HIP, that's fine too.


## How to teach this

The Helm client will have a flag to bypass the cache. This will be in the client documentation alongside the other flags.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again for posterity, can we give that flag a name so this document can be self-describing?


## Reference implementation

An implementation of the design described in this HIP is under development [here](https://github.com/dragonchaser/helm/tree/add_download_cache).

## Rejected ideas

There are two rejected ideas with regard to the cache.

### Using The Current Repository Cache

Charts are currently downloaded to the `$HELM_CACHE_HOME/repository` directory in the form `name-1.2.3.tgz`. This directory could be used to store chart archives and provenance files using the digest as the name. This is not ideal as users debugging issues would have many files and would not be able to easily identify the repositories. It would also make clearing the cache of charts and provenance files while preserving the repository cache more difficult.

A better solution is to use the same caching methods as other disk based object caches.

### Placing Files In The ~/.cache Directory

One suggestion was to use the `~/.cache` directory which is common on POSIX systems to hold the cache. This does not follow the current caching directory used by Helm that works across platform. A better solution is to use the existing Helm cache location.

## Open issues

There are no open issues