Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: chart repository v2 (usage of nested index and it's impact on operations like search, fetch, repo update, install) #3557

Open
ankushchadha opened this issue Feb 22, 2018 · 24 comments

Comments

Projects
None yet
10 participants
@ankushchadha
Copy link

commented Feb 22, 2018

Scale has atleast two dimensions - one is the # of charts and second is the variety and the number of devices running the helm cli.

With a single index file, there are following problems -
What we observed is that with 100k charts, the memory being consumed while adding a helm repository via cli (helm repo add) was 1GB+. A good chunk of the memory is being used during conversion of yaml to json to object.
In addition, any update related operation on the index.yaml will take more time as we scale more. Example: Updating absolute urls.

One way to solve this problem is to partition the index file and follow the nested indexing approach. The most popular approach is -

  1. A topmost index.yaml file that only includes the chart names
  2. Subfolders based on the name of the cart.
  3. A sub-index file under every chart subfolder that includes following section
  • latest (in case the version is not mentioned during helm fetch)
  • versions section (includes all versions of the chart that is part of the repository)
  • metadata (description, giturls, authors, etc) - This can even be moved further down and is debatable.

Example:
repo1:
index.yaml (no metadata such as description, etc)
/chartName1/index.yaml (includes latest and versions)
/chartName2/index.yaml
/chartName3/index.yaml

@jascott1 jascott1 added the proposal label Feb 23, 2018

@elioengcomp

This comment has been minimized.

Copy link

commented Feb 23, 2018

Once the index file is partitioned and this folder structure is introduced, it will require some sort of push operation on the CLI to automate and hide this layout enforcement from the user.

@jdolitsky

This comment has been minimized.

Copy link
Member

commented Feb 25, 2018

@elioengcomp I'm not sure why the addition of this format requires a push operation, can you elaborate on that? I may be misunderstanding.

@ankushchadha few ideas to add here:

  1. This should become apiVersion: v2, so that Helm CLI can maintain backward compatibility, and support both v1 and v2 formats for the time being
  2. The sub-index file should have a different name, such as versions.yaml so as not to be confused with index.yaml.
  3. The URLs for chart names should be prefixed with /charts. For example, /charts/chartName1/versions.yaml instead of /chartName1/versions.yaml. This will allow for people to be more flexible with ingress rules etc.
  4. The chart .tgz URLs, if relative, should be relative to the chart path. For example, if a URL of x/y/z/mychart-0.1.0.tgz is listed in sub-index file /charts/mychart/versions.yaml, the Helm CLI should resolve to /charts/mychart/x/y/z/mychart-0.1.0.tgz
  5. The top level index.yaml should potentially contain, at the very least, metadata regarding the latest version of each chart. This way, running helm search myrepo/ will not require downloading more than one file.
  6. The helm search command should be modified to prevent potentially downloading hunderds/thousands of sub-index files at one time. If we agree on this, I can open a separate proposal issue. Instead of searching by substring, we should add positional arguments. For example, instead of helm search myrepo/ -l, if you want to use the -l flag, you will have to do something like helm search myrepo mychart -l.

I think we need to be careful about how we approach #5. Although it will take longer to get to 100k unique chart names vs. 100k chart versions (individual .tgzs), it will happen.

Thoughts?

@elioengcomp

This comment has been minimized.

Copy link

commented Feb 26, 2018

@jdolitsky I think it would be easier to perform indexing operations and to maintain the repository if all packages follow a well defined repository layout, something like /charts/<chartName>/<chartName>-<version>.tgz.

In that case, we may need a mechanism to enforce that layout and hide its complexity from the user for every new package installed. In this case, an operation like helm push myrepo <mychart>-<version>.tgz that automatically uploads the chart to the right location would be handful.

Another alternative is the usage of some relative path approach like you described in item 4. But that is going to make merging repositories a more complicated task. We may need to merge repositories to make virtual repositories possible.

Best.
Elio

@jdolitsky

This comment has been minimized.

Copy link
Member

commented Feb 26, 2018

@elioengcomp I'm not opposed to the idea, but I think that this will become difficult considering the number of options for hosting a chart repo (GitHub, S3, Artifactory, ChartMuseum).

We would need to agree on an API spec and auth mechanism, otherwise support tons of different options. These push operations might be more appropriate as Helm plugins.

@mattfarina can you weigh in?

@bacongobbler bacongobbler added this to the Upcoming - Major milestone Feb 26, 2018

@ljnelson

This comment has been minimized.

Copy link

commented Mar 1, 2018

In terms of prior art in this space, you may want to look at the details of how (the battle-tested, scalable) Maven Central (and Maven repositories in general) is organized and built, since with this proposal chart repositories start to resemble it greatly. Of note is that a Maven repository is not required to have search functionality itself (it's usually broken out as a separate implementation-specific concern; see search.maven.org for one such implementation).

@jdolitsky

This comment has been minimized.

Copy link
Member

commented Mar 1, 2018

@elioengcomp I think we need to table the discussion around a push command for now. Or lets open up a separate issue on this. I think it will be a long enough discussion enough just agreeing on the spec.

@bacongobbler in terms of timing (I see you have added the "Major" milestone)- since the index file API is versioned, could we technically get this change into Helm 2 so that we can start working with the server side of things? apiVersion: v1 could remain the default, then perhaps in Helm 3 the default becomes apiVersion: v2?

@bacongobbler

This comment has been minimized.

Copy link
Member

commented Mar 1, 2018

As far as I recall, technically speaking the only reason index.yaml really exists today is for two things: helm search, and for hosting charts on locations other than the chart repository for redundancy purposes. I'd love to know if I'm wrong :)

For helm search, after helm repo update is called, index.yaml is fetched from the repository for Helm to perform local "lookups" on that cached index. This is also how helm-whatup works; it performs a version diff against index.yaml vs what's in the cluster.

Regarding hosting charts across multiple servers, I believe that was a use case we tried to keep open but nobody appeared interested in the idea (or at least, we haven't actually seen any interest from the community on this). To explain how the implementation was designed, helm repo update fetches down the index from the repository. helm install then looks up the packaged chart listed in the chart's URLs field, fetches the package, verifies (if --verify is present and the chart is signed), serializes it and finally ships it to tiller.

That brings up the question: Are chart repository administrators not concerned about hosting the index and the charts on a single service? IOW, are chart repository administrators not too concerned with redundant "index" servers? e.g. DockerHub has a "gateway" endpoint that redirects docker pull requests to a geo-local CDN hosting the actual layers and manifests.

I bring this up to hopefully bring some context into my next thought...

Of note is that a Maven repository is not required to have search functionality itself (it's usually broken out as a separate implementation-specific concern; see search.maven.org for one such implementation).

That might be an interesting proposal to consider for Helm 3: what if we made helm search completely optional, and got rid of index.yaml entirely and delegated that to other applications such as kubeapps (or in CodeFresh and JFrog's situations, their platforms)? Are there other parts of the system that rely on index.yaml to function?

@bacongobbler

This comment has been minimized.

Copy link
Member

commented Mar 1, 2018

@bacongobbler in terms of timing (I see you have added the "Major" milestone)- since the index file API is versioned, could we technically get this change into Helm 2 so that we can start working with the server side of things? apiVersion: v1 could remain the default, then perhaps in Helm 3 the default becomes apiVersion: v2?

I don't see why not. I flagged this as a discussion for helm 3, but if there was a way we can tackle this for Helm 2 I don't see why not.

Then again, see my comment for further thoughts :)

@bacongobbler bacongobbler removed this from the Upcoming - Major milestone Mar 1, 2018

@bacongobbler

This comment has been minimized.

Copy link
Member

commented Mar 1, 2018

I think we need to table the discussion around a push command for now.

Yes, agreed. That should be discussed in another ticket.

@jainishshah17

This comment has been minimized.

Copy link

commented Mar 2, 2018

I agree with index structure proposed by @elioengcomp @ankushchadha. For helm search once we have new indexing mechanism. We add API to search chart in the repository and make it server-side operation (Always updated result as compare to the local index.yaml).

@ankushchadha

This comment has been minimized.

Copy link
Author

commented Mar 2, 2018

@jdolitsky - Can you elaborate 4?

5 & 6 - The more metadata we add to the topmost index file, the more problems we encounter with scale. This is exactly what we should avoid. Maybe search is a different topic to discuss based on v2 of an index.

Let's take an example

helm repo update
helm search stable/artifactory

  1. If artifactory is missing from the repo level index.yaml file (that was downloaded as part of helm repo update), then 0 results
  2. If artifactory is part of the repo level index.yaml, then it will download versions.yaml and then check if latest or specific version exists (based on the search query)

This is one of the ways to solve the problem on the client side. I am sure there are more ways to solve this problem on the client side. The other way is to push the logic to the server side as indicated by @jainishshah17

Thoughts?

@jdolitsky

This comment has been minimized.

Copy link
Member

commented Mar 2, 2018

@ankushchadha for number 4, Im just trying to say that the spec does not require charts to be in a specific location. But given Matt's comments, perhaps it should. This way we can get rid of the URLs section entirely. This would make the v2 spec even simpler.

On that note, can we update the title of this issue to something like "Chart Repository v2 API"?

@bacongobbler -

That brings up the question: Are chart repository administrators not concerned about hosting the index and the charts on a single service?

I havent heard of anybody concerned about this. There might be a desire for people to store charts in different locations on the backend, but that should be up the routing service to resolve for you. Throw some crazy CDN setup on top of it if you need. One hostname.

Are there other parts of the system that rely on index.yaml to function?

I think these: helm fetch helm inspect helm install

But if the path to chart .tgz's are tied to a path spec, it should not matter.

@jainishshah17 -

add API to search chart in the repository and make it server-side operation

Yes! +100 to moving search to server. We just need to agree on a spec.

What if helm search <repo>/<query> maps to GET <repo_url>/charts?q=<query>

@ankushchadha ankushchadha changed the title Proposal: partition the index.yaml file to tackle scale Proposal: chart repository v2 (usage of nested index and it's impact on operations like search, fetch, repo update, install) Mar 5, 2018

@mattfarina

This comment has been minimized.

Copy link
Collaborator

commented Mar 8, 2018

Several things come to mind...

I'll first touch on some considerations on the use of index.yaml files at all:

  1. I know there is a desire to remove index.yaml files and query an endpoint on the server for searches. This won't work for those using object storage or a static webserver which are popular options to host repositories.
  2. Search currently works across multiple repositories. For example, I could have 3 repositories (stable, incubator, and private). When I perform a search it is able to search all of them at the same time without leaking details of my search to any of them. That means my search for a private package doesn't have information leaked to a public service. It helps to avoid accidental information leakage that should not happen.

These kinds of situations need a sane and safe path forward.

As for the breakup of the index.yaml file...

  1. Right now helm fetch needs a full URL (e.g, helm fetch https://kubernetes-charts.storage.googleapis.com/mariadb-2.1.7.tgz). A package can be fetched without adding a repository. When fetching a package it's not possible to ask for the latest or a semver range without fetching the registry index in theory. Imagine doing something like helm fetch https://kubernetes-charts.storage.googleapis.com/mariadb@^2.1 without having added a registry. With sub-index files this could be possible by fetching the sub-index file for mariadb, figure out the version, and fetch the right file. All without needing to add the whole registry. This would work for a static web server and wouldn't share any extra information with the host.
  2. I think the breakup of the index.yaml file would work for search, install, fetch, and inspect as they work today. The one wrinkle is exposing search details with the -l flag. For example, if I did a helm search chartmuseum -l with the new structure it would find it using the simpler index.yaml file and then need to download the partial for chartmuseum (at least the first time or if there is a change). This would expose to the host that someone looked up data on version for chartmuesum. If the client didn't perform a fetch almost immediately the host could deduce that it was used for search. Since everything is still scoped to fetching from just the registry that has the chart there isn't a concern for accidental search leakage while searching multiple repos. There would be offline effects and some additional latency due to network effects. Not sure -l is used that often to have a major impact.

At the moment I don't see any v2 blockers for the index changes.

@mattfarina

This comment has been minimized.

Copy link
Collaborator

commented Mar 8, 2018

@jdolitsky

What if helm search / maps to GET <repo_url>/charts?q=

There would be a couple effects....

  1. Object Storage and static web servers couldn't support this form of search.
  2. It doesn't allow searching multiple repositories at once which I can do today. If I have 3 repositories (stable, incubator, and private) I can currently search them all with helm search foo. That would change.

Are chart repository administrators not concerned about hosting the index and the charts on a single service?

I'm starting to think we should query (a survey) helm users to figure out their needs. Gather some data instead of assuming.

@jdolitsky

This comment has been minimized.

Copy link
Member

commented Mar 22, 2018

Hi all, please see helm/community#18 for a continuation of this discussion

@saracen

This comment has been minimized.

Copy link

commented Mar 23, 2018

I'm just catching up on some of the issues here. But regarding search... why not make it a client-side and server-side ability?

A dynamically generated index.yaml could provide the helm client with hints about what it supports. It could tell the Helm client that it supports search and how it supports it. The same could perhaps be done to inform Helm of the repository's folder structure.

It's likely that people would start out with small static hosted indexes (easily searchable client-side) and then out grow it when it no longer meets their needs.

@jdolitsky

This comment has been minimized.

Copy link
Member

commented Mar 23, 2018

@saracen thats a good point. if they reach a certain point where they are at the scale that this is an issue, their index.yaml can have a remote_search: true flag. And this will be the default for indexes provided by artifactory/chartmuseum

@saracen

This comment has been minimized.

Copy link

commented Mar 23, 2018

@jdolitsky A step further would be service discovery. For example, Github's pages have this in their head section:

<link rel="search" type="application/opensearchdescription+xml" href="/opensearch.xml" title="GitHub">

This points to their opensearch document. That document tells browsers and other clients how search is implemented and at what URLs.

You'll see that there's some templating in the URL, It's pretty powerful. People may wish to continue to use a completely static hosted solution and rely on myhelmchartindexer.com to provide search.

@mattfarina

This comment has been minimized.

Copy link
Collaborator

commented Apr 12, 2018

I've been doing a little analysis of memory usage and parsers. The info is up at https://github.com/mattfarina/yamlbench

@DarthFennec

This comment has been minimized.

Copy link

commented May 23, 2018

I figured I'd add my thoughts on the memory efficiency issue.

Would there be much downside to switching to an index.json instead of a YAML? I don't think there's anything in the index that can't be properly represented in JSON, and index files aren't exactly meant to be read or edited directly by humans anyway (which afaik is the main reason to use YAML). I'm seeing encoding/json in @mattfarina's yamlbench, and it seems to be a good few times more efficient than any of the YAML parsers, and there are plenty of parsers that are even more efficient (easyjson, ffjson, json-iterator, etc).

The way the current YAML library (ghodss/yaml) appears to work is, read the the entire index into memory, use go-yaml to unmarshal to an object tree, walk the entire tree to build a new cleaned-up tree, use encoding to marshal the new tree to an in-memory JSON string, then use encoding again to unmarshal that string to an IndexFile. No wonder it's so inefficient, there are five different copies of the entire document stored in memory at once. Of course, just switching to JSON would remove most of these steps, and cut things down significantly. But couldn't we go further?

It seems to me that many of the situations where index files are read are simple single-pass lookups. The file is loaded into a byte array, parsed in its entirety to an object tree, then one chart is picked out by name and its URLs are returned, and everything else is GC'd. Seems like a bit of a waste, no? Why not stream the file into a fixed buffer, parse each chart entry as they appear, ignore entries until we find the one we're looking for, then close the stream? That's constant memory usage, regardless of the size of the file. Granted, this can't be applied when you actually need multiple passes and random access to the index data, but even in those situations, a decent amount of memory might be saved by parsing a stream directly instead of reading the entire file to a byte array (which is how it's done right now).

Those are my thoughts, anyway. I noticed that most of the conversation here seems to be focused on how to deal with issues caused by splitting indexes or removing them entirely, and I figured I'd voice a different approach.

@bacongobbler

This comment has been minimized.

Copy link
Member

commented May 24, 2018

Yes I think that's in line with what @mattfarina's been proposing in past helm dev calls (and what kickstarted the yamlbench repo). Adding index.json support to the chart repository spec as well as the Helm client may be a viable approach, then we can start figuring out if sharding's still necessary. I'd prefer we start with simpler, smaller wins and work our way up to more advanced scenarios like the one @ankushchadha is proposing.

@mattfarina

This comment has been minimized.

Copy link
Collaborator

commented May 24, 2018

I like the idea of simpler wins. Given some of the non-standard library parsers claim to be better on memory I'll do some analysis on those to see if we can avoid sharding. With CI systems creating just so many releases we want to make sure the v2 solution scales the way continuous deploy's do.

@mattfarina

This comment has been minimized.

Copy link
Collaborator

commented Jun 19, 2018

I've been running some numbers on splitting the chart index into an index with just the latest chart version and then separate chart files with all the versions. You can see the results at https://github.com/mattfarina/helm-index-json This builds on the previous work I did on yaml vs json at https://github.com/mattfarina/yamlbench

The memory and performance of working with files is significantly different if we switch to json and split the files.

@mattfarina

This comment has been minimized.

Copy link
Collaborator

commented Jul 23, 2018

I've created a Helm v3 proposal to implement the split at helm/community#27

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.