Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding distributed search proposal #24

Merged
merged 4 commits into from Jul 26, 2018

Conversation

@mattfarina
Copy link
Contributor

@mattfarina mattfarina commented Jun 6, 2018

cc: @prydonius @unguiculus @technosophos @michelleN

The best correlation I have to this is packagist for PHP.

Please nit as this is a draft.

Copy link
Member

@prydonius prydonius left a comment

Thanks for putting this together, Matt. This looks like a good start to figuring out how we go about this.

  1. Should the Helm CLI have a way to query this API to search (instead of the local cache)?
  2. Is there a way we can make installing a chart from a non-checked out repo easier? Similar to checking out Homebrew taps and installing a formula in one command? (e.g. the ks CLI for ksonnet can be installed with brew install ksonnet/tap/ks, if the ksonnet tap is not checked out, Homebrew will add it).
  3. This may be another proposal, but do we want to propose a way we can maintain the level of trust with charts or repos. Ideas that have been thrown around are ratings/stars, some concept of "trusted" repos if all charts continuously pass linting/test requirements used in kubernetes/charts.

## A Single Search Location

The goal of this is to have a search site and API that enables the search of many public repositories. Private repositories are a separate scope and can operate with existing tools.

This comment has been minimized.

@prydonius

prydonius Jun 6, 2018
Member

Might be good to specifically state hosting repositories as a non-goal.

This comment has been minimized.

@mattfarina

mattfarina Jun 6, 2018
Author Contributor

Good idea. Updated.


To help, and possibly enforce, quality of charts we need to provide tools that can perform an analysis of charts to help validate quality. These tools exist for the stable and incubator charts today. They will be packaged in a manner others can consume and leverage within their workflows.

Note, work on this step has already begun.

This comment has been minimized.

@prydonius

prydonius Jun 6, 2018
Member

Is this referring to @unguiculus' work on https://github.com/kubernetes-helm/chart-testing? Do we want to link to that specifically?

This comment has been minimized.

@mattfarina

mattfarina Jun 6, 2018
Author Contributor

yes. I added a link

The following are outstanding actions that need to be worked out but can happen after the proposal is accepted:

* [ ] Decide on the hosting location for this search.
* [ ] Decide on and documented the requirements for listed repositories

This comment has been minimized.

@prydonius

prydonius Jun 6, 2018
Member

s/documented/document

This comment has been minimized.

@mattfarina

mattfarina Jun 6, 2018
Author Contributor

Good catch. Fixed

@mattfarina
Copy link
Contributor Author

@mattfarina mattfarina commented Jun 6, 2018

@prydonius

Should the Helm CLI have a way to query this API to search (instead of the local cache)?

I have two thoughts on this...

  1. The local cache of repos should still be used because it can hold private repo information and it doesn't leak details on the search to a 3rd party. It's also contextually set to that person rather than all the things.
  2. I like the idea of providing an option to work with this service. Possibly a flag or something else. But, I think this is a Helm feature addition and should happen after we get this service up and prove we can do it.

Basically, I would do Helm integration as a second proposal when the time is right.

Is there a way we can make installing a chart from a non-checked out repo easier? Similar to checking out Homebrew taps and installing a formula in one command? (e.g. the ks CLI for ksonnet can be installed with brew install ksonnet/tap/ks, if the ksonnet tap is not checked out, Homebrew will add it).

Right now, in Helm 2, you can do something like:

$ helm install https://kubernetes-charts.storage.googleapis.com/acs-engine-autoscaler-2.2.0.tgz

What we're really talking about doing is aliases (similar to the way stable and incubator work today) and doing aliases in a shared manner. I would file this as a separate Helm change and we need to put a lot of work into this because:

  • Homebrew dropped the official taps. Now it either needs to be a github org or explicitly pass in the location. Homebrew no longer does the shared tap knowledge.
  • We need to figure out how to make it work with names people are already using and to handling conflicts

I like the overall idea but I would punt on it as a secondary thing that can come in a follow-up proposal where the nuances of the hard parts can be debated.

I do have ideas for another proposal, coming soon, that would enable things like:

$ helm install https://kubernetes-charts.storage.googleapis.com/acs-engine-autoscaler

or

$ helm install https://kubernetes-charts.storage.googleapis.com/acs-engine-autoscaler#^2.2.0

The idea is to make the experience easier by not needing to know all the version information.

This may be another proposal, but do we want to propose a way we can maintain the level of trust with charts or repos. Ideas that have been thrown around are ratings/stars, some concept of "trusted" repos if all charts continuously pass linting/test requirements used in kubernetes/charts.

Trust is a squishy thing. Who decides trust? It's the end users and different people will decide differently. If we say something is trusted but they fail trust what happens to the Helm project? It'll loose some trust.

I like the approach of giving people information and letting them decide how to trust. We set some minimum bar for inclusion and after that the trust is variable. Some things we could do:

  • If the repo is on a site like GitHub we could pull in and display some details
  • Tell people if a chart is signed or not and provide public key info on it
  • Provide a badge if they are running different levels of CI tools on the charts

Oracle is a company that really brings up the trust question. Some people don't trust them because of all the closed source they do. Others trust them because they have contracts. Trust is not a one size fits all situation. So, I would propose we empower people doing searches rather than try to decide trust for them.

@prydonius
Copy link
Member

@prydonius prydonius commented Jun 6, 2018

Basically, I would do Helm integration as a second proposal when the time is right.

Thanks, makes sense. I also agree that this should not be the default behaviour but a flag or another command.

Right now, in Helm 2, you can do something like:

$ helm install https://kubernetes-charts.storage.googleapis.com/acs-engine-autoscaler-2.2.0.tgz

Interesting, I didn't know this. Sounds like we could easily make this nicer with some sort of vanity URL, e.g. helm install helm.sh/prydonius/wordpress to get the latest and helm install helm.sh/prydonius/wordpress#^2.2.0 with some sort of version tag as you suggest. Where prydonius could be a repo I've added to the site. I like it!

So, I would propose we empower people doing searches rather than try to decide trust for them.

You're absolutely right here, and I agree that the trust should be decided based on all the information we can provide to a user. I do think it benefits users to display repos and charts that are higher rated, have a higher frequency of contributions, and follow all CI best practices more prominent though. We need to choose the metrics we use for "trust" carefully, but I don't think that prevents us from being able to sort by them.

@mattfarina
Copy link
Contributor Author

@mattfarina mattfarina commented Jun 7, 2018

@prydonius Interesting idea with the vanity locations like helm.sh/prydonius/wordpress. Imagine the short name (e.g., helm.sh/prydonius) being a list of all the charts in a repo. We would have to make this work separately from GitHub.

@scottrigby
Copy link
Member

@scottrigby scottrigby commented Jun 12, 2018

@mattfarina This proposal summarizes all the important things I can recall from past conversations around helm registry (which I realize now has hit a dead end as a solution, but the key requirements are similar): distributed source model with centralized info, ability to register, discoverability, ease of DX for contributors and users, important analytics for users in deciding which package to try (the trust factor).

@prydonius I like the account (user/org) namespacing and helm.sh vanity URL ideas too. Probably because I'm already familiar with this from other communities (packagist etc), but also because I think this helps with trust and distinguishing one author or group's version of a package from another. 👏 👏

@omkensey
Copy link

@omkensey omkensey commented Jun 13, 2018

The other thing this makes me think of is the yum/apt repo ecosystem. Yum repos can publish (and EPEL does, among I think others) a package that users can install that adds the signing keys for that repo as trusted keys, sets up the repo in the yum config, etc. I could see helm repos publishing some sort of signed metadata bundle that allows that too.

Also like the Linux package ecosystem, we're going to have to deal with package conflicts. Currently this is handled with Linux package repos by repo priorities being set, explicitly or implicitly, and if multiple repos have a given package name/arch/version to be installed, highest-priority repo wins. This also allows users to implicitly indicate, in a way, their trust level of a repo -- "if stable has that chart I don't want to get it from honest-jims-lightly-used-charts, but I do want to use Jim for charts only he provides". (This does open up the possibility of a malicious or compromised repo manipulating chart versions (or a poorly-run repo doing so accidentally) to win dependency resolution even if they're lower priority, by falsely claiming they provide a later version of a chart than all other repos -- so we may want to set some basic rules around chart name/version conflicts with things in the stable repo. EPEL has this kind of relationship with the RHEL/CentOS repos.)

We don't want to be in charge of setting trust levels, but like we do with stable charts, I do think we want to set basic criteria for a repo's inclusion in whatever index is created. I like the compliance badge idea. We should make sure it's visually clear that we're not saying this repo or its charts are trustworthy, only that it passed some basic functional tests.

Tangentially, this also brings up a separate idea -- mirroring. It's out of scope for this discussion but we may want to consider creating/promoting some tools for creating mirrors of repos, both for added resiliency to the public and for enabling repos to be easily used behind strict corporate firewalls.

@mattfarina
Copy link
Contributor Author

@mattfarina mattfarina commented Jun 14, 2018

@omkensey I see where you are going with this. A few thoughts...

  • You might want to look at #20 as it talks about signing. Helm charts can be signed today. We just don't see much of that. It would be nice to make that experience easier for everyone.
  • Helm was designed to be like APT/yum so any correlations should not be a surprise. There was intent in that.
  • If I'm using, for example, apt and I want to install something by name there is just a name. For example, apt-get install wget. With helm there is the difference in the repo short name. For example, helm install stable/mysql. There are no conflicts because of the namespacing by repo. Don't you think this removes the need for priorities?

It's worth noting, for those that don't know, you can do something like this today with having added a repo:

$ helm install https://kubernetes-charts.storage.googleapis.com/drupal-1.0.0.tgz

Mirroring
There are already tools for mirroring today. For example, if I understand it right, Artifactory is being used as a mirror right now. One of the top user agents to read from the community repo is Artifactory as it pulls in charts. It would be nice to see some open source tools that do this as well.

@jdolitsky
Copy link
Member

@jdolitsky jdolitsky commented Jun 18, 2018

@mattfarina this all sounds great. In terms of mirroring @omkensey, this is something not yet implemented, but has been asked for in chartmuseum. This can be part of provided "Repository Tools".

Few items that are unclear to me (and can probably be discussed in a separate PR after merge):

1.) "The goal of this is to have a search site and API" - this sounds to be some overlap with functionality provided by Monocular @prydonius Should we break off Monocular API into its own project to support this functionality? Is this something that should be added to chartmuseum, with the ability to provide search on locally hosted repos out-of-the-box? I'm not sure if this proposal is suggesting a new tool or not

2.) Auth. How is authentication handled? How/will the central service delegate the responsibility of authorization to individual repos? This is one of the bigger challenges of this in my opinion.

@mattfarina
Copy link
Contributor Author

@mattfarina mattfarina commented Jun 18, 2018

@jdolitsky To respond to your points...

  1. Couldn't this be implemented by using monocular? Monocular is the software to do something and not a public service. Running something as a public service could do this. hub.kubeapps.com is using monocular + some other things.
  2. Since Helm is part of the CNCF and the CNCF is part of the linux foundation, I'm curious about https://identity.linuxfoundation.org/. I'm only at the curious stage rather than knowing enough about it to have an opinion.
@prydonius
Copy link
Member

@prydonius prydonius commented Jun 18, 2018

I also think that Monocular is the best place to go and implement this, having been originally designed for this purpose. Monocular already supports aggregating multiple repositories today, though repositories can only currently be configured in the global configuration. We would want to extend this to make it possible for logged-in users to add their own repositories under a namespace.

Perhaps a good next step here, after there is consensus on the overall goal, is to put together a proposal on the features we will need to implement in Monocular to get the desired functionality.

@mattfarina
Copy link
Contributor Author

@mattfarina mattfarina commented Jun 19, 2018

@prydonius for a v0.1 is there anything beyond straight monocular we should have?

@jdolitsky
Copy link
Member

@jdolitsky jdolitsky commented Jun 19, 2018

@mattfarina Are we talking about hosting chart packages? Or only pointers to user-hosted repos?

@prydonius
Copy link
Member

@prydonius prydonius commented Jun 19, 2018

@prydonius for a v0.1 is there anything beyond straight monocular we should have?

That would be a good start, along with a Monocular config file to list the repositories the site will index. However, I think it would be good to start discussing what work we'd want to do beyond that (e.g. pagination, improved search).

@mattfarina
Copy link
Contributor Author

@mattfarina mattfarina commented Jun 20, 2018

@jdolitsky Just to restate here, we are not looking at the hosting of packages. Rather, we want to make packages hosted in a distributed manner, in many different repositories, to be discoverable.

@prydonius Here are a few ideas that come to mind for me:

  • A homepage that displays charts other than all of them. Maybe have displays of the latest additions, those recently updated, or something else. If it's a paginated list of everything people might try to game it to have theirs at the top on the first page.
  • Integration with external systems to display data. This could vary by system. For example, if the repo is hosted on GitHub (there's a plugin for that) we could display some metadata like stars, traffic, or something else.

Given a little time I'll try to come up with some more ideas.

@bacongobbler
Copy link
Member

@bacongobbler bacongobbler commented Jul 18, 2018

One thing this document doesn't go into too much detail: who has permission to add or remove repositories to the list? I'm assuming that's a responsibility of the chart maintainers?

Putting it another way: If I were to maintain my own repository of charts and there was decent uptake from the community, how would I go through the process to get it added to this service?

@jzelinskie
Copy link

@jzelinskie jzelinskie commented Jul 19, 2018

I like the simplicity of the godoc.org model where user traffic drives discovery of content. However, I think it'd be hard to apply here. Do we want this to be completely equalizing or do we want to aggregate a list of trusted sources and have a push protocol for them to notify the indexer of new content being available?

@mattfarina
Copy link
Contributor Author

@mattfarina mattfarina commented Jul 24, 2018

@bacongobbler The discussions so far has been the charts maintainers. There would be documented criteria that is still TBD.

@jzelinskie Here's my take on your questions...

do we want to aggregate a list of trusted sources and have a push protocol for them to notify the indexer of new content being available?

I think this will change over time. Monocular already has a method. I expect we will iterate on and improve this over time. We would start with Monocular as it roughly sits today and then iterate to improve on it. This isn't a case of build something new.

I like the simplicity of the godoc.org model where user traffic drives discovery of content.

The godoc model has lots of faults too. For example, how do you differentiate between the root and all the forks when they are picked up by godoc? For pkg search it has problems. Especially when something has numerous contributors that end up being indexed.

When you distribute you don't have access to the download data to try to differentiate either.

I'm personally partial to the packagist model... https://packagist.org/. Someone needs to choose to list something when they want it to be discoverable. The intended central sources of truth end up being listed rather than all the forks. Yet, the search is really just a public search and metadata cache. The sources of truth reside in 3rd parties.

Signed-off-by: Matt Farina <matt@mattfarina.com>
@mattfarina
Copy link
Contributor Author

@mattfarina mattfarina commented Jul 26, 2018

This was voted on by the Helm maintainers and passed. We will be moving forward with distributed search of charts hosted by the Helm project.

@mattfarina mattfarina merged commit 16c2cc0 into helm:master Jul 26, 2018
1 check passed
1 check passed
cla/linuxfoundation mattfarina authorized
Details
@mattfarina mattfarina deleted the mattfarina:proposal-distributed-search branch Jul 26, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

7 participants