Optimize scanReposForURL function #1904

technosophos · 2017-02-02T17:04:15Z

Per #1855, there is probably a way to optimize the traversal algorithm (maybe with a cache) that is used by scanReposForURL()

thomastaylor312 · 2017-04-15T19:20:09Z

@technosophos This is probably a good one to try and do as part of 2.4 given its focus 😄, should we bring it in?

fejta-bot · 2017-12-23T07:58:39Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

thomastaylor312 · 2018-01-04T07:21:25Z

/lifecycle frozen

irajdeep · 2019-08-19T09:30:32Z

@technosophos I am planning to work on this if no one else is currently working on it :)

thomastaylor312 · 2019-08-19T15:21:56Z

@irajdeep Sounds great! I would target Helm v3 as we are mostly focusing on bug fixes for v2

irajdeep · 2019-08-19T15:38:05Z

sounds cool, I will start working on it then

This commit is an attempt to address concerns scoped in helm#1904. `ChartDownloader.scanReposForURL()` is called when a fetching chart is identified by an absolute URL and no trivial reference between the chart and parental repository could be established. The operation requires a full linear repository scan. The runtime perfromance degenerates in a case of a non-existing URL as a full scan goes over the full range of existing repos, all their ChartVersion entries and each version of an entry. The motivation for this change is to speed up this operation, ideally make the aforementioned operation to run in nearly constant amortized time (a detailed poreformance analysis is presented below). There is a constraint that should be taken into account before a conclusion about overall sense can be made: the function call exists in a vacuum: there is no continuous runtime between consecutive runs of the function. I.e. we can't use an in-memory structure that's created once and should consider using a persistent cache approach (temporary cache files). Dealing with a FS might come with a singnificant performance penalty but in the light of the aforementioned situation that seems decent. This commit introduces a concept of a secondary repository index: an extended cached index structure based on the primary repo index. Effectively it builds an inverted index based on the data read from index.yaml. In the current implementation it only builds inverred URL index. The proposed structure of a secondary file looks like: indexes: byURL: chart_url_1: name: chart_name_1 version: chart_version_1 chart_url_2: name: chart_name_2 version: chart_version_2 Ideally, the structure should provide a nearly constant lookup time in order to provide any sensible result. On the other hand, repository cache is a self-contained substance which exists independently from other repositories. Therefore it was decided to create a secondary index file for every repository. In this case we're still dealing with a linear traversal of repositories but relying on an assumption that an average # of repositories is smaller than an average # of entries in them. The assumed invariant is irrelevant to the existing algorithm as it's runtime doesn't change if the ratio changes. In the current implementation, `scanReposForURL` performs cached index.yaml loading on linear repo traversal. In the proposed implementation this operation is substituted by reading index-secondary.yaml insted. We believe it has no impact on the operation runtime complexity: the files are comparable in size and nesting complexity. This commit contains a new benchmark suite to test the performance of the new implementation. For the sake of comparability we preserved the current implementation of `scanReposForURL` but renamed it to `scanReposForURLBak`. The results of a benchmark run: goos: darwin goarch: amd64 pkg: helm.sh/helm/pkg/downloader BenchmarkScanReposForURL/with_secondary_index-4 50000 24649 ns/op BenchmarkScanReposForURL/with_secondary_index-4 100000 22762 ns/op BenchmarkScanReposForURL/with_secondary_index-4 50000 23170 ns/op BenchmarkScanReposForURL/with_secondary_index-4 100000 22712 ns/op BenchmarkScanReposForURL/with_secondary_index-4 50000 22346 ns/op BenchmarkScanReposForURL/with_secondary_index-4 100000 22658 ns/op BenchmarkScanReposForURL/with_secondary_index-4 100000 23544 ns/op BenchmarkScanReposForURL/with_secondary_index-4 100000 23062 ns/op BenchmarkScanReposForURL/with_secondary_index-4 100000 22476 ns/op BenchmarkScanReposForURL/with_secondary_index-4 100000 22621 ns/op BenchmarkScanReposForURL/no_secondary_index-4 20 144900320 ns/op BenchmarkScanReposForURL/no_secondary_index-4 10 178732070 ns/op BenchmarkScanReposForURL/no_secondary_index-4 10 130792721 ns/op BenchmarkScanReposForURL/no_secondary_index-4 10 170880342 ns/op BenchmarkScanReposForURL/no_secondary_index-4 10 144174110 ns/op BenchmarkScanReposForURL/no_secondary_index-4 20 147588994 ns/op BenchmarkScanReposForURL/no_secondary_index-4 10 155225613 ns/op BenchmarkScanReposForURL/no_secondary_index-4 10 205336598 ns/op BenchmarkScanReposForURL/no_secondary_index-4 10 168166931 ns/op BenchmarkScanReposForURL/no_secondary_index-4 10 152456294 ns/op Signed-off-by: Oleg Sidorov <oleg.sidorov@booking.com>

This commit is an attempt to address concerns scoped in helm#1904. `ChartDownloader.scanReposForURL()` is called when a fetching chart is identified by an absolute URL and no trivial reference between the chart and parental repository could be established. The operation requires a full linear repository scan. The runtime perfromance degenerates in a case of a non-existing URL as a full scan goes over the full range of existing repos, all their ChartVersion entries and each version of an entry. The motivation for this change is to speed up this operation, ideally make the aforementioned operation to run in nearly constant amortized time (a detailed poreformance analysis is presented below). There is a constraint that should be taken into account before a conclusion about overall sense can be made: the function call exists in a vacuum: there is no continuous runtime between consecutive runs of the function. I.e. we can't use an in-memory structure that's created once and should consider using a persistent cache approach (temporary cache files). Dealing with a FS might come with a singnificant performance penalty but in the light of the aforementioned situation that seems decent. This commit introduces a concept of a secondary repository index: an extended cached index structure based on the primary repo index. Effectively it builds an inverted index based on the data read from index.yaml. In the current implementation it only builds inverred URL index. The proposed structure of a secondary file looks like: indexes: byURL: chart_url_1: name: chart_name_1 version: chart_version_1 chart_url_2: name: chart_name_2 version: chart_version_2 Ideally, the structure should provide a nearly constant lookup time in order to provide any sensible result. On the other hand, repository cache is a self-contained substance which exists independently from other repositories. Therefore it was decided to create a secondary index file for every repository. In this case we're still dealing with a linear traversal of repositories but relying on an assumption that an average # of repositories is smaller than an average # of entries in them. The assumed invariant is irrelevant to the existing algorithm as it's runtime doesn't change if the ratio changes. In the current implementation, `scanReposForURL` performs cached index.yaml loading on linear repo traversal. In the proposed implementation this operation is substituted by reading index-secondary.yaml insted. We believe it has no impact on the operation runtime complexity: the files are comparable in size and nesting complexity. The results of a benchmark run: goos: darwin goarch: amd64 pkg: helm.sh/helm/pkg/downloader BenchmarkScanReposForURL/with_secondary_index-4 50000 24649 ns/op BenchmarkScanReposForURL/with_secondary_index-4 100000 22762 ns/op BenchmarkScanReposForURL/with_secondary_index-4 50000 23170 ns/op BenchmarkScanReposForURL/with_secondary_index-4 100000 22712 ns/op BenchmarkScanReposForURL/with_secondary_index-4 50000 22346 ns/op BenchmarkScanReposForURL/with_secondary_index-4 100000 22658 ns/op BenchmarkScanReposForURL/with_secondary_index-4 100000 23544 ns/op BenchmarkScanReposForURL/with_secondary_index-4 100000 23062 ns/op BenchmarkScanReposForURL/with_secondary_index-4 100000 22476 ns/op BenchmarkScanReposForURL/with_secondary_index-4 100000 22621 ns/op BenchmarkScanReposForURL/no_secondary_index-4 20 144900320 ns/op BenchmarkScanReposForURL/no_secondary_index-4 10 178732070 ns/op BenchmarkScanReposForURL/no_secondary_index-4 10 130792721 ns/op BenchmarkScanReposForURL/no_secondary_index-4 10 170880342 ns/op BenchmarkScanReposForURL/no_secondary_index-4 10 144174110 ns/op BenchmarkScanReposForURL/no_secondary_index-4 20 147588994 ns/op BenchmarkScanReposForURL/no_secondary_index-4 10 155225613 ns/op BenchmarkScanReposForURL/no_secondary_index-4 10 205336598 ns/op BenchmarkScanReposForURL/no_secondary_index-4 10 168166931 ns/op BenchmarkScanReposForURL/no_secondary_index-4 10 152456294 ns/op Signed-off-by: Oleg Sidorov <oleg.sidorov@booking.com>

github-actions · 2020-11-22T00:30:45Z

This issue has been marked as stale because it has been open for 90 days with no activity. This thread will be automatically closed in 30 days if no further activity occurs.

This fixes the bug by not including provided values files in the array of generated values, which is evalutated in a defered block. Resolves helm#1904

technosophos mentioned this issue Feb 2, 2017

fix(helm): fix 'helm dep up' to fetch by URL #1855

Merged

technosophos added this to the 2.4.0-Triage milestone Feb 14, 2017

thomastaylor312 added the refactor label Apr 15, 2017

k8s-ci-robot added the lifecycle/stale label Dec 23, 2017

k8s-ci-robot added the lifecycle/frozen label Jan 4, 2018

bacongobbler removed lifecycle/frozen labels Sep 24, 2018

osdrv mentioned this issue Sep 10, 2019

[Proposal] Implemented secondary repository indexing refs #1904 #6405

Closed

3 tasks

bacongobbler removed this from the Upcoming - Minor milestone Aug 14, 2020

github-actions bot added the Stale label Nov 22, 2020

github-actions bot closed this as completed Dec 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize scanReposForURL function #1904

Optimize scanReposForURL function #1904

technosophos commented Feb 2, 2017

thomastaylor312 commented Apr 15, 2017

fejta-bot commented Dec 23, 2017

thomastaylor312 commented Jan 4, 2018

irajdeep commented Aug 19, 2019

thomastaylor312 commented Aug 19, 2019

irajdeep commented Aug 19, 2019

github-actions bot commented Nov 22, 2020

Optimize scanReposForURL function #1904

Optimize scanReposForURL function #1904

Comments

technosophos commented Feb 2, 2017

thomastaylor312 commented Apr 15, 2017

fejta-bot commented Dec 23, 2017

thomastaylor312 commented Jan 4, 2018

irajdeep commented Aug 19, 2019

thomastaylor312 commented Aug 19, 2019

irajdeep commented Aug 19, 2019

github-actions bot commented Nov 22, 2020