-
Notifications
You must be signed in to change notification settings - Fork 7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize scanReposForURL function #1904
Comments
@technosophos This is probably a good one to try and do as part of 2.4 given its focus 😄, should we bring it in? |
Issues go stale after 90d of inactivity. Prevent issues from auto-closing with an If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or |
/lifecycle frozen |
@technosophos I am planning to work on this if no one else is currently working on it :) |
@irajdeep Sounds great! I would target Helm v3 as we are mostly focusing on bug fixes for v2 |
sounds cool, I will start working on it then |
This commit is an attempt to address concerns scoped in helm#1904. `ChartDownloader.scanReposForURL()` is called when a fetching chart is identified by an absolute URL and no trivial reference between the chart and parental repository could be established. The operation requires a full linear repository scan. The runtime perfromance degenerates in a case of a non-existing URL as a full scan goes over the full range of existing repos, all their ChartVersion entries and each version of an entry. The motivation for this change is to speed up this operation, ideally make the aforementioned operation to run in nearly constant amortized time (a detailed poreformance analysis is presented below). There is a constraint that should be taken into account before a conclusion about overall sense can be made: the function call exists in a vacuum: there is no continuous runtime between consecutive runs of the function. I.e. we can't use an in-memory structure that's created once and should consider using a persistent cache approach (temporary cache files). Dealing with a FS might come with a singnificant performance penalty but in the light of the aforementioned situation that seems decent. This commit introduces a concept of a secondary repository index: an extended cached index structure based on the primary repo index. Effectively it builds an inverted index based on the data read from index.yaml. In the current implementation it only builds inverred URL index. The proposed structure of a secondary file looks like: indexes: byURL: chart_url_1: name: chart_name_1 version: chart_version_1 chart_url_2: name: chart_name_2 version: chart_version_2 Ideally, the structure should provide a nearly constant lookup time in order to provide any sensible result. On the other hand, repository cache is a self-contained substance which exists independently from other repositories. Therefore it was decided to create a secondary index file for every repository. In this case we're still dealing with a linear traversal of repositories but relying on an assumption that an average # of repositories is smaller than an average # of entries in them. The assumed invariant is irrelevant to the existing algorithm as it's runtime doesn't change if the ratio changes. In the current implementation, `scanReposForURL` performs cached index.yaml loading on linear repo traversal. In the proposed implementation this operation is substituted by reading index-secondary.yaml insted. We believe it has no impact on the operation runtime complexity: the files are comparable in size and nesting complexity. This commit contains a new benchmark suite to test the performance of the new implementation. For the sake of comparability we preserved the current implementation of `scanReposForURL` but renamed it to `scanReposForURLBak`. The results of a benchmark run: goos: darwin goarch: amd64 pkg: helm.sh/helm/pkg/downloader BenchmarkScanReposForURL/with_secondary_index-4 50000 24649 ns/op BenchmarkScanReposForURL/with_secondary_index-4 100000 22762 ns/op BenchmarkScanReposForURL/with_secondary_index-4 50000 23170 ns/op BenchmarkScanReposForURL/with_secondary_index-4 100000 22712 ns/op BenchmarkScanReposForURL/with_secondary_index-4 50000 22346 ns/op BenchmarkScanReposForURL/with_secondary_index-4 100000 22658 ns/op BenchmarkScanReposForURL/with_secondary_index-4 100000 23544 ns/op BenchmarkScanReposForURL/with_secondary_index-4 100000 23062 ns/op BenchmarkScanReposForURL/with_secondary_index-4 100000 22476 ns/op BenchmarkScanReposForURL/with_secondary_index-4 100000 22621 ns/op BenchmarkScanReposForURL/no_secondary_index-4 20 144900320 ns/op BenchmarkScanReposForURL/no_secondary_index-4 10 178732070 ns/op BenchmarkScanReposForURL/no_secondary_index-4 10 130792721 ns/op BenchmarkScanReposForURL/no_secondary_index-4 10 170880342 ns/op BenchmarkScanReposForURL/no_secondary_index-4 10 144174110 ns/op BenchmarkScanReposForURL/no_secondary_index-4 20 147588994 ns/op BenchmarkScanReposForURL/no_secondary_index-4 10 155225613 ns/op BenchmarkScanReposForURL/no_secondary_index-4 10 205336598 ns/op BenchmarkScanReposForURL/no_secondary_index-4 10 168166931 ns/op BenchmarkScanReposForURL/no_secondary_index-4 10 152456294 ns/op Signed-off-by: Oleg Sidorov <oleg.sidorov@booking.com>
This commit is an attempt to address concerns scoped in helm#1904. `ChartDownloader.scanReposForURL()` is called when a fetching chart is identified by an absolute URL and no trivial reference between the chart and parental repository could be established. The operation requires a full linear repository scan. The runtime perfromance degenerates in a case of a non-existing URL as a full scan goes over the full range of existing repos, all their ChartVersion entries and each version of an entry. The motivation for this change is to speed up this operation, ideally make the aforementioned operation to run in nearly constant amortized time (a detailed poreformance analysis is presented below). There is a constraint that should be taken into account before a conclusion about overall sense can be made: the function call exists in a vacuum: there is no continuous runtime between consecutive runs of the function. I.e. we can't use an in-memory structure that's created once and should consider using a persistent cache approach (temporary cache files). Dealing with a FS might come with a singnificant performance penalty but in the light of the aforementioned situation that seems decent. This commit introduces a concept of a secondary repository index: an extended cached index structure based on the primary repo index. Effectively it builds an inverted index based on the data read from index.yaml. In the current implementation it only builds inverred URL index. The proposed structure of a secondary file looks like: indexes: byURL: chart_url_1: name: chart_name_1 version: chart_version_1 chart_url_2: name: chart_name_2 version: chart_version_2 Ideally, the structure should provide a nearly constant lookup time in order to provide any sensible result. On the other hand, repository cache is a self-contained substance which exists independently from other repositories. Therefore it was decided to create a secondary index file for every repository. In this case we're still dealing with a linear traversal of repositories but relying on an assumption that an average # of repositories is smaller than an average # of entries in them. The assumed invariant is irrelevant to the existing algorithm as it's runtime doesn't change if the ratio changes. In the current implementation, `scanReposForURL` performs cached index.yaml loading on linear repo traversal. In the proposed implementation this operation is substituted by reading index-secondary.yaml insted. We believe it has no impact on the operation runtime complexity: the files are comparable in size and nesting complexity. This commit contains a new benchmark suite to test the performance of the new implementation. For the sake of comparability we preserved the current implementation of `scanReposForURL` but renamed it to `scanReposForURLBak`. The results of a benchmark run: goos: darwin goarch: amd64 pkg: helm.sh/helm/pkg/downloader BenchmarkScanReposForURL/with_secondary_index-4 50000 24649 ns/op BenchmarkScanReposForURL/with_secondary_index-4 100000 22762 ns/op BenchmarkScanReposForURL/with_secondary_index-4 50000 23170 ns/op BenchmarkScanReposForURL/with_secondary_index-4 100000 22712 ns/op BenchmarkScanReposForURL/with_secondary_index-4 50000 22346 ns/op BenchmarkScanReposForURL/with_secondary_index-4 100000 22658 ns/op BenchmarkScanReposForURL/with_secondary_index-4 100000 23544 ns/op BenchmarkScanReposForURL/with_secondary_index-4 100000 23062 ns/op BenchmarkScanReposForURL/with_secondary_index-4 100000 22476 ns/op BenchmarkScanReposForURL/with_secondary_index-4 100000 22621 ns/op BenchmarkScanReposForURL/no_secondary_index-4 20 144900320 ns/op BenchmarkScanReposForURL/no_secondary_index-4 10 178732070 ns/op BenchmarkScanReposForURL/no_secondary_index-4 10 130792721 ns/op BenchmarkScanReposForURL/no_secondary_index-4 10 170880342 ns/op BenchmarkScanReposForURL/no_secondary_index-4 10 144174110 ns/op BenchmarkScanReposForURL/no_secondary_index-4 20 147588994 ns/op BenchmarkScanReposForURL/no_secondary_index-4 10 155225613 ns/op BenchmarkScanReposForURL/no_secondary_index-4 10 205336598 ns/op BenchmarkScanReposForURL/no_secondary_index-4 10 168166931 ns/op BenchmarkScanReposForURL/no_secondary_index-4 10 152456294 ns/op Signed-off-by: Oleg Sidorov <oleg.sidorov@booking.com>
This commit is an attempt to address concerns scoped in helm#1904. `ChartDownloader.scanReposForURL()` is called when a fetching chart is identified by an absolute URL and no trivial reference between the chart and parental repository could be established. The operation requires a full linear repository scan. The runtime perfromance degenerates in a case of a non-existing URL as a full scan goes over the full range of existing repos, all their ChartVersion entries and each version of an entry. The motivation for this change is to speed up this operation, ideally make the aforementioned operation to run in nearly constant amortized time (a detailed poreformance analysis is presented below). There is a constraint that should be taken into account before a conclusion about overall sense can be made: the function call exists in a vacuum: there is no continuous runtime between consecutive runs of the function. I.e. we can't use an in-memory structure that's created once and should consider using a persistent cache approach (temporary cache files). Dealing with a FS might come with a singnificant performance penalty but in the light of the aforementioned situation that seems decent. This commit introduces a concept of a secondary repository index: an extended cached index structure based on the primary repo index. Effectively it builds an inverted index based on the data read from index.yaml. In the current implementation it only builds inverred URL index. The proposed structure of a secondary file looks like: indexes: byURL: chart_url_1: name: chart_name_1 version: chart_version_1 chart_url_2: name: chart_name_2 version: chart_version_2 Ideally, the structure should provide a nearly constant lookup time in order to provide any sensible result. On the other hand, repository cache is a self-contained substance which exists independently from other repositories. Therefore it was decided to create a secondary index file for every repository. In this case we're still dealing with a linear traversal of repositories but relying on an assumption that an average # of repositories is smaller than an average # of entries in them. The assumed invariant is irrelevant to the existing algorithm as it's runtime doesn't change if the ratio changes. In the current implementation, `scanReposForURL` performs cached index.yaml loading on linear repo traversal. In the proposed implementation this operation is substituted by reading index-secondary.yaml insted. We believe it has no impact on the operation runtime complexity: the files are comparable in size and nesting complexity. The results of a benchmark run: goos: darwin goarch: amd64 pkg: helm.sh/helm/pkg/downloader BenchmarkScanReposForURL/with_secondary_index-4 50000 24649 ns/op BenchmarkScanReposForURL/with_secondary_index-4 100000 22762 ns/op BenchmarkScanReposForURL/with_secondary_index-4 50000 23170 ns/op BenchmarkScanReposForURL/with_secondary_index-4 100000 22712 ns/op BenchmarkScanReposForURL/with_secondary_index-4 50000 22346 ns/op BenchmarkScanReposForURL/with_secondary_index-4 100000 22658 ns/op BenchmarkScanReposForURL/with_secondary_index-4 100000 23544 ns/op BenchmarkScanReposForURL/with_secondary_index-4 100000 23062 ns/op BenchmarkScanReposForURL/with_secondary_index-4 100000 22476 ns/op BenchmarkScanReposForURL/with_secondary_index-4 100000 22621 ns/op BenchmarkScanReposForURL/no_secondary_index-4 20 144900320 ns/op BenchmarkScanReposForURL/no_secondary_index-4 10 178732070 ns/op BenchmarkScanReposForURL/no_secondary_index-4 10 130792721 ns/op BenchmarkScanReposForURL/no_secondary_index-4 10 170880342 ns/op BenchmarkScanReposForURL/no_secondary_index-4 10 144174110 ns/op BenchmarkScanReposForURL/no_secondary_index-4 20 147588994 ns/op BenchmarkScanReposForURL/no_secondary_index-4 10 155225613 ns/op BenchmarkScanReposForURL/no_secondary_index-4 10 205336598 ns/op BenchmarkScanReposForURL/no_secondary_index-4 10 168166931 ns/op BenchmarkScanReposForURL/no_secondary_index-4 10 152456294 ns/op Signed-off-by: Oleg Sidorov <oleg.sidorov@booking.com>
This commit is an attempt to address concerns scoped in helm#1904. `ChartDownloader.scanReposForURL()` is called when a fetching chart is identified by an absolute URL and no trivial reference between the chart and parental repository could be established. The operation requires a full linear repository scan. The runtime perfromance degenerates in a case of a non-existing URL as a full scan goes over the full range of existing repos, all their ChartVersion entries and each version of an entry. The motivation for this change is to speed up this operation, ideally make the aforementioned operation to run in nearly constant amortized time (a detailed poreformance analysis is presented below). There is a constraint that should be taken into account before a conclusion about overall sense can be made: the function call exists in a vacuum: there is no continuous runtime between consecutive runs of the function. I.e. we can't use an in-memory structure that's created once and should consider using a persistent cache approach (temporary cache files). Dealing with a FS might come with a singnificant performance penalty but in the light of the aforementioned situation that seems decent. This commit introduces a concept of a secondary repository index: an extended cached index structure based on the primary repo index. Effectively it builds an inverted index based on the data read from index.yaml. In the current implementation it only builds inverred URL index. The proposed structure of a secondary file looks like: indexes: byURL: chart_url_1: name: chart_name_1 version: chart_version_1 chart_url_2: name: chart_name_2 version: chart_version_2 Ideally, the structure should provide a nearly constant lookup time in order to provide any sensible result. On the other hand, repository cache is a self-contained substance which exists independently from other repositories. Therefore it was decided to create a secondary index file for every repository. In this case we're still dealing with a linear traversal of repositories but relying on an assumption that an average # of repositories is smaller than an average # of entries in them. The assumed invariant is irrelevant to the existing algorithm as it's runtime doesn't change if the ratio changes. In the current implementation, `scanReposForURL` performs cached index.yaml loading on linear repo traversal. In the proposed implementation this operation is substituted by reading index-secondary.yaml insted. We believe it has no impact on the operation runtime complexity: the files are comparable in size and nesting complexity. The results of a benchmark run: goos: darwin goarch: amd64 pkg: helm.sh/helm/pkg/downloader BenchmarkScanReposForURL/with_secondary_index-4 50000 24649 ns/op BenchmarkScanReposForURL/with_secondary_index-4 100000 22762 ns/op BenchmarkScanReposForURL/with_secondary_index-4 50000 23170 ns/op BenchmarkScanReposForURL/with_secondary_index-4 100000 22712 ns/op BenchmarkScanReposForURL/with_secondary_index-4 50000 22346 ns/op BenchmarkScanReposForURL/with_secondary_index-4 100000 22658 ns/op BenchmarkScanReposForURL/with_secondary_index-4 100000 23544 ns/op BenchmarkScanReposForURL/with_secondary_index-4 100000 23062 ns/op BenchmarkScanReposForURL/with_secondary_index-4 100000 22476 ns/op BenchmarkScanReposForURL/with_secondary_index-4 100000 22621 ns/op BenchmarkScanReposForURL/no_secondary_index-4 20 144900320 ns/op BenchmarkScanReposForURL/no_secondary_index-4 10 178732070 ns/op BenchmarkScanReposForURL/no_secondary_index-4 10 130792721 ns/op BenchmarkScanReposForURL/no_secondary_index-4 10 170880342 ns/op BenchmarkScanReposForURL/no_secondary_index-4 10 144174110 ns/op BenchmarkScanReposForURL/no_secondary_index-4 20 147588994 ns/op BenchmarkScanReposForURL/no_secondary_index-4 10 155225613 ns/op BenchmarkScanReposForURL/no_secondary_index-4 10 205336598 ns/op BenchmarkScanReposForURL/no_secondary_index-4 10 168166931 ns/op BenchmarkScanReposForURL/no_secondary_index-4 10 152456294 ns/op Signed-off-by: Oleg Sidorov <oleg.sidorov@booking.com>
This issue has been marked as stale because it has been open for 90 days with no activity. This thread will be automatically closed in 30 days if no further activity occurs. |
This fixes the bug by not including provided values files in the array of generated values, which is evalutated in a defered block. Resolves helm#1904
Per #1855, there is probably a way to optimize the traversal algorithm (maybe with a cache) that is used by scanReposForURL()
The text was updated successfully, but these errors were encountered: