Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize scanReposForURL function #1904

Closed
technosophos opened this issue Feb 2, 2017 · 7 comments
Closed

Optimize scanReposForURL function #1904

technosophos opened this issue Feb 2, 2017 · 7 comments

Comments

@technosophos
Copy link
Member

Per #1855, there is probably a way to optimize the traversal algorithm (maybe with a cache) that is used by scanReposForURL()

@thomastaylor312
Copy link
Contributor

@technosophos This is probably a good one to try and do as part of 2.4 given its focus 😄, should we bring it in?

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

@thomastaylor312
Copy link
Contributor

/lifecycle frozen

@irajdeep
Copy link

@technosophos I am planning to work on this if no one else is currently working on it :)

@thomastaylor312
Copy link
Contributor

@irajdeep Sounds great! I would target Helm v3 as we are mostly focusing on bug fixes for v2

@irajdeep
Copy link

sounds cool, I will start working on it then

osdrv pushed a commit to osdrv/helm that referenced this issue Sep 10, 2019
This commit is an attempt to address concerns scoped in helm#1904.

`ChartDownloader.scanReposForURL()` is called when a fetching chart is
identified by an absolute URL and no trivial reference between the chart
and parental repository could be established. The operation requires a
full linear repository scan. The runtime perfromance degenerates in a
case of a non-existing URL as a full scan goes over the full range of
existing repos, all their ChartVersion entries and each version of an entry.

The motivation for this change is to speed up this operation, ideally
make the aforementioned operation to run in nearly constant amortized time (a
detailed poreformance analysis is presented below).

There is a constraint that should be taken into account before a conclusion
about overall sense can be made: the function call exists in a vacuum: there is
no continuous runtime between consecutive runs of the function. I.e. we can't
use an in-memory structure that's created once and should consider using
a persistent cache approach (temporary cache files). Dealing with a FS might
come with a singnificant performance penalty but in the light of the
aforementioned situation that seems decent.

This commit introduces a concept of a secondary repository index: an
extended cached index structure based on the primary repo index.
Effectively it builds an inverted index based on the data read from
index.yaml. In the current implementation it only builds inverred URL
index.

The proposed structure of a secondary file looks like:

	indexes:
		byURL:
			chart_url_1:
				name: chart_name_1
				version: chart_version_1
			chart_url_2:
				name: chart_name_2
				version: chart_version_2

Ideally, the structure should provide a nearly constant lookup time in order to
provide any sensible result. On the other hand, repository cache is a
self-contained substance which exists independently from other repositories.
Therefore it was decided to create a secondary index file for every repository.
In this case we're still dealing with a linear traversal of repositories but
relying on an assumption that an average # of repositories is smaller than an
average # of entries in them. The assumed invariant is irrelevant to the
existing algorithm as it's runtime doesn't change if the ratio changes.

In the current implementation, `scanReposForURL` performs cached index.yaml
loading on linear repo traversal. In the proposed implementation this operation
is substituted by reading index-secondary.yaml insted. We believe it has no
impact on the operation runtime complexity: the files are comparable in size
and nesting complexity.

This commit contains a new benchmark suite to test the performance of the new
implementation. For the sake of comparability we preserved the current
implementation of `scanReposForURL` but renamed it to `scanReposForURLBak`.

The results of a benchmark run:

	goos: darwin
	goarch: amd64
	pkg: helm.sh/helm/pkg/downloader
	BenchmarkScanReposForURL/with_secondary_index-4                    50000             24649 ns/op
	BenchmarkScanReposForURL/with_secondary_index-4                   100000             22762 ns/op
	BenchmarkScanReposForURL/with_secondary_index-4                    50000             23170 ns/op
	BenchmarkScanReposForURL/with_secondary_index-4                   100000             22712 ns/op
	BenchmarkScanReposForURL/with_secondary_index-4                    50000             22346 ns/op
	BenchmarkScanReposForURL/with_secondary_index-4                   100000             22658 ns/op
	BenchmarkScanReposForURL/with_secondary_index-4                   100000             23544 ns/op
	BenchmarkScanReposForURL/with_secondary_index-4                   100000             23062 ns/op
	BenchmarkScanReposForURL/with_secondary_index-4                   100000             22476 ns/op
	BenchmarkScanReposForURL/with_secondary_index-4                   100000             22621 ns/op
	BenchmarkScanReposForURL/no_secondary_index-4                         20         144900320 ns/op
	BenchmarkScanReposForURL/no_secondary_index-4                         10         178732070 ns/op
	BenchmarkScanReposForURL/no_secondary_index-4                         10         130792721 ns/op
	BenchmarkScanReposForURL/no_secondary_index-4                         10         170880342 ns/op
	BenchmarkScanReposForURL/no_secondary_index-4                         10         144174110 ns/op
	BenchmarkScanReposForURL/no_secondary_index-4                         20         147588994 ns/op
	BenchmarkScanReposForURL/no_secondary_index-4                         10         155225613 ns/op
	BenchmarkScanReposForURL/no_secondary_index-4                         10         205336598 ns/op
	BenchmarkScanReposForURL/no_secondary_index-4                         10         168166931 ns/op
	BenchmarkScanReposForURL/no_secondary_index-4                         10         152456294 ns/op

Signed-off-by: Oleg Sidorov <oleg.sidorov@booking.com>
osdrv pushed a commit to osdrv/helm that referenced this issue Sep 16, 2019
This commit is an attempt to address concerns scoped in helm#1904.

`ChartDownloader.scanReposForURL()` is called when a fetching chart is
identified by an absolute URL and no trivial reference between the chart
and parental repository could be established. The operation requires a
full linear repository scan. The runtime perfromance degenerates in a
case of a non-existing URL as a full scan goes over the full range of
existing repos, all their ChartVersion entries and each version of an entry.

The motivation for this change is to speed up this operation, ideally
make the aforementioned operation to run in nearly constant amortized time (a
detailed poreformance analysis is presented below).

There is a constraint that should be taken into account before a conclusion
about overall sense can be made: the function call exists in a vacuum: there is
no continuous runtime between consecutive runs of the function. I.e. we can't
use an in-memory structure that's created once and should consider using
a persistent cache approach (temporary cache files). Dealing with a FS might
come with a singnificant performance penalty but in the light of the
aforementioned situation that seems decent.

This commit introduces a concept of a secondary repository index: an
extended cached index structure based on the primary repo index.
Effectively it builds an inverted index based on the data read from
index.yaml. In the current implementation it only builds inverred URL
index.

The proposed structure of a secondary file looks like:

	indexes:
		byURL:
			chart_url_1:
				name: chart_name_1
				version: chart_version_1
			chart_url_2:
				name: chart_name_2
				version: chart_version_2

Ideally, the structure should provide a nearly constant lookup time in order to
provide any sensible result. On the other hand, repository cache is a
self-contained substance which exists independently from other repositories.
Therefore it was decided to create a secondary index file for every repository.
In this case we're still dealing with a linear traversal of repositories but
relying on an assumption that an average # of repositories is smaller than an
average # of entries in them. The assumed invariant is irrelevant to the
existing algorithm as it's runtime doesn't change if the ratio changes.

In the current implementation, `scanReposForURL` performs cached index.yaml
loading on linear repo traversal. In the proposed implementation this operation
is substituted by reading index-secondary.yaml insted. We believe it has no
impact on the operation runtime complexity: the files are comparable in size
and nesting complexity.

This commit contains a new benchmark suite to test the performance of the new
implementation. For the sake of comparability we preserved the current
implementation of `scanReposForURL` but renamed it to `scanReposForURLBak`.

The results of a benchmark run:

	goos: darwin
	goarch: amd64
	pkg: helm.sh/helm/pkg/downloader
	BenchmarkScanReposForURL/with_secondary_index-4                    50000             24649 ns/op
	BenchmarkScanReposForURL/with_secondary_index-4                   100000             22762 ns/op
	BenchmarkScanReposForURL/with_secondary_index-4                    50000             23170 ns/op
	BenchmarkScanReposForURL/with_secondary_index-4                   100000             22712 ns/op
	BenchmarkScanReposForURL/with_secondary_index-4                    50000             22346 ns/op
	BenchmarkScanReposForURL/with_secondary_index-4                   100000             22658 ns/op
	BenchmarkScanReposForURL/with_secondary_index-4                   100000             23544 ns/op
	BenchmarkScanReposForURL/with_secondary_index-4                   100000             23062 ns/op
	BenchmarkScanReposForURL/with_secondary_index-4                   100000             22476 ns/op
	BenchmarkScanReposForURL/with_secondary_index-4                   100000             22621 ns/op
	BenchmarkScanReposForURL/no_secondary_index-4                         20         144900320 ns/op
	BenchmarkScanReposForURL/no_secondary_index-4                         10         178732070 ns/op
	BenchmarkScanReposForURL/no_secondary_index-4                         10         130792721 ns/op
	BenchmarkScanReposForURL/no_secondary_index-4                         10         170880342 ns/op
	BenchmarkScanReposForURL/no_secondary_index-4                         10         144174110 ns/op
	BenchmarkScanReposForURL/no_secondary_index-4                         20         147588994 ns/op
	BenchmarkScanReposForURL/no_secondary_index-4                         10         155225613 ns/op
	BenchmarkScanReposForURL/no_secondary_index-4                         10         205336598 ns/op
	BenchmarkScanReposForURL/no_secondary_index-4                         10         168166931 ns/op
	BenchmarkScanReposForURL/no_secondary_index-4                         10         152456294 ns/op

Signed-off-by: Oleg Sidorov <oleg.sidorov@booking.com>
osdrv pushed a commit to osdrv/helm that referenced this issue Oct 7, 2019
This commit is an attempt to address concerns scoped in helm#1904.

`ChartDownloader.scanReposForURL()` is called when a fetching chart is
identified by an absolute URL and no trivial reference between the chart
and parental repository could be established. The operation requires a
full linear repository scan. The runtime perfromance degenerates in a
case of a non-existing URL as a full scan goes over the full range of
existing repos, all their ChartVersion entries and each version of an entry.

The motivation for this change is to speed up this operation, ideally
make the aforementioned operation to run in nearly constant amortized time (a
detailed poreformance analysis is presented below).

There is a constraint that should be taken into account before a conclusion
about overall sense can be made: the function call exists in a vacuum: there is
no continuous runtime between consecutive runs of the function. I.e. we can't
use an in-memory structure that's created once and should consider using
a persistent cache approach (temporary cache files). Dealing with a FS might
come with a singnificant performance penalty but in the light of the
aforementioned situation that seems decent.

This commit introduces a concept of a secondary repository index: an
extended cached index structure based on the primary repo index.
Effectively it builds an inverted index based on the data read from
index.yaml. In the current implementation it only builds inverred URL
index.

The proposed structure of a secondary file looks like:

	indexes:
		byURL:
			chart_url_1:
				name: chart_name_1
				version: chart_version_1
			chart_url_2:
				name: chart_name_2
				version: chart_version_2

Ideally, the structure should provide a nearly constant lookup time in order to
provide any sensible result. On the other hand, repository cache is a
self-contained substance which exists independently from other repositories.
Therefore it was decided to create a secondary index file for every repository.
In this case we're still dealing with a linear traversal of repositories but
relying on an assumption that an average # of repositories is smaller than an
average # of entries in them. The assumed invariant is irrelevant to the
existing algorithm as it's runtime doesn't change if the ratio changes.

In the current implementation, `scanReposForURL` performs cached index.yaml
loading on linear repo traversal. In the proposed implementation this operation
is substituted by reading index-secondary.yaml insted. We believe it has no
impact on the operation runtime complexity: the files are comparable in size
and nesting complexity.

The results of a benchmark run:

	goos: darwin
	goarch: amd64
	pkg: helm.sh/helm/pkg/downloader
	BenchmarkScanReposForURL/with_secondary_index-4                    50000             24649 ns/op
	BenchmarkScanReposForURL/with_secondary_index-4                   100000             22762 ns/op
	BenchmarkScanReposForURL/with_secondary_index-4                    50000             23170 ns/op
	BenchmarkScanReposForURL/with_secondary_index-4                   100000             22712 ns/op
	BenchmarkScanReposForURL/with_secondary_index-4                    50000             22346 ns/op
	BenchmarkScanReposForURL/with_secondary_index-4                   100000             22658 ns/op
	BenchmarkScanReposForURL/with_secondary_index-4                   100000             23544 ns/op
	BenchmarkScanReposForURL/with_secondary_index-4                   100000             23062 ns/op
	BenchmarkScanReposForURL/with_secondary_index-4                   100000             22476 ns/op
	BenchmarkScanReposForURL/with_secondary_index-4                   100000             22621 ns/op
	BenchmarkScanReposForURL/no_secondary_index-4                         20         144900320 ns/op
	BenchmarkScanReposForURL/no_secondary_index-4                         10         178732070 ns/op
	BenchmarkScanReposForURL/no_secondary_index-4                         10         130792721 ns/op
	BenchmarkScanReposForURL/no_secondary_index-4                         10         170880342 ns/op
	BenchmarkScanReposForURL/no_secondary_index-4                         10         144174110 ns/op
	BenchmarkScanReposForURL/no_secondary_index-4                         20         147588994 ns/op
	BenchmarkScanReposForURL/no_secondary_index-4                         10         155225613 ns/op
	BenchmarkScanReposForURL/no_secondary_index-4                         10         205336598 ns/op
	BenchmarkScanReposForURL/no_secondary_index-4                         10         168166931 ns/op
	BenchmarkScanReposForURL/no_secondary_index-4                         10         152456294 ns/op

Signed-off-by: Oleg Sidorov <oleg.sidorov@booking.com>
osdrv pushed a commit to osdrv/helm that referenced this issue Oct 7, 2019
This commit is an attempt to address concerns scoped in helm#1904.

`ChartDownloader.scanReposForURL()` is called when a fetching chart is
identified by an absolute URL and no trivial reference between the chart
and parental repository could be established. The operation requires a
full linear repository scan. The runtime perfromance degenerates in a
case of a non-existing URL as a full scan goes over the full range of
existing repos, all their ChartVersion entries and each version of an entry.

The motivation for this change is to speed up this operation, ideally
make the aforementioned operation to run in nearly constant amortized time (a
detailed poreformance analysis is presented below).

There is a constraint that should be taken into account before a conclusion
about overall sense can be made: the function call exists in a vacuum: there is
no continuous runtime between consecutive runs of the function. I.e. we can't
use an in-memory structure that's created once and should consider using
a persistent cache approach (temporary cache files). Dealing with a FS might
come with a singnificant performance penalty but in the light of the
aforementioned situation that seems decent.

This commit introduces a concept of a secondary repository index: an
extended cached index structure based on the primary repo index.
Effectively it builds an inverted index based on the data read from
index.yaml. In the current implementation it only builds inverred URL
index.

The proposed structure of a secondary file looks like:

	indexes:
		byURL:
			chart_url_1:
				name: chart_name_1
				version: chart_version_1
			chart_url_2:
				name: chart_name_2
				version: chart_version_2

Ideally, the structure should provide a nearly constant lookup time in order to
provide any sensible result. On the other hand, repository cache is a
self-contained substance which exists independently from other repositories.
Therefore it was decided to create a secondary index file for every repository.
In this case we're still dealing with a linear traversal of repositories but
relying on an assumption that an average # of repositories is smaller than an
average # of entries in them. The assumed invariant is irrelevant to the
existing algorithm as it's runtime doesn't change if the ratio changes.

In the current implementation, `scanReposForURL` performs cached index.yaml
loading on linear repo traversal. In the proposed implementation this operation
is substituted by reading index-secondary.yaml insted. We believe it has no
impact on the operation runtime complexity: the files are comparable in size
and nesting complexity.

The results of a benchmark run:

	goos: darwin
	goarch: amd64
	pkg: helm.sh/helm/pkg/downloader
	BenchmarkScanReposForURL/with_secondary_index-4                    50000             24649 ns/op
	BenchmarkScanReposForURL/with_secondary_index-4                   100000             22762 ns/op
	BenchmarkScanReposForURL/with_secondary_index-4                    50000             23170 ns/op
	BenchmarkScanReposForURL/with_secondary_index-4                   100000             22712 ns/op
	BenchmarkScanReposForURL/with_secondary_index-4                    50000             22346 ns/op
	BenchmarkScanReposForURL/with_secondary_index-4                   100000             22658 ns/op
	BenchmarkScanReposForURL/with_secondary_index-4                   100000             23544 ns/op
	BenchmarkScanReposForURL/with_secondary_index-4                   100000             23062 ns/op
	BenchmarkScanReposForURL/with_secondary_index-4                   100000             22476 ns/op
	BenchmarkScanReposForURL/with_secondary_index-4                   100000             22621 ns/op
	BenchmarkScanReposForURL/no_secondary_index-4                         20         144900320 ns/op
	BenchmarkScanReposForURL/no_secondary_index-4                         10         178732070 ns/op
	BenchmarkScanReposForURL/no_secondary_index-4                         10         130792721 ns/op
	BenchmarkScanReposForURL/no_secondary_index-4                         10         170880342 ns/op
	BenchmarkScanReposForURL/no_secondary_index-4                         10         144174110 ns/op
	BenchmarkScanReposForURL/no_secondary_index-4                         20         147588994 ns/op
	BenchmarkScanReposForURL/no_secondary_index-4                         10         155225613 ns/op
	BenchmarkScanReposForURL/no_secondary_index-4                         10         205336598 ns/op
	BenchmarkScanReposForURL/no_secondary_index-4                         10         168166931 ns/op
	BenchmarkScanReposForURL/no_secondary_index-4                         10         152456294 ns/op

Signed-off-by: Oleg Sidorov <oleg.sidorov@booking.com>
@bacongobbler bacongobbler removed this from the Upcoming - Minor milestone Aug 14, 2020
@github-actions
Copy link

This issue has been marked as stale because it has been open for 90 days with no activity. This thread will be automatically closed in 30 days if no further activity occurs.

@github-actions github-actions bot added the Stale label Nov 22, 2020
MichaelMorrisEst pushed a commit to Nordix/helm that referenced this issue Nov 17, 2023
This fixes the bug by not including provided values files in the array of generated values, which is evalutated in a defered block.
Resolves helm#1904
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants