Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pkg/helm: migrate to secret storage backend in namespace of CR #1102

Merged

Conversation

joelanford
Copy link
Member

Description of the change:

  • Migrates the helm-operator from an in-memory storage backend to a secret storage backend so that release state is persisted to the cluster.
  • Migrates from an operator-global storage backend to CR-specific storage backends so that release secrets are stored in the same namespace as the CR.

Motivation for the change:
This PR is the first step towards completing #1100. We'll also be able to close #917, whose goal is to migrate to a persistent storage backend.

@openshift-ci-robot openshift-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Feb 13, 2019
@openshift-ci-robot openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 4, 2019
@openshift-ci-robot openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 4, 2019
Copy link
Member

@lilic lilic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few nits/suggestions left. Overall looks good to me. My only concern is that this is a tempory/transition state, what is the reason behind this? Can't we have a check for this, to create the Secret. Not too familiar with the problem, so just curious to learn more.

pkg/helm/release/manager.go Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
pkg/helm/release/manager_factory.go Outdated Show resolved Hide resolved
@joelanford
Copy link
Member Author

@lilic

Not too familiar with the problem, so just curious to learn more.

The background is that for Helm (specifically the Tiller components of Helm) to function properly, there needs to be a storage backend that stores and manages the release state for every release. Helm has a storage interface for managing this.

When starting the helm-operator for the first time, the storage backend is empty. As CRs are added and updated and releases are installed and upgraded, respectively, the storage backend is populated and updated with the latest release state.

We're currently using the in-memory storage backend, which means that all of this state disappears when the operator is restarted, so there needs to be some way to repopulate the storage backend when it comes back up. To do that, we save the release state to the CR status and use the syncReleaseStatus function to make sure that the storage backend is populated at the very beginning of the reconcile function (before running install/upgrade/uninstall tasks)

This PR swaps out the in-memory storage backend for a secret backend, which means that release state will be stored via the storage backend with secrets, rather than in memory. When the operator restarts, it will no longer need to sync release state to its backend because it's all already there, stored in the cluster as secrets.

My only concern is that this is a tempory/transition state, what is the reason behind this?

The catch with the above is that we need to gracefully handle transitioning release state from the CR status to the secret-based backend.

As a follow-on to this PR, we plan to store only the most relevant bits of information about the release (e.g. release name and version) in the CR status and remove the rest of the full release state.

With that in mind, we could plan on leaving this syncReleaseStatus function in place, such that it only runs if the CR status contains the full release state. In the follow-on PR, we'll stop saving the full release state to the CR status, and we can update the syncReleaseStatus function to additionally remove the release state from the CR status once it has been migrated to the new secret-based backend.

Does that make sense? WDYT?

@lilic
Copy link
Member

lilic commented Mar 6, 2019

Thanks for the detailed explanation! It makes sense, maybe we should document this for the user, that we create the Secret in the cluster that stores the information about the release state.

With that in mind, we could plan on leaving this syncReleaseStatus function in place...

I agree, we should leave it, if at all possible, mainly since users tend to skip releases when upgrading.

... , such that it only runs if the CR status contains the full release state.

Can we just always have it run whenever the secrets are not present in the cluster. For example, what if it happens that the secrets get deleted by accident but the CR status doesn't contain the full release state? Ah, nevermind, the secret creation won't work then as there is no CR status to write to the secret 🤦‍♀️ . But still curious about the question, can the user in a way create the Secret with the release information "manually"?

@openshift-ci-robot openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 8, 2019
@openshift-ci-robot openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 12, 2019
@joelanford
Copy link
Member Author

joelanford commented Mar 12, 2019

@lilic

what if it happens that the secrets get deleted by accident but the CR status doesn't contain the full release state?

If the secret (but not the release resources) associated with the deployed release gets deleted and there's no release state to sync from the CR, the operator will think that the CR represents a new release that needs to be installed. It will:

  1. Attempt to install a new release, which will fail because the release resources already exist
  2. Uninstall the failed release in an attempt to rollback to the previous state, which deletes the new secret and existing release resources
  3. Return an error from the Reconcile function, causing the CR to be requeued for reconciliation.
  4. During the next reconciliation, attempt to install a new release, which should succeed since the release resources were deleted during the uninstall of the failed release.

This is essentially the same thing that would happen if we didn't have this transition code to sync the CR status into the new secret backend.

But still curious about the question, can the user in a way create the Secret with the release information "manually"?

It would be difficult to do it by hand, since the Helm package responsible for creating the release secrets stores the state as a gzipped protobuf. Theoretically, they could be restored from a backup though.

Copy link
Contributor

@hasbro17 hasbro17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@lilic lilic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying!

lgtm 🎉

Copy link
Member

@shawn-hurley shawn-hurley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

I would consider moving the code from the internal function into the external one, just removes the extra layer.

I would also consider renaming the unstructured arg r to something like un or cr or resource. Just didn't know what it was as I was reading the code and had to read back to figure it out.

@openshift-ci-robot openshift-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 18, 2019
@joelanford joelanford merged commit c412807 into operator-framework:master Mar 18, 2019
@joelanford joelanford deleted the helm-storage-cr-namespace branch March 18, 2019 19:46
@joelanford joelanford mentioned this pull request Mar 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants