Skip to content
This repository has been archived by the owner on Jan 25, 2019. It is now read-only.

Improve release name uniqueness and ownership #57

Merged
merged 6 commits into from
Oct 25, 2018
Merged

Improve release name uniqueness and ownership #57

merged 6 commits into from
Oct 25, 2018

Conversation

joelanford
Copy link
Member

Description of change
This PR updates the way the release name is determined for a CR, first by checking new annotations that allow users to customize the release name. If no annotations are set, it falls back to using a tiller-provided release name, which is guaranteed to be unique.

The PR also adds checks to ensure that a CR "owns" a release name before making changes for that release. This prevents two CRs from updating each others releases and prevents a CR from uninstalling another CR's release.

Motivation for change
See #49 for details

if err != nil {
return r, needsUpdate, fmt.Errorf("install error: %s", err)
}
needsUpdate = true
logrus.Infof("Installed release for %s release=%s", ResourceString(r), updatedRelease.GetName())

} else if status.Release == nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joelanford I'm wondering if this scenario is possible:

  • There is no deployed release, so we install the release c.installRelease()
  • We fail to update the status.Release
  • Retry the reconcile loop and this time c.storageBackend.Deployed() returns latestRelease, nil.
  • status.Release==nil from our last failed status update and we return an error thinking this release is not owned by the CR(even though it actually is).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that definitely sounds possible. Off the top of my head, I can think of a few things we can try:

  • If we fail to update, try to rollback the release to whatever the previous status.Release was, or uninstall if it was nil. Depending on the error, the rollback may also fail.
  • Rather than using c.storageBackend.Deployed() as the latestRelease, use status.Release.
  • Keep an in-memory map of resource UUID to release name.
  • See if there's a way to store the resource UUID in the release (maybe with the chart values), such that c.storageBackend.Deployed() would "know" its owner.

Each of these may cause other interesting failure scenarios though, so I'd have to experiment to know whether any of these are feasible.

Any other ideas?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See if there's a way to store the resource UUID in the release (maybe with the chart values), such that c.storageBackend.Deployed() would "know" its owner.

Turns out it's pretty easy to add an arbitrary config value to the release, which gets saved with the release in the storage backend. It seems like this is probably more robust than the other options. The latest commit changes the ownership checks to use this method instead.

return fmt.Sprintf("%s-%s", operatorName, r.GetName())
func getReleaseName(r *unstructured.Unstructured) string {
status := v1alpha1.StatusFor(r)
if status.Release != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joelanford The implementation seems fine to me but I'm thinking we should try our best to avoid relying on Status for our control logic. That seems to be the upstream convention as well.

https://github.com/kubernetes/community/blob/master/contributors/devel/api-conventions.md#spec-and-status
https://github.com/kubernetes/community/blob/master/contributors/design-proposals/architecture/principles.md#api

Object status must be 100% reconstructable by observation. Any history kept must be just an optimization and not required for correct operation.

The valueResourceUID is useful in letting us verify the correct owner of a pre-existing release.
But we still end up storing the release name in the Status to look up again in the future.

Right now with this implementation what happens if:

  • We deploy a new release with no release name, which generates a new unique release name e.g "foo"
      c.installRelease(tiller, r.GetNamespace(), "", chart, config)
  • Fail to update the status with updatedRelease=foo
  • Retry the reconcile. See that status.Release is empty from our last failed attempt.
  • Retry deploying the release with no release name
      c.installRelease(tiller, r.GetNamespace(), "", chart, config)
  • Does that give us a new deployed release with a new unique name updatedRelease=bar?

Ideally we could make the UUID part of the release name e.g cr.name+UUID to make it fully unique which avoids the problem of saving the release name since we can reconstruct it just by looking at the CR. And that solves the ownership problem.
But there's the issue of the 53 char limit. I know UUID's are 36 chars so we could stay within the limit but I'm wondering how human readable do release names need to be?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you're right. That would result in a separate new release.

Thanks for linking the conventions and principles. That's good info!

I did think about using UUIDs in the release name, but there are some downsides:

  • As you point out, UUIDs would take up 36 of the 53 characters, leaving only 17 if we wanted to include the CR name. That may be cutting it a little too close for comfort.
  • Release names end up in the generated resource names for all helm chart resources, so users would end up with really long names for everything managed by the helm operator.

We could do something like cr.Name + last_N_chars(cr.UUID), where N is < 36. Assuming the UUIDs are uniformally random hex characters, the probability of collision would likely be sufficiently low at N=8.

If we want better human readability, another option could be that we add the release name to the CR (somewhere in metadata or spec maybe?) prior to any attempt to invoke the release installation, update, etc. That way a CR never results in any deployed release until it has been updated with a release name.

Increases uniqueness of release name while also minimizing complexity
that would arise with generating and tracking release name ownership
across CR updates and reconciliation.

Removes release name annotations, since it would be unnecessarily
complex to handle a release name change due to a new or updated
annotation value.
@joelanford
Copy link
Member Author

Reverted ownership checks and updated getReleaseName(r) to use the CR name and the first 8 hex chars of the CR UID.

I also removed the annotations since it would be difficult to handle release name changes in the case where the annotations are added or updated after a release has been made.

func releaseName(r *unstructured.Unstructured) string {
return fmt.Sprintf("%s-%s", operatorName, r.GetName())
func getReleaseName(r *unstructured.Unstructured) string {
shortUID := strings.Replace(string(r.GetUID()), "-", "", -1)[:8]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although perhaps unlikely we could still get collisions with just the last 8 chars of the UUID.
I think some parts of the UUID are usually constant based on the implementation.

We could reduce those odds by base64 encoding the UUID from 32 chars to 22 chars.
E.g https://gist.github.com/hasbro17/201d727ca6402c067dd399a8a059c32c
based on https://medium.com/@ablab/url-safe-uuid-encoding-in-go-4ba3b4ce4206

Or hashing the 32 chars(minus 4 hyphens) down to 8 or 10 chars. That might be a little safer than just lopping off the last 8 chars.

So we can explore the best option but at least add a TODO for now on how we could improve the short UID generation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like Kubernetes uses UUID v1, which uses the a combination of the clock and MAC address of the host:

I spun up a deployment with two replicas and compared their UIDs:

$ kubectl get pods -o jsonpath={..metadata.uid} -l release=app1-bf7534ac | tr " " "\n"
bf852ded-d7ea-11e8-ab5b-88c3b95d461f
bf85debb-d7ea-11e8-ab5b-88c3b95d461f

Only the first 8 chars differ, so my initial stab (actually using the first 8, not last 8) was complete luck. I'm thinking this is worth fixing before it's integrated into the SDK so we don't have to deal with users' release names changing when they upgrade their base helm operator image.

I like base64 idea, so I'll give that a go. That would give users 31 characters for the CR name, which seems like enough. I'll test what happens with a long CR name and add an additional check if necessary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out k8s doesn't like uppercase letters in resource names, so the base64 approach didn't work:

ERRO[0003] failed to reconcile release for apiVersion=apache.org/v1alpha1 kind=Tomcat name=default/app1: install error: release app1-v3U0rNfqEeirW4jDuV1GHw failed: Service "app1-v3U0rNfqEeirW4jDuV1GHw-tomcat" is invalid: metadata.name: Invalid value: "app1-v3U0rNfqEeirW4jDuV1GHw-tomcat": a DNS-1035 label must consist of lower case alphanumeric characters or '-', start with an alphabetic character, and end with an alphanumeric character (e.g. 'my-name',  or 'abc-123', regex used for validation is '[a-z]([-a-z0-9]*[a-z0-9])?')

I switched to a base36 encoding (lowercase letters and digits 0-9), which works. It increases the length to 25 characters, so still not too bad.

INFO[0235] Reconciled release for apiVersion=apache.org/v1alpha1 kind=Tomcat name=default/app1 release=app1-41arrw2nar998e3vumtsue9lr

The resulting resource names are a bit long, but maybe still within the realm of reasonableness:

$ kubectl get all -l release=app1-41arrw2nar998e3vumtsue9lr
NAME                                                         READY   STATUS    RESTARTS   AGE
pod/app1-41arrw2nar998e3vumtsue9lr-tomcat-75b55c8855-9frcr   2/2     Running   0          5m
pod/app1-41arrw2nar998e3vumtsue9lr-tomcat-75b55c8855-qv966   0/2     Pending   0          5m

NAME                                            TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)   AGE
service/app1-41arrw2nar998e3vumtsue9lr-tomcat   ClusterIP   10.109.202.233   <none>        80/TCP    5m

NAME                                                    DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/app1-41arrw2nar998e3vumtsue9lr-tomcat   2         2         2            1           5m

NAME                                                               DESIRED   CURRENT   READY   AGE
replicaset.apps/app1-41arrw2nar998e3vumtsue9lr-tomcat-75b55c8855   2         2         1       5m

For release names that are too long, the operator prints a helpful error message:

ERRO[0000] failed to reconcile release for apiVersion=apache.org/v1alpha1 kind=Tomcat name=default/my-super-long-name-for-a-tomcat: install error: release name "my-super-long-name-for-a-tomcat-86jzwwlaodwvoiarjie4v87bz" exceeds max length of 53

As a separate bit of work, I don't think we're actually updating the CR status when we fail, so I'll add an issue for that.

WDYT?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes let's follow up on the status update in a separate PR/issue.

And there might be a way to shorten the UUID further maybe by hashing it. But this is good enough for now.

Copy link
Collaborator

@hasbro17 hasbro17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after nit

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants