Unversioned deployment with multiple pods does not update all pods #110

larte · 2017-10-27T13:59:31Z

I have a deployment with 5 replicas following a :latest tag. From the logs, I can see that keel resets the image to 0.0.0 and after 5 seconds applies :latest.

The deployment seems to revert back to the previous version of replicaset, and the rollout does not continue for pods. The single pod that was recreated has the same rc version, but the latest image was pulled via imagePullPolicy. During the reset, I can see 2 pods in ErrImagePull state.

I'm running keel 0.5.0-rc.1 with native webhooks with keel.sh/policy: force in kubernetes 1.7.x

The events from the deployment are:

Normal  ScalingReplicaSet  2m     deployment-controller  Scaled up replica set app-1687492293 to 1
Normal  ScalingReplicaSet  2m     deployment-controller  Scaled down replica set app-3177265212 to 4
Normal  ScalingReplicaSet  2m     deployment-controller  Scaled up replica set app-1687492293 to 2
Normal  ScalingReplicaSet  2m     deployment-controller  Scaled up replica set app-3177265212 to 5
Normal  ScalingReplicaSet  2m     deployment-controller  Scaled down replica set app-1687492293 to 0

and in the end, the pods are:

Name                        RC                             AGE
app-3177265212-40zzc    app-3177265212   2017-10-27T13:26:34Z
app-3177265212-85jsp    app-3177265212   2017-10-27T13:43:07Z   <- This pod has changed
app-3177265212-b2mws    app-3177265212   2017-10-27T13:26:34Z
app-3177265212-p9mkc    app-3177265212   2017-10-27T13:26:34Z
app-3177265212-qth3s    app-3177265212   2017-10-27T13:26:34Z

The text was updated successfully, but these errors were encountered:

rusenask · 2017-10-27T14:12:23Z

It seems that Kubernetes doesn't manage to destroy existing replicas in time. Could you try scaling down to 1 replica and then trying an update? If it would solve the issue then Keel could do it for you. I imagine workflow could be:

Set replicas to 1 or 0
Set tag to 0.0.0
Set replica count and tag to whatever you had before

larte · 2017-10-27T14:22:24Z

That would work for non-production workloads, which applies to working with latest tag anyway.

The other would be to set spec.strategy.type to 'Recreate' in the deployment which results to some downtime as well, but wouldn't require changes in keel.

I'm currently trying out a very rough patch, where no reset is performed, but a ENV variable is set to each container, resulting in a new rc each time. What is your opinion on this? I remember seeing some discussion earlier on the force-policy feature ticket.

rusenask · 2017-10-27T14:55:33Z

Could work. Another option is to terminate pods, it could be done "slowly" so it's almost like a rolling update. Regarding non-production workloads - I guess it's reasonable to expect that production workloads would be versioned.

Regarding that patch - feel free to open a work in progress PR :)

taylorchu · 2017-11-07T00:38:36Z

I tested keel 0.4.7 with gke server version 1.8, and "force update" does not work for me.

Here is the sequence of events that happened:

1. scheduler assigns the allocation to a node, and replicaset-controller creates a new pod.
2. the image is set to 0.0.0 by keel. Since the image does not exist, it shows `Failed to pull image` and `Error syncing pod`.
3. the node backoffs pulling the image
4. the new pod is deleted

The notification I got is that the image is updated successfully, and yet the pod is not updated at all. (There is only 1 pod.)

Instead of pulling tag 0.0.0, which creates unexpected "fail to pull" events in the cluster (unless that person knows keel very well), we only have to change the replica count from N to 0 and then from 0 to N.

@larte @rusenask

rusenask · 2017-11-07T12:46:47Z

Seems like k8s scheduler behaviour changed. I think force update should be reimplemented with your suggestion. Seems like a clean approach.

@taylorchu do you have to wait a little bit when you set replicas to 0 or it terminates pods immediately?

taylorchu · 2017-11-07T17:17:37Z

no, I do not set replica count to 0 myself.

rusenask · 2017-12-12T23:15:44Z

@taylorchu started looking at this issue. One problem with setting replicas to 0 would mean that auto scaler would stop working (it has to be unset).

What about terminating all pods? that would result in k8s recreating them. If it was done with some breaks in the process it could even mean no downtime.

The-Loeki · 2018-03-01T15:38:42Z

We're just starting to use Keel (on GCP K8s 1.8.7) and are hit with this problem on 0.6.1.
It occured to me that the only reason Keel has to do this (and b0rks it up apparently) is because Replication Controllers already explicitly refuse to (as per K8s docs)

As my 5c, I think emulating the rolling update would be the cleanest way to go.

Also, we're quite happy running (a carefully selected set of) latest tagged containers in production and some apps have a gitflow where master branch is always deemed stable; so merges to the branch are only approved once they're production ready, leading again to a valid use case for a stable/latest tag.

rusenask · 2018-03-01T15:42:48Z

Hi, thanks. Will get this sorted ASAP. Do you think my suggested strategy by terminating pods would do the job? Terminated pod will always pull the new version as I understand.

The-Loeki · 2018-03-01T15:46:59Z

Well AFAIK you'd need to set imagePullPolicy: Always on the Deployment, but other than that, we'd be perfectly happy with it; it's pretty much what we're doing manually now.

rusenask · 2018-03-01T15:49:05Z

Awesome, I am a bit swamped by work these days but will try to add and test this strategy either this evening or on the weekend :)

The-Loeki · 2018-03-01T16:09:59Z

That would be awesome! We'll be more than happy to help you test the changes if you like.

rusenask · 2018-03-04T12:29:13Z

Hi @The-Loeki, just pushed alpha tag that is built based on #154. Did testing and it seems to be a reliable way to force update for same image tags.

It would be nice if you did more testing as it should also solve that other #153 issue (even added a unit test for that specific docker registry :)).

Migrated client-go (which is now split into multiple repos) to release-6.0 which should ensure that everything works for foreseeable future. There were a bunch of other updates to dependencies which required more changes (how we parse images) so any additional testing would be really welcome :)

The-Loeki · 2018-03-09T10:35:24Z

Hi @rusenask thanks for your hard work :) Today we've done the first round of testing on the alpha tag.

The Good

Zalando works w00t
We've done two deployment upgrades with 'latest' tags to see how it works and it looks nice for now. We'll be doing a bunch more of those and be sure to let you know!

The Bad

RBAC permissions need to be fixed to allow deletion of pods.
Question: Do you need to delete replicasets and replicacontrollers as well?
I'll hack up a PR

The Ugly

time="2018-03-09T10:23:29Z" level=debug msg="registry client: getting digest" registry="https://quay.io" repository=coreos/dex tag=v2.9.0
2018/03/09 10:23:30 registry failed ping request, error: Get https://quay.io/v2/: http: non-successful response (status=401 body="{\"error\": \"Invalid bearer token format\"}")
time="2018-03-09T10:23:30Z" level=debug msg="registry.manifest.head url=https://quay.io/v2/coreos/dex/manifests/v2.9.0 repository=coreos/dex reference=v2.9.0"
time="2018-03-09T10:23:30Z" level=info msg="trigger.poll.RepositoryWatcher: new watch repository tags job added" digest="sha256:c9ab4b2f064b8dd3cde614af50d5f1c49d6c45603ce377022c15bc9aa217e2db" image="quay.io/coreos/dex:v2.9.0" job_name=quay.io/coreos/dex schedule="@every 24h"
time="2018-03-09T10:23:37Z" level=debug msg="secrets.defaultGetter.lookupSecrets: pod secrets found" image=quay.io/jetstack/cert-manager-controller namespace=kube-system pod_selector="app=cert-manager,release=cert-manager" provider=helm registry=quay.io secrets="[]"
time="2018-03-09T10:23:37Z" level=debug msg="secrets.defaultGetter.lookupSecrets: no secrets for image found" image=quay.io/jetstack/cert-manager-controller namespace=kube-system pod_selector="app=cert-manager,release=cert-manager" pods_checked=1 provider=helm registry=quay.io
time="2018-03-09T10:23:37Z" level=debug msg="registry client: getting digest" registry="https://quay.io" repository=jetstack/cert-manager-controller tag=v0.2.3
2018/03/09 10:23:37 registry failed ping request, error: Get https://quay.io/v2/: http: non-successful response (status=401 body="{\"error\": \"Invalid bearer token format\"}")
time="2018-03-09T10:23:37Z" level=debug msg="registry.manifest.head url=https://quay.io/v2/jetstack/cert-manager-controller/manifests/v0.2.3 repository=jetstack/cert-manager-controller reference=v0.2.3"
time="2018-03-09T10:23:37Z" level=info msg="trigger.poll.RepositoryWatcher: new watch repository tags job added" digest="sha256:6bccc03f2e98e34f2b1782d29aed77763e93ea81de96f246ebeb81effd947085" image="quay.io/jetstack/cert-manager-controller:v0.2.3" job_name=quay.io/jetstack/cert-manager-controller schedule="@every 24h"
time="2018-03-09T10:24:35Z" level=debug msg="secrets.defaultGetter.lookupSecrets: pod secrets found" image=quay.io/jetstack/cert-manager-controller namespace=kube-system pod_selector="app=cert-manager,release=cert-manager" provider=helm registry=quay.io secrets="[]"
time="2018-03-09T10:24:35Z" level=debug msg="secrets.defaultGetter.lookupSecrets: no secrets for image found" image=quay.io/jetstack/cert-manager-controller namespace=kube-system pod_selector="app=cert-manager,release=cert-manager" pods_checked=1 provider=helm registry=quay.io
time="2018-03-09T10:25:30Z" level=debug msg="secrets.defaultGetter.lookupSecrets: pod secrets found" image=quay.io/jetstack/cert-manager-controller namespace=kube-system pod_selector="app=cert-manager,release=cert-manager" provider=helm registry=quay.io secrets="[]"

curl -m 5 -Lv -H "Content-Type: application/json" https://quay.io/v2/jetstack/cert-manager-controller/manifests/v0.2.3 of course 'just works'

I'd venture from the logs that it tries to auth against Quay with the empty/nonexistent secret or something, but that's just a guess

rusenask · 2018-03-09T11:08:31Z

Hi @The-Loeki thanks for trying it out :)

Great regarding the good part.

As for the bad, maybe it's angry about empty credentials (try sending empty basic auth). Not sure what changed though. I will dig into it.

The-Loeki · 2018-03-09T11:12:20Z

Did you see my updated comments? I'm hacking up a PR with fixed RBAC perms, but I'm not sure if you need to be able to delete replicasets & controllers too?

rusenask · 2018-03-09T11:14:41Z

No, I didn't see it. Yeah, totally forgot that perms are required for deletion. Only pod deletion permissions are required, thanks!

Regarding the quay:

After a simple unit test that pretty much does the same thing as for Zalando registry, Quay returns an error (every registry wants to be unique). Will get it fixed.

The-Loeki · 2018-03-09T11:16:31Z

We'll be deploying Harbor as our own registry service soon, so you might want to get more coffee ;)

rusenask · 2018-03-09T11:17:37Z

at least it's open source :)

rusenask · 2018-03-09T12:16:48Z

Apparently that error was just a log of failed ping, the manifest was retrieved successfully. I have removed Ping function from the registry client as I can see that public index.docker.io doesn't have that endpoint anymore too. New alpha image is available.

Merging into master branch.

The-Loeki · 2018-03-09T12:21:05Z

Looks much better indeed

[theloeki@murphy ~]$ kubectl -n kube-system logs -f keel-85f9fd6447-4gtt2 |grep quay
time="2018-03-09T12:19:13Z" level=debug msg="registry client: getting digest" registry="https://quay.io" repository=coreos/dex tag=v2.9.0
time="2018-03-09T12:19:13Z" level=debug msg="registry.manifest.head url=https://quay.io/v2/coreos/dex/manifests/v2.9.0 repository=coreos/dex reference=v2.9.0"
time="2018-03-09T12:19:14Z" level=info msg="trigger.poll.RepositoryWatcher: new watch repository tags job added" digest="sha256:c9ab4b2f064b8dd3cde614af50d5f1c49d6c45603ce377022c15bc9aa217e2db" image="quay.io/coreos/dex:v2.9.0" job_name=quay.io/coreos/dex schedule="@every 24h"
time="2018-03-09T12:19:18Z" level=debug msg="secrets.defaultGetter.lookupSecrets: pod secrets found" image=quay.io/jetstack/cert-manager-controller namespace=kube-system pod_selector="app=cert-manager,release=cert-manager" provider=helm registry=quay.io secrets="[]"
time="2018-03-09T12:19:18Z" level=debug msg="secrets.defaultGetter.lookupSecrets: no secrets for image found" image=quay.io/jetstack/cert-manager-controller namespace=kube-system pod_selector="app=cert-manager,release=cert-manager" pods_checked=1 provider=helm registry=quay.io
time="2018-03-09T12:19:18Z" level=debug msg="registry client: getting digest" registry="https://quay.io" repository=jetstack/cert-manager-controller tag=v0.2.3
time="2018-03-09T12:19:18Z" level=debug msg="registry.manifest.head url=https://quay.io/v2/jetstack/cert-manager-controller/manifests/v0.2.3 repository=jetstack/cert-manager-controller reference=v0.2.3"
time="2018-03-09T12:19:19Z" level=info msg="trigger.poll.RepositoryWatcher: new watch repository tags job added" digest="sha256:6bccc03f2e98e34f2b1782d29aed77763e93ea81de96f246ebeb81effd947085" image="quay.io/jetstack/cert-manager-controller:v0.2.3" job_name=quay.io/jetstack/cert-manager-controller schedule="@every 24h"
time="2018-03-09T12:20:15Z" level=debug msg="secrets.defaultGetter.lookupSecrets: pod secrets found" image=quay.io/jetstack/cert-manager-controller namespace=kube-system pod_selector="app=cert-manager,release=cert-manager" provider=helm registry=quay.io secrets="[]"
time="2018-03-09T12:20:15Z" level=debug msg="secrets.defaultGetter.lookupSecrets: no secrets for image found" image=quay.io/jetstack/cert-manager-controller namespace=kube-system pod_selector="app=cert-manager,release=cert-manager" pods_checked=1 provider=helm registry=quay.io

rusenask · 2018-03-11T14:55:46Z

Fixed, available from 0.7.x.

…-hq#110) Bumps [google.golang.org/api](https://github.com/googleapis/google-api-go-client) from 0.170.0 to 0.171.0. - [Release notes](https://github.com/googleapis/google-api-go-client/releases) - [Changelog](https://github.com/googleapis/google-api-go-client/blob/main/CHANGES.md) - [Commits](googleapis/google-api-go-client@v0.170.0...v0.171.0) --- updated-dependencies: - dependency-name: google.golang.org/api dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

rusenask added bug provider/kubernetes labels Oct 27, 2017

rusenask mentioned this issue Mar 4, 2018

[:alpha available] Feature/force update #154

Merged

rusenask closed this as completed Mar 11, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unversioned deployment with multiple pods does not update all pods #110

Unversioned deployment with multiple pods does not update all pods #110

larte commented Oct 27, 2017

rusenask commented Oct 27, 2017

larte commented Oct 27, 2017

rusenask commented Oct 27, 2017 •

edited

Loading

taylorchu commented Nov 7, 2017

rusenask commented Nov 7, 2017

taylorchu commented Nov 7, 2017

rusenask commented Dec 12, 2017

The-Loeki commented Mar 1, 2018

rusenask commented Mar 1, 2018

The-Loeki commented Mar 1, 2018

rusenask commented Mar 1, 2018

The-Loeki commented Mar 1, 2018

rusenask commented Mar 4, 2018

The-Loeki commented Mar 9, 2018 •

edited

Loading

rusenask commented Mar 9, 2018

The-Loeki commented Mar 9, 2018

rusenask commented Mar 9, 2018

The-Loeki commented Mar 9, 2018

rusenask commented Mar 9, 2018

rusenask commented Mar 9, 2018

The-Loeki commented Mar 9, 2018

rusenask commented Mar 11, 2018

Unversioned deployment with multiple pods does not update all pods #110

Unversioned deployment with multiple pods does not update all pods #110

Comments

larte commented Oct 27, 2017

rusenask commented Oct 27, 2017

larte commented Oct 27, 2017

rusenask commented Oct 27, 2017 • edited Loading

taylorchu commented Nov 7, 2017

rusenask commented Nov 7, 2017

taylorchu commented Nov 7, 2017

rusenask commented Dec 12, 2017

The-Loeki commented Mar 1, 2018

rusenask commented Mar 1, 2018

The-Loeki commented Mar 1, 2018

rusenask commented Mar 1, 2018

The-Loeki commented Mar 1, 2018

rusenask commented Mar 4, 2018

The-Loeki commented Mar 9, 2018 • edited Loading

The Good

The Bad

The Ugly

rusenask commented Mar 9, 2018

The-Loeki commented Mar 9, 2018

rusenask commented Mar 9, 2018

The-Loeki commented Mar 9, 2018

rusenask commented Mar 9, 2018

rusenask commented Mar 9, 2018

The-Loeki commented Mar 9, 2018

rusenask commented Mar 11, 2018

rusenask commented Oct 27, 2017 •

edited

Loading

The-Loeki commented Mar 9, 2018 •

edited

Loading