Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unversioned deployment with multiple pods does not update all pods #110

Closed
larte opened this issue Oct 27, 2017 · 22 comments
Closed

Unversioned deployment with multiple pods does not update all pods #110

larte opened this issue Oct 27, 2017 · 22 comments

Comments

@larte
Copy link

larte commented Oct 27, 2017

I have a deployment with 5 replicas following a :latest tag. From the logs, I can see that keel resets the image to 0.0.0 and after 5 seconds applies :latest.

The deployment seems to revert back to the previous version of replicaset, and the rollout does not continue for pods. The single pod that was recreated has the same rc version, but the latest image was pulled via imagePullPolicy. During the reset, I can see 2 pods in ErrImagePull state.

I'm running keel 0.5.0-rc.1 with native webhooks with keel.sh/policy: force in kubernetes 1.7.x

The events from the deployment are:

Normal  ScalingReplicaSet  2m     deployment-controller  Scaled up replica set app-1687492293 to 1
Normal  ScalingReplicaSet  2m     deployment-controller  Scaled down replica set app-3177265212 to 4
Normal  ScalingReplicaSet  2m     deployment-controller  Scaled up replica set app-1687492293 to 2
Normal  ScalingReplicaSet  2m     deployment-controller  Scaled up replica set app-3177265212 to 5
Normal  ScalingReplicaSet  2m     deployment-controller  Scaled down replica set app-1687492293 to 0

and in the end, the pods are:

Name                        RC                             AGE
app-3177265212-40zzc    app-3177265212   2017-10-27T13:26:34Z
app-3177265212-85jsp    app-3177265212   2017-10-27T13:43:07Z   <- This pod has changed
app-3177265212-b2mws    app-3177265212   2017-10-27T13:26:34Z
app-3177265212-p9mkc    app-3177265212   2017-10-27T13:26:34Z
app-3177265212-qth3s    app-3177265212   2017-10-27T13:26:34Z
@rusenask
Copy link
Collaborator

It seems that Kubernetes doesn't manage to destroy existing replicas in time. Could you try scaling down to 1 replica and then trying an update? If it would solve the issue then Keel could do it for you. I imagine workflow could be:

  1. Set replicas to 1 or 0
  2. Set tag to 0.0.0
  3. Set replica count and tag to whatever you had before

@larte
Copy link
Author

larte commented Oct 27, 2017

That would work for non-production workloads, which applies to working with latest tag anyway.

The other would be to set spec.strategy.type to 'Recreate' in the deployment which results to some downtime as well, but wouldn't require changes in keel.

I'm currently trying out a very rough patch, where no reset is performed, but a ENV variable is set to each container, resulting in a new rc each time. What is your opinion on this? I remember seeing some discussion earlier on the force-policy feature ticket.

@rusenask
Copy link
Collaborator

rusenask commented Oct 27, 2017

Could work. Another option is to terminate pods, it could be done "slowly" so it's almost like a rolling update. Regarding non-production workloads - I guess it's reasonable to expect that production workloads would be versioned.

Regarding that patch - feel free to open a work in progress PR :)

@taylorchu
Copy link

I tested keel 0.4.7 with gke server version 1.8, and "force update" does not work for me.

Here is the sequence of events that happened:

1. scheduler assigns the allocation to a node, and replicaset-controller creates a new pod.
2. the image is set to 0.0.0 by keel. Since the image does not exist, it shows `Failed to pull image` and `Error syncing pod`.
3. the node backoffs pulling the image
4. the new pod is deleted

The notification I got is that the image is updated successfully, and yet the pod is not updated at all. (There is only 1 pod.)

Instead of pulling tag 0.0.0, which creates unexpected "fail to pull" events in the cluster (unless that person knows keel very well), we only have to change the replica count from N to 0 and then from 0 to N.

@larte @rusenask

@rusenask
Copy link
Collaborator

rusenask commented Nov 7, 2017

Seems like k8s scheduler behaviour changed. I think force update should be reimplemented with your suggestion. Seems like a clean approach.

@taylorchu do you have to wait a little bit when you set replicas to 0 or it terminates pods immediately?

@taylorchu
Copy link

no, I do not set replica count to 0 myself.

@rusenask
Copy link
Collaborator

@taylorchu started looking at this issue. One problem with setting replicas to 0 would mean that auto scaler would stop working (it has to be unset).

What about terminating all pods? that would result in k8s recreating them. If it was done with some breaks in the process it could even mean no downtime.

@The-Loeki
Copy link
Contributor

We're just starting to use Keel (on GCP K8s 1.8.7) and are hit with this problem on 0.6.1.
It occured to me that the only reason Keel has to do this (and b0rks it up apparently) is because Replication Controllers already explicitly refuse to (as per K8s docs)

As my 5c, I think emulating the rolling update would be the cleanest way to go.

Also, we're quite happy running (a carefully selected set of) latest tagged containers in production and some apps have a gitflow where master branch is always deemed stable; so merges to the branch are only approved once they're production ready, leading again to a valid use case for a stable/latest tag.

@rusenask
Copy link
Collaborator

rusenask commented Mar 1, 2018

Hi, thanks. Will get this sorted ASAP. Do you think my suggested strategy by terminating pods would do the job? Terminated pod will always pull the new version as I understand.

@The-Loeki
Copy link
Contributor

Well AFAIK you'd need to set imagePullPolicy: Always on the Deployment, but other than that, we'd be perfectly happy with it; it's pretty much what we're doing manually now.

@rusenask
Copy link
Collaborator

rusenask commented Mar 1, 2018

Awesome, I am a bit swamped by work these days but will try to add and test this strategy either this evening or on the weekend :)

@The-Loeki
Copy link
Contributor

That would be awesome! We'll be more than happy to help you test the changes if you like.

@rusenask
Copy link
Collaborator

rusenask commented Mar 4, 2018

Hi @The-Loeki, just pushed alpha tag that is built based on #154. Did testing and it seems to be a reliable way to force update for same image tags.

It would be nice if you did more testing as it should also solve that other #153 issue (even added a unit test for that specific docker registry :)).

Migrated client-go (which is now split into multiple repos) to release-6.0 which should ensure that everything works for foreseeable future. There were a bunch of other updates to dependencies which required more changes (how we parse images) so any additional testing would be really welcome :)

@The-Loeki
Copy link
Contributor

The-Loeki commented Mar 9, 2018

Hi @rusenask thanks for your hard work :) Today we've done the first round of testing on the alpha tag.

The Good

  • Zalando works w00t
  • We've done two deployment upgrades with 'latest' tags to see how it works and it looks nice for now. We'll be doing a bunch more of those and be sure to let you know!

The Bad

  • RBAC permissions need to be fixed to allow deletion of pods.
    Question: Do you need to delete replicasets and replicacontrollers as well?
    I'll hack up a PR

The Ugly

time="2018-03-09T10:23:29Z" level=debug msg="registry client: getting digest" registry="https://quay.io" repository=coreos/dex tag=v2.9.0
2018/03/09 10:23:30 registry failed ping request, error: Get https://quay.io/v2/: http: non-successful response (status=401 body="{\"error\": \"Invalid bearer token format\"}")
time="2018-03-09T10:23:30Z" level=debug msg="registry.manifest.head url=https://quay.io/v2/coreos/dex/manifests/v2.9.0 repository=coreos/dex reference=v2.9.0"
time="2018-03-09T10:23:30Z" level=info msg="trigger.poll.RepositoryWatcher: new watch repository tags job added" digest="sha256:c9ab4b2f064b8dd3cde614af50d5f1c49d6c45603ce377022c15bc9aa217e2db" image="quay.io/coreos/dex:v2.9.0" job_name=quay.io/coreos/dex schedule="@every 24h"
time="2018-03-09T10:23:37Z" level=debug msg="secrets.defaultGetter.lookupSecrets: pod secrets found" image=quay.io/jetstack/cert-manager-controller namespace=kube-system pod_selector="app=cert-manager,release=cert-manager" provider=helm registry=quay.io secrets="[]"
time="2018-03-09T10:23:37Z" level=debug msg="secrets.defaultGetter.lookupSecrets: no secrets for image found" image=quay.io/jetstack/cert-manager-controller namespace=kube-system pod_selector="app=cert-manager,release=cert-manager" pods_checked=1 provider=helm registry=quay.io
time="2018-03-09T10:23:37Z" level=debug msg="registry client: getting digest" registry="https://quay.io" repository=jetstack/cert-manager-controller tag=v0.2.3
2018/03/09 10:23:37 registry failed ping request, error: Get https://quay.io/v2/: http: non-successful response (status=401 body="{\"error\": \"Invalid bearer token format\"}")
time="2018-03-09T10:23:37Z" level=debug msg="registry.manifest.head url=https://quay.io/v2/jetstack/cert-manager-controller/manifests/v0.2.3 repository=jetstack/cert-manager-controller reference=v0.2.3"
time="2018-03-09T10:23:37Z" level=info msg="trigger.poll.RepositoryWatcher: new watch repository tags job added" digest="sha256:6bccc03f2e98e34f2b1782d29aed77763e93ea81de96f246ebeb81effd947085" image="quay.io/jetstack/cert-manager-controller:v0.2.3" job_name=quay.io/jetstack/cert-manager-controller schedule="@every 24h"
time="2018-03-09T10:24:35Z" level=debug msg="secrets.defaultGetter.lookupSecrets: pod secrets found" image=quay.io/jetstack/cert-manager-controller namespace=kube-system pod_selector="app=cert-manager,release=cert-manager" provider=helm registry=quay.io secrets="[]"
time="2018-03-09T10:24:35Z" level=debug msg="secrets.defaultGetter.lookupSecrets: no secrets for image found" image=quay.io/jetstack/cert-manager-controller namespace=kube-system pod_selector="app=cert-manager,release=cert-manager" pods_checked=1 provider=helm registry=quay.io
time="2018-03-09T10:25:30Z" level=debug msg="secrets.defaultGetter.lookupSecrets: pod secrets found" image=quay.io/jetstack/cert-manager-controller namespace=kube-system pod_selector="app=cert-manager,release=cert-manager" provider=helm registry=quay.io secrets="[]"

curl -m 5 -Lv -H "Content-Type: application/json" https://quay.io/v2/jetstack/cert-manager-controller/manifests/v0.2.3 of course 'just works'

I'd venture from the logs that it tries to auth against Quay with the empty/nonexistent secret or something, but that's just a guess

@rusenask
Copy link
Collaborator

rusenask commented Mar 9, 2018

Hi @The-Loeki thanks for trying it out :)

Great regarding the good part.

As for the bad, maybe it's angry about empty credentials (try sending empty basic auth). Not sure what changed though. I will dig into it.

@The-Loeki
Copy link
Contributor

Did you see my updated comments? I'm hacking up a PR with fixed RBAC perms, but I'm not sure if you need to be able to delete replicasets & controllers too?

@rusenask
Copy link
Collaborator

rusenask commented Mar 9, 2018

No, I didn't see it. Yeah, totally forgot that perms are required for deletion. Only pod deletion permissions are required, thanks!

Regarding the quay:

After a simple unit test that pretty much does the same thing as for Zalando registry, Quay returns an error (every registry wants to be unique). Will get it fixed.

@The-Loeki
Copy link
Contributor

We'll be deploying Harbor as our own registry service soon, so you might want to get more coffee ;)

@rusenask
Copy link
Collaborator

rusenask commented Mar 9, 2018

at least it's open source :)

@rusenask
Copy link
Collaborator

rusenask commented Mar 9, 2018

Apparently that error was just a log of failed ping, the manifest was retrieved successfully. I have removed Ping function from the registry client as I can see that public index.docker.io doesn't have that endpoint anymore too. New alpha image is available.

Merging into master branch.

@The-Loeki
Copy link
Contributor

Looks much better indeed

[theloeki@murphy ~]$ kubectl -n kube-system logs -f keel-85f9fd6447-4gtt2 |grep quay
time="2018-03-09T12:19:13Z" level=debug msg="registry client: getting digest" registry="https://quay.io" repository=coreos/dex tag=v2.9.0
time="2018-03-09T12:19:13Z" level=debug msg="registry.manifest.head url=https://quay.io/v2/coreos/dex/manifests/v2.9.0 repository=coreos/dex reference=v2.9.0"
time="2018-03-09T12:19:14Z" level=info msg="trigger.poll.RepositoryWatcher: new watch repository tags job added" digest="sha256:c9ab4b2f064b8dd3cde614af50d5f1c49d6c45603ce377022c15bc9aa217e2db" image="quay.io/coreos/dex:v2.9.0" job_name=quay.io/coreos/dex schedule="@every 24h"
time="2018-03-09T12:19:18Z" level=debug msg="secrets.defaultGetter.lookupSecrets: pod secrets found" image=quay.io/jetstack/cert-manager-controller namespace=kube-system pod_selector="app=cert-manager,release=cert-manager" provider=helm registry=quay.io secrets="[]"
time="2018-03-09T12:19:18Z" level=debug msg="secrets.defaultGetter.lookupSecrets: no secrets for image found" image=quay.io/jetstack/cert-manager-controller namespace=kube-system pod_selector="app=cert-manager,release=cert-manager" pods_checked=1 provider=helm registry=quay.io
time="2018-03-09T12:19:18Z" level=debug msg="registry client: getting digest" registry="https://quay.io" repository=jetstack/cert-manager-controller tag=v0.2.3
time="2018-03-09T12:19:18Z" level=debug msg="registry.manifest.head url=https://quay.io/v2/jetstack/cert-manager-controller/manifests/v0.2.3 repository=jetstack/cert-manager-controller reference=v0.2.3"
time="2018-03-09T12:19:19Z" level=info msg="trigger.poll.RepositoryWatcher: new watch repository tags job added" digest="sha256:6bccc03f2e98e34f2b1782d29aed77763e93ea81de96f246ebeb81effd947085" image="quay.io/jetstack/cert-manager-controller:v0.2.3" job_name=quay.io/jetstack/cert-manager-controller schedule="@every 24h"
time="2018-03-09T12:20:15Z" level=debug msg="secrets.defaultGetter.lookupSecrets: pod secrets found" image=quay.io/jetstack/cert-manager-controller namespace=kube-system pod_selector="app=cert-manager,release=cert-manager" provider=helm registry=quay.io secrets="[]"
time="2018-03-09T12:20:15Z" level=debug msg="secrets.defaultGetter.lookupSecrets: no secrets for image found" image=quay.io/jetstack/cert-manager-controller namespace=kube-system pod_selector="app=cert-manager,release=cert-manager" pods_checked=1 provider=helm registry=quay.io

@rusenask
Copy link
Collaborator

Fixed, available from 0.7.x.

knechtionscoding pushed a commit to knechtionscoding/keel that referenced this issue Mar 27, 2024
…-hq#110)

Bumps [google.golang.org/api](https://github.com/googleapis/google-api-go-client) from 0.170.0 to 0.171.0.
- [Release notes](https://github.com/googleapis/google-api-go-client/releases)
- [Changelog](https://github.com/googleapis/google-api-go-client/blob/main/CHANGES.md)
- [Commits](googleapis/google-api-go-client@v0.170.0...v0.171.0)

---
updated-dependencies:
- dependency-name: google.golang.org/api
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants