Fluentd-scaler causing fluentd pod deletions and messes with ds-controller #61190

shyamjvs · 2018-03-14T18:20:32Z

Forking from #60500 (comment):

To summarize, here's what we observed:

PATCH daemonset calls coming every minute from both fluentd-scaler and addon-manager (verified this by individually turning them on/off). Things we need to understand here:

Is the daemonset object continuously toggling b/w 2 states? (we know here that it's RV is increasing continuously)
If yes, what field(s) in the object are changing? IIRC value of some label/annotation (I think 'UpdatedPodsScheduled') is changing (probably related to 2 below)
Also, why/should the fluentd-scaler send any api request if the resources are already set to the right value?

Fluentd pods are getting deleted and recreated by daemonset-controller when the scaler is enabled (as was also seen in Increase in apiserver/c-m mem usage by 100-200 MB in 100-node cluster #60500 (comment)). Why this is happening? One thing to note here - all those delete calls are preceded by PUT pod-status calls from respective kubelets (but maybe that's expected).

cc @kubernetes/sig-instrumentation-bugs @crassirostris @liggitt
/priority critical-urgent
/assign @x13n

The text was updated successfully, but these errors were encountered:

shyamjvs · 2018-03-14T18:22:52Z

Marked this as release-blocker as at least some part of it seems like a correctness issue (i.e fluentd pods dying and getting recreated).
Note also that this is causing increased mem-usage with daemonset-controller (depicted here - #60500 (comment))

x13n · 2018-03-15T08:20:50Z

Thanks for filing this!

Is the daemonset object continuously toggling b/w 2 states? (we know here that it's RV is increasing continuously)

If yes, what field(s) in the object are changing? IIRC value of some label/annotation (I think 'UpdatedPodsScheduled') is changing (probably related to 2 below)

The only thing that was changing (apart from ds/fluentd-gcp status) was the resource version.

x13n · 2018-03-15T09:28:06Z

I modified fluentd scaler to use --dry-run in kubectl set resources to check whether there would be a change, but this still keeps on updating, even though there is no change. Apparently, the logic inside kubectl is broken...

x13n · 2018-03-15T10:48:51Z

Btw, addon manager sends tons of PATCH requests for all resources every minute anyway, so even though I will fix fluent-scaler not to send them, it won't solve everything.

Fixes kubernetes#61190. This version verifies on its own whether resources should be updated or not, instead of relying on `kubectl set resources`.

@shyamjvs

Automatic merge from submit-queue (batch tested with PRs 60888, 61225). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Bump fluentd-gcp-scaler version **What this PR does / why we need it**: This version verifies on its own whether resources should be updated or not, instead of relying on `kubectl set resources`. **Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*: Fixes #61190 **Special notes for your reviewer**: **Release note**: ```release-note NONE ``` cc @shyamjvs

jberkus · 2018-03-16T17:17:42Z

Should this issue actually be closed? It was auto-closed by the bot, but nobody has confirmed that the actual issue is fixed.

krzyzacy · 2018-03-16T17:49:15Z

I'll let @shyamjvs to confirm (looks like he'll be back on Monday)

jdumars · 2018-03-16T18:39:43Z

/status in-progress

shyamjvs · 2018-03-17T01:16:17Z

Yes, we should keep this issue open until we fully understand what's going wrong and fix the root-cause.
I believe much of the discussion wrt this is happening under related issue #61189.

shyamjvs · 2018-03-21T12:54:55Z

Ok.. Seems like I missed Wojtek's comment above.

shyamjvs · 2018-03-21T12:58:05Z

So I just checked the ds-controller code and it seems like we're updating the ds status only if this condition is being violated:

kubernetes/pkg/controller/daemon/daemon_controller.go

Lines 1011 to 1021 in 622ad35

    
           func storeDaemonSetStatus(dsClient unversionedapps.DaemonSetInterface, ds *apps.DaemonSet, desiredNumberScheduled, currentNumberScheduled, numberMisscheduled, numberReady, updatedNumberScheduled, numberAvailable, numberUnavailable int) error { 
        
           	if int(ds.Status.DesiredNumberScheduled) == desiredNumberScheduled && 
        
           		int(ds.Status.CurrentNumberScheduled) == currentNumberScheduled && 
        
           		int(ds.Status.NumberMisscheduled) == numberMisscheduled && 
        
           		int(ds.Status.NumberReady) == numberReady && 
        
           		int(ds.Status.UpdatedNumberScheduled) == updatedNumberScheduled && 
        
           		int(ds.Status.NumberAvailable) == numberAvailable && 
        
           		int(ds.Status.NumberUnavailable) == numberUnavailable && 
        
           		ds.Status.ObservedGeneration >= ds.Generation { 
        
           		return nil 
        
           	}

Which means one of those fields is probably mismatching. I'll try to add some logs and run the test at --v=4 to get more info.

liggitt · 2018-03-21T15:37:45Z

This is the code that is supposed to protect us from situation that something must change when we write object to etcd (resource version is not stored in etcd):
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/storage/etcd3/store.go#L325

So there should never be a situation that only resource version is changing. If it is, there is bug and need to be fixed. @liggitt - FYI

anything is possible, but we do have tests ensuring this works properly (no-op updates do not actually result in an update+resourceVersion bump in etcd): https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/registry/generic/registry/store_test.go#L674-L728

shyamjvs · 2018-03-22T11:13:06Z

So I ran some 100-node experiments against #61472 with more debug logs and I think I now understand the reason for this. The controller-manager logs with the scaler and without the scaler mainly show the following difference:

I0321 14:26:34.315600       1 daemon_controller.go:986] Pods to delete for daemon set fluentd-gcp-v3.0.0: [fluentd-gcp-v3.0.0-j5vvh], deleting 1
I0321 14:27:26.218875       1 daemon_controller.go:986] Pods to delete for daemon set fluentd-gcp-v3.0.0: [fluentd-gcp-v3.0.0-jl74c], deleting 1
I0321 14:28:26.005293       1 daemon_controller.go:986] Pods to delete for daemon set fluentd-gcp-v3.0.0: [fluentd-gcp-v3.0.0-89dvm], deleting 1
I0321 14:29:24.596255       1 daemon_controller.go:986] Pods to delete for daemon set fluentd-gcp-v3.0.0: [fluentd-gcp-v3.0.0-l4qtp], deleting 1

This logs show-up when the scaler is active. And these pod deletions are happening as part of rolling-upgrade of daemonset to newer version (the one having resources set). You can see they're getting deleted one-by-one, which also explains why it takes so long (and so are happening even 30 mins after the cluster creation). And the reason why they're deleted one-by-one is because ds-controller has a rolling-upgrade strategy where maxUnavailable=1 by default. Proof:

I0321 15:29:42.284073       1 update.go:431]  DaemonSet kube-system/fluentd-gcp-v3.0.0, maxUnavailable: 1, numUnavailable: 0
I0321 15:29:43.490056       1 update.go:431]  DaemonSet kube-system/fluentd-gcp-v3.0.0, maxUnavailable: 1, numUnavailable: 1

And the increased mem-usage FMU is because of the rolling-upgrade performed by ds-controller (over the prolonged period) - as it's the only difference I observe from logs
Further, the reason why the RV was increasing was because of few fields 'NumUnavailable', 'NumAvailable', etc fluctuating while the rolling-upgrade

shyamjvs · 2018-03-22T11:20:48Z

I spoke offline with @x13n and suggested that we should increase maxUnavailable for the fluentd daemonset to a large enough value so that we're not bottlenecked by it. My reasoning is:

fluentd is not some user workload which will have some service downtime in case there are many unavailable at once
fluentds on nodes are independent of each other (i.e one fluentd shouldn't need to wait for another node's fluentd to come up)
we should try to make the scaling happen as fast as possible - otherwise it may be too late before they cope-up with increased log-load in the cluster

I'm going to make that change and test it against my PR (thanks @x13n for pointing that we can change maxUnavailable directly in ds config).

shyamjvs · 2018-03-22T13:02:03Z

So increasing maxUnavailable indeed seems to help (ref: c-m logs):

I0322 12:18:37.297447       1 daemon_controller.go:986] Pods to delete for daemon set fluentd-gcp-v3.0.0: [fluentd-gcp-v3.0.0-6hvnl ... fluentd-gcp-v3.0.0-fcpwz], deleting 101

All the 101 replicas were deleted at once instead of one-by-one and happened pretty much within 2 mins of the original fluentd-ds seen by the controller:

I0322 12:16:49.538882       1 daemon_controller.go:164] Adding daemon set fluentd-gcp-v3.0.0

We should get that PR into 1.10.

x13n · 2018-03-22T13:05:36Z

Thanks for confirming this! Did increasing maxUnavailable also reduce c-m memory usage?

shyamjvs · 2018-03-22T13:56:48Z

So the run was still over the threshold, but I was expecting it as I ran c-m with increased verbosity (which can cause that).
That said, I'll observe how the usages are once that PR gets in - and FMU it's still possible to have higher values due to the rolling-upgrade potentially happening during the test.

@x13n

Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Increase fluentd rolling-upgrade maxUnavailable to large value ~For testing wrt #61190 (comment) Fixes issue #61190 wrt slow rolling-upgrade /cc @x13n @wojtek-t /sig instrumentation /kind bug /priority critical-urgent ```release-note NONE ```

x13n · 2018-03-26T07:45:32Z

@shyamjvs Is there anything else to do here or should this be closed now?

shyamjvs · 2018-03-26T08:42:10Z

Let's close it once we're sure there are no more such failures. We may also need to bump the threshold.

…

On Mon, Mar 26, 2018, 9:46 AM Kubernetes Submit Queue < ***@***.***> wrote: [MILESTONENOTIFIER] Milestone Issue Labels *Incomplete* @shyamjvs <https://github.com/shyamjvs> @x13n <https://github.com/x13n> *Action required*: This issue requires label changes. If the required changes are not made within 1 day, the issue will be moved out of the v1.11 milestone. *priority*: Must specify exactly one of priority/critical-urgent, priority/important-longterm or priority/important-soon. Help - Additional instructions <https://github.com/kubernetes/community/blob/master/contributors/devel/release/issues.md> - Commands for setting labels <https://go.k8s.io/bot-commands> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#61190 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEIhk0QUt8HMxHUAC9tpGENpS5rCmiusks5tiJztgaJpZM4Sq8RZ> .

shyamjvs · 2018-03-26T12:36:15Z

So bumping maxUnavailable for fluentd rolling-upgrade in #61472 is causing couple of issues:

Our latest 5k-node performance run failed as part of BeforeSuite where we're checking that ~all kube-system pods are running or ready. The reason is because quite some fluentd pods were killed as part of the rolling-upgrade and we timed out before all those pods could come up
Our 100-node performance runs are flaking due to high CPU/mem-usage of c-m observed during density test, in cases when the fluentd upgrade happens during the test - which is kind of expected

So at this point, we might want to choose among one of the 2 options:

Rollback Increase fluentd rolling-upgrade maxUnavailable to large value #61472, which comes at the cost of tradeoffs I explained in Increase fluentd rolling-upgrade maxUnavailable to large value #61472 (comment)
Increase our CPU/mem thresholds to eliminate those flakes in 100-node job and increase the "max-allowed-not-ready-pods" in kube-system ns to a higher value, so that our 5k-node test doesn't fail in BeforeSuite

Thoughts?

wojtek-t · 2018-03-26T12:40:34Z

TBH, I don't know why it was merged without waiting for opinions...

I really vote for reverting that.

Increase our CPU/mem thresholds to eliminate those flakes in 100-node job and increase the "max-allowed-not-ready-pods" in kube-system ns to a higher value, so that our 5k-node test doesn't fail in BeforeSuite

We don't want this - this may mean making cluster "not fully usable" from user perspective.Which is not really what we want to test.

shyamjvs · 2018-03-26T13:08:55Z

SGTM. Let's revert it and increase our thresholds for the time-being. The discussion about changing fluentd rolling-upgrades can continue to happen in the background.

@wojtek-t

…e-change Automatic merge from submit-queue (batch tested with PRs 60519, 61099, 61218, 61166, 61714). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Revert "Increase fluentd rolling-upgrade maxUnavailable to large value" This reverts commit 7dd6adc. Ref #61190 (comment) /cc @wojtek-t ```release-note NONE ```

@wojtek-t

Automatic merge from submit-queue (batch tested with PRs 60499, 61715, 61688, 61300, 58787). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Increase cpu/mem thresholds for c-m in density test Follows #61190 (comment) /cc @wojtek-t ```release-note NONE ```

k8s-github-robot · 2018-03-29T08:26:44Z

[MILESTONENOTIFIER] Milestone Issue Needs Approval

@shyamjvs @x13n @kubernetes/sig-instrumentation-misc

Action required: This issue must have the status/approved-for-milestone label applied by a SIG maintainer. If the label is not applied within 5 days, the issue will be moved out of the v1.11 milestone.

Issue Labels

sig/instrumentation: Issue will be escalated to these SIGs if needed.
priority/important-soon: Escalate to the issue owners and SIG owner; move out of milestone after several unsuccessful escalation attempts.
kind/bug: Fixes a bug discovered during the current release.

Help

shyamjvs · 2018-03-29T08:54:06Z

This issue is now fixed after my above PRs. We're having continuous green runs for the job

Fixes kubernetes#61190. This version verifies on its own whether resources should be updated or not, instead of relying on `kubectl set resources`.

k8s-ci-robot assigned x13n Mar 14, 2018

k8s-ci-robot added sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels Mar 14, 2018

shyamjvs added this to the v1.10 milestone Mar 14, 2018

shyamjvs added the release-blocker label Mar 14, 2018

k8s-github-robot added the milestone/needs-approval label Mar 14, 2018

shyamjvs added status/approved-for-milestone and removed milestone/needs-approval labels Mar 14, 2018

k8s-github-robot added the milestone/needs-attention label Mar 14, 2018

shyamjvs mentioned this issue Mar 14, 2018

Increase in apiserver/c-m mem usage by 100-200 MB in 100-node cluster #60500

Closed

jberkus mentioned this issue Mar 14, 2018

1.10 Issue Burndown kubernetes/sig-release#86

Closed

x13n mentioned this issue Mar 15, 2018

Bugfix: don't call kubectl set resources when there would be no effect. GoogleCloudPlatform/k8s-stackdriver#125

Merged

x13n added a commit to x13n/kubernetes that referenced this issue Mar 15, 2018

Bump fluentd-gcp-scaler version

3d9e9e9

Fixes kubernetes#61190. This version verifies on its own whether resources should be updated or not, instead of relying on `kubectl set resources`.

x13n mentioned this issue Mar 15, 2018

Bump fluentd-gcp-scaler version #61225

Merged

k8s-github-robot closed this as completed in #61225 Mar 15, 2018

shyamjvs mentioned this issue Mar 16, 2018

Flaky timeouts while waiting for RC pods to be running in density test #61189

Closed

krzyzacy reopened this Mar 16, 2018

k8s-ci-robot added the status/in-progress label Mar 16, 2018

k8s-github-robot removed the milestone/needs-attention label Mar 16, 2018

k8s-github-robot added the milestone/needs-attention label Mar 18, 2018

shyamjvs mentioned this issue Mar 21, 2018

Move 100-node scalability tests to increased verbosity of V4 kubernetes/test-infra#7356

Merged

shyamjvs self-assigned this Mar 22, 2018

jdumars added this to the v1.11 milestone Mar 23, 2018

k8s-github-robot added the milestone/incomplete-labels label Mar 23, 2018

shyamjvs added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Mar 26, 2018

k8s-github-robot added milestone/needs-approval and removed milestone/incomplete-labels labels Mar 26, 2018

This was referenced Mar 26, 2018

Revert "Increase fluentd rolling-upgrade maxUnavailable to large value" #61714

Merged

Increase cpu/mem thresholds for c-m in density test #61715

Merged

shyamjvs closed this as completed Mar 29, 2018

prameshj pushed a commit to prameshj/kubernetes that referenced this issue Jun 1, 2018

Bump fluentd-gcp-scaler version

ce5a3e8

Fixes kubernetes#61190. This version verifies on its own whether resources should be updated or not, instead of relying on `kubectl set resources`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fluentd-scaler causing fluentd pod deletions and messes with ds-controller #61190

Fluentd-scaler causing fluentd pod deletions and messes with ds-controller #61190

shyamjvs commented Mar 14, 2018

shyamjvs commented Mar 14, 2018

x13n commented Mar 15, 2018

x13n commented Mar 15, 2018

x13n commented Mar 15, 2018

jberkus commented Mar 16, 2018

krzyzacy commented Mar 16, 2018

jdumars commented Mar 16, 2018

shyamjvs commented Mar 17, 2018

shyamjvs commented Mar 21, 2018

shyamjvs commented Mar 21, 2018

liggitt commented Mar 21, 2018

shyamjvs commented Mar 22, 2018

shyamjvs commented Mar 22, 2018

shyamjvs commented Mar 22, 2018

x13n commented Mar 22, 2018

shyamjvs commented Mar 22, 2018

x13n commented Mar 26, 2018

shyamjvs commented Mar 26, 2018 via email

shyamjvs commented Mar 26, 2018

wojtek-t commented Mar 26, 2018

shyamjvs commented Mar 26, 2018

k8s-github-robot commented Mar 29, 2018

shyamjvs commented Mar 29, 2018 •

edited

Fluentd-scaler causing fluentd pod deletions and messes with ds-controller #61190

Fluentd-scaler causing fluentd pod deletions and messes with ds-controller #61190

Comments

shyamjvs commented Mar 14, 2018

shyamjvs commented Mar 14, 2018

x13n commented Mar 15, 2018

x13n commented Mar 15, 2018

x13n commented Mar 15, 2018

jberkus commented Mar 16, 2018

krzyzacy commented Mar 16, 2018

jdumars commented Mar 16, 2018

shyamjvs commented Mar 17, 2018

shyamjvs commented Mar 21, 2018

shyamjvs commented Mar 21, 2018

liggitt commented Mar 21, 2018

shyamjvs commented Mar 22, 2018

shyamjvs commented Mar 22, 2018

shyamjvs commented Mar 22, 2018

x13n commented Mar 22, 2018

shyamjvs commented Mar 22, 2018

x13n commented Mar 26, 2018

shyamjvs commented Mar 26, 2018 via email

shyamjvs commented Mar 26, 2018

wojtek-t commented Mar 26, 2018

shyamjvs commented Mar 26, 2018

k8s-github-robot commented Mar 29, 2018

shyamjvs commented Mar 29, 2018 • edited

shyamjvs commented Mar 29, 2018 •

edited