Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fluentd-scaler causing fluentd pod deletions and messes with ds-controller #61190

Closed
shyamjvs opened this issue Mar 14, 2018 · 37 comments · Fixed by #61225
Closed

Fluentd-scaler causing fluentd pod deletions and messes with ds-controller #61190

shyamjvs opened this issue Mar 14, 2018 · 37 comments · Fixed by #61225
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. milestone/needs-approval priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation.
Milestone

Comments

@shyamjvs
Copy link
Member

Forking from #60500 (comment):

To summarize, here's what we observed:

  1. PATCH daemonset calls coming every minute from both fluentd-scaler and addon-manager (verified this by individually turning them on/off). Things we need to understand here:
  • Is the daemonset object continuously toggling b/w 2 states? (we know here that it's RV is increasing continuously)
  • If yes, what field(s) in the object are changing? IIRC value of some label/annotation (I think 'UpdatedPodsScheduled') is changing (probably related to 2 below)
  • Also, why/should the fluentd-scaler send any api request if the resources are already set to the right value?
  1. Fluentd pods are getting deleted and recreated by daemonset-controller when the scaler is enabled (as was also seen in Increase in apiserver/c-m mem usage by 100-200 MB in 100-node cluster #60500 (comment)). Why this is happening? One thing to note here - all those delete calls are preceded by PUT pod-status calls from respective kubelets (but maybe that's expected).

cc @kubernetes/sig-instrumentation-bugs @crassirostris @liggitt
/priority critical-urgent
/assign @x13n

@k8s-ci-robot k8s-ci-robot added sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels Mar 14, 2018
@shyamjvs shyamjvs added this to the v1.10 milestone Mar 14, 2018
@shyamjvs
Copy link
Member Author

Marked this as release-blocker as at least some part of it seems like a correctness issue (i.e fluentd pods dying and getting recreated).
Note also that this is causing increased mem-usage with daemonset-controller (depicted here - #60500 (comment))

@x13n
Copy link
Member

x13n commented Mar 15, 2018

Thanks for filing this!

  • Is the daemonset object continuously toggling b/w 2 states? (we know here that it's RV is increasing continuously)
  • If yes, what field(s) in the object are changing? IIRC value of some label/annotation (I think 'UpdatedPodsScheduled') is changing (probably related to 2 below)

The only thing that was changing (apart from ds/fluentd-gcp status) was the resource version.

@x13n
Copy link
Member

x13n commented Mar 15, 2018

I modified fluentd scaler to use --dry-run in kubectl set resources to check whether there would be a change, but this still keeps on updating, even though there is no change. Apparently, the logic inside kubectl is broken...

@x13n
Copy link
Member

x13n commented Mar 15, 2018

Btw, addon manager sends tons of PATCH requests for all resources every minute anyway, so even though I will fix fluent-scaler not to send them, it won't solve everything.

x13n added a commit to x13n/kubernetes that referenced this issue Mar 15, 2018
Fixes kubernetes#61190.

This version verifies on its own whether resources should be updated or not, instead of relying on `kubectl set resources`.
k8s-github-robot pushed a commit that referenced this issue Mar 15, 2018
Automatic merge from submit-queue (batch tested with PRs 60888, 61225). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Bump fluentd-gcp-scaler version

**What this PR does / why we need it**:
This version verifies on its own whether resources should be updated or not, instead of relying on `kubectl set resources`.

**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes #61190


**Special notes for your reviewer**:

**Release note**:

```release-note
NONE
```

cc @shyamjvs
@jberkus
Copy link

jberkus commented Mar 16, 2018

Should this issue actually be closed? It was auto-closed by the bot, but nobody has confirmed that the actual issue is fixed.

@krzyzacy krzyzacy reopened this Mar 16, 2018
@krzyzacy
Copy link
Member

I'll let @shyamjvs to confirm (looks like he'll be back on Monday)

@jdumars
Copy link
Member

jdumars commented Mar 16, 2018

/status in-progress

@shyamjvs
Copy link
Member Author

Yes, we should keep this issue open until we fully understand what's going wrong and fix the root-cause.
I believe much of the discussion wrt this is happening under related issue #61189.

@shyamjvs
Copy link
Member Author

Ok.. Seems like I missed Wojtek's comment above.

@shyamjvs
Copy link
Member Author

So I just checked the ds-controller code and it seems like we're updating the ds status only if this condition is being violated:

func storeDaemonSetStatus(dsClient unversionedapps.DaemonSetInterface, ds *apps.DaemonSet, desiredNumberScheduled, currentNumberScheduled, numberMisscheduled, numberReady, updatedNumberScheduled, numberAvailable, numberUnavailable int) error {
if int(ds.Status.DesiredNumberScheduled) == desiredNumberScheduled &&
int(ds.Status.CurrentNumberScheduled) == currentNumberScheduled &&
int(ds.Status.NumberMisscheduled) == numberMisscheduled &&
int(ds.Status.NumberReady) == numberReady &&
int(ds.Status.UpdatedNumberScheduled) == updatedNumberScheduled &&
int(ds.Status.NumberAvailable) == numberAvailable &&
int(ds.Status.NumberUnavailable) == numberUnavailable &&
ds.Status.ObservedGeneration >= ds.Generation {
return nil
}

Which means one of those fields is probably mismatching. I'll try to add some logs and run the test at --v=4 to get more info.

@liggitt
Copy link
Member

liggitt commented Mar 21, 2018

This is the code that is supposed to protect us from situation that something must change when we write object to etcd (resource version is not stored in etcd):
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/storage/etcd3/store.go#L325

So there should never be a situation that only resource version is changing. If it is, there is bug and need to be fixed. @liggitt - FYI

anything is possible, but we do have tests ensuring this works properly (no-op updates do not actually result in an update+resourceVersion bump in etcd): https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/registry/generic/registry/store_test.go#L674-L728

@shyamjvs
Copy link
Member Author

So I ran some 100-node experiments against #61472 with more debug logs and I think I now understand the reason for this. The controller-manager logs with the scaler and without the scaler mainly show the following difference:

I0321 14:26:34.315600       1 daemon_controller.go:986] Pods to delete for daemon set fluentd-gcp-v3.0.0: [fluentd-gcp-v3.0.0-j5vvh], deleting 1
I0321 14:27:26.218875       1 daemon_controller.go:986] Pods to delete for daemon set fluentd-gcp-v3.0.0: [fluentd-gcp-v3.0.0-jl74c], deleting 1
I0321 14:28:26.005293       1 daemon_controller.go:986] Pods to delete for daemon set fluentd-gcp-v3.0.0: [fluentd-gcp-v3.0.0-89dvm], deleting 1
I0321 14:29:24.596255       1 daemon_controller.go:986] Pods to delete for daemon set fluentd-gcp-v3.0.0: [fluentd-gcp-v3.0.0-l4qtp], deleting 1

This logs show-up when the scaler is active. And these pod deletions are happening as part of rolling-upgrade of daemonset to newer version (the one having resources set). You can see they're getting deleted one-by-one, which also explains why it takes so long (and so are happening even 30 mins after the cluster creation). And the reason why they're deleted one-by-one is because ds-controller has a rolling-upgrade strategy where maxUnavailable=1 by default. Proof:

I0321 15:29:42.284073       1 update.go:431]  DaemonSet kube-system/fluentd-gcp-v3.0.0, maxUnavailable: 1, numUnavailable: 0
I0321 15:29:43.490056       1 update.go:431]  DaemonSet kube-system/fluentd-gcp-v3.0.0, maxUnavailable: 1, numUnavailable: 1

And the increased mem-usage FMU is because of the rolling-upgrade performed by ds-controller (over the prolonged period) - as it's the only difference I observe from logs
Further, the reason why the RV was increasing was because of few fields 'NumUnavailable', 'NumAvailable', etc fluctuating while the rolling-upgrade

@shyamjvs
Copy link
Member Author

I spoke offline with @x13n and suggested that we should increase maxUnavailable for the fluentd daemonset to a large enough value so that we're not bottlenecked by it. My reasoning is:

  • fluentd is not some user workload which will have some service downtime in case there are many unavailable at once
  • fluentds on nodes are independent of each other (i.e one fluentd shouldn't need to wait for another node's fluentd to come up)
  • we should try to make the scaling happen as fast as possible - otherwise it may be too late before they cope-up with increased log-load in the cluster

I'm going to make that change and test it against my PR (thanks @x13n for pointing that we can change maxUnavailable directly in ds config).

@shyamjvs shyamjvs self-assigned this Mar 22, 2018
@shyamjvs
Copy link
Member Author

So increasing maxUnavailable indeed seems to help (ref: c-m logs):

I0322 12:18:37.297447       1 daemon_controller.go:986] Pods to delete for daemon set fluentd-gcp-v3.0.0: [fluentd-gcp-v3.0.0-6hvnl ... fluentd-gcp-v3.0.0-fcpwz], deleting 101

All the 101 replicas were deleted at once instead of one-by-one and happened pretty much within 2 mins of the original fluentd-ds seen by the controller:

I0322 12:16:49.538882       1 daemon_controller.go:164] Adding daemon set fluentd-gcp-v3.0.0

We should get that PR into 1.10.

@x13n
Copy link
Member

x13n commented Mar 22, 2018

Thanks for confirming this! Did increasing maxUnavailable also reduce c-m memory usage?

@shyamjvs
Copy link
Member Author

So the run was still over the threshold, but I was expecting it as I ran c-m with increased verbosity (which can cause that).
That said, I'll observe how the usages are once that PR gets in - and FMU it's still possible to have higher values due to the rolling-upgrade potentially happening during the test.

@jdumars jdumars added this to the v1.11 milestone Mar 23, 2018
k8s-github-robot pushed a commit that referenced this issue Mar 25, 2018
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Increase fluentd rolling-upgrade maxUnavailable to large value

~For testing wrt #61190 (comment)
Fixes issue #61190 wrt slow rolling-upgrade

/cc @x13n @wojtek-t 
/sig instrumentation
/kind bug
/priority critical-urgent

```release-note
NONE
```
@x13n
Copy link
Member

x13n commented Mar 26, 2018

@shyamjvs Is there anything else to do here or should this be closed now?

@shyamjvs
Copy link
Member Author

shyamjvs commented Mar 26, 2018 via email

@shyamjvs shyamjvs added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Mar 26, 2018
@shyamjvs
Copy link
Member Author

So bumping maxUnavailable for fluentd rolling-upgrade in #61472 is causing couple of issues:

  • Our latest 5k-node performance run failed as part of BeforeSuite where we're checking that ~all kube-system pods are running or ready. The reason is because quite some fluentd pods were killed as part of the rolling-upgrade and we timed out before all those pods could come up
  • Our 100-node performance runs are flaking due to high CPU/mem-usage of c-m observed during density test, in cases when the fluentd upgrade happens during the test - which is kind of expected

So at this point, we might want to choose among one of the 2 options:

Thoughts?

@wojtek-t
Copy link
Member

TBH, I don't know why it was merged without waiting for opinions...

I really vote for reverting that.

Increase our CPU/mem thresholds to eliminate those flakes in 100-node job and increase the "max-allowed-not-ready-pods" in kube-system ns to a higher value, so that our 5k-node test doesn't fail in BeforeSuite

We don't want this - this may mean making cluster "not fully usable" from user perspective.Which is not really what we want to test.

@shyamjvs
Copy link
Member Author

SGTM. Let's revert it and increase our thresholds for the time-being. The discussion about changing fluentd rolling-upgrades can continue to happen in the background.

k8s-github-robot pushed a commit that referenced this issue Mar 27, 2018
…e-change

Automatic merge from submit-queue (batch tested with PRs 60519, 61099, 61218, 61166, 61714). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Revert "Increase fluentd rolling-upgrade maxUnavailable to large value"

This reverts commit 7dd6adc.

Ref #61190 (comment)

/cc @wojtek-t 

```release-note
NONE
```
k8s-github-robot pushed a commit that referenced this issue Mar 27, 2018
Automatic merge from submit-queue (batch tested with PRs 60499, 61715, 61688, 61300, 58787). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Increase cpu/mem thresholds for c-m in density test

Follows #61190 (comment)

/cc @wojtek-t 

```release-note
NONE
```
@k8s-github-robot
Copy link

[MILESTONENOTIFIER] Milestone Issue Needs Approval

@shyamjvs @x13n @kubernetes/sig-instrumentation-misc

Action required: This issue must have the status/approved-for-milestone label applied by a SIG maintainer. If the label is not applied within 5 days, the issue will be moved out of the v1.11 milestone.

Issue Labels
  • sig/instrumentation: Issue will be escalated to these SIGs if needed.
  • priority/important-soon: Escalate to the issue owners and SIG owner; move out of milestone after several unsuccessful escalation attempts.
  • kind/bug: Fixes a bug discovered during the current release.
Help

@shyamjvs
Copy link
Member Author

shyamjvs commented Mar 29, 2018

This issue is now fixed after my above PRs. We're having continuous green runs for the job

prameshj pushed a commit to prameshj/kubernetes that referenced this issue Jun 1, 2018
Fixes kubernetes#61190.

This version verifies on its own whether resources should be updated or not, instead of relying on `kubectl set resources`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. milestone/needs-approval priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants