Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fluentd doesn't receive the signal TERM #68064

Merged
merged 1 commit into from
Sep 25, 2018

Conversation

gianrubio
Copy link
Contributor

@gianrubio gianrubio commented Aug 30, 2018

Sometimes when my fluentd daemonset restarts, it sends all the logs again, even logs already sent.

My first assumption was the pos file were misconfigured but after a long time working on this I found the issue. When fluentd is stopped by kubernetes it does not receive the TERM signal, "broking" the position file. The cause of this issue is the CMD spec without "[" uses a shell form and doesn't use exec to start a process.

To reproduce the error you could delete the fluentd pod and you will never see a message saying fluentd has been killed (ex: ...fluentd main process get SIGTERM).

After I applied the patch, fluentd started receiving the SIGTERM

Fluentd logs after the patch

2018-08-30 11:42:43 +0000 [debug]: fluent/log.rb:302:debug: fluentd main process get SIGTERM
2018-08-30 11:42:43 +0000 [debug]: fluent/log.rb:302:debug: getting start to shutdown main process
2018-08-30 11:42:43 +0000 [info]: fluent/log.rb:322:info: fluentd worker is now stopping worker=0
2018-08-30 11:42:43 +0000 [info]: fluent/log.rb:322:info: shutting down fluentd worker worker=0
2018-08-30 11:42:43 +0000 [debug]: fluent/log.rb:302:debug: calling stop on input plugin type=:tail plugin_id="fluentd-containers.log"
2018-08-30 11:42:43 +0000 [debug]: fluent/log.rb:302:debug: calling stop on input plugin type=:forward plugin_id="object:3fb7a18ccf88"
2018-08-30 11:42:43 +0000 [debug]: fluent/log.rb:302:debug: calling stop on input plugin type=:prometheus plugin_id="object:3fb79eb05834"
2018-08-30 11:42:43 +0000 [debug]: fluent/log.rb:302:debug: calling stop on input plugin type=:monitor_agent plugin_id="object:3fb79ec9ea4c"
2018-08-30 11:42:43 +0000 [debug]: fluent/log.rb:302:debug: calling stop on input plugin type=:prometheus_monitor plugin_id="object:3fb79ef47e9c"
2018-08-30 11:42:43 +0000 [debug]: fluent/log.rb:302:debug: calling stop on input plugin type=:prometheus_output_monitor plugin_id="object:3fb79ec5412c"
2018-08-30 11:42:43 +0000 [debug]: fluent/log.rb:302:debug: calling stop on input plugin type=:prometheus_tail_monitor plugin_id="object:3fb79ec2c258"
2018-08-30 11:42:43 +0000 [debug]: fluent/log.rb:302:debug: calling stop on input plugin type=:tail plugin_id="minion"
2018-08-30 11:42:43 +0000 [debug]: fluent/log.rb:302:debug: calling stop on input plugin type=:tail plugin_id="startupscript.log"
2018-08-30 11:42:43 +0000 [debug]: fluent/log.rb:302:debug: calling stop on input plugin type=:tail plugin_id="docker.log"
2018-08-30 11:42:43 +0000 [debug]: fluent/log.rb:302:debug: calling stop on input plugin type=:tail plugin_id="kubelet.log"
2018-08-30 11:42:43 +0000 [debug]: fluent/log.rb:302:debug: calling stop on input plugin type=:tail plugin_id="kube-proxy.log"
2018-08-30 11:42:43 +0000 [debug]: fluent/log.rb:302:debug: calling stop on input plugin type=:tail plugin_id="kube-apiserver.log"
2018-08-30 11:42:43 +0000 [debug]: fluent/log.rb:302:debug: calling stop on input plugin type=:tail plugin_id="kube-controller-manager.log"
2018-08-30 11:42:43 +0000 [debug]: fluent/log.rb:302:debug: calling stop on input plugin type=:tail plugin_id="kube-scheduler.log"
2018-08-30 11:42:43 +0000 [debug]: fluent/log.rb:302:debug: calling stop on input plugin type=:tail plugin_id="rescheduler.log"
2018-08-30 11:42:43 +0000 [debug]: fluent/log.rb:302:debug: calling stop on input plugin type=:tail plugin_id="glbc.log"
2018-08-30 11:42:43 +0000 [debug]: fluent/log.rb:302:debug: calling stop on input plugin type=:tail plugin_id="cluster-autoscaler.log"
2018-08-30 11:42:43 +0000 [debug]: fluent/log.rb:302:debug: calling stop on input plugin type=:systemd plugin_id="journald-docker"
2018-08-30 11:42:43 +0000 [debug]: fluent/log.rb:302:debug: calling stop on input plugin type=:systemd plugin_id="journald-kubelet"
2018-08-30 11:42:43 +0000 [debug]: fluent/log.rb:302:debug: calling stop on input plugin type=:systemd plugin_id="journald-node-problem-detector"
2018-08-30 11:42:43 +0000 [debug]: fluent/log.rb:302:debug: calling stop on output plugin type=:detect_exceptions plugin_id="raw.kubernetes"
2018-08-30 11:42:43 +0000 [debug]: fluent/log.rb:302:debug: calling stop on output plugin type=:elasticsearch plugin_id="apps"
2018-08-30 11:42:43 +0000 [debug]: fluent/log.rb:302:debug: calling stop on output plugin type=:elasticsearch plugin_id="infrastructure"
2018-08-30 11:42:43 +0000 [debug]: fluent/log.rb:302:debug: calling stop on filter plugin type=:kubernetes_metadata plugin_id="object:3fb7a35c0ce8"
2018-08-30 11:42:43 +0000 [debug]: fluent/log.rb:302:debug: calling stop on filter plugin type=:record_transformer plugin_id="object:3fb7a0208d4c"
2018-08-30 11:42:43 +0000 [debug]: fluent/log.rb:302:debug: calling stop on output plugin type=:null plugin_id="object:3fb7a19121c8"

Release note:

Pass signals to fluentd process

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Aug 30, 2018
@k8s-ci-robot k8s-ci-robot added the sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. label Aug 30, 2018
@gianrubio gianrubio changed the title WIP fluentd-elasticsearch signal [fluentd-elasticsearch] doesn't receive the signal TERM Aug 30, 2018
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 30, 2018
@gianrubio
Copy link
Contributor Author

/assing @coffeepac

@gianrubio gianrubio changed the title [fluentd-elasticsearch] doesn't receive the signal TERM fluentd doesn't receive the signal TERM Aug 30, 2018
@gianrubio
Copy link
Contributor Author

/retest

@coffeepac
Copy link
Contributor

/sign instrumentation

@k8s-ci-robot k8s-ci-robot added the sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. label Aug 30, 2018
@coffeepac
Copy link
Contributor

/remove-sig cluster-lifecycle

@k8s-ci-robot k8s-ci-robot removed the sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. label Aug 30, 2018
@coffeepac
Copy link
Contributor

/approve

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 30, 2018
@coffeepac
Copy link
Contributor

@gianrubio please add a release note, this is something people have asked about in the past.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Sep 3, 2018
@gianrubio
Copy link
Contributor Author

done @coffeepac

@coffeepac
Copy link
Contributor

/test pull-kubernetes-e2e-gke

@coffeepac
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 4, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: coffeepac, gianrubio

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coffeepac
Copy link
Contributor

@gianrubio we missed the 1.12 code freeze by a day, my bad it was today. I had a busy weekend. code freeze runs until Sept 18th so this will get merged then.

@coffeepac
Copy link
Contributor

/test pull-kubernetes-integration
/test pull-kubernetes-e2e-gke

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

2 similar comments
@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

@gianrubio
Copy link
Contributor Author

/retest

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

3 similar comments
@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

@gianrubio
Copy link
Contributor Author

@coffeepac any idea how to fix this flaky test?

@gianrubio
Copy link
Contributor Author

/retest

@coffeepac
Copy link
Contributor

@gianrubio looks like it passed. unfortunately, fetja-bot is gonna be running the tests every 72 hours until this PR gets merged.

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

3 similar comments
@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

@k8s-ci-robot k8s-ci-robot merged commit 9c29560 into kubernetes:master Sep 25, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants