Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Now i am running daemonset in all node, but how do i verify it is useful? #106

Closed
strugglingyouth opened this issue Mar 30, 2017 · 10 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@strugglingyouth
Copy link

What happens to the node wrong after creating daemonset? And set node to unscheduler?

Now i am running daemonset in all node, but how do i verify it is useful。

It is a bit difficult to simulate these problems。

Hardware issues: Bad cpu, memory or disk;
Kernel issues: Kernel deadlock, corrupted file system;
Container runtime issues: Unresponsive runtime daemon;
...

@Random-Liu
Copy link
Member

First thing you may want to check is kubectl describe nodes to see whether KernelDeadlock node condition shows up.

Currently, the kernel problem detection is purely based on the kernel log, you could inject kernel log to see whether NPD takes action correspondingly, e.g. problems in https://github.com/kubernetes/node-problem-detector/tree/master/test/kernel_log_generator/problems.

@strugglingyouth
Copy link
Author

I used kubernetes 1.6 testing and inject kernel log to /var/log/dmesg,but nothing happened。Do i have right?

@Random-Liu
Copy link
Member

Random-Liu commented Mar 31, 2017

@strugglingyouth Did you change the configuration correspondingly? https://github.com/kubernetes/node-problem-detector/blob/master/config/kernel-monitor-filelog.json

You may want to inject log into /var/log/kern.log. Or you may want to change the configuration to point to /var/log/dmesg.

@hardikdr
Copy link
Member

Or you can also make use of the script https://github.com/kubernetes/node-problem-detector/blob/master/test/kernel_log_generator/generator.sh ,
either via docker or setting up PROBLEM Env manually in script and running on hostmachine !!
For verification , it would be also be useful to put a watch on the events , and look for the ones coming from Node.
kubectl get events | grep Node

@xiuli2
Copy link

xiuli2 commented Sep 26, 2017

@Random-Liu I simulate the problems like the following in /var/log/kern.log

kernel:[534024.040037] unregister_netdevice: waiting for lo to become free. Usage count = 2
kernel: Memory cgroup out of memory: Kill process 1012 (heapster) score 1035 or sacrifice child
kernel: Killed process 1012 (heapster) total-vm:327128kB, anon-rss:306328kB, file-rss:11132kB

I am not sure whether the log is OKto validate the NPD? the format of the log is right? any specific style? need some time stamp?

After manually modify the kern.log,

A: I tried to run the command "kubectl get events | grep Node", but no the expected OOM events.

B: I ran the command "oc describe node NodeName" to view the Conditions and Events sections, and check whether OOM and unregisterdevice error got caught, but no related Conditions and Events. are the steps correct to check NPD work?

Could you please show the detailed steps to verify it is useful? thanks in advance!

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 6, 2018
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 10, 2018
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@weinliu
Copy link

weinliu commented Apr 9, 2018

/reopen
Hi @Random-Liu , I've got the same question with @xiuli2 , how can I verify the NPD is working?
I got docker-monitor.json and kernel-monitor enabled: --system-log-monitors=/etc/npd/kernel-monitor.json,/etc/npd/docker-monitor.json

Then I generated a fake journald log generated:

# echo "BUG: unable to handle kernel NULL pointer dereference at 12345" |systemd-cat
# echo "Error trying v2 registry: failed to register layer: rename /var/lib/docker/image/aufs/layerdb/tmp/layer-632022140 /var/lib/docker/image/aufs/layerdb/sha256/011b303988d241a4ae28a6b82b0d8262751ef02910f0ae2265cb637504b72e36: directory not empty" | systemd-cat

But I can not see any output for NPD from kubectl get events | grep Node or describe node NodeName

So is kubectl logs

# kubectl logs node-problem-detector-72c4w
I0409 01:24:20.074715       1 log_monitor.go:63] Finish parsing log monitor config file: {WatcherConfig:{Plugin:journald PluginConfig:map[source:kernel] LogPath:/host/log/journal Lookback:5m} BufferSize:10 Source:kernel-monitor DefaultConditions:[{Type:KernelDeadlock Status:false Transition:0001-01-01 00:00:00 +0000 UTC Reason:KernelHasNoDeadlock Message:kernel has no deadlock}] Rules:[{Type:temporary Condition: Reason:OOMKilling Pattern:Kill process \d+ (.+) score \d+ or sacrifice child\nKilled process \d+ (.+) total-vm:\d+kB, anon-rss:\d+kB, file-rss:\d+kB} {Type:temporary Condition: Reason:TaskHung Pattern:task \S+:\w+ blocked for more than \w+ seconds\.} {Type:temporary Condition: Reason:UnregisterNetDevice Pattern:unregister_netdevice: waiting for \w+ to become free. Usage count = \d+} {Type:temporary Condition: Reason:KernelOops Pattern:BUG: unable to handle kernel NULL pointer dereference at .*} {Type:temporary Condition: Reason:KernelOops Pattern:divide error: 0000 \[#\d+\] SMP} {Type:permanent Condition:KernelDeadlock Reason:AUFSUmountHung Pattern:task umount\.aufs:\w+ blocked for more than \w+ seconds\.} {Type:permanent Condition:KernelDeadlock Reason:DockerHung Pattern:task docker:\w+ blocked for more than \w+ seconds\.}]}
I0409 01:24:20.074881       1 log_watchers.go:40] Use log watcher of plugin "journald"
I0409 01:24:20.074999       1 log_monitor.go:63] Finish parsing log monitor config file: {WatcherConfig:{Plugin:journald PluginConfig:map[source:docker] LogPath:/host/log/journal Lookback:5m} BufferSize:10 Source:docker-monitor DefaultConditions:[] Rules:[{Type:temporary Condition: Reason:CorruptDockerImage Pattern:Error trying v2 registry: failed to register layer: rename /var/lib/docker/image/(.+) /var/lib/docker/image/(.+): directory not empty.*}]}
I0409 01:24:20.075013       1 log_watchers.go:40] Use log watcher of plugin "journald"
I0409 01:24:20.075872       1 log_monitor.go:72] Start log monitor
I0409 01:24:20.097735       1 log_watcher.go:69] Start watching journald
I0409 01:24:20.097789       1 log_monitor.go:72] Start log monitor
I0409 01:24:20.099870       1 log_monitor.go:163] Initialize condition generated: [{Type:KernelDeadlock Status:false Transition:2018-04-09 01:24:20.099830399 -0400 EDT m=+0.043253068 Reason:KernelHasNoDeadlock Message:kernel has no deadlock}]
I0409 01:24:20.114934       1 log_watcher.go:69] Start watching journald
I0409 01:24:20.115008       1 problem_detector.go:73] Problem detector started
I0409 01:24:20.117235       1 log_monitor.go:163] Initialize condition generated: []

Thanks a lot.

@k8s-ci-robot
Copy link
Contributor

@weinliu: you can't re-open an issue/PR unless you authored it or you are assigned to it.

In response to this:

/reopen
Hi @Random-Liu , I've got the same question with @xiuli2 , how can I verify the NPD is working?
I got docker-monitor.json and kernel-monitor enabled: --system-log-monitors=/etc/npd/kernel-monitor.json,/etc/npd/docker-monitor.json

Then I generated a fake journald log generated:

# echo "BUG: unable to handle kernel NULL pointer dereference at 12345" |systemd-cat
# echo "Error trying v2 registry: failed to register layer: rename /var/lib/docker/image/aufs/layerdb/tmp/layer-632022140 /var/lib/docker/image/aufs/layerdb/sha256/011b303988d241a4ae28a6b82b0d8262751ef02910f0ae2265cb637504b72e36: directory not empty" | systemd-cat

But I can not see any output for NPD from kubectl get events | grep Node or describe node NodeName

So is kubectl logs

# kubectl logs node-problem-detector-72c4w
I0409 01:24:20.074715       1 log_monitor.go:63] Finish parsing log monitor config file: {WatcherConfig:{Plugin:journald PluginConfig:map[source:kernel] LogPath:/host/log/journal Lookback:5m} BufferSize:10 Source:kernel-monitor DefaultConditions:[{Type:KernelDeadlock Status:false Transition:0001-01-01 00:00:00 +0000 UTC Reason:KernelHasNoDeadlock Message:kernel has no deadlock}] Rules:[{Type:temporary Condition: Reason:OOMKilling Pattern:Kill process \d+ (.+) score \d+ or sacrifice child\nKilled process \d+ (.+) total-vm:\d+kB, anon-rss:\d+kB, file-rss:\d+kB} {Type:temporary Condition: Reason:TaskHung Pattern:task \S+:\w+ blocked for more than \w+ seconds\.} {Type:temporary Condition: Reason:UnregisterNetDevice Pattern:unregister_netdevice: waiting for \w+ to become free. Usage count = \d+} {Type:temporary Condition: Reason:KernelOops Pattern:BUG: unable to handle kernel NULL pointer dereference at .*} {Type:temporary Condition: Reason:KernelOops Pattern:divide error: 0000 \[#\d+\] SMP} {Type:permanent Condition:KernelDeadlock Reason:AUFSUmountHung Pattern:task umount\.aufs:\w+ blocked for more than \w+ seconds\.} {Type:permanent Condition:KernelDeadlock Reason:DockerHung Pattern:task docker:\w+ blocked for more than \w+ seconds\.}]}
I0409 01:24:20.074881       1 log_watchers.go:40] Use log watcher of plugin "journald"
I0409 01:24:20.074999       1 log_monitor.go:63] Finish parsing log monitor config file: {WatcherConfig:{Plugin:journald PluginConfig:map[source:docker] LogPath:/host/log/journal Lookback:5m} BufferSize:10 Source:docker-monitor DefaultConditions:[] Rules:[{Type:temporary Condition: Reason:CorruptDockerImage Pattern:Error trying v2 registry: failed to register layer: rename /var/lib/docker/image/(.+) /var/lib/docker/image/(.+): directory not empty.*}]}
I0409 01:24:20.075013       1 log_watchers.go:40] Use log watcher of plugin "journald"
I0409 01:24:20.075872       1 log_monitor.go:72] Start log monitor
I0409 01:24:20.097735       1 log_watcher.go:69] Start watching journald
I0409 01:24:20.097789       1 log_monitor.go:72] Start log monitor
I0409 01:24:20.099870       1 log_monitor.go:163] Initialize condition generated: [{Type:KernelDeadlock Status:false Transition:2018-04-09 01:24:20.099830399 -0400 EDT m=+0.043253068 Reason:KernelHasNoDeadlock Message:kernel has no deadlock}]
I0409 01:24:20.114934       1 log_watcher.go:69] Start watching journald
I0409 01:24:20.115008       1 problem_detector.go:73] Problem detector started
I0409 01:24:20.117235       1 log_monitor.go:163] Initialize condition generated: []

Thanks a lot.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

7 participants