Now i am running daemonset in all node, but how do i verify it is useful？ #106

strugglingyouth · 2017-03-30T06:56:08Z

What happens to the node wrong after creating daemonset? And set node to unscheduler？

Now i am running daemonset in all node, but how do i verify it is useful。

It is a bit difficult to simulate these problems。

Hardware issues: Bad cpu, memory or disk;
Kernel issues: Kernel deadlock, corrupted file system;
Container runtime issues: Unresponsive runtime daemon;
...

Random-Liu · 2017-03-30T22:17:48Z

First thing you may want to check is kubectl describe nodes to see whether KernelDeadlock node condition shows up.

Currently, the kernel problem detection is purely based on the kernel log, you could inject kernel log to see whether NPD takes action correspondingly, e.g. problems in https://github.com/kubernetes/node-problem-detector/tree/master/test/kernel_log_generator/problems.

strugglingyouth · 2017-03-31T04:39:41Z

I used kubernetes 1.6 testing and inject kernel log to /var/log/dmesg，but nothing happened。Do i have right?

Random-Liu · 2017-03-31T17:45:41Z

@strugglingyouth Did you change the configuration correspondingly? https://github.com/kubernetes/node-problem-detector/blob/master/config/kernel-monitor-filelog.json

You may want to inject log into /var/log/kern.log. Or you may want to change the configuration to point to /var/log/dmesg.

hardikdr · 2017-08-23T05:08:54Z

Or you can also make use of the script https://github.com/kubernetes/node-problem-detector/blob/master/test/kernel_log_generator/generator.sh ,
either via docker or setting up PROBLEM Env manually in script and running on hostmachine !!
For verification , it would be also be useful to put a watch on the events , and look for the ones coming from Node.
kubectl get events | grep Node

xiuli2 · 2017-09-26T09:52:28Z

@Random-Liu I simulate the problems like the following in /var/log/kern.log

kernel:[534024.040037] unregister_netdevice: waiting for lo to become free. Usage count = 2
kernel: Memory cgroup out of memory: Kill process 1012 (heapster) score 1035 or sacrifice child
kernel: Killed process 1012 (heapster) total-vm:327128kB, anon-rss:306328kB, file-rss:11132kB

I am not sure whether the log is OKto validate the NPD? the format of the log is right? any specific style? need some time stamp?

After manually modify the kern.log,

A: I tried to run the command "kubectl get events | grep Node", but no the expected OOM events.

B: I ran the command "oc describe node NodeName" to view the Conditions and Events sections, and check whether OOM and unregisterdevice error got caught, but no related Conditions and Events. are the steps correct to check NPD work?

Could you please show the detailed steps to verify it is useful? thanks in advance!

fejta-bot · 2018-01-06T14:35:37Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

fejta-bot · 2018-02-10T01:27:15Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

fejta-bot · 2018-03-12T02:13:32Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

weinliu · 2018-04-09T09:04:39Z

/reopen
Hi @Random-Liu , I've got the same question with @xiuli2 , how can I verify the NPD is working?
I got docker-monitor.json and kernel-monitor enabled: --system-log-monitors=/etc/npd/kernel-monitor.json,/etc/npd/docker-monitor.json

Then I generated a fake journald log generated:

# echo "BUG: unable to handle kernel NULL pointer dereference at 12345" |systemd-cat
# echo "Error trying v2 registry: failed to register layer: rename /var/lib/docker/image/aufs/layerdb/tmp/layer-632022140 /var/lib/docker/image/aufs/layerdb/sha256/011b303988d241a4ae28a6b82b0d8262751ef02910f0ae2265cb637504b72e36: directory not empty" | systemd-cat

But I can not see any output for NPD from kubectl get events | grep Node or describe node NodeName

So is kubectl logs

# kubectl logs node-problem-detector-72c4w
I0409 01:24:20.074715       1 log_monitor.go:63] Finish parsing log monitor config file: {WatcherConfig:{Plugin:journald PluginConfig:map[source:kernel] LogPath:/host/log/journal Lookback:5m} BufferSize:10 Source:kernel-monitor DefaultConditions:[{Type:KernelDeadlock Status:false Transition:0001-01-01 00:00:00 +0000 UTC Reason:KernelHasNoDeadlock Message:kernel has no deadlock}] Rules:[{Type:temporary Condition: Reason:OOMKilling Pattern:Kill process \d+ (.+) score \d+ or sacrifice child\nKilled process \d+ (.+) total-vm:\d+kB, anon-rss:\d+kB, file-rss:\d+kB} {Type:temporary Condition: Reason:TaskHung Pattern:task \S+:\w+ blocked for more than \w+ seconds\.} {Type:temporary Condition: Reason:UnregisterNetDevice Pattern:unregister_netdevice: waiting for \w+ to become free. Usage count = \d+} {Type:temporary Condition: Reason:KernelOops Pattern:BUG: unable to handle kernel NULL pointer dereference at .*} {Type:temporary Condition: Reason:KernelOops Pattern:divide error: 0000 \[#\d+\] SMP} {Type:permanent Condition:KernelDeadlock Reason:AUFSUmountHung Pattern:task umount\.aufs:\w+ blocked for more than \w+ seconds\.} {Type:permanent Condition:KernelDeadlock Reason:DockerHung Pattern:task docker:\w+ blocked for more than \w+ seconds\.}]}
I0409 01:24:20.074881       1 log_watchers.go:40] Use log watcher of plugin "journald"
I0409 01:24:20.074999       1 log_monitor.go:63] Finish parsing log monitor config file: {WatcherConfig:{Plugin:journald PluginConfig:map[source:docker] LogPath:/host/log/journal Lookback:5m} BufferSize:10 Source:docker-monitor DefaultConditions:[] Rules:[{Type:temporary Condition: Reason:CorruptDockerImage Pattern:Error trying v2 registry: failed to register layer: rename /var/lib/docker/image/(.+) /var/lib/docker/image/(.+): directory not empty.*}]}
I0409 01:24:20.075013       1 log_watchers.go:40] Use log watcher of plugin "journald"
I0409 01:24:20.075872       1 log_monitor.go:72] Start log monitor
I0409 01:24:20.097735       1 log_watcher.go:69] Start watching journald
I0409 01:24:20.097789       1 log_monitor.go:72] Start log monitor
I0409 01:24:20.099870       1 log_monitor.go:163] Initialize condition generated: [{Type:KernelDeadlock Status:false Transition:2018-04-09 01:24:20.099830399 -0400 EDT m=+0.043253068 Reason:KernelHasNoDeadlock Message:kernel has no deadlock}]
I0409 01:24:20.114934       1 log_watcher.go:69] Start watching journald
I0409 01:24:20.115008       1 problem_detector.go:73] Problem detector started
I0409 01:24:20.117235       1 log_monitor.go:163] Initialize condition generated: []

Thanks a lot.

k8s-ci-robot · 2018-04-09T09:04:39Z

@weinliu: you can't re-open an issue/PR unless you authored it or you are assigned to it.

In response to this:

/reopen
Hi @Random-Liu , I've got the same question with @xiuli2 , how can I verify the NPD is working?
I got docker-monitor.json and kernel-monitor enabled: --system-log-monitors=/etc/npd/kernel-monitor.json,/etc/npd/docker-monitor.json

Then I generated a fake journald log generated:

# echo "BUG: unable to handle kernel NULL pointer dereference at 12345" |systemd-cat
# echo "Error trying v2 registry: failed to register layer: rename /var/lib/docker/image/aufs/layerdb/tmp/layer-632022140 /var/lib/docker/image/aufs/layerdb/sha256/011b303988d241a4ae28a6b82b0d8262751ef02910f0ae2265cb637504b72e36: directory not empty" | systemd-cat

But I can not see any output for NPD from kubectl get events | grep Node or describe node NodeName

So is kubectl logs

# kubectl logs node-problem-detector-72c4w
I0409 01:24:20.074715       1 log_monitor.go:63] Finish parsing log monitor config file: {WatcherConfig:{Plugin:journald PluginConfig:map[source:kernel] LogPath:/host/log/journal Lookback:5m} BufferSize:10 Source:kernel-monitor DefaultConditions:[{Type:KernelDeadlock Status:false Transition:0001-01-01 00:00:00 +0000 UTC Reason:KernelHasNoDeadlock Message:kernel has no deadlock}] Rules:[{Type:temporary Condition: Reason:OOMKilling Pattern:Kill process \d+ (.+) score \d+ or sacrifice child\nKilled process \d+ (.+) total-vm:\d+kB, anon-rss:\d+kB, file-rss:\d+kB} {Type:temporary Condition: Reason:TaskHung Pattern:task \S+:\w+ blocked for more than \w+ seconds\.} {Type:temporary Condition: Reason:UnregisterNetDevice Pattern:unregister_netdevice: waiting for \w+ to become free. Usage count = \d+} {Type:temporary Condition: Reason:KernelOops Pattern:BUG: unable to handle kernel NULL pointer dereference at .*} {Type:temporary Condition: Reason:KernelOops Pattern:divide error: 0000 \[#\d+\] SMP} {Type:permanent Condition:KernelDeadlock Reason:AUFSUmountHung Pattern:task umount\.aufs:\w+ blocked for more than \w+ seconds\.} {Type:permanent Condition:KernelDeadlock Reason:DockerHung Pattern:task docker:\w+ blocked for more than \w+ seconds\.}]}
I0409 01:24:20.074881       1 log_watchers.go:40] Use log watcher of plugin "journald"
I0409 01:24:20.074999       1 log_monitor.go:63] Finish parsing log monitor config file: {WatcherConfig:{Plugin:journald PluginConfig:map[source:docker] LogPath:/host/log/journal Lookback:5m} BufferSize:10 Source:docker-monitor DefaultConditions:[] Rules:[{Type:temporary Condition: Reason:CorruptDockerImage Pattern:Error trying v2 registry: failed to register layer: rename /var/lib/docker/image/(.+) /var/lib/docker/image/(.+): directory not empty.*}]}
I0409 01:24:20.075013       1 log_watchers.go:40] Use log watcher of plugin "journald"
I0409 01:24:20.075872       1 log_monitor.go:72] Start log monitor
I0409 01:24:20.097735       1 log_watcher.go:69] Start watching journald
I0409 01:24:20.097789       1 log_monitor.go:72] Start log monitor
I0409 01:24:20.099870       1 log_monitor.go:163] Initialize condition generated: [{Type:KernelDeadlock Status:false Transition:2018-04-09 01:24:20.099830399 -0400 EDT m=+0.043253068 Reason:KernelHasNoDeadlock Message:kernel has no deadlock}]
I0409 01:24:20.114934       1 log_watcher.go:69] Start watching journald
I0409 01:24:20.115008       1 problem_detector.go:73] Problem detector started
I0409 01:24:20.117235       1 log_monitor.go:163] Initialize condition generated: []

Thanks a lot.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 6, 2018

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 10, 2018

k8s-ci-robot closed this as completed Mar 12, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Now i am running daemonset in all node, but how do i verify it is useful？ #106

Now i am running daemonset in all node, but how do i verify it is useful？ #106

strugglingyouth commented Mar 30, 2017

Random-Liu commented Mar 30, 2017

strugglingyouth commented Mar 31, 2017

Random-Liu commented Mar 31, 2017 •

edited

Loading

hardikdr commented Aug 23, 2017

xiuli2 commented Sep 26, 2017

fejta-bot commented Jan 6, 2018

fejta-bot commented Feb 10, 2018

fejta-bot commented Mar 12, 2018

weinliu commented Apr 9, 2018

k8s-ci-robot commented Apr 9, 2018

Now i am running daemonset in all node, but how do i verify it is useful？ #106

Now i am running daemonset in all node, but how do i verify it is useful？ #106

Comments

strugglingyouth commented Mar 30, 2017

Random-Liu commented Mar 30, 2017

strugglingyouth commented Mar 31, 2017

Random-Liu commented Mar 31, 2017 • edited Loading

hardikdr commented Aug 23, 2017

xiuli2 commented Sep 26, 2017

fejta-bot commented Jan 6, 2018

fejta-bot commented Feb 10, 2018

fejta-bot commented Mar 12, 2018

weinliu commented Apr 9, 2018

k8s-ci-robot commented Apr 9, 2018

Random-Liu commented Mar 31, 2017 •

edited

Loading