Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

after upgrading npd from v0.8.13 to v0.8.15, containers with unready status: [node-problem-detector] #865

Closed
pacoxu opened this issue Feb 26, 2024 · 10 comments

Comments

@pacoxu
Copy link
Member

pacoxu commented Feb 26, 2024

with 1.8.15

kubernetes/kubernetes#123114 (comment)

F0216 07:10:52.272154       1 custom_plugin_monitor.go:77] Failed to validate custom plugin config {Plugin:custom PluginGlobalConfig:{InvokeIntervalString:0xc0004f9fc0 TimeoutString:0xc0004f9fd0 InvokeInterval:5m0s Timeout:1m0s MaxOutputLength:0xc00056bb90 Concurrency:0xc00056bba0 EnableMessageChangeBasedConditionUpdate:0x2d0a80e SkipInitialStatus:0x2d0a80f} Source:kernel-monitor DefaultConditions:[{Type:FrequentUnregisterNetDevice Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoFrequentUnregisterNetDevice Message:node is functioning properly}] Rules:[0xc0004ded90] EnableMetricsReporting:0xc00056bba8}: rule path "/home/kubernetes/bin/log-counter" does not exist. Rule: &{Type:permanent Condition:FrequentUnregisterNetDevice Reason:UnregisterNetDevice Path:/home/kubernetes/bin/log-counter Args:[--journald-source=kernel --log-path=/var/log/journal --lookback=20m --count=3 --pattern=unregister_netdevice: waiting for \w+ to become free. Usage count = \d+] TimeoutString:0xc0004f9ff0 Timeout:1m0s}
I0216 07:10:53.766460       1 log_monitor.go:78] Finish parsing log monitor config file /config/kernel-monitor.json: {WatcherConfig:{Plugin:kmsg PluginConfig:map[] LogPath:/dev/kmsg Lookback:5m Delay:} BufferSize:10 Source:kernel-monitor DefaultConditions:[{Type:KernelDeadlock Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:KernelHasNoDeadlock Message:kernel has no deadlock} {Type:ReadonlyFilesystem Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:FilesystemIsNotReadOnly Message:Filesystem is not read-only}] Rules:[{Type:temporary Condition: Reason:OOMKilling Pattern:Killed process \d+ (.+) total-vm:\d+kB, anon-rss:\d+kB, file-rss:\d+kB.*} {Type:temporary Condition: Reason:TaskHung Pattern:task [\S ]+:\w+ blocked for more than \w+ seconds\.} {Type:temporary Condition: Reason:UnregisterNetDevice Pattern:unregister_netdevice: waiting for \w+ to become free. Usage count = \d+} {Type:temporary Condition: Reason:KernelOops Pattern:BUG: unable to handle kernel NULL pointer dereference at .*} {Type:temporary Condition: Reason:KernelOops Pattern:divide error: 0000 \[#\d+\] SMP} {Type:temporary Condition: Reason:Ext4Error Pattern:EXT4-fs error .*} {Type:temporary Condition: Reason:Ext4Warning Pattern:EXT4-fs warning .*} {Type:temporary Condition: Reason:IOError Pattern:Buffer I/O error .*} {Type:temporary Condition: Reason:MemoryReadError Pattern:CE memory read error .*} {Type:permanent Condition:KernelDeadlock Reason:DockerHung Pattern:task docker:\w+ blocked for more than \w+ seconds\.} {Type:permanent Condition:ReadonlyFilesystem Reason:FilesystemIsReadOnly Pattern:Remounting filesystem read-only}] EnableMetricsReporting:0xc0006e858e}
I0216 07:10:53.766678       1 log_watchers.go:40] Use log watcher of plugin "kmsg"
I0216 07:10:53.767239       1 log_monitor.go:78] Finish parsing log monitor config file /config/systemd-monitor.json: {WatcherConfig:{Plugin:journald PluginConfig:map[source:systemd] LogPath:/var/log/journal Lookback:5m Delay:} BufferSize:10 Source:systemd-monitor DefaultConditions:[] Rules:[{Type:temporary Condition: Reason:KubeletStart Pattern:Started Kubernetes kubelet.} {Type:temporary Condition: Reason:DockerStart Pattern:Starting Docker Application Container Engine...} {Type:temporary Condition: Reason:ContainerdStart Pattern:Starting containerd container runtime...}] EnableMetricsReporting:0xc0006e8cea}
I0216 07:10:53.767297       1 log_watchers.go:40] Use log watcher of plugin "journald"
F0216 07:10:53.770149       1 custom_plugin_monitor.go:77] Failed to validate custom plugin config {Plugin:custom PluginGlobalConfig:{InvokeIntervalString:0xc00035c8f0 TimeoutString:0xc00035c900 InvokeInterval:5m0s Timeout:1m0s MaxOutputLength:0xc00013af60 Concurrency:0xc00013af70 EnableMessageChangeBasedConditionUpdate:0x2d0a80e SkipInitialStatus:0x2d0a80f} Source:kernel-monitor DefaultConditions:[{Type:FrequentUnregisterNetDevice Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoFrequentUnregisterNetDevice Message:node is functioning properly}] Rules:[0xc00039c000] EnableMetricsReporting:0xc00013af78}: rule path "/home/kubernetes/bin/log-counter" does not exist. Rule: &{Type:permanent Condition:FrequentUnregisterNetDevice Reason:UnregisterNetDevice Path:/home/kubernetes/bin/log-counter Args:[--journald-source=kernel --log-path=/var/log/journal --lookback=20m --count=3 --pattern=unregister_netdevice: waiting for \w+ to become free. Usage count = \d+] TimeoutString:0xc00035c920 Timeout:1m0s}
F0216 07:11:07.771971       1 custom_plugin_monitor.go:77] Failed to validate custom plugin config {Plugin:custom PluginGlobalConfig:{InvokeIntervalString:0xc0006ec490 TimeoutString:0xc0006ec4a0 InvokeInterval:5m0s Timeout:1m0s MaxOutputLength:0xc000146cd0 Concurrency:0xc000146ce0 EnableMessageChangeBasedConditionUpdate:0x2d0a80e SkipInitialStatus:0x2d0a80f} Source:kernel-monitor DefaultConditions:[{Type:FrequentUnregisterNetDevice Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoFrequentUnregisterNetDevice Message:node is functioning properly}] Rules:[0xc000418cb0] EnableMetricsReporting:0xc000146ce8}: rule path "/home/kubernetes/bin/log-counter" does not exist. Rule: &{Type:permanent Condition:FrequentUnregisterNetDevice Reason:UnregisterNetDevice Path:/home/kubernetes/bin/log-counter Args:[--journald-source=kernel --log-path=/var/log/journal --lookback=20m --count=3 --pattern=unregister_netdevice: waiting for \w+ to become free. Usage count = \d+] TimeoutString:0xc0006ec4c0 Timeout:1m0s}
F0216 07:11:35.668334       1 custom_plugin_monitor.go:77] Failed to validate custom plugin config {Plugin:custom PluginGlobalConfig:{InvokeIntervalString:0xc0006b2c70 TimeoutString:0xc0006b2c80 InvokeInterval:5m0s Timeout:1m0s MaxOutputLength:0xc0006cc9d0 Concurrency:0xc0006cc9e0 EnableMessageChangeBasedConditionUpdate:0x2d0a80e SkipInitialStatus:0x2d0a80f} Source:kernel-monitor DefaultConditions:[{Type:FrequentUnregisterNetDevice Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoFrequentUnregisterNetDevice Message:node is functioning properly}] Rules:[0xc000418d20] EnableMetricsReporting:0xc0006cc9e8}: rule path "/home/kubernetes/bin/log-counter" does not exist. Rule: &{Type:permanent Condition:FrequentUnregisterNetDevice Reason:UnregisterNetDevice Path:/home/kubernetes/bin/log-counter Args:[--journald-source=kernel --log-path=/var/log/journal --lookback=20m --count=3 --pattern=unregister_netdevice: waiting for \w+ to become free. Usage count = \d+] TimeoutString:0xc0006b2ca0 Timeout:1m0s}
F0216 07:12:26.664969       1 custom_plugin_monitor.go:77] Failed to validate custom plugin config {Plugin:custom PluginGlobalConfig:{InvokeIntervalString:0xc0004ae2a0 TimeoutString:0xc0004ae2b0 InvokeInterval:5m0s Timeout:1m0s MaxOutputLength:0xc000015f50 Concurrency:0xc000015f60 EnableMessageChangeBasedConditionUpdate:0x2d0a80e SkipInitialStatus:0x2d0a80f} Source:kernel-monitor DefaultConditions:[{Type:FrequentUnregisterNetDevice Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoFrequentUnregisterNetDevice Message:node is functioning properly}] Rules:[0xc0000cc5b0] EnableMetricsReporting:0xc000015f68}: rule path "/home/kubernetes/bin/log-counter" does not exist. Rule: &{Type:permanent Condition:FrequentUnregisterNetDevice Reason:UnregisterNetDevice Path:/home/kubernetes/bin/log-counter Args:[--journald-source=kernel --log-path=/var/log/journal --lookback=20m --count=3 --pattern=unregister_netdevice: waiting for \w+ to become free. Usage count = \d+] TimeoutString:0xc0004ae2d0 Timeout:1m0s}
F0216 07:13:57.671249       1 custom_plugin_monitor.go:77] Failed to validate custom plugin config {Plugin:custom PluginGlobalConfig:{InvokeIntervalString:0xc000660240 TimeoutString:0xc000660250 InvokeInterval:5m0s Timeout:1m0s MaxOutputLength:0xc000146510 Concurrency:0xc000146520 EnableMessageChangeBasedConditionUpdate:0x2d0a80e SkipInitialStatus:0x2d0a80f} Source:kernel-monitor DefaultConditions:[{Type:FrequentUnregisterNetDevice Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoFrequentUnregisterNetDevice Message:node is functioning properly}] Rules:[0xc000418000] EnableMetricsReporting:0xc000146528}: rule path "/home/kubernetes/bin/log-counter" does not exist. Rule: &{Type:permanent Condition:FrequentUnregisterNetDevice Reason:UnregisterNetDevice Path:/home/kubernetes/bin/log-counter Args:[--journald-source=kernel --log-path=/var/log/journal --lookback=20m --count=3 --pattern=unregister_netdevice: waiting for \w+ to become free. Usage count = \d+] TimeoutString:0xc000660270 Timeout:1m0s}
F0216 07:16:42.672808       1 custom_plugin_monitor.go:77] Failed to validate custom plugin config {Plugin:custom PluginGlobalConfig:{InvokeIntervalString:0xc000572d60 TimeoutString:0xc000572d70 InvokeInterval:5m0s Timeout:1m0s MaxOutputLength:0xc00051dc20 Concurrency:0xc00051dc30 EnableMessageChangeBasedConditionUpdate:0x2d0a80e SkipInitialStatus:0x2d0a80f} Source:kernel-monitor DefaultConditions:[{Type:FrequentUnregisterNetDevice Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoFrequentUnregisterNetDevice Message:node is functioning properly}] Rules:[0xc0001bd810] EnableMetricsReporting:0xc00051dc38}: rule path "/home/kubernetes/bin/log-counter" does not exist. Rule: &{Type:permanent Condition:FrequentUnregisterNetDevice Reason:UnregisterNetDevice Path:/home/kubernetes/bin/log-counter Args:[--journald-source=kernel --log-path=/var/log/journal --lookback=20m --count=3 --pattern=unregister_netdevice: waiting for \w+ to become free. Usage count = \d+] TimeoutString:0xc000572d90 Timeout:1m0s}
F0216 07:21:50.960997       1 custom_plugin_monitor.go:77] Failed to validate custom plugin config {Plugin:custom PluginGlobalConfig:{InvokeIntervalString:0xc0001179f0 TimeoutString:0xc000117a00 InvokeInterval:5m0s Timeout:1m0s MaxOutputLength:0xc000533c70 Concurrency:0xc000533c80 EnableMessageChangeBasedConditionUpdate:0x2d0a80e SkipInitialStatus:0x2d0a80f} Source:kernel-monitor DefaultConditions:[{Type:FrequentUnregisterNetDevice Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoFrequentUnregisterNetDevice Message:node is functioning properly}] Rules:[0xc00050b650] EnableMetricsReporting:0xc000533c88}: rule path "/home/kubernetes/bin/log-counter" does not exist. Rule: &{Type:permanent Condition:FrequentUnregisterNetDevice Reason:UnregisterNetDevice Path:/home/kubernetes/bin/log-counter Args:[--journald-source=kernel --log-path=/var/log/journal --lookback=20m --count=3 --pattern=unregister_netdevice: waiting for \w+ to become free. Usage count = \d+] TimeoutString:0xc000117a20 Timeout:1m0s}
@pacoxu
Copy link
Member Author

pacoxu commented Feb 26, 2024

Failed to validate custom plugin config 

{
  Plugin:custom 
  PluginGlobalConfig:{
    InvokeIntervalString:0xc0004711c0 
    TimeoutString:0xc0004711d0 
    InvokeInterval:5m0s Timeout:1m0s MaxOutputLength:0xc00056f4c0 Concurrency:0xc00056f4d0 EnableMessageChangeBasedConditionUpdate:0x2d0a80e SkipInitialStatus:0x2d0a80f
  } 
  Source:kernel-monitor 
  DefaultConditions:[
    {
      Type:FrequentUnregisterNetDevice Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoFrequentUnregisterNetDevice Message:node is functioning properly
    }
  ] 
  Rules:[0xc0005ece00] 
  EnableMetricsReporting:0xc00056f4d8
}:

rule path "/home/kubernetes/bin/log-counter" does not exist. 
Rule: &{
  Type:permanent 
  Condition:FrequentUnregisterNetDevice 
  Reason:UnregisterNetDevice 
  Path:/home/kubernetes/bin/log-counter 
  Args:[
    --journald-source=kernel 
    --log-path=/var/log/journal 
    --lookback=20m
    --count=3 
    --pattern=
    unregister_netdevice: waiting for \w+ to become free. Usage count = \d+
    ]
  TimeoutString:0xc0004711f0 Timeout:1m0s}

"path": "/home/kubernetes/bin/log-counter",

@pacoxu
Copy link
Member Author

pacoxu commented Feb 26, 2024

1.8.15 lost the bin log-counter after #801 @hakman @vteratipally

➜  ~ docker run -it --rm --entrypoint=ls registry.k8s.io/node-problem-detector/node-problem-detector:v0.8.15 /home/kubernetes/bin/
health-checker
➜  ~ docker run -it --rm --entrypoint=ls registry.k8s.io/node-problem-detector/node-problem-detector:v0.8.13 /home/kubernetes/bin/
health-checker	log-counter

@pacoxu
Copy link
Member Author

pacoxu commented Feb 26, 2024

Local run shows that the log-counter is not built due to no journald

WARNING: No output specified with docker-container driver. Build result will only remain in the build cache. To push result image into registry use --push or to load image into docker use --load
CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build \
		-o bin/node-problem-detector \
		-ldflags '-X k8s.io/node-problem-detector/pkg/version.version=v0.8.15-20-gd1166d34' \
		-tags "" \
		./cmd/nodeproblemdetector
echo "Warning: log-counter requires journald, skipping."
Warning: log-counter requires journald, skipping.

@hakman
Copy link
Member

hakman commented Feb 26, 2024

My guess is that it happens because of the way CloudBuild runs:

ENABLE_JOURNALD=0 make push-container

@pacoxu
Copy link
Member Author

pacoxu commented Feb 26, 2024

This seems to be intended for kubernetes/test-infra#23202 (comment)?

@pacoxu
Copy link
Member Author

pacoxu commented Feb 26, 2024

/cc @SergeyKanzhelev

@hakman do you have some proposals to fix this?

  • Can we set ENABLE_JOURNALD=1?
  • And we should fail instead of skip it in the image build progress for log-counter.

@hakman
Copy link
Member

hakman commented Feb 26, 2024

@pacoxu @SergeyKanzhelev Let's give #867 a try.

@hakman
Copy link
Member

hakman commented Feb 28, 2024

@pacoxu could you give gcr.io/k8s-staging-npd/node-problem-detector:master a try?
If all ok, we could do a release.

@wangzhen127
Copy link
Member

It looks like the issue was resolved in kubernetes/kubernetes#123114.

/close

@k8s-ci-robot
Copy link
Contributor

@wangzhen127: Closing this issue.

In response to this:

It looks like the issue was resolved in kubernetes/kubernetes#123114.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants