-
Notifications
You must be signed in to change notification settings - Fork 682
Open
Description
Hey team,
I'm running into an issue with the node problem detector where the kmsg file descriptor get closed after being overloaded with messages:
E1103 12:19:38.657492 1 log_watcher_linux.go:102] Kmsg channel closed
...
E1103 12:19:45.959686 1 log_monitor.go:136] Log channel closed: /config/kernel-monitor.json
sudo journalctl --since "2025-11-03 12:19:37" --until "2025-11-03 12:19:40" | grep -E "error|fail|killed|oom|memory" | wc -l
56477
When the kmsg FD is closed, the pod continues to run without indication of an issue. Is there some way we can detect this and potentially restart the container? Or is there a way to mitigate this?
Related issues: #1003 and #1004
Kubernetes version: 1.33
Node Problem Detector version: v1.34.0
Steps to reproduce:
Prereqs:
- Kubernetes cluster with node-problem-detector daemonset installed
Steps:
- ssh to one of the instances in the cluster
- Run the script below
#!/bin/bash
# minimal_fast_flood.sh
COUNT=${1:-100000}
UPDATE_EVERY=5000
[ "$EUID" -ne 0 ] && exec sudo bash "$0" "$@"
echo "Flooding: $COUNT messages | $(date '+%H:%M:%S')"
START=$(date +%s)
exec 3>/dev/kmsg
for ((i=1; i<=COUNT; i++)); do
echo "task docker:abc123 blocked for more than 60 seconds." >&3
((i % UPDATE_EVERY == 0)) && echo " $i/$COUNT"
done
exec 3>&-
ELAPSED=$(($(date +%s) - START))
echo "Done: ${ELAPSED}s ($(($COUNT / $ELAPSED)) msg/s)"
- Check the node problem detector pod logs
Metadata
Metadata
Assignees
Labels
No labels