Skip to content
This repository was archived by the owner on Apr 29, 2026. It is now read-only.

⏺ The watchdog timer implementation is complete. Here's a summary:#99

Merged
navicore merged 3 commits intomainfrom
watchdog
Dec 16, 2025
Merged

⏺ The watchdog timer implementation is complete. Here's a summary:#99
navicore merged 3 commits intomainfrom
watchdog

Conversation

@navicore
Copy link
Copy Markdown
Owner

Files modified:

  • crates/runtime/src/watchdog.rs - New watchdog module (already created)
  • crates/runtime/src/lib.rs - Added pub mod watchdog;
  • crates/runtime/src/scheduler.rs - Call install_watchdog() in scheduler_init()
  • crates/runtime/src/channel.rs - Fixed flaky test with retry mechanism
  • docs/ROADMAP.md - Added watchdog documentation

Features:

  • Environment variable configuration:
    • SEQ_WATCHDOG_SECS - threshold (0 = disabled, default)
    • SEQ_WATCHDOG_INTERVAL - check frequency (default: 5s)
    • SEQ_WATCHDOG_ACTION - warn (default) or exit
  • Behavior:
    • Dedicated thread scans strand registry periodically
    • Detects strands running longer than threshold
    • Dumps full diagnostics (same as SIGQUIT) on first detection
    • Suppresses repeated warnings for the same stuck strand
    • Can terminate process if SEQ_WATCHDOG_ACTION=exit
  • Zero hot-path overhead - checking happens on a separate thread

All 288 tests pass. Ready for testing with an actual infinite loop.

  Files modified:
  - crates/runtime/src/watchdog.rs - New watchdog module (already created)
  - crates/runtime/src/lib.rs - Added pub mod watchdog;
  - crates/runtime/src/scheduler.rs - Call install_watchdog() in scheduler_init()
  - crates/runtime/src/channel.rs - Fixed flaky test with retry mechanism
  - docs/ROADMAP.md - Added watchdog documentation

  Features:
  - Environment variable configuration:
    - SEQ_WATCHDOG_SECS - threshold (0 = disabled, default)
    - SEQ_WATCHDOG_INTERVAL - check frequency (default: 5s)
    - SEQ_WATCHDOG_ACTION - warn (default) or exit
  - Behavior:
    - Dedicated thread scans strand registry periodically
    - Detects strands running longer than threshold
    - Dumps full diagnostics (same as SIGQUIT) on first detection
    - Suppresses repeated warnings for the same stuck strand
    - Can terminate process if SEQ_WATCHDOG_ACTION=exit
  - Zero hot-path overhead - checking happens on a separate thread

  All 288 tests pass. Ready for testing with an actual infinite loop.
@claude
Copy link
Copy Markdown

claude Bot commented Dec 16, 2025

Code Review: Watchdog Timer Implementation

Overall, this is a solid implementation with clean design and good documentation. All 288 tests passing is excellent. Here's my detailed feedback:


Strengths

  1. Excellent documentation: The module-level docs clearly explain configuration, design rationale, and usage examples
  2. Zero hot-path overhead: Using a dedicated thread for monitoring is the right choice
  3. Clean separation of concerns: Config, detection logic, and action handling are well separated
  4. Lock-free design: Properly uses atomics and leverages existing strand registry infrastructure
  5. Idempotent initialization: Using Once ensures safe multiple calls
  6. Good error handling: Silently ignores thread spawn failures without panicking
  7. Test fix is solid: The retry mechanism for test_channel_stats properly handles lock contention

🔍 Issues & Suggestions

Critical: Thread Safety Bug in install_watchdog (crates/runtime/src/watchdog.rs:129)

std::thread::Builder::new()
    .name("seq-watchdog".to_string())
    .spawn(move || watchdog_loop(config))
    .ok(); // ⚠️ Silently ignoring spawn failure

Problem: If thread spawn fails, the watchdog is silently disabled with no indication to the user. Since the user explicitly configured SEQ_WATCHDOG_SECS, they expect watchdog protection.

Recommendation: Log the failure to stderr:

.spawn(move || watchdog_loop(config))
.unwrap_or_else(|e| {
    eprintln!("[watchdog] WARNING: Failed to start watchdog thread: {}", e);
});

Medium: Time Source Vulnerability (crates/runtime/src/watchdog.rs:151-154)

let now = SystemTime::now()
    .duration_since(UNIX_EPOCH)
    .map(|d| d.as_secs())
    .unwrap_or(0);  // ⚠️ Returns 0 on error (e.g., system time before 1970)

Problem: If system time is invalid, unwrap_or(0) causes now.saturating_sub(spawn_time) to produce artificially large durations, triggering false positives.

Recommendation: Skip the check if time is invalid:

let now = SystemTime::now()
    .duration_since(UNIX_EPOCH)
    .ok()
    .map(|d| d.as_secs())?;  // Return None if time is invalid

Then update check_for_stuck_strands to return Option and handle None in the caller.


Low: Repeated Warnings After First Detection (crates/runtime/src/watchdog.rs:181-183)

let was_triggered = WATCHDOG_TRIGGERED.swap(true, Ordering::SeqCst);

Issue: Once triggered, the global flag stays set forever, suppressing diagnostics even if:

  • The stuck strand completes and a different strand gets stuck
  • The threshold is very long (e.g., 1 hour) and periodic diagnostics would be useful

Recommendation: Track which strand triggered the watchdog:

static WATCHDOG_TRIGGERED_STRAND: AtomicU64 = AtomicU64::new(0);

fn handle_stuck_strand(strand_id: u64, running_secs: u64, config: &WatchdogConfig) {
    let prev = WATCHDOG_TRIGGERED_STRAND.swap(strand_id, Ordering::SeqCst);
    let is_new_strand = prev != strand_id;
    
    // Dump diagnostics on first trigger OR new stuck strand
    if prev == 0 || is_new_strand {
        dump_diagnostics();
    }
    // ...
}

Low: Missing Test Coverage

The tests only cover basic config validation. Consider adding:

  1. Environment variable parsing: Test from_env() with various SEQ_WATCHDOG_* values
  2. Stuck strand detection logic: Test check_for_stuck_strands() with mock timestamps
  3. Edge cases: Negative/zero thresholds, invalid env var values

Example test:

#[test]
fn test_from_env_parsing() {
    std::env::set_var("SEQ_WATCHDOG_SECS", "30");
    std::env::set_var("SEQ_WATCHDOG_INTERVAL", "10");
    std::env::set_var("SEQ_WATCHDOG_ACTION", "exit");
    
    let config = WatchdogConfig::from_env();
    assert_eq!(config.threshold_secs, 30);
    assert_eq!(config.interval_secs, 10);
    assert_eq!(config.action, WatchdogAction::Exit);
    
    std::env::remove_var("SEQ_WATCHDOG_SECS");
    std::env::remove_var("SEQ_WATCHDOG_INTERVAL");
    std::env::remove_var("SEQ_WATCHDOG_ACTION");
}

Low: Memory Ordering Could Be Relaxed (crates/runtime/src/watchdog.rs:183)

let was_triggered = WATCHDOG_TRIGGERED.swap(true, Ordering::SeqCst);

Optimization: SeqCst is unnecessarily strong here. Ordering::Relaxed is sufficient since there's no data dependency:

let was_triggered = WATCHDOG_TRIGGERED.swap(true, Ordering::Relaxed);

Documentation: Consider Adding Performance Notes

In the module docs, explicitly state the overhead:

  • Memory: One thread stack (~2MB default) + sizeof(WatchdogConfig) (~24 bytes)
  • CPU: One thread sleeping most of the time, minimal CPU when checking registry
  • Worst case: Registry scan is O(n) where n = registry size (default 1024)

📋 Minor Observations

  1. Emoji consistency (crates/runtime/src/watchdog.rs:191): Using ⚠️ is nice, but verify it renders correctly in all terminal environments the project targets
  2. ROADMAP.md formatting: Excellent documentation, matches the module docs well
  3. Channel test fix: The 10-retry limit with 1ms sleep (10ms total) seems reasonable for test environments

🎯 Recommendations Summary

Must fix before merge:

  • Log watchdog thread spawn failures to stderr

Should fix before merge:

  • Handle invalid system time gracefully (avoid false positives)
  • Track per-strand trigger state (avoid suppressing different stuck strands)

Nice to have:

  • Add environment variable parsing tests
  • Use Relaxed ordering for WATCHDOG_TRIGGERED
  • Document performance characteristics

Approval Status

This is good work and ready to merge after addressing the thread spawn failure logging. The time handling and trigger suppression issues are important but not blockers.

The integration is clean, the test fix is appropriate, and the feature will be very useful for debugging production issues. Great job! 🎉

@claude
Copy link
Copy Markdown

claude Bot commented Dec 16, 2025

Pull Request Review: Watchdog Timer Implementation

I've reviewed PR #99 implementing the watchdog timer feature. Overall, this is a solid, well-designed implementation that follows Rust best practices and integrates cleanly with the existing runtime. Here's my detailed feedback:


✅ Strengths

1. Excellent Code Quality

  • Clear documentation: Module-level docs with configuration tables and examples (watchdog.rs:1-31)
  • Consistent style: Matches existing codebase conventions (e.g., diagnostics.rs, scheduler.rs)
  • Proper atomics: Uses appropriate Ordering semantics throughout
  • Idempotent initialization: Correct use of Once for thread-safe initialization (watchdog.rs:113)

2. Smart Design Decisions

  • Zero hot-path overhead: Watchdog runs on a separate thread, no performance impact on strands
  • Piggybacks on existing infrastructure: Leverages strand_registry() instead of adding new tracking
  • Graceful degradation: Registry overflow doesn't break functionality, just reduces visibility
  • Suppression logic: Avoids log spam by tracking triggered state (watchdog.rs:183)

3. Good Error Handling

  • Silently handles thread spawn failure (watchdog.rs:128-129) - appropriate for optional feature
  • Validates environment variables with fallbacks (watchdog.rs:74-98)
  • Handles SystemTime::now() failure gracefully (watchdog.rs:151-154)

4. Test Coverage

  • Config validation tests (watchdog.rs:224-240)
  • Channel test flakiness fix is excellent - proper retry with backoff (channel.rs:853-861)

🔍 Issues & Recommendations

Critical: Thread Leak in install_watchdog()

Location: watchdog.rs:126-129

std::thread::Builder::new()
    .name("seq-watchdog".to_string())
    .spawn(move || watchdog_loop(config))
    .ok();  // ⚠️ Drops JoinHandle, thread leaks on shutdown

Problem: The spawned thread runs an infinite loop (watchdog.rs:137-143) with no shutdown mechanism. The JoinHandle is immediately dropped, making the thread unjoinable.

Impact:

  • Thread continues running after main program exits (orphaned OS thread)
  • May trigger warnings in test frameworks or process managers
  • Prevents clean shutdown in environments that expect graceful termination

Recommended Fix:
Add a shutdown mechanism using atomic flag or channel:

use std::sync::atomic::{AtomicBool, Ordering};
use std::sync::Arc;

static WATCHDOG_SHUTDOWN: AtomicBool = AtomicBool::new(false);

fn watchdog_loop(config: WatchdogConfig) {
    let interval = Duration::from_secs(config.interval_secs);
    
    while !WATCHDOG_SHUTDOWN.load(Ordering::Relaxed) {
        std::thread::sleep(interval);
        
        if let Some((strand_id, running_secs)) = check_for_stuck_strands(config.threshold_secs) {
            handle_stuck_strand(strand_id, running_secs, &config);
        }
    }
}

// Add shutdown function (call from scheduler_shutdown)
pub fn shutdown_watchdog() {
    WATCHDOG_SHUTDOWN.store(true, Ordering::Relaxed);
}

Then update scheduler.rs:321-323 to call watchdog::shutdown_watchdog().


Medium: Silent Error Handling

Location: watchdog.rs:129

.ok();  // Silently ignores spawn failure

Issue: Thread spawn failure is silently ignored. Users won't know the watchdog isn't running.

Recommendation: Log to stderr on failure:

.unwrap_or_else(|e| {
    eprintln!("[watchdog] Warning: failed to spawn thread: {}", e);
});

Low: Time Precision Consideration

Location: watchdog.rs:151-154, 142-143

let now = SystemTime::now()
    .duration_since(UNIX_EPOCH)
    .map(|d| d.as_secs())  // Truncates to seconds
    .unwrap_or(0);

Observation: Using second-granularity timestamps means:

  • A strand running for 59.9 seconds shows as 59 seconds
  • Check interval of 5 seconds may miss short-lived stuck strands

Is this an issue? Probably not for most use cases (detecting infinite loops), but worth documenting.

Optional Enhancement: Use millisecond precision if sub-second accuracy matters:

.map(|d| d.as_millis() as u64)

Low: Race Condition in Trigger Suppression

Location: watchdog.rs:183

let was_triggered = WATCHDOG_TRIGGERED.swap(true, Ordering::SeqCst);

Issue: Global flag means if strand #1 gets stuck, then unstuck, then strand #2 gets stuck, diagnostics for strand #2 are suppressed.

Expected behavior: Each stuck strand should get one diagnostic dump.

Fix: Use a HashSet<u64> to track which strand IDs have triggered:

use std::sync::Mutex;
use std::collections::HashSet;

static TRIGGERED_STRANDS: Mutex<HashSet<u64>> = Mutex::new(HashSet::new());

fn handle_stuck_strand(strand_id: u64, running_secs: u64, config: &WatchdogConfig) {
    let mut triggered = TRIGGERED_STRANDS.lock().unwrap();
    let was_triggered = !triggered.insert(strand_id);
    
    // ... rest of logic
}

Low: Missing Integration Test

Observation: Tests verify config parsing but don't test actual watchdog detection.

Recommendation: Add integration test:

#[test]
fn test_watchdog_detects_stuck_strand() {
    std::env::set_var("SEQ_WATCHDOG_SECS", "1");
    std::env::set_var("SEQ_WATCHDOG_INTERVAL", "1");
    
    install_watchdog();
    
    extern "C" fn infinite_loop(_stack: Stack) -> Stack {
        loop { std::thread::sleep(Duration::from_millis(100)); }
    }
    
    strand_spawn(infinite_loop, std::ptr::null_mut());
    std::thread::sleep(Duration::from_secs(3));
    
    // Verify triggered state or stderr output
}

(Note: This requires capturing stderr or exposing triggered state)


📋 Minor Observations

  1. channel.rs test fix (lines 852-861): Excellent pragmatic solution to flaky test. Retry with backoff is the right approach for parallel test contention.

  2. Documentation: ROADMAP.md is clear and well-formatted. Matches style of existing diagnostics documentation.

  3. Memory usage: Watchdog thread has minimal overhead (just stack + small config struct). Well within acceptable range.

  4. Security: No security concerns. Environment variables are properly validated, no command injection risks.


🎯 Recommendations Summary

Priority Issue Action
High Thread leak on shutdown Add shutdown mechanism
Medium Silent spawn failure Log error to stderr
Low Suppression per-strand Track triggered strands individually
Low Integration test Add test for actual stuck strand detection
Info Second precision Document or enhance to milliseconds if needed

✅ Approval Status

This PR is approved with minor suggestions. The core implementation is solid and production-ready. The critical issue (thread leak) should be addressed but won't cause data corruption or crashes - it's a cleanup concern.

Suggested merge strategy:

  1. Address the thread leak issue (high priority)
  2. Add error logging for spawn failure (medium priority)
  3. Other improvements can be follow-up PRs if desired

Great work on the implementation! The design is clean, the code is well-documented, and it integrates nicely with the existing runtime.

  1. Thread spawn failure logging (Critical):
  if let Err(e) = std::thread::Builder::new()...spawn(...) {
      eprintln!("[watchdog] WARNING: Failed to start watchdog thread: {}", e);
  }

  2. Invalid system time handling (Medium):
  let now = SystemTime::now()
      .duration_since(UNIX_EPOCH)
      .ok()
      .map(|d| d.as_secs())?;  // Returns None if time is invalid

  3. Per-strand trigger tracking (Low):
  - Changed WATCHDOG_TRIGGERED: AtomicBool → WATCHDOG_TRIGGERED_STRAND: AtomicU64
  - Dumps diagnostics on first trigger OR when a different strand gets stuck

  4. Relaxed memory ordering (Low):
  - Changed SeqCst → Relaxed for the trigger tracking

  5. Environment variable parsing tests (Low):
  - Added 4 new tests for from_env(): all values, warn action, invalid values, unknown action

  Also removed the emoji from the warning message per the observation about terminal compatibility.
@claude
Copy link
Copy Markdown

claude Bot commented Dec 16, 2025

Code Review: Watchdog Timer Implementation

Summary

This PR adds a watchdog timer feature to detect stuck strands (infinite loops, runaway computation). The implementation is well-designed and follows good practices. All 288 tests pass. Overall this is production-ready with a few minor suggestions below.


✅ Strengths

1. Excellent Architecture

  • Zero hot-path overhead: Monitoring runs on a dedicated thread, doesn't impact strand execution performance
  • Piggybacks on existing infrastructure: Reuses the strand registry rather than adding new tracking overhead
  • Idempotent initialization: Uses Once to safely handle multiple installation attempts
  • Lock-free tracking: Uses AtomicU64 for the triggered strand ID

2. Good Configuration Design

  • Environment variable-based config is appropriate for runtime behavior
  • Sensible defaults (disabled by default, 5s check interval, warn action)
  • Validation: interval must be > 0, falls back to default

3. Smart Detection Logic

  • Tracks which strand triggered to avoid spam (suppresses repeated warnings)
  • Reports the longest-running stuck strand when multiple exceed threshold
  • Dumps full diagnostics on first detection or when a different strand gets stuck
  • Handles system time errors gracefully (returns None rather than panic)

4. Testing

  • Good test coverage for config parsing, defaults, edge cases
  • Properly serializes env var tests with mutex to avoid race conditions
  • Uses unsafe correctly with clear documentation of safety requirements

🔍 Code Quality Issues

1. watchdog.rs:156 - Potential Time Handling Issue

let now = SystemTime::now()
    .duration_since(UNIX_EPOCH)
    .ok()
    .map(|d| d.as_secs())?;

Issue: This returns None if system time is before UNIX_EPOCH or if there's a clock error. While unlikely, this means the watchdog silently stops working if the system clock is misconfigured.

Recommendation: Add a one-time warning when system time is invalid:

let now = match SystemTime::now().duration_since(UNIX_EPOCH) {
    Ok(d) => d.as_secs(),
    Err(e) => {
        static WARNED: AtomicBool = AtomicBool::new(false);
        if !WARNED.swap(true, Ordering::Relaxed) {
            eprintln!("[watchdog] WARNING: Invalid system time: {}", e);
        }
        return None;
    }
};

2. watchdog.rs:197-198 - Inconsistent Error Handling

WatchdogAction::Exit => {
    let _ = writeln!(stderr, "    Exiting due to SEQ_WATCHDOG_ACTION=exit");
    std::process::exit(1);
}

Issue: Ignoring the writeln error before exiting. If stderr write fails, the user won't know why the process terminated.

Recommendation: Use let _ = write!() is acceptable here since we're about to exit anyway, but consider using eprintln!() macro instead which is clearer:

WatchdogAction::Exit => {
    eprintln!("    Exiting due to SEQ_WATCHDOG_ACTION=exit");
    std::process::exit(1);
}

3. watchdog.rs:261-262 - Unsafe Usage Documentation

The unsafe helpers set_env and restore_env are well-commented, but the SAFETY comments could be more specific:

Current:

// SAFETY: caller ensures mutex is held

Better:

// SAFETY: ENV_TEST_MUTEX held by caller, preventing data races on process-global env vars

🐛 Potential Bugs

1. Race Condition in handle_stuck_strand (watchdog.rs:185)

let prev_strand = WATCHDOG_TRIGGERED_STRAND.swap(strand_id, Ordering::Relaxed);
let is_new_strand = prev_strand != strand_id;

Issue: Uses Ordering::Relaxed which provides no synchronization guarantees. While this is probably fine for a diagnostic tool, it could theoretically lead to duplicate diagnostic dumps if multiple watchdog checks happen concurrently (though unlikely with sequential watchdog_loop).

Recommendation: Use Ordering::SeqCst or Ordering::AcqRel for clearer semantics:

let prev_strand = WATCHDOG_TRIGGERED_STRAND.swap(strand_id, Ordering::SeqCst);

The performance difference is negligible here since this is not a hot path.

2. Missing Validation in WatchdogConfig::from_env (watchdog.rs:86)

let action = std::env::var("SEQ_WATCHDOG_ACTION")
    .ok()
    .map(|s| match s.to_lowercase().as_str() {
        "exit" => WatchdogAction::Exit,
        _ => WatchdogAction::Warn,
    })
    .unwrap_or(WatchdogAction::Warn);

Issue: Invalid actions silently fall back to "warn" without informing the user. If someone types SEQ_WATCHDOG_ACTION=exist (typo), they won't know their config is wrong.

Recommendation: Warn on unrecognized values:

let action = std::env::var("SEQ_WATCHDOG_ACTION")
    .ok()
    .map(|s| {
        let lower = s.to_lowercase();
        match lower.as_str() {
            "exit" => WatchdogAction::Exit,
            "warn" => WatchdogAction::Warn,
            _ => {
                eprintln!("[watchdog] WARNING: Unknown action '{}', using 'warn'", s);
                WatchdogAction::Warn
            }
        }
    })
    .unwrap_or(WatchdogAction::Warn);

⚡ Performance Considerations

Good

  • ✅ Zero hot-path overhead (separate monitoring thread)
  • ✅ Lock-free atomic operations for triggered strand tracking
  • ✅ Bounded iteration over strand registry
  • ✅ Configurable check interval to control overhead

Minor Concern

The watchdog thread wakes up every interval_secs even when disabled. Consider adding an early return:

pub fn install_watchdog() {
    WATCHDOG_INIT.call_once(|| {
        let config = WatchdogConfig::from_env();
        
        if !config.is_enabled() {
            return;  // Good - already does this
        }
        // ...
    });
}

Actually, looking at the code - this is already correctly handled! No issues here.


🔒 Security Considerations

Good

  • ✅ No user input parsing (only env vars)
  • ✅ No shell command execution
  • ✅ Bounded resource usage (one thread, minimal memory)
  • ✅ Safe use of std::process::exit() rather than libc::exit()

Minor

The SEQ_WATCHDOG_SECS parsing could theoretically be exploited if someone sets it to a very large value, but:

  1. Environment variables are controlled by the process runner, not external input
  2. A large threshold just means "never trigger" which is equivalent to disabled
  3. No overflow risk with u64

No security issues found.


📊 Test Coverage

Excellent Coverage

  • ✅ Default configuration
  • ✅ Enabled configuration
  • ✅ Environment variable parsing (all three vars)
  • ✅ Invalid values (non-numeric, zero interval)
  • ✅ Unknown action defaults to warn
  • ✅ Env var test serialization (mutex protection)

Missing Tests

  1. No integration test for actual watchdog triggering

    • Consider adding a test that spawns a long-running strand and verifies the watchdog detects it
    • Could use the reset_triggered() function you've already provided for testing
  2. No test for SystemTime errors

    • Hard to test, but could mock if needed

Recommendation: Add at least one integration test:

#[test]
fn test_watchdog_detects_stuck_strand() {
    reset_triggered();
    // Simulate a strand running for longer than threshold
    // Verify diagnostic dump happens
}

🔧 channel.rs Test Fixes

Excellent Fix

The retry mechanism in crates/runtime/src/channel.rs is a pragmatic solution to parallel test lock contention:

fn get_stats_with_retry() -> Option<Vec<super::ChannelStats>> {
    for _ in 0..10 {
        if let Some(stats) = super::channel_stats() {
            return Some(stats);
        }
        std::thread::sleep(std::time::Duration::from_millis(1));
    }
    None
}

Strengths:

  • ✅ Bounded retry (10 attempts, 10ms total)
  • ✅ Graceful degradation (skips test instead of failing)
  • ✅ Minimal sleep duration (1ms)

Alternative approach (not required, just for consideration):
You could use std::sync::Barrier or a test-ordering attribute if this becomes a broader issue, but the retry approach is simpler and sufficient.


📝 Documentation Review

ROADMAP.md Updates - Excellent

  • ✅ Clear feature description
  • ✅ Configuration table with defaults
  • ✅ Usage examples
  • ✅ Implementation details
  • ✅ Properly placed in "Current" section

watchdog.rs Module Docs - Good

  • ✅ Clear module-level documentation
  • ✅ Configuration table
  • ✅ Usage examples
  • ✅ Design rationale

Minor suggestion: Add a note about what "stuck" means:

//! Monitors strand execution time and triggers alerts when strands run too long
//! without yielding. A strand is considered "stuck" if it has been executing
//! continuously since spawn without completing, for longer than the threshold.

📋 Integration Changes

scheduler.rs

// Install watchdog timer (if enabled via SEQ_WATCHDOG_SECS)
crate::watchdog::install_watchdog();

Perfect: Clean integration, properly commented, follows existing pattern.

lib.rs

pub mod watchdog;

Perfect: Simple module export.


Final Recommendations

Critical (Should Fix)

None! This is production-ready.

High Priority (Strongly Recommended)

  1. Add warning for invalid SEQ_WATCHDOG_ACTION values
  2. Use Ordering::SeqCst instead of Relaxed for WATCHDOG_TRIGGERED_STRAND
  3. Add one integration test for watchdog triggering

Nice to Have

  1. Add warning for system time errors
  2. Clarify "stuck" definition in module docs
  3. Use eprintln!() instead of writeln!(stderr, ...) for consistency

Conclusion

This is high-quality production code. The watchdog implementation is:

  • ✅ Well-architected (zero hot-path overhead)
  • ✅ Well-tested (unit tests cover all config paths)
  • ✅ Well-documented (clear examples and rationale)
  • ✅ Safe (no undefined behavior, proper use of atomics)

The channel test fixes are pragmatic and appropriate.

Recommendation: Approve with minor suggestions

Great work! This will be a valuable debugging tool for production systems.

@navicore navicore merged commit 421fbc7 into main Dec 16, 2025
2 checks passed
@navicore navicore deleted the watchdog branch December 16, 2025 04:53
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant