Conversation
Files modified:
- crates/runtime/src/watchdog.rs - New watchdog module (already created)
- crates/runtime/src/lib.rs - Added pub mod watchdog;
- crates/runtime/src/scheduler.rs - Call install_watchdog() in scheduler_init()
- crates/runtime/src/channel.rs - Fixed flaky test with retry mechanism
- docs/ROADMAP.md - Added watchdog documentation
Features:
- Environment variable configuration:
- SEQ_WATCHDOG_SECS - threshold (0 = disabled, default)
- SEQ_WATCHDOG_INTERVAL - check frequency (default: 5s)
- SEQ_WATCHDOG_ACTION - warn (default) or exit
- Behavior:
- Dedicated thread scans strand registry periodically
- Detects strands running longer than threshold
- Dumps full diagnostics (same as SIGQUIT) on first detection
- Suppresses repeated warnings for the same stuck strand
- Can terminate process if SEQ_WATCHDOG_ACTION=exit
- Zero hot-path overhead - checking happens on a separate thread
All 288 tests pass. Ready for testing with an actual infinite loop.
Code Review: Watchdog Timer ImplementationOverall, this is a solid implementation with clean design and good documentation. All 288 tests passing is excellent. Here's my detailed feedback: ✅ Strengths
🔍 Issues & SuggestionsCritical: Thread Safety Bug in
|
Pull Request Review: Watchdog Timer ImplementationI've reviewed PR #99 implementing the watchdog timer feature. Overall, this is a solid, well-designed implementation that follows Rust best practices and integrates cleanly with the existing runtime. Here's my detailed feedback: ✅ Strengths1. Excellent Code Quality
2. Smart Design Decisions
3. Good Error Handling
4. Test Coverage
🔍 Issues & RecommendationsCritical: Thread Leak in
|
| Priority | Issue | Action |
|---|---|---|
| High | Thread leak on shutdown | Add shutdown mechanism |
| Medium | Silent spawn failure | Log error to stderr |
| Low | Suppression per-strand | Track triggered strands individually |
| Low | Integration test | Add test for actual stuck strand detection |
| Info | Second precision | Document or enhance to milliseconds if needed |
✅ Approval Status
This PR is approved with minor suggestions. The core implementation is solid and production-ready. The critical issue (thread leak) should be addressed but won't cause data corruption or crashes - it's a cleanup concern.
Suggested merge strategy:
- Address the thread leak issue (high priority)
- Add error logging for spawn failure (medium priority)
- Other improvements can be follow-up PRs if desired
Great work on the implementation! The design is clean, the code is well-documented, and it integrates nicely with the existing runtime.
1. Thread spawn failure logging (Critical):
if let Err(e) = std::thread::Builder::new()...spawn(...) {
eprintln!("[watchdog] WARNING: Failed to start watchdog thread: {}", e);
}
2. Invalid system time handling (Medium):
let now = SystemTime::now()
.duration_since(UNIX_EPOCH)
.ok()
.map(|d| d.as_secs())?; // Returns None if time is invalid
3. Per-strand trigger tracking (Low):
- Changed WATCHDOG_TRIGGERED: AtomicBool → WATCHDOG_TRIGGERED_STRAND: AtomicU64
- Dumps diagnostics on first trigger OR when a different strand gets stuck
4. Relaxed memory ordering (Low):
- Changed SeqCst → Relaxed for the trigger tracking
5. Environment variable parsing tests (Low):
- Added 4 new tests for from_env(): all values, warn action, invalid values, unknown action
Also removed the emoji from the warning message per the observation about terminal compatibility.
Code Review: Watchdog Timer ImplementationSummaryThis PR adds a watchdog timer feature to detect stuck strands (infinite loops, runaway computation). The implementation is well-designed and follows good practices. All 288 tests pass. Overall this is production-ready with a few minor suggestions below. ✅ Strengths1. Excellent Architecture
2. Good Configuration Design
3. Smart Detection Logic
4. Testing
🔍 Code Quality Issues1. watchdog.rs:156 - Potential Time Handling Issuelet now = SystemTime::now()
.duration_since(UNIX_EPOCH)
.ok()
.map(|d| d.as_secs())?;Issue: This returns Recommendation: Add a one-time warning when system time is invalid: let now = match SystemTime::now().duration_since(UNIX_EPOCH) {
Ok(d) => d.as_secs(),
Err(e) => {
static WARNED: AtomicBool = AtomicBool::new(false);
if !WARNED.swap(true, Ordering::Relaxed) {
eprintln!("[watchdog] WARNING: Invalid system time: {}", e);
}
return None;
}
};2. watchdog.rs:197-198 - Inconsistent Error HandlingWatchdogAction::Exit => {
let _ = writeln!(stderr, " Exiting due to SEQ_WATCHDOG_ACTION=exit");
std::process::exit(1);
}Issue: Ignoring the writeln error before exiting. If stderr write fails, the user won't know why the process terminated. Recommendation: Use WatchdogAction::Exit => {
eprintln!(" Exiting due to SEQ_WATCHDOG_ACTION=exit");
std::process::exit(1);
}3. watchdog.rs:261-262 - Unsafe Usage DocumentationThe unsafe helpers Current: // SAFETY: caller ensures mutex is heldBetter: // SAFETY: ENV_TEST_MUTEX held by caller, preventing data races on process-global env vars🐛 Potential Bugs1. Race Condition in handle_stuck_strand (watchdog.rs:185)let prev_strand = WATCHDOG_TRIGGERED_STRAND.swap(strand_id, Ordering::Relaxed);
let is_new_strand = prev_strand != strand_id;Issue: Uses Recommendation: Use let prev_strand = WATCHDOG_TRIGGERED_STRAND.swap(strand_id, Ordering::SeqCst);The performance difference is negligible here since this is not a hot path. 2. Missing Validation in WatchdogConfig::from_env (watchdog.rs:86)let action = std::env::var("SEQ_WATCHDOG_ACTION")
.ok()
.map(|s| match s.to_lowercase().as_str() {
"exit" => WatchdogAction::Exit,
_ => WatchdogAction::Warn,
})
.unwrap_or(WatchdogAction::Warn);Issue: Invalid actions silently fall back to "warn" without informing the user. If someone types Recommendation: Warn on unrecognized values: let action = std::env::var("SEQ_WATCHDOG_ACTION")
.ok()
.map(|s| {
let lower = s.to_lowercase();
match lower.as_str() {
"exit" => WatchdogAction::Exit,
"warn" => WatchdogAction::Warn,
_ => {
eprintln!("[watchdog] WARNING: Unknown action '{}', using 'warn'", s);
WatchdogAction::Warn
}
}
})
.unwrap_or(WatchdogAction::Warn);⚡ Performance ConsiderationsGood
Minor ConcernThe watchdog thread wakes up every pub fn install_watchdog() {
WATCHDOG_INIT.call_once(|| {
let config = WatchdogConfig::from_env();
if !config.is_enabled() {
return; // Good - already does this
}
// ...
});
}Actually, looking at the code - this is already correctly handled! No issues here. 🔒 Security ConsiderationsGood
MinorThe
No security issues found. 📊 Test CoverageExcellent Coverage
Missing Tests
Recommendation: Add at least one integration test: #[test]
fn test_watchdog_detects_stuck_strand() {
reset_triggered();
// Simulate a strand running for longer than threshold
// Verify diagnostic dump happens
}🔧 channel.rs Test FixesExcellent FixThe retry mechanism in fn get_stats_with_retry() -> Option<Vec<super::ChannelStats>> {
for _ in 0..10 {
if let Some(stats) = super::channel_stats() {
return Some(stats);
}
std::thread::sleep(std::time::Duration::from_millis(1));
}
None
}Strengths:
Alternative approach (not required, just for consideration): 📝 Documentation ReviewROADMAP.md Updates - Excellent
watchdog.rs Module Docs - Good
Minor suggestion: Add a note about what "stuck" means: //! Monitors strand execution time and triggers alerts when strands run too long
//! without yielding. A strand is considered "stuck" if it has been executing
//! continuously since spawn without completing, for longer than the threshold.
📋 Integration Changesscheduler.rs// Install watchdog timer (if enabled via SEQ_WATCHDOG_SECS)
crate::watchdog::install_watchdog();✅ Perfect: Clean integration, properly commented, follows existing pattern. lib.rspub mod watchdog;✅ Perfect: Simple module export. Final RecommendationsCritical (Should Fix)None! This is production-ready. High Priority (Strongly Recommended)
Nice to Have
ConclusionThis is high-quality production code. The watchdog implementation is:
The channel test fixes are pragmatic and appropriate. Recommendation: Approve with minor suggestions Great work! This will be a valuable debugging tool for production systems. |
Files modified:
Features:
All 288 tests pass. Ready for testing with an actual infinite loop.