Skip to content

Fix critical deployment bugs from audit#88

Merged
jonathonbyrdziak merged 2 commits intomasterfrom
fix/audit-critical-bugs
Mar 22, 2026
Merged

Fix critical deployment bugs from audit#88
jonathonbyrdziak merged 2 commits intomasterfrom
fix/audit-critical-bugs

Conversation

@jonathonbyrdziak
Copy link
Copy Markdown
Contributor

Summary

Fixes 8 critical/high-severity bugs found during deployment system audit. These bugs caused:

  • Failed deployments becoming permanently un-retryable
  • Lock file state being wiped on every watcher start
  • Guaranteed downtime during blue-green promotion
  • Silent failures due to swallowed stderr
  • Daemon processes using stale cached data

Bugs Fixed

Bug Severity Description
#1 Critical PID key mismatch: consumers read git.slave.pid, writer uses slave.pid
#2 Critical release.current written on failed deploys — failures become permanent
#3 High promote() stops old containers before starting new — guaranteed downtime
#7 High Singleton cache never cleared in daemon — stale data after disk changes
#12 Critical JsonLock::delete() on every GitSlave start wipes ALL lock file state
#13 Medium Zero logging in release-watcher poll loop; credential refresh after API call
#14 Medium Shell::run() appends 2>&1 to all commands, swallowing stderr
#15 Medium isLockedPIDStillRunning() reads Json instead of JsonLock for PID

Files Changed

Test plan

  • Deploy to EC2 slave node
  • Verify protocol restart starts watcher without wiping lock file
  • Verify release-watcher logs show poll activity every cycle
  • Wait for token expiry (1 hour) — confirm credential refresh works
  • Trigger a release and verify blue-green promotion has zero downtime
  • Kill a build mid-way — verify next cycle retries (release.current not set)

🤖 Generated with Claude Code

jonathonbyrdziak and others added 2 commits March 21, 2026 18:36
Bug #1: PID key mismatch — ProtocolStart, ProtocolStop, IncidentDetector
read 'git.slave.pid' but GitSlave writes 'slave.pid'. Fixed all consumers.

Bug #2: release-watcher writes release.current on failed deploys, making
failures permanently un-retryable. Now only updates on successful promotion.

Bug #3: BlueGreen::promote() stops old containers before starting new,
guaranteeing downtime. Rewritten to start new first, then stop old, then
restart old on shadow ports for standby rollback.

Bug #7: JsonLock/Json singleton caches never cleared in long-running
release-watcher daemon. Added clearInstances() calls each poll cycle.

Bug #12: GitSlave::execute() calls JsonLock::delete() on every start,
wiping ALL lock file state (release.current, blue-green state, etc).
Changed to only clear the stale PID key.

Bug #13: release-watcher had zero logging between startup and release
detection. Added comprehensive poll cycle logging and moved credential
refresh before API calls.

Bug #14: Shell::run() appended '2>&1' to every command, swallowing
stderr and mixing it with stdout. Removed — callers that need stderr
redirect explicitly.

Bug #15: Shell::isLockedPIDStillRunning() read from Json (protocol.json)
instead of JsonLock (protocol.lock) for PID. Fixed import and call.

Also: Config::clearInstances() added for daemon cache clearing.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Resolve conflicts with PR #87's merge into master:
- ProtocolStart, ProtocolStop, IncidentDetector: use DeploymentState API
  (supersedes the manual PID key fix since DeploymentState handles it)
- release-watcher: keep master's wlog() version, re-apply bug fixes:
  - Bug #7: singleton cache clearing each poll cycle
  - Bug #13: credential refresh failure logging
  - Bug #2: release.current only written on successful promotion

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@jonathonbyrdziak jonathonbyrdziak merged commit c47c46a into master Mar 22, 2026
5 of 8 checks passed
@jonathonbyrdziak jonathonbyrdziak deleted the fix/audit-critical-bugs branch March 22, 2026 02:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant