Failure recovery hooks and retry count variables #242

mailiam · 2026-02-21T02:24:26Z

mailiam
Feb 21, 2026

Current Behavior

In pitchfork, daemons can be configured with a retry count to automatically restart on failure (e.g., retry = 3 retries up to 3 times). However, once retries are exhausted, the daemon simply stays in an "Errored" state with no further action—no cleanup, notifications, or alternative recovery mechanisms. The retry count is tracked internally but isn't exposed to the daemon's run command.

This limits robustness for production-like scenarios where you might want to handle failures gracefully after all retries fail.

Proposed Enhancements

On-Failure Hook: Add a new config option like on_failure = "cleanup.sh" in pitchfork.toml. When retries are fully exhausted, execute this hook once (e.g., for logging, sending alerts, or running cleanup scripts).
- Or it could use exisitng dependency / trigger config (not concrete idea)
Retry Count Injection: Pass the current retry attempt as an environment variable (e.g., PITCHFORK_RETRIED=3) to the daemon's run command, allowing the script to adjust behavior based on the attempt number (e.g., exponential backoff or different logic per retry).

Use Case Examples

Database Cleanup on API Failure: An API daemon retries 3 times on connection errors. After exhaustion, run an on_failure script to flush caches or notify DevOps.
```
[daemons.api]
run = "node server.js"
retry = 3
on_failure = "./scripts/cleanup-cache.sh"
```
Benefit: Prevents stale data buildup without manual intervention.
Progressive Retry Logic: A batch job script uses RETRY_COUNT to increase delays (e.g., sleep longer on higher attempts).
```
# In your run script
delay=$((RETRY_COUNT * 10))  # 0s on first run, 10s on retry 1, etc.
sleep $delay && ./batch-process.sh
```
Benefit: Smarter backoff without pitchfork handling complex logic.
Alerting on Critical Failures: After retries fail for a critical service, trigger an alert via the hook.
```
[daemons.monitor]
run = "python monitor.py"
retry = 5
on_failure = "curl -X POST https://alerts.example.com/webhook"
```
Benefit: Integrates with external monitoring without polling logs.

What do you think? Are there other ways to handle this?

gaojunran · 2026-02-21T09:51:42Z

gaojunran
Feb 21, 2026

we will have #215.

3 replies

mailiam Feb 22, 2026
Author

Looks great! Forgive my searching skill 😅

mailiam Feb 22, 2026
Author

we will have #215.

Wait, don't we still benefit from exposed retries count variable?

gaojunran Feb 23, 2026

according to #245, PITCHFORK_RETRY_COUNT will be injected.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Failure recovery hooks and retry count variables #242

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Failure recovery hooks and retry count variables #242

Uh oh!

mailiam Feb 21, 2026

Current Behavior

Proposed Enhancements

Use Case Examples

Replies: 1 comment · 3 replies

Uh oh!

gaojunran Feb 21, 2026

Uh oh!

mailiam Feb 22, 2026 Author

Uh oh!

mailiam Feb 22, 2026 Author

Uh oh!

gaojunran Feb 23, 2026

mailiam
Feb 21, 2026

Replies: 1 comment 3 replies

gaojunran
Feb 21, 2026

mailiam Feb 22, 2026
Author

mailiam Feb 22, 2026
Author