FPM delayed process restarting #9632

bukka · 2022-09-28T17:48:01Z

Description

There is an issue in FPM happening when a child crashes shortly after the start. That might cause a loop of child restarts which might result in unresponsive master process because it is overloaded with handling restarts - signal events. One such case was described in bug #61558. Although that particular issue got fixed, this should be addressed as it might still happen due to the crash in extension or similar problem.

The solution could be to introduce an increased delay between process restarts. The idea is that we would measure how many restarts were done in the last second or some sensible time. If it goes over lets say 2 x pm.max_children, then we would set delay before starting a new child. If that happens in such short interval + current delay again, we increase the delay. If there are no restarts, then we would decrease the delay so we can recover in case it was just a temporary problem however unlikely this is.

Unfortunately it brings various challenges:

The scoreboard would need to be extended with some extra data (probably just one field counting number of children created from the beginning) and some sort of scoreboard history would need to be introduced to compare data between specific intervals. It could be done using some sort of scoreboard snapshots during server maintenance or managed using its own periodic events. It should be easily selectable so some optimal structure might need to be introduced.
The current delay might need to be also stored in scoreboard as it's a shared value. With the above check, it means it will increase amount of reads and writes to scoreboard and we might need to do some smarter locking before that.
Starting of the child would need to move to a separate event so the delay can be performed. It might be a good thing in general but it's not clear if it could potentially introduce some regression if used for all starts so we might need some abstraction so we can still trigger immediate starts directly without going through even loop. It might need some experimenting as well.
Consideration of ondemand pm where we can have natural starts when scaling up. Surely we don't want to delay those.
Configuration for the specific params so users can tweak it if it's too strict or lax for their workload. We should have good enough defaults but we cannot obviously make it optimal for all workloads.
Finding the right defaults which will require some comprehensive testing with different sort of configurations.
Possibly prevent infinite wait loop - it should be enough to limit it to the sum of max children in all pools. Or might be worth to give another try to Add FPM process.restart_batch_size option #9027 which catches primarily a pid of the terminated child. Although this might need some extra checking on Mac as it was failing there and it's not clear as there is much benefit in it.

The text was updated successfully, but these errors were encountered:

bukka added Feature SAPI: fpm labels Sep 28, 2022

bukka mentioned this issue Sep 28, 2022

Add FPM process.restart_batch_size option #9027

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FPM delayed process restarting #9632

FPM delayed process restarting #9632

bukka commented Sep 28, 2022

FPM delayed process restarting #9632

FPM delayed process restarting #9632

Comments

bukka commented Sep 28, 2022

Description