You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There is an issue in FPM happening when a child crashes shortly after the start. That might cause a loop of child restarts which might result in unresponsive master process because it is overloaded with handling restarts - signal events. One such case was described in bug #61558. Although that particular issue got fixed, this should be addressed as it might still happen due to the crash in extension or similar problem.
The solution could be to introduce an increased delay between process restarts. The idea is that we would measure how many restarts were done in the last second or some sensible time. If it goes over lets say 2 x pm.max_children, then we would set delay before starting a new child. If that happens in such short interval + current delay again, we increase the delay. If there are no restarts, then we would decrease the delay so we can recover in case it was just a temporary problem however unlikely this is.
Unfortunately it brings various challenges:
The scoreboard would need to be extended with some extra data (probably just one field counting number of children created from the beginning) and some sort of scoreboard history would need to be introduced to compare data between specific intervals. It could be done using some sort of scoreboard snapshots during server maintenance or managed using its own periodic events. It should be easily selectable so some optimal structure might need to be introduced.
The current delay might need to be also stored in scoreboard as it's a shared value. With the above check, it means it will increase amount of reads and writes to scoreboard and we might need to do some smarter locking before that.
Starting of the child would need to move to a separate event so the delay can be performed. It might be a good thing in general but it's not clear if it could potentially introduce some regression if used for all starts so we might need some abstraction so we can still trigger immediate starts directly without going through even loop. It might need some experimenting as well.
Consideration of ondemand pm where we can have natural starts when scaling up. Surely we don't want to delay those.
Configuration for the specific params so users can tweak it if it's too strict or lax for their workload. We should have good enough defaults but we cannot obviously make it optimal for all workloads.
Finding the right defaults which will require some comprehensive testing with different sort of configurations.
Possibly prevent infinite wait loop - it should be enough to limit it to the sum of max children in all pools. Or might be worth to give another try to Add FPM process.restart_batch_size option #9027 which catches primarily a pid of the terminated child. Although this might need some extra checking on Mac as it was failing there and it's not clear as there is much benefit in it.
The text was updated successfully, but these errors were encountered:
Description
There is an issue in FPM happening when a child crashes shortly after the start. That might cause a loop of child restarts which might result in unresponsive master process because it is overloaded with handling restarts - signal events. One such case was described in bug #61558. Although that particular issue got fixed, this should be addressed as it might still happen due to the crash in extension or similar problem.
The solution could be to introduce an increased delay between process restarts. The idea is that we would measure how many restarts were done in the last second or some sensible time. If it goes over lets say 2 x
pm.max_children
, then we would set delay before starting a new child. If that happens in such short interval + current delay again, we increase the delay. If there are no restarts, then we would decrease the delay so we can recover in case it was just a temporary problem however unlikely this is.Unfortunately it brings various challenges:
wait
loop - it should be enough to limit it to the sum of max children in all pools. Or might be worth to give another try to Add FPM process.restart_batch_size option #9027 which catches primarily a pid of the terminated child. Although this might need some extra checking on Mac as it was failing there and it's not clear as there is much benefit in it.The text was updated successfully, but these errors were encountered: