Add warning and config variables for heartbeat timeout/interval #3245
Conversation
|
Hello @daviddavis! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on January 16, 2018 at 15:16 Hours UTC |
| PULP_PROCESS_HEARTBEAT_INTERVAL = 5 | ||
|
|
||
| # The amount of time (in seconds) after which a Celery process is considered missing. | ||
| PULP_PROCESS_TIMEOUT_INTERVAL = 25 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to remove these constants from common and add them to pulp.server.constants. For more info see https://pulp.plan.io/issues/3135#note-22.
server/etc/pulp/server.conf
Outdated
| @@ -318,6 +318,9 @@ | |||
| # | |||
| # login_method: Select the SASL login method used to connect to the broker. This should be left | |||
| # unset except in special cases such as SSL client certificate authentication. | |||
| # | |||
| # worker_failover_time: The maximum time a worker will run for before being considered dead and | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should probably be named something other than worker_failover_time since it doesn't apply only when HA is being used.
Maybe worker_expiration_time. I don't care too much about the name though so if you prefer it this way that's fine with me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about a name like worker_timeout or worker_missing_time or worker_missing_timeout ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The killed in most cases makes it sound like we're actively killing the process which we don't do. I see you meant though. I'm also wondering about a little bit of guidance here too. Here's an idea, feel free to edit wholly/rewrite:
The amount of time (in seconds) before considering a worker as missing. If Pulp's mongo database has slow I/O, then setting a higher number may resolve issues where workers are going missing incorrectly. Defaults to 30.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for worker_timeout or worker_missing_timeout. I also really like that description.
server/pulp/server/constants.py
Outdated
|
|
||
| # The amount of time (in seconds) between process wakeups to "heartbeat" and perform | ||
| # their tasks. | ||
| PULP_PROCESS_HEARTBEAT_INTERVAL = int(config.getint('tasks', 'worker_failover_time') / 5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably ok as-is, but maybe it would be good to add a brief explanation behind the math and magic numbers going on, the distinction between worker_failover_time and the process_timeout_interval, etc.
I'm fine without it too, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a link to the redmine comments where this is discussed would be a good way to help us (and others) remember.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on that
|
We didn't talk about this before, but what about adding another another "timing check" here too? I think it's possible that the read operations also be really slow and that if they are so slow that they are > the heartbeat interval than that is probably also bad. I was thinking of the same check in between the creation of |
|
Thanks @daviddavis. The code you pushed all looks exactly correct. I forgot to ask on the first round of feedback, but can a release note also be added about the new setting. This would be going into 2.16 I think. That is the last thing I can think of. Thanks a lot for putting this together. It's really an important fix for our users. |
| =========== | ||
|
|
||
| New Features | ||
| --------- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docs builder warning: underline too short
|
ok test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@daviddavis Thank you so much for this work! It looks exactly correct to me. I'm +1 to merging.
closes pulp#3245 [Issue 3245](https://pulp.plan.io/issues/3245) (cherry picked from commit a8db40b)
closes #3245 [Issue 3245](https://pulp.plan.io/issues/3245) (cherry picked from commit a8db40b)
refs #3135
https://pulp.plan.io/issues/3135