Skip to content
This repository has been archived by the owner on Dec 7, 2022. It is now read-only.

Add warning and config variables for heartbeat timeout/interval #3245

Merged
merged 1 commit into from Jan 16, 2018
Merged

Add warning and config variables for heartbeat timeout/interval #3245

merged 1 commit into from Jan 16, 2018

Conversation

daviddavis
Copy link
Contributor

@dralley dralley closed this in a8db40b Jan 8, 2018
@dralley dralley reopened this Jan 8, 2018
@pep8speaks
Copy link

pep8speaks commented Jan 9, 2018

Hello @daviddavis! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on January 16, 2018 at 15:16 Hours UTC

PULP_PROCESS_HEARTBEAT_INTERVAL = 5

# The amount of time (in seconds) after which a Celery process is considered missing.
PULP_PROCESS_TIMEOUT_INTERVAL = 25
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to remove these constants from common and add them to pulp.server.constants. For more info see https://pulp.plan.io/issues/3135#note-22.

@daviddavis daviddavis changed the title Add warning for heartbeats taking too long Add warning and config variables for heartbeat timeout/interval Jan 10, 2018
@daviddavis
Copy link
Contributor Author

@dralley @bmbouter can you review/re-review my changes?

@@ -318,6 +318,9 @@
#
# login_method: Select the SASL login method used to connect to the broker. This should be left
# unset except in special cases such as SSL client certificate authentication.
#
# worker_failover_time: The maximum time a worker will run for before being considered dead and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should probably be named something other than worker_failover_time since it doesn't apply only when HA is being used.

Maybe worker_expiration_time. I don't care too much about the name though so if you prefer it this way that's fine with me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about a name like worker_timeout or worker_missing_time or worker_missing_timeout ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The killed in most cases makes it sound like we're actively killing the process which we don't do. I see you meant though. I'm also wondering about a little bit of guidance here too. Here's an idea, feel free to edit wholly/rewrite:

The amount of time (in seconds) before considering a worker as missing. If Pulp's mongo database has slow I/O, then setting a higher number may resolve issues where workers are going missing incorrectly. Defaults to 30.

Copy link
Contributor

@dralley dralley Jan 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for worker_timeout or worker_missing_timeout. I also really like that description.


# The amount of time (in seconds) between process wakeups to "heartbeat" and perform
# their tasks.
PULP_PROCESS_HEARTBEAT_INTERVAL = int(config.getint('tasks', 'worker_failover_time') / 5)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably ok as-is, but maybe it would be good to add a brief explanation behind the math and magic numbers going on, the distinction between worker_failover_time and the process_timeout_interval, etc.

I'm fine without it too, though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a link to the redmine comments where this is discussed would be a good way to help us (and others) remember.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on that

@bmbouter
Copy link
Member

We didn't talk about this before, but what about adding another another "timing check" here too? I think it's possible that the read operations also be really slow and that if they are so slow that they are > the heartbeat interval than that is probably also bad. I was thinking of the same check in between the creation of now on L#72 and worker_list being created to in L#74.

@daviddavis
Copy link
Contributor Author

@dralley @bmbouter thanks for reviews. think I addressed all the feedback.

@bmbouter
Copy link
Member

Thanks @daviddavis. The code you pushed all looks exactly correct. I forgot to ask on the first round of feedback, but can a release note also be added about the new setting. This would be going into 2.16 I think. That is the last thing I can think of. Thanks a lot for putting this together. It's really an important fix for our users.

===========

New Features
---------
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs builder warning: underline too short

@daviddavis
Copy link
Contributor Author

ok test

@daviddavis
Copy link
Contributor Author

@bmbouter @dralley fixed the docs.

Copy link
Member

@bmbouter bmbouter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@daviddavis Thank you so much for this work! It looks exactly correct to me. I'm +1 to merging.

@daviddavis daviddavis merged commit ca05377 into pulp:master Jan 16, 2018
pcreech pushed a commit to pcreech/pulp that referenced this pull request Jan 17, 2018
pcreech pushed a commit that referenced this pull request Jan 22, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
4 participants