Simplify worker 'online' state tracking #3369
Conversation
Hello @dralley! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on March 22, 2018 at 15:49 Hours UTC |
and have a recent heartbeat. "Recent" is defined here as "within the pulp process timeout | ||
interval". | ||
To be considered 'online', a worker must have a recent heartbeat timestamp and must not | ||
have the 'gracefully_stopped' flag set to True. "Recent" is defined here as "within |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These docs duplicate the docs in the online()
method which makes me concerned about maintaining them. Can these be removed and point at the online()
via a Sphinx link instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer not to because, since the code also has to be mostly duplicated but isn't exactly the same, both will need to be refactored together in sync anyway, and fixing the docstrings would be part of that refactoring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree since the code is also duplicated and deduplicating that in this PR isn't the goal.
@@ -165,17 +182,29 @@ class TaskLock(Model): | |||
lock (models.TextField): The name of the lock acquired | |||
|
|||
""" | |||
CELERY_BEAT = 'CeleryBeat' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
niiiice
RESOURCE_MANAGER = 'ResourceManager' | ||
LOCK_STRINGS = ( | ||
(CELERY_BEAT, 'Celery Beat Lock'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@@ -76,17 +77,20 @@ def check_celery_processes(): | |||
now = timezone.now() | |||
oldest_heartbeat_time = now - timedelta(seconds=TASKING_CONSTANTS.PULP_PROCESS_TIMEOUT_INTERVAL) | |||
|
|||
for worker in Worker.objects.filter(last_heartbeat__lt=oldest_heartbeat_time, online=True): | |||
for worker in Worker.objects.filter(last_heartbeat__lt=oldest_heartbeat_time, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should switch to using the Worker.objects.online_workers()
manager queryset method. It seems to do the same thing. Is that right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's actually doing precisely the opposite. It's getting workers that have timed out but haven't had gracefully_stopped
set to True
, e.g. crashed/missing workers.
We would have to create a new helper called missing_workers()
or something along those lines, which I would be fine with.
msg = _("Worker '%s' has gone missing, removing from list of workers") % worker.name | ||
_logger.error(msg) | ||
|
||
mark_worker_offline(worker.name) | ||
|
||
worker_count = Worker.objects.exclude( | ||
name__startswith=TASKING_CONSTANTS.RESOURCE_MANAGER_WORKER_NAME).filter(online=True).count() | ||
name__startswith=TASKING_CONSTANTS.RESOURCE_MANAGER_WORKER_NAME)\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar for this, can this be refactored to use Worker.objects.online_workers()
. I think the current approach would include non-gracefully stopped workers incorrectly without a change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, because this code only wants "workers" and doesn't care about resource managers, and Worker.objects.online_workers()
returns both.
I think it would be entirely reasonable to rewrite Worker.objects.online_workers()
to exclude the resource manager but then the status API couldn't use it... so it would just be shifting the problem somewhere else.
Alternatively, we could have a kwarg to toggle whether or not to exclude the resource manager from the query.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Querysets are chainable so you can Worker.objects.online_workers().exclude(name__startswith=TASKING_CONSTANTS.RESOURCE_MANAGER_WORKER_NAME)without having to "build" those capabilities into the
online_workers` manager method itself.
|
||
resource_manager_count = Worker.objects.filter( | ||
name__startswith=TASKING_CONSTANTS.RESOURCE_MANAGER_WORKER_NAME).filter(online=True).count() | ||
name__startswith=TASKING_CONSTANTS.RESOURCE_MANAGER_WORKER_NAME)\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar for this, can this be refactored to use Worker.objects.online_workers()
. I think the current approach would include non-gracefully stopped workers incorrectly without a change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again nope, for the same reason. It's only counting resource managers.
@@ -147,14 +151,15 @@ def mark_worker_offline(name, normal_shutdown=False): | |||
_logger.info(msg) | |||
|
|||
try: | |||
worker = Worker.objects.get(name=name, online=True) | |||
worker = Worker.objects.get(name=name, gracefully_stopped=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't this use the online()
property? I suspect it cannot because it's a property and not in the DB. I believe this will incorrectly also include non-gracefully stopped workers. Perhaps we could use this instead:
Worker.objects.online_workers.get(name=name)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh this is in mark_worker_offline
so maybe the query you have is right.
+1 to serializing both |
In terms of removing the timestamp, I think we should do that as a separate piece of work. I think the design you propose is OK I'm just hesitant to jump into it while this work is in-progress. How does that sound? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @dralley I requested some changes so let me know when I should review again. I also plan to do full hand testing after the code-level review. Thank you for contributing!
I'm fine with leaving the timestamp in for now. Something else I would like to do is have some way to know whether a resource manager is actually doing anything or if it's just HA sitting around waiting for the lock to be freed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Testing. Ignore this approval.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is all correct, I requested a few small changes. I'm going to hand test now since all my changes are to docstrings.
Also @dralley would you be willing to file a Pulp smash test plan for this?
pulpcore/pulpcore/app/models/task.py
Outdated
the pulp process timeout interval". | ||
|
||
Returns: | ||
bool: Whether the worker is online or not. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this be clearer if it was written True if the worker is online, False otherwise
? Just an idea.
pulpcore/pulpcore/app/models/task.py
Outdated
now = timezone.now() | ||
age_threshold = now - timedelta(seconds=TASKING_CONSTANTS.PULP_PROCESS_TIMEOUT_INTERVAL) | ||
|
||
return not self.gracefully_stopped and self.last_heartbeat > age_threshold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably not an important aspect, but to be consistent w/ the inequality used in the manager method I think this should this be >=
pulpcore/pulpcore/app/models/task.py
Outdated
""" | ||
Update the timestamp field to now and save it. | ||
|
||
Warnings: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a Napolean style doc aspect but it's not. Can Warnings:
not be used and instead move that text to be just another sentence on L#214?
@@ -115,10 +115,17 @@ class WorkerSerializer(ModelSerializer): | |||
) | |||
|
|||
online = serializers.BooleanField( | |||
help_text='Whether the worker is online or not. Defaults to True.', | |||
help_text=_('Whether the worker is online'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar here, I'm hoping the values True and False will be explicitly called out with its case for any bool
) | ||
|
||
gracefully_stopped = serializers.BooleanField( | ||
help_text=_('If this worker is currently offline, this represents whether the worker \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar here w/ the True and False values themselves called out in the docstring help.
read_only=True | ||
) | ||
|
||
class Meta: | ||
model = models.Worker | ||
fields = ModelSerializer.Meta.fields + ('name', 'last_heartbeat', 'online') | ||
fields = ModelSerializer.Meta.fields + ('name', 'last_heartbeat', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this line do? It looks right but I want to learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's the docs page: http://www.django-rest-framework.org/api-guide/serializers/#specifying-which-fields-to-include
It looks like it's ensuring that all of the fields specified to include on the base ModelSerializer are included on the serializer for the final WorkerSerializer, but I'm not sure if there is a better way to accomplish this. Maybe worth looking into.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
I found one issue:
Reload the browsable API for worker 2 that was killed. Observe that online = False which is correct, but gracefully_shutdown=True which is not correct because it was not gracefully shutdown. |
@@ -232,6 +212,7 @@ def get_resource_manager_lock(name): | |||
worker.save_heartbeat() | |||
|
|||
try: | |||
# this whole block is gross and we shouldn't do it this way. TODO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree but this comment probably should be removed (my opinion)
Here is why: https://github.com/pulp/pulp/pull/3369/files#diff-b9365647194214f918f5930ca9f8eb93L153 I suppose |
I see utility in having a value that tells you "this worker crashed" but we also do need to keep track of workers that have already been cleaned up so we're not wasting cycles and for the reason you just pointed out you can't really do this with one variable. |
I like the user value and utility of having the data that "this worker crashed". It's useful in a variety of ways. We could add another field called cleaned_up and never expose that to the user. Could the logic then be built on that?I would like to keep that out of the API some because I don't think we are going to use it long term with the switch to Celery4 there won't be a cleanup. What do you think about ^? |
@bmbouter updated. I recommend looking at the commits separately to make things easier. |
@dralley I manually reviewed the second commit. It all looks good. I just retested it now, and it tests exactly as I expected it to for both the graceful and non-graceful cases. To merge can you either squash it into one commit or add the redmine association keywords to the second commit? If we merge as-is Redmine won't have all the commits. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dralley this work looks exactly right. These improvements will really benefit Pulp users. Thank you so much for working on this!
closes #3121
https://pulp.plan.io/issues/3121