Skip to content
This repository has been archived by the owner on Dec 7, 2022. It is now read-only.

Simplify worker 'online' state tracking #3369

Merged
merged 2 commits into from Mar 22, 2018
Merged

Simplify worker 'online' state tracking #3369

merged 2 commits into from Mar 22, 2018

Conversation

dralley
Copy link
Contributor

@dralley dralley commented Mar 12, 2018

@pep8speaks
Copy link

pep8speaks commented Mar 12, 2018

Hello @dralley! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on March 22, 2018 at 15:49 Hours UTC

@dralley dralley added the 3.0 label Mar 12, 2018
@dralley dralley requested a review from daviddavis March 13, 2018 05:25
and have a recent heartbeat. "Recent" is defined here as "within the pulp process timeout
interval".
To be considered 'online', a worker must have a recent heartbeat timestamp and must not
have the 'gracefully_stopped' flag set to True. "Recent" is defined here as "within
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These docs duplicate the docs in the online() method which makes me concerned about maintaining them. Can these be removed and point at the online() via a Sphinx link instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer not to because, since the code also has to be mostly duplicated but isn't exactly the same, both will need to be refactored together in sync anyway, and fixing the docstrings would be part of that refactoring.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree since the code is also duplicated and deduplicating that in this PR isn't the goal.

@@ -165,17 +182,29 @@ class TaskLock(Model):
lock (models.TextField): The name of the lock acquired

"""
CELERY_BEAT = 'CeleryBeat'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

niiiice

RESOURCE_MANAGER = 'ResourceManager'
LOCK_STRINGS = (
(CELERY_BEAT, 'Celery Beat Lock'),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -76,17 +77,20 @@ def check_celery_processes():
now = timezone.now()
oldest_heartbeat_time = now - timedelta(seconds=TASKING_CONSTANTS.PULP_PROCESS_TIMEOUT_INTERVAL)

for worker in Worker.objects.filter(last_heartbeat__lt=oldest_heartbeat_time, online=True):
for worker in Worker.objects.filter(last_heartbeat__lt=oldest_heartbeat_time,
Copy link
Member

@bmbouter bmbouter Mar 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should switch to using the Worker.objects.online_workers() manager queryset method. It seems to do the same thing. Is that right?

Copy link
Contributor Author

@dralley dralley Mar 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's actually doing precisely the opposite. It's getting workers that have timed out but haven't had gracefully_stopped set to True, e.g. crashed/missing workers.

We would have to create a new helper called missing_workers() or something along those lines, which I would be fine with.

msg = _("Worker '%s' has gone missing, removing from list of workers") % worker.name
_logger.error(msg)

mark_worker_offline(worker.name)

worker_count = Worker.objects.exclude(
name__startswith=TASKING_CONSTANTS.RESOURCE_MANAGER_WORKER_NAME).filter(online=True).count()
name__startswith=TASKING_CONSTANTS.RESOURCE_MANAGER_WORKER_NAME)\
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar for this, can this be refactored to use Worker.objects.online_workers(). I think the current approach would include non-gracefully stopped workers incorrectly without a change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, because this code only wants "workers" and doesn't care about resource managers, and Worker.objects.online_workers() returns both.

I think it would be entirely reasonable to rewrite Worker.objects.online_workers() to exclude the resource manager but then the status API couldn't use it... so it would just be shifting the problem somewhere else.

Alternatively, we could have a kwarg to toggle whether or not to exclude the resource manager from the query.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Querysets are chainable so you can Worker.objects.online_workers().exclude(name__startswith=TASKING_CONSTANTS.RESOURCE_MANAGER_WORKER_NAME)without having to "build" those capabilities into theonline_workers` manager method itself.


resource_manager_count = Worker.objects.filter(
name__startswith=TASKING_CONSTANTS.RESOURCE_MANAGER_WORKER_NAME).filter(online=True).count()
name__startswith=TASKING_CONSTANTS.RESOURCE_MANAGER_WORKER_NAME)\
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar for this, can this be refactored to use Worker.objects.online_workers(). I think the current approach would include non-gracefully stopped workers incorrectly without a change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again nope, for the same reason. It's only counting resource managers.

@@ -147,14 +151,15 @@ def mark_worker_offline(name, normal_shutdown=False):
_logger.info(msg)

try:
worker = Worker.objects.get(name=name, online=True)
worker = Worker.objects.get(name=name, gracefully_stopped=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't this use the online() property? I suspect it cannot because it's a property and not in the DB. I believe this will incorrectly also include non-gracefully stopped workers. Perhaps we could use this instead:

Worker.objects.online_workers.get(name=name)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh this is in mark_worker_offline so maybe the query you have is right.

@bmbouter
Copy link
Member

+1 to serializing both gracefully_shutdown and online.

@bmbouter
Copy link
Member

In terms of removing the timestamp, I think we should do that as a separate piece of work. I think the design you propose is OK I'm just hesitant to jump into it while this work is in-progress. How does that sound?

Copy link
Member

@bmbouter bmbouter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @dralley I requested some changes so let me know when I should review again. I also plan to do full hand testing after the code-level review. Thank you for contributing!

@dralley
Copy link
Contributor Author

dralley commented Mar 13, 2018

I'm fine with leaving the timestamp in for now.

Something else I would like to do is have some way to know whether a resource manager is actually doing anything or if it's just HA sitting around waiting for the lock to be freed.

@dralley dralley removed the request for review from daviddavis March 14, 2018 19:58
daviddavis
daviddavis previously approved these changes Mar 14, 2018
Copy link
Contributor

@daviddavis daviddavis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing. Ignore this approval.

@daviddavis daviddavis dismissed their stale review March 14, 2018 21:24

Was just testing.

Copy link
Member

@bmbouter bmbouter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is all correct, I requested a few small changes. I'm going to hand test now since all my changes are to docstrings.

Also @dralley would you be willing to file a Pulp smash test plan for this?

the pulp process timeout interval".

Returns:
bool: Whether the worker is online or not.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this be clearer if it was written True if the worker is online, False otherwise? Just an idea.

now = timezone.now()
age_threshold = now - timedelta(seconds=TASKING_CONSTANTS.PULP_PROCESS_TIMEOUT_INTERVAL)

return not self.gracefully_stopped and self.last_heartbeat > age_threshold
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not an important aspect, but to be consistent w/ the inequality used in the manager method I think this should this be >=

"""
Update the timestamp field to now and save it.

Warnings:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a Napolean style doc aspect but it's not. Can Warnings: not be used and instead move that text to be just another sentence on L#214?

@@ -115,10 +115,17 @@ class WorkerSerializer(ModelSerializer):
)

online = serializers.BooleanField(
help_text='Whether the worker is online or not. Defaults to True.',
help_text=_('Whether the worker is online'),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar here, I'm hoping the values True and False will be explicitly called out with its case for any bool

)

gracefully_stopped = serializers.BooleanField(
help_text=_('If this worker is currently offline, this represents whether the worker \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar here w/ the True and False values themselves called out in the docstring help.

read_only=True
)

class Meta:
model = models.Worker
fields = ModelSerializer.Meta.fields + ('name', 'last_heartbeat', 'online')
fields = ModelSerializer.Meta.fields + ('name', 'last_heartbeat',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this line do? It looks right but I want to learn more.

Copy link
Contributor Author

@dralley dralley Mar 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the docs page: http://www.django-rest-framework.org/api-guide/serializers/#specifying-which-fields-to-include

It looks like it's ensuring that all of the fields specified to include on the base ModelSerializer are included on the serializer for the final WorkerSerializer, but I'm not sure if there is a better way to accomplish this. Maybe worth looking into.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@bmbouter
Copy link
Member

I found one issue:

  1. Have workers started normally
  2. Look at worker2 via the browsable API and see that online=True and gracefully_shutdown=False
  3. pkill -9 worker 2 non gracefully: sudo pkill -9 -f _2
  4. Wait enough time to see log statements like this appear:
Mar 15 19:30:21 pulp3.dev celery[21727]: [2018-03-15 19:30:21,158: ERROR/MainProcess] Worker 'reserved_resource_worker_2@pulp3.dev' has gone missing, removing from list of workers
Mar 15 19:30:21 pulp3.dev celery[21727]: [2018-03-15 19:30:21,160: ERROR/MainProcess] The worker named reserved_resource_worker_2@pulp3.dev is missing. Canceling the tasks in its queue.

Reload the browsable API for worker 2 that was killed. Observe that online = False which is correct, but gracefully_shutdown=True which is not correct because it was not gracefully shutdown.

@@ -232,6 +212,7 @@ def get_resource_manager_lock(name):
worker.save_heartbeat()

try:
# this whole block is gross and we shouldn't do it this way. TODO
Copy link
Member

@bmbouter bmbouter Mar 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree but this comment probably should be removed (my opinion)

@dralley
Copy link
Contributor Author

dralley commented Mar 15, 2018

Here is why: https://github.com/pulp/pulp/pull/3369/files#diff-b9365647194214f918f5930ca9f8eb93L153

I suppose gracefully_shutdown actually, functionally means something more along the lines of "this worker has been cleaned up after". What do you think we should do about this?

@dralley
Copy link
Contributor Author

dralley commented Mar 15, 2018

I see utility in having a value that tells you "this worker crashed" but we also do need to keep track of workers that have already been cleaned up so we're not wasting cycles and for the reason you just pointed out you can't really do this with one variable.

@bmbouter
Copy link
Member

I like the user value and utility of having the data that "this worker crashed". It's useful in a variety of ways.

We could add another field called cleaned_up and never expose that to the user. Could the logic then be built on that?I would like to keep that out of the API some because I don't think we are going to use it long term with the switch to Celery4 there won't be a cleanup.

What do you think about ^?

@dralley
Copy link
Contributor Author

dralley commented Mar 16, 2018

@bmbouter updated. I recommend looking at the commits separately to make things easier.

@bmbouter
Copy link
Member

@dralley I manually reviewed the second commit. It all looks good. I just retested it now, and it tests exactly as I expected it to for both the graceful and non-graceful cases. To merge can you either squash it into one commit or add the redmine association keywords to the second commit? If we merge as-is Redmine won't have all the commits.

Copy link
Member

@bmbouter bmbouter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dralley this work looks exactly right. These improvements will really benefit Pulp users. Thank you so much for working on this!

@dralley dralley merged commit d4466ce into pulp:3.0-dev Mar 22, 2018
@dralley dralley deleted the 3121-worker-online branch March 22, 2018 15:57
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
4 participants