Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mark abandoned tasks as "failed" #1649

Merged
merged 1 commit into from
Sep 28, 2021
Merged

Conversation

mdellweg
Copy link
Member

fixes #9247

@pulpbot
Copy link
Member

pulpbot commented Sep 27, 2021

Attached issue: https://pulp.plan.io/issues/9247

@dralley
Copy link
Contributor

dralley commented Sep 27, 2021

When I trigger a failure, the task status is correct, but in the logs it looks like it's "failing" when it tries to save a null state.

Sep 27 14:24:29 pulp3-source-fedora34.localhost.example.com pulpcore-worker[290186]:   File "/home/vagrant/devel/pulpcore/pulpcore/tasking/entrypoint.py", line 46, in worker
Sep 27 14:24:29 pulp3-source-fedora34.localhost.example.com pulpcore-worker[290186]:     NewPulpWorker().run_forever()
Sep 27 14:24:29 pulp3-source-fedora34.localhost.example.com pulpcore-worker[290186]:   File "/home/vagrant/devel/pulpcore/pulpcore/tasking/pulpcore_worker.py", line 308, in run_forever
Sep 27 14:24:29 pulp3-source-fedora34.localhost.example.com pulpcore-worker[290187]: pulp [None]: pulpcore.tasking.pulpcore_worker:INFO: Cleaning up and canceling Task 5faa90e3-8c9a-483a-9467-3ab894a8b82d
Sep 27 14:24:29 pulp3-source-fedora34.localhost.example.com pulpcore-worker[290186]:     self.supervise_task(task)
Sep 27 14:24:29 pulp3-source-fedora34.localhost.example.com pulpcore-worker[290186]:   File "/home/vagrant/devel/pulpcore/pulpcore/tasking/pulpcore_worker.py", line 297, in supervise_task
Sep 27 14:24:29 pulp3-source-fedora34.localhost.example.com pulpcore-worker[290186]:     self.cancel_abandoned_task(task, cancel_state)
Sep 27 14:24:29 pulp3-source-fedora34.localhost.example.com pulpcore-worker[290186]:   File "/home/vagrant/devel/pulpcore/pulpcore/tasking/pulpcore_worker.py", line 150, in cancel_abandoned_task
Sep 27 14:24:29 pulp3-source-fedora34.localhost.example.com pulpcore-worker[290186]:     Task.objects.filter(pk=task.pk, state=TASK_STATES.CANCELING).update(state=final_state)
Sep 27 14:24:29 pulp3-source-fedora34.localhost.example.com pulpcore-worker[290186]:   File "/usr/local/lib/pulp/lib64/python3.9/site-packages/django/db/models/query.py", line 783, in update
Sep 27 14:24:29 pulp3-source-fedora34.localhost.example.com pulpcore-worker[290186]:     rows = query.get_compiler(self.db).execute_sql(CURSOR)
Sep 27 14:24:29 pulp3-source-fedora34.localhost.example.com pulpcore-worker[290186]:   File "/usr/local/lib/pulp/lib64/python3.9/site-packages/django/db/models/sql/compiler.py", line 1559, in execute_sql
Sep 27 14:24:29 pulp3-source-fedora34.localhost.example.com pulpcore-worker[290186]:     cursor = super().execute_sql(result_type)
Sep 27 14:24:29 pulp3-source-fedora34.localhost.example.com pulpcore-worker[290186]:   File "/usr/local/lib/pulp/lib64/python3.9/site-packages/django/db/models/sql/compiler.py", line 1175, in execute_sql
Sep 27 14:24:29 pulp3-source-fedora34.localhost.example.com pulpcore-worker[290186]:     cursor.execute(sql, params)
Sep 27 14:24:29 pulp3-source-fedora34.localhost.example.com pulpcore-worker[290186]:   File "/usr/local/lib/pulp/lib64/python3.9/site-packages/django/db/backends/utils.py", line 66, in execute
Sep 27 14:24:29 pulp3-source-fedora34.localhost.example.com pulpcore-worker[290186]:     return self._execute_with_wrappers(sql, params, many=False, executor=self._execute)
Sep 27 14:24:29 pulp3-source-fedora34.localhost.example.com pulpcore-worker[290186]:   File "/usr/local/lib/pulp/lib64/python3.9/site-packages/django/db/backends/utils.py", line 75, in _execute_with_wrappers
Sep 27 14:24:29 pulp3-source-fedora34.localhost.example.com pulpcore-worker[290186]:     return executor(sql, params, many, context)
Sep 27 14:24:29 pulp3-source-fedora34.localhost.example.com pulpcore-worker[290186]:   File "/usr/local/lib/pulp/lib64/python3.9/site-packages/django/db/backends/utils.py", line 84, in _execute
Sep 27 14:24:29 pulp3-source-fedora34.localhost.example.com pulpcore-worker[290186]:     return self.cursor.execute(sql, params)
Sep 27 14:24:29 pulp3-source-fedora34.localhost.example.com pulpcore-worker[290186]:   File "/usr/local/lib/pulp/lib64/python3.9/site-packages/django/db/utils.py", line 90, in __exit__
Sep 27 14:24:29 pulp3-source-fedora34.localhost.example.com pulpcore-worker[290186]:     raise dj_exc_value.with_traceback(traceback) from exc_value
Sep 27 14:24:29 pulp3-source-fedora34.localhost.example.com pulpcore-worker[290186]:   File "/usr/local/lib/pulp/lib64/python3.9/site-packages/django/db/backends/utils.py", line 84, in _execute
Sep 27 14:24:29 pulp3-source-fedora34.localhost.example.com pulpcore-worker[290186]:     return self.cursor.execute(sql, params)
Sep 27 14:24:29 pulp3-source-fedora34.localhost.example.com pulpcore-worker[290186]: django.db.utils.IntegrityError: null value in column "state" of relation "core_task" violates not-null constraint

@mdellweg
Copy link
Member Author

When I trigger a failure, the task status is correct, but in the logs it looks like it's "failing" when it tries to save a null state.

Yeah, i can see, where that is coming from.

@dralley
Copy link
Contributor

dralley commented Sep 27, 2021

It works now, but the logs are still a bit misleading. When the task aborts / disappears, it still prints "Cleaning up and canceling Task ..." when it should say that it's moving to failed.

@mdellweg
Copy link
Member Author

@dralley how about that?

Task.objects.filter(pk=task.pk, state=TASK_STATES.RUNNING).update(
state=TASK_STATES.CANCELING
)
task.refresh_from_db()
if task.state == TASK_STATES.CANCELING:
_logger.info(_("Cleaning up Task %s and marking as %s"), task.pk, final_state)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make the capitalization of Task consistent with the other messages?

Copy link
Contributor

@dralley dralley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Manually tested with:

  • SIGABRT from within task (os.abort())
  • SIGKILL (sudo kill -9 $pid)
  • Python exception from within task
  • Task cancellation

It works and logs appropriate messages in all situations, thank you

@dralley
Copy link
Contributor

dralley commented Sep 27, 2021

I think we're still missing a small piece of functionality that had been discussed earlier. It logs "Cleaning up X and marking as failed" but there is no message to call out that the worker process disappeared

https://pulp.plan.io/issues/9247#note-3

@mdellweg
Copy link
Member Author

I now see the following:

[
  {
    "pulp_href": "/pulp/api/v3/tasks/0d01973a-56b1-472d-9347-b4b7ecad05d6/",
    "pulp_created": "2021-09-28T08:33:27.907425Z",
    "state": "canceled",
    "name": "pulp_herminig.app.tasks.benchmark.test_task",
    "logging_cid": "8825c817a34d4406b79e0e739044cd63",
    "started_at": "2021-09-28T08:33:27.940186Z",
    "finished_at": "2021-09-28T08:33:47.087975Z",
    "error": null,
    "worker": "/pulp/api/v3/workers/e69b0e8b-9776-4ce9-a3b4-942f0648654f/",
    "parent_task": null,
    "child_tasks": [],
    "task_group": "/pulp/api/v3/task-groups/809b3b41-e269-4cab-9efc-032c06ecb299/",
    "progress_reports": [],
    "created_resources": [],
    "reserved_resources_record": []
  },
  {
    "pulp_href": "/pulp/api/v3/tasks/d9d0b347-30db-4874-9409-e018b05cb33f/",
    "pulp_created": "2021-09-28T08:31:43.734470Z",
    "state": "failed",
    "name": "pulp_herminig.app.tasks.benchmark.test_task",
    "logging_cid": "01129c8ff3cd4670ab8c52fb67787676",
    "started_at": "2021-09-28T08:31:49.073222Z",
    "finished_at": "2021-09-28T08:31:55.766986Z",
    "error": {
      "reason": "Worker has gone missing."
    },
    "worker": "/pulp/api/v3/workers/52c0900b-0297-4a5a-976c-bdf2195446f6/",
    "parent_task": null,
    "child_tasks": [],
    "task_group": "/pulp/api/v3/task-groups/45239142-7dce-46f5-8b11-a3d38c862ba1/",
    "progress_reports": [],
    "created_resources": [],
    "reserved_resources_record": []
  },
  {
    "pulp_href": "/pulp/api/v3/tasks/d4fc9f51-cbe1-4dfd-bfe2-b81a32ca7b74/",
    "pulp_created": "2021-09-28T08:30:34.706707Z",
    "state": "failed",
    "name": "pulp_herminig.app.tasks.benchmark.test_task",
    "logging_cid": "3310ff88568e474d9f1141535c879220",
    "started_at": "2021-09-28T08:30:34.754476Z",
    "finished_at": "2021-09-28T08:30:59.025217Z",
    "error": {
      "reason": "Aborted during worker shutdown."
    },
    "worker": null,
    "parent_task": null,
    "child_tasks": [],
    "task_group": "/pulp/api/v3/task-groups/a8ab99cc-6de5-409d-a76f-5ec217e17779/",
    "progress_reports": [],
    "created_resources": [],
    "reserved_resources_record": []
  }
]

Task.objects.filter(pk=task.pk, state=TASK_STATES.RUNNING).update(
state=TASK_STATES.CANCELING
)
task.refresh_from_db()
if task.state == TASK_STATES.CANCELING:
_logger.info(_("Cleaning up task %s and marking as %s"), task.pk, final_state)
Copy link
Contributor

@dralley dralley Sep 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the event of failed task (or code-wise perhaps if reason), could we have this message print {...} marking as failed. reason: Worker has gone missing.

Copy link
Contributor

@dralley dralley Sep 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like

if reason:
    msg = _("Cleaning up task %s and marking as %s. Reason: %s")
    args = (task.pk, final_state, reason)
else:
    msg = _("Cleaning up task %s and marking as %s.)
    args = (task.pk, final_state)
_logger.info(msg, *args)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A clearer log message would be nice.

@@ -129,26 +129,39 @@ def beat(self):
def notify_workers(self):
self.cursor.execute("NOTIFY pulp_worker_wakeup")

def cancel_abandoned_task(self, task):
def cancel_abandoned_task(self, task, final_state, reason=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this method get args and kwargs docs?

"""Cancel and clean up an abandoned task.

This function must only be called while holding the lock for that task.
This function must only be called while holding the lock for that task. It is a no-op if
the task is neither in "running" nor "canceling" state.

Return ``True`` if the task was actually canceled, ``False`` otherwise.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could also follow the napolean docstring format.

Copy link
Member

@bmbouter bmbouter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hand tested this and it worked well. I saw Cleaning up Task b087bd6f-dc8c-45e3-9984-e70b598159f1 and marking as failed in my logs and the task state was at failed. Thanks @mdellweg !

@bmbouter bmbouter merged commit 6fb9109 into pulp:master Sep 28, 2021
@mdellweg mdellweg deleted the fail_abandoned_tasks branch September 28, 2021 20:56
@salman8328
Copy link

salman8328 commented May 19, 2023

hey !,
when the worker goes missing, the task is shown as failed, but the task list shows a few sub-tasks are still running, as shown below:

{
"pulp_href": "/pulp/api/v3/tasks/9f53c6f6-bc00-4da0-b2e7-67175486b7ba/",
"pulp_created": "2023-05-17T13:30:42.169868Z",
"state": "failed",
"name": "pulp_rpm.app.tasks.synchronizing.synchronize",
"logging_cid": "9dd033c5e9234b0aa9d4ad95ff90867f",
"started_at": "2023-05-17T13:30:42.220159Z",
"finished_at": "2023-05-19T05:35:58.765626Z",
"error": {
"reason": "Worker has gone missing."
},
"worker": null,
"parent_task": null,
"child_tasks": [],
"task_group": null,
"progress_reports": [
{
"message": "Downloading Artifacts",
"code": "sync.downloading.artifacts",
"state": "running",
"total": null,
"done": 7872,
"suffix": null
},
{
"message": "Parsed Advisories",
"code": "sync.parsing.advisories",
"state": "completed",
"total": 986,
"done": 986,
"suffix": null
},
{
"message": "Parsed Modulemd",
"code": "sync.parsing.modulemds",
"state": "completed",
"total": 45,
"done": 45,
"suffix": null
},
{
"message": "Parsed Modulemd-defaults",
"code": "sync.parsing.modulemd_defaults",
"state": "completed",
"total": 7,
"done": 7,
"suffix": null
},
{
"message": "Parsed Modulemd Obsolete",
"code": "sync.parsing.modulemd_obsoletes",
"state": "completed",
"total": 0,
"done": 0,
"suffix": null
},
{
"message": "Skipping Packages",
"code": "sync.skipped.packages",
"state": "completed",
"total": 0,
"done": 0,
"suffix": null
},
{
"message": "Parsed Packages",
"code": "sync.parsing.packages",
"state": "completed",
"total": 8267,
"done": 8267,
"suffix": null
},
{
"message": "Downloading Metadata Files",
"code": "sync.downloading.metadata",
"state": "completed",
"total": null,
"done": 6,
"suffix": null
},
{
"message": "Associating Content",
"code": "associating.content",
"state": "running",
"total": null,
"done": 9005,
"suffix": null
}
],
"created_resources": [],
"reserved_resources_record": [
"/pulp/api/v3/repositories/rpm/rpm/dbf41fe8-c225-453a-9f5f-b339715b263c/",
"shared:/pulp/api/v3/remotes/rpm/rpm/7bd62f03-318e-417a-ae03-00d64d16e5df/"
]
},

this could be just a display issue but it is really confusing to see, running there.

and after the worker goes missing, the pulp worker list endpoint still shows the state=missing workers in worker list, which is pretty weird.

will these workers be deleted later on ? or will pulp still try to wait for those workers to be online indefinetly?

@dralley, thank you !

@dralley
Copy link
Contributor

dralley commented May 19, 2023

They're not subtasks, just progress reports. Issue is filed here #3609

I think the workers do get cleaned up automatically after a week or so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants