Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kolibri shutdown blocked by running cancellable task #10249

Closed
bjester opened this issue Mar 14, 2023 · 5 comments · Fixed by #11120
Closed

Kolibri shutdown blocked by running cancellable task #10249

bjester opened this issue Mar 14, 2023 · 5 comments · Fixed by #11120
Assignees
Labels
DEV: backend Python, databases, networking, filesystem...

Comments

@bjester
Copy link
Contributor

bjester commented Mar 14, 2023

Observed behavior

Kolibri wouldn't shutdown while a content import task was running. Observe the timestamps around ^C in the logs below-- above it was also already during shutdown.

Errors and logs

INFO     2023-03-14 13:22:59,426 Keyboard interrupt caught. Exiting.
INFO     2023-03-14 13:22:59,426 Bus state: STOP
INFO     2023-03-14 13:22:59,588 HTTP Server kolibri.utils.server.Server(('0.0.0.0', 8080)) shut down
INFO     2023-03-14 13:23:00,034 HTTP Server kolibri.utils.server.Server(('0.0.0.0', 34365)) shut down
INFO     2023-03-14 13:23:00,034 Joining 'ProcessControlPlugin'
INFO     2023-03-14 13:23:00,404 Stopped thread 'ProcessControlPlugin'.
INFO     2023-03-14 13:23:00,404 Joining 'ZeroConfPlugin'
INFO     2023-03-14 13:23:01,399 Stopped thread 'ZeroConfPlugin'.
INFO     2023-03-14 13:23:01,654 Asking job schedulers to shut down.
INFO     2023-03-14 13:23:01,654 Canceling job id 6ead5d86df944986848a66e1585c6a48.
^C
^[[1;5F
INFO     2023-03-14 13:27:29,520 Setting availability to True of 1474 LocalFile objects based on passed in checksums
INFO     2023-03-14 13:27:29,527 Setting availability of non-topic ContentNode objects based on LocalFile availability in 1 batches of 10000
INFO     2023-03-14 13:27:29,537 Annotating ContentNode objects with children for 3 levels
INFO     2023-03-14 13:27:29,540 Annotating ContentNode objects with children for level 3
INFO     2023-03-14 13:27:29,542 Annotating ContentNode objects with children for level 2
INFO     2023-03-14 13:27:29,542 Annotating ContentNode objects with children for level 1
INFO     2023-03-14 13:27:29,569 Bus state: IDLE
INFO     2023-03-14 13:27:29,569 Bus state: EXIT
INFO     2023-03-14 13:27:29,569 Waiting for child threads to terminate...
INFO     2023-03-14 13:27:29,569 Bus state: EXITED

Expected behavior

It seems that this should not have blocked Kolibri shutdown, since the task is cancellable. So for some reason the cancellation of the job, as captured by the logs, didn't actually stop the content import task.

User-facing consequences

In ability to shutdown Kolibri until tasks are complete

Steps to reproduce

  1. Start a content import that should take a while
  2. Attempt to shutdown Kolibri

Context

Ubuntu 22.04
Kolibri develop pex https://github.com/learningequality/kolibri/suites/11465479013/artifacts/591721049
Chrome

@bjester bjester added the DEV: backend Python, databases, networking, filesystem... label Mar 14, 2023
@photon0205
Copy link
Contributor

@bjester I would like to work on this issue

@rtibbles
Copy link
Member

Hi @photon0205, I have assigned you. First step would be to replicate this error (it looks like @bjester was importing resources for a channel).

The next step would be to look at the worker implementation and see how it handles shutdown to see why his might be happening.

@dbnicholson
Copy link
Contributor

We run into this all the time if there's a worker downloading content from a task. In the log it reports that the job is being cancelled, but nothing actually happens.

First, the task is received and it starts downloading the items:

DEBUG 2023-08-17 07:12:10,950 cherrypy.access 127.0.0.1 - - "POST /api/tasks/tasks/" 200 0 "" "curl/7.74.0"
DEBUG 2023-08-17 07:12:10,972 urllib3.connectionpool Starting new HTTPS connection (1): studio.learningequality.org:443
DEBUG 2023-08-17 07:12:11,225 urllib3.connectionpool https://studio.learningequality.org:443 "GET /api/public/v1/channels/lookup/2f95235c3709511fa12d007f31ed6
a7b HTTP/1.1" 200 None
DEBUG 2023-08-17 07:12:11,307 urllib3.connectionpool Starting new HTTPS connection (1): studio.learningequality.org:443
DEBUG 2023-08-17 07:12:11,465 urllib3.connectionpool https://studio.learningequality.org:443 "HEAD /content/storage/e/6/e610bd5a54c93a5550a77a8828a915d5.png H
TTP/1.1" 200 0
DEBUG 2023-08-17 07:12:11,556 urllib3.connectionpool https://studio.learningequality.org:443 "GET /content/storage/e/6/e610bd5a54c93a5550a77a8828a915d5.png HT
TP/1.1" 206 133718

Then I interrupt the server to stop it and it says it's canceling the task:

INFO 2023-08-17 07:12:14,033 kolibri.utils.server Caught signal SIGINT.
DEBUG 2023-08-17 07:12:14,034 urllib3.connectionpool https://studio.learningequality.org:443 "GET /content/storage/1/1/11acc0389000ac01255a56088a9bf73e.png HTTP/1.1" 206 128702
INFO 2023-08-17 07:12:14,035 kolibri.utils.server Keyboard interrupt caught. Exiting.
INFO 2023-08-17 07:12:14,037 kolibri.utils.server Bus state: STOP
INFO 2023-08-17 07:12:14,116 kolibri.utils.server HTTP Server kolibri.utils.server.Server(('0.0.0.0', 39877)) shut down
DEBUG 2023-08-17 07:12:14,364 urllib3.connectionpool https://studio.learningequality.org:443 "HEAD /content/storage/2/4/2460a226770606206260106e22e743ba.png HTTP/1.1" 200 0
DEBUG 2023-08-17 07:12:14,443 urllib3.connectionpool https://studio.learningequality.org:443 "GET /content/storage/2/4/2460a226770606206260106e22e743ba.png HTTP/1.1" 206 138061
INFO 2023-08-17 07:12:14,479 kolibri.utils.server HTTP Server kolibri.utils.server.Server(('0.0.0.0', 8080)) shut down
INFO 2023-08-17 07:12:14,480 kolibri.utils.server Joining 'ZeroConfPlugin'
DEBUG 2023-08-17 07:12:14,758 urllib3.connectionpool https://studio.learningequality.org:443 "HEAD /content/storage/f/1/f12a24c01ad4feebc55f1c915a2cdc6f.png HTTP/1.1" 200 0
DEBUG 2023-08-17 07:12:14,844 urllib3.connectionpool https://studio.learningequality.org:443 "GET /content/storage/f/1/f12a24c01ad4feebc55f1c915a2cdc6f.png HTTP/1.1" 206 66215
INFO 2023-08-17 07:12:14,998 kolibri.utils.server Stopped thread 'ZeroConfPlugin'.
DEBUG 2023-08-17 07:12:15,017 urllib3.connectionpool https://studio.learningequality.org:443 "HEAD /content/storage/7/2/7291ae066b1a2086cfd3528cfe94c51c.png HTTP/1.1" 200 0
DEBUG 2023-08-17 07:12:15,228 urllib3.connectionpool https://studio.learningequality.org:443 "GET /content/storage/7/2/7291ae066b1a2086cfd3528cfe94c51c.png HTTP/1.1" 206 35127
INFO 2023-08-17 07:12:15,265 kolibri.core.tasks.worker Asking job schedulers to shut down.
INFO 2023-08-17 07:12:15,266 kolibri.core.tasks.worker Canceling job id a8531ca3de91493e84b4cc221f537790.
DEBUG 2023-08-17 07:12:15,373 kolibri JOBCHECKER shut down event received; closing.

But the downloads keep running. Finally when they're done the server shuts down:

DEBUG 2023-08-17 07:15:37,625 urllib3.connectionpool https://studio.learningequality.org:443 "GET /content/storage/d/d/dde21e80075dce78d0e4028519c197f6.mp4 HTTP/1.1" 206 50140116
INFO 2023-08-17 07:15:41,723 kolibri.core.content.utils.annotation Setting availability to True of 202 LocalFile objects based on passed in checksums
INFO 2023-08-17 07:15:41,731 kolibri.core.content.utils.annotation Setting availability of non-topic ContentNode objects based on LocalFile availability in 1 batches of 10000
INFO 2023-08-17 07:15:41,738 kolibri.core.content.utils.annotation Annotating ContentNode objects with children for 2 levels
INFO 2023-08-17 07:15:41,741 kolibri.core.content.utils.annotation Annotating ContentNode objects with children for level 2
INFO 2023-08-17 07:15:41,743 kolibri.core.content.utils.annotation Annotating ContentNode objects with children for level 1
DEBUG 2023-08-17 07:15:41,746 kolibri.core.content.utils.annotation Recursive topic tree annotation took 0 seconds
INFO 2023-08-17 07:15:41,789 kolibri.utils.server Joining 'ProcessControlPlugin'
INFO 2023-08-17 07:15:42,254 kolibri.utils.server Stopped thread 'ProcessControlPlugin'.
INFO 2023-08-17 07:15:42,254 kolibri.utils.server Bus state: IDLE
INFO 2023-08-17 07:15:42,254 kolibri.utils.server Bus state: EXIT
INFO 2023-08-17 07:15:42,264 kolibri.utils.server Waiting for child threads to terminate...
INFO 2023-08-17 07:15:42,265 kolibri.utils.server Bus state: EXITED

@dbnicholson
Copy link
Contributor

@rtibbles what do you think is going on here? Does the Future used for content imports not support cancelling?

@dbnicholson
Copy link
Contributor

Oh, Futures just can't be cancelled once they're running - https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.Future.cancel. I confirmed with the following patch:

diff --git a/kolibri/core/tasks/worker.py b/kolibri/core/tasks/worker.py
index 9861a441c4..a0f1164a09 100644
--- a/kolibri/core/tasks/worker.py
+++ b/kolibri/core/tasks/worker.py
@@ -72,7 +72,8 @@ class Worker(object):
         job_ids = list(self.future_job_mapping.keys())
         for job_id in job_ids:
             logger.info("Canceling job id {}.".format(job_id))
-            self.cancel(job_id)
+            if not self.cancel(job_id):
+                logger.warning("Job id {} could not be cancelled".format(job_id))
         # Now shutdown the workers
         self.workers.shutdown(wait=wait)
 

Which resulted in:

WARNING  2023-08-17 07:27:39,566 Job id 100b46ee3a0845ef99ff9d1102269dd9 could not be cancelled

This stack overflow post has some more details about it. Presumably this is because you can't cancel a thread. You could workaround it by using a process pool, but you'd have to have a custom executor to kill the running processes.

I guess to do this generically in Kolibri there would have to be a separate mechanism to signal a running task that it needs to stop. Possibly how you'd do it is that you pass an Event into the task function which it could monitor. When cancelling tasks, you'd signal the event if the task was running and block until the future completed.

@rtibbles rtibbles assigned rtibbles and unassigned photon0205 Aug 17, 2023
dbnicholson added a commit to endlessm/kolibri-installer-android that referenced this issue Aug 17, 2023
Running Kolibri tasks cannot be cancelled[1] and shutting down the
process bus running them will block until the running tasks complete.
This means the Kolibri server can't be stopped quickly or reliably
during the activity's `onActivityStopped` hook.

To workaround that, move the task workers spawned from the
`ServicesPlugin` into a separate process bus. This bus will only be
stopped when the activity is destroyed (or killed). Besides making the
server lifecycle much more reliable, it allows tasks to continue running
and starting in the background as long as the activity process is alive.

Ideally the server and workers would be separate services outside of the
main webview activity, but python-for-android makes that really hard to
accomplish.

1. learningequality/kolibri#10249

Helps: endlessm/kolibri-explore-plugin#740
dbnicholson added a commit to endlessm/kolibri-installer-android that referenced this issue Aug 17, 2023
Running Kolibri tasks cannot be cancelled[1] and shutting down the
process bus running them will block until the running tasks complete.
This means the Kolibri server can't be stopped quickly or reliably
during the activity's `onActivityStopped` hook.

To workaround that, move the task workers spawned from the
`ServicesPlugin` into a separate process bus. This bus will only be
stopped when the activity is destroyed (or killed). Besides making the
server lifecycle much more reliable, it allows tasks to continue running
and starting in the background as long as the activity process is alive.

Ideally the server and workers would be separate services outside of the
main webview activity, but python-for-android makes that really hard to
accomplish.

1. learningequality/kolibri#10249

Helps: endlessm/kolibri-explore-plugin#740
dbnicholson added a commit to endlessm/kolibri-installer-android that referenced this issue Aug 17, 2023
Running Kolibri tasks cannot be cancelled[1] and shutting down the
process bus running them will block until the running tasks complete.
This means the Kolibri server can't be stopped quickly or reliably
during the activity's `onActivityStopped` hook.

To workaround that, move the task workers spawned from the
`ServicesPlugin` into a separate process bus. This bus will only be
stopped when the activity is destroyed (or killed). Besides making the
server lifecycle much more reliable, it allows tasks to continue running
and starting in the background as long as the activity process is alive.

Ideally the server and workers would be separate services outside of the
main webview activity, but python-for-android makes that really hard to
accomplish.

1. learningequality/kolibri#10249

Helps: endlessm/kolibri-explore-plugin#740
dylanmccall pushed a commit to endlessm/kolibri-installer-android that referenced this issue Aug 17, 2023
Running Kolibri tasks cannot be cancelled[1] and shutting down the
process bus running them will block until the running tasks complete.
This means the Kolibri server can't be stopped quickly or reliably
during the activity's `onActivityStopped` hook.

To workaround that, move the task workers spawned from the
`ServicesPlugin` into a separate process bus. This bus will only be
stopped when the activity is destroyed (or killed). Besides making the
server lifecycle much more reliable, it allows tasks to continue running
and starting in the background as long as the activity process is alive.

Ideally the server and workers would be separate services outside of the
main webview activity, but python-for-android makes that really hard to
accomplish.

1. learningequality/kolibri#10249

Helps: endlessm/kolibri-explore-plugin#740
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DEV: backend Python, databases, networking, filesystem...
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants