Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

Allow tasks to execute on the host set in "host" field #991

Closed
invenio-developers opened this Issue · 13 comments

2 participants

@invenio-developers
Collaborator

Originally by adeiana (@Osso) on 2012-04-03

Before running a task, we bind it to one host, however previously
when trying to run it on that same host, the task was also excluded

@kaplun
Collaborator

Originally on 2012-04-03

Hi Alessio,

can you detail a bit this ticket? What exactly is it trying to solve? What do you mean by "previously"?

Cheers!
Sam

@invenio-developers
Collaborator

Originally by adeiana (@Osso) on 2012-04-03

self.tie_task_to_host() returns False when the host field is not empty.
As a result at line 1071 where it is called we never enter the code path where
we use os.system(COMMAND). Thus the command is never run when host is not ''
(the local hostname should be a valid value)

@kaplun
Collaborator

Originally on 2012-04-03

  • What version of Invenio do you refer to? (since bibsched is being highly modified). Do you refer to latest master?
  • I think what you report as bug is instead a feature. tie_task_to_host will explicitly check that the given task hasn't already been tied to a host, and is hence be free to be scheduled on the current host. Otherwise if host variable was set already to a given value, this means that the task has already been tied, and we should hence avoid executing it twice. (either on the current host, if host has the value of the current host, or on a different host).

Why do you think the comman is never run?

@invenio-developers
Collaborator

Originally by adeiana (@Osso) on 2012-04-03

To be more explicit my queue current state stored in schTASK:

+-----+-------------------+---------------------+---------+----------+-----------+------------+
| id  | proc              | runtime             | status  | priority | host      | sequenceid |
+-----+-------------------+---------------------+---------+----------+-----------+------------+
| 476 | bibauthorid       | 2012-04-02 18:48:24 | WAITING |        0 | aso.local |       NULL |
| 475 | inveniogc         | 2012-04-02 18:48:30 | WAITING |        0 |           |       NULL |
| 486 | bibreformat       | 2012-04-02 18:48:30 | WAITING |        0 |           |       NULL |
| 468 | bibindex          | 2012-04-02 18:48:36 | WAITING |        0 |           |       NULL |
| 471 | bibindex:fulltext | 2012-04-02 18:48:40 | WAITING |        0 |           |       NULL |
| 469 | webcoll           | 2012-04-02 18:48:54 | WAITING |        0 |           |       NULL |
| 472 | bibindex:author   | 2012-04-02 18:48:56 | WAITING |        0 |           |       NULL |
| 470 | bibindex:global   | 2012-04-02 18:49:00 | WAITING |        0 |           |       NULL |
| 473 | bibrank           | 2012-04-02 18:49:06 | WAITING |        0 |           |       NULL |
| 500 | bibreformat       | 2012-04-02 21:30:28 | WAITING |        0 |           |       NULL |
+-----+-------------------+---------------------+---------+----------+-----------+------------+

This first task does not run.

Yes this problem is present on current master.

@invenio-developers
Collaborator

Originally by adeiana (@Osso) on 2012-04-03

After your explanation I understand that this state can only by reached after triggering issue #943

@kaplun
Collaborator

Originally on 2012-04-03

I see!

Indeed I was about to write you that this state should never be reached :-) I guess that by catching SystemExit exception (as happens in the other ticket), this situation shouldn't be possibly reached anymore, since tasks will cleanly set the host value back to empty string.

So we can close this ticket as invalid, I guess :-)

@invenio-developers
Collaborator

Originally by adeiana (@Osso) on 2012-04-04

I am afraid we still need to address this.

Actually, I encountered this problem again with an invalid interpreter for bibauthorid.
It seems like there are many ways a task can fail.
I am attaching a different way to tackle the problem.

@kaplun
Collaborator

Originally on 2012-04-04

Hi Alessio,

unfortunately, you can't really have bibsched to declare the task as in ERROR state, because most of the time, it's simply that the task is taking too much too start (e.g. because it creates enormous data structure upon start, as it happens with citation dictionaries).

Maybe we can still go in this direction (i.e. to have bibsched decide the task has failed, rather than the task), by first inspecting the task pid file and trying to ping the task with a UNIX signal. If no pid file exists, or if the number in the pid file does not correspond to a live task, then indeed bibsched can declare the task as ERROR as you propose. This can be implemented by exploiting the already existing get_task_pid function.

@invenio-developers
Collaborator

Originally by adeiana (@Osso) on 2012-04-04

Let's take this further,

I see multiple reasons not dependent of our code that would make it long to start.
Particularly an unresponsive AFS leaving the processes in D state for several seconds.
or there is a problem on startup we never reach the startup and enter some deadlock.

We have the same exact problem with RUNNING tasks.
If they were kill by the oom killer, they remain in RUNNING status. I already had this happening too.

Depending on our need to differentiate a starting task from a running task.
If needed, we can add a STARTING status, if not we mark the task as RUNNING.
All tasks in starting status can behave like running as a result will have the queue entering the current deadlock we have.

In a separate ticket we handle pinging tasks regularly to check that they are not dead.

Viable solution?

@kaplun
Collaborator

Originally on 2012-04-04

Replying to [comment:11 adeiana]:

I see multiple reasons not dependent of our code that would make it long to start.
Particularly an unresponsive AFS leaving the processes in D state for several seconds.
or there is a problem on startup we never reach the startup and enter some deadlock.

Yep!

We have the same exact problem with RUNNING tasks.
If they were kill by the oom killer, they remain in RUNNING status. I already had this happening too.

Depending on our need to differentiate a starting task from a running task.

This is currently already the case: right before executing a task, bibsched is changing its status from WAITING to SCHEDULED. It's then the responsibility of the task to change it from SCHEDULED to RUNNING (in order to proof it's in a healthy state).

If needed, we can add a STARTING status, if not we mark the task as RUNNING.
All tasks in starting status can behave like running as a result will have the queue entering the current deadlock we have.

In a separate ticket we handle pinging tasks regularly to check that they are not dead.

Ouch, I thought we were already doing this in bibsched (and probably this was the case in old versions, but the code somehow is no longer there).

Viable solution?

To regularly ping the running tasks as in the above case is definitively a viable solution. Special care must be taken however: if the task is in status RUNNING, and we fail to ping it, we have to check that meanwhile, it has not changed the status to DONE or something else, as the task might end in the very moment we are pinging it :-)

@invenio-developers
Collaborator

Originally by adeiana (@Osso) on 2012-04-04

Indeed the SCHEDULED status is exactly that. So that is handled.
The reason I reach that deadlock status is that the SCHEDULED status is lost as soon as I renable the automatic queue without acknowledging that task first.
I don't think that behavior is nice. Should we change it ?

@kaplun
Collaborator

Originally on 2012-04-04

Indeed this would become a left over now that we have implemented the nice switch to ERROR (when using task_low_level_submission and wrong args), and once we will implement the regular pinging of task to check if they are alive.

I will prepare a patch with all the ideas we gathered in this useful ticket :-)

@invenio-developers
Collaborator

Originally by adeiana (@Osso) on 2014-01-09

In 826e8d5:

#CommitTicketReference repository="invenio" revision="826e8d5068b01ff48f0a8dc11361b3ff36ff2c86"
BibSched: many improvements

- Displays a Yes/No box to make sure you don't delete tasks by mistake.

- In the bibsched daemon, every 50 cycles, check for local tasks that have
  crashed.

- Store debug mode in database so that we can switch it on and off without
  restarting the bibsched daemons.

- Fixes a bug that would mark a task as crashed because the pid of that
  task would not exist anymore but because the task has completed properly.

- Press B to lood bibsched.log in your pager.

- Fixes the task options panel when a task has a very long list of arguments.
  (closes #1177)

- Tasks in about to stop are not going to sleep as soon as possible anymore.
  This proved annoying because instead of stopping they would just wait in
  SLEEPING and you would have to wake them up manually in order to make them
  stop.

- Adds a help panel in the bibsched monitor accessed via the "h" keystroke.

- Limits the progress column char length to match the database schema

- Adds the username of person doing the action in bibsched when running
  a task manually or editing the motd

- Adds --host which allows to force the execution of a task to a
  certain host (closes #991)

- Prevents non concurrent task to wake up too early, prevent higher priority
  tasks to run.

- flush logs after writing each message.
  This can be useful when using a filesystem that buffers yours writes
  like AFS and want to check the logs from a different server than the one
  the task is running on.

- Confirmation dialog before deleting periodic tasks

- Bind signal USR2 to starting foo remote console
  to debug running bibtasks.

- If you ask a task to stop, (status is set to "ABOUT TO STOP") and then you
 lower the priority of the task, say to -11, the scheduler changes the status
 of the task to "ABOUT TO SLEEP", ignoring the previous status.

- When --fixed-time is set and a task is postponed we used the regular
  sleeptime (to respect the fixed time) instead of running as soon as possible
  (in this case the beginning of the allowed times by --limit)
  e.g. A task is scheduled to run between monday and friday and sleep 24 hours
       and is supposed to run at 7am.
       Old behavior, on saturday 7am, it is postponed to run on monday morning
       at midnight.
       New behavior, it is postponed to run sunday, 7am. On sunday it is
       postponed to monday 7am.

- Fixes a bug sleeping a monotask that needs to run instead of
  ourselves.

- Adds STOPPED to displayed status in default bibsched view

- Removes the ability to force run manually tasks via the bibsched
  monitor and out of their time limit (specified via -L 00:40-05:00)

Signed-off-by: Alessio Deiana <alessio.deiana@cern.ch>
Reviewed-by: Samuele Kaplun <samuele.kaplun@cern.ch>
@kaplun kaplun referenced this issue from a commit in kaplun/invenio
@Osso Osso BibSched: many improvements
* Displays a Yes/No box to make sure you don't delete tasks by mistake.

* In the bibsched daemon, every 50 cycles, check for local tasks that have
  crashed.

* Store debug mode in database so that we can switch it on and off without
  restarting the bibsched daemons.

* Fixes a bug that would mark a task as crashed because the pid of that
  task would not exist anymore but because the task has completed properly.

* Press B to lood bibsched.log in your pager.

* Fixes the task options panel when a task has a very long list of arguments.
  (closes #1177)

* Tasks in about to stop are not going to sleep as soon as possible anymore.
  This proved annoying because instead of stopping they would just wait in
  SLEEPING and you would have to wake them up manually in order to make them
  stop.

* Adds a help panel in the bibsched monitor accessed via the "h" keystroke.

* Limits the progress column char length to match the database schema

* Adds the username of person doing the action in bibsched when running
  a task manually or editing the motd

* Adds --host which allows to force the execution of a task to a
  certain host (closes #991)

* Prevents non concurrent task to wake up too early, prevent higher priority
  tasks to run.

* flush logs after writing each message.
  This can be useful when using a filesystem that buffers yours writes
  like AFS and want to check the logs from a different server than the one
  the task is running on.

* Confirmation dialog before deleting periodic tasks

* Bind signal USR2 to starting foo remote console
  to debug running bibtasks.

* If you ask a task to stop, (status is set to "ABOUT TO STOP") and then you
 lower the priority of the task, say to -11, the scheduler changes the status
 of the task to "ABOUT TO SLEEP", ignoring the previous status.

* When --fixed-time is set and a task is postponed we used the regular
  sleeptime (to respect the fixed time) instead of running as soon as possible
  (in this case the beginning of the allowed times by --limit)
  e.g. A task is scheduled to run between monday and friday and sleep 24 hours
       and is supposed to run at 7am.
       Old behavior, on saturday 7am, it is postponed to run on monday morning
       at midnight.
       New behavior, it is postponed to run sunday, 7am. On sunday it is
       postponed to monday 7am.

* Fixes a bug sleeping a monotask that needs to run instead of
  ourselves.

* Adds STOPPED to displayed status in default bibsched view

* Removes the ability to force run manually tasks via the bibsched
  monitor and out of their time limit (specified via -L 00:40-05:00)

Signed-off-by: Alessio Deiana <alessio.deiana@cern.ch>
Reviewed-by: Samuele Kaplun <samuele.kaplun@cern.ch>
4c9eac5
@kaplun kaplun referenced this issue from a commit in kaplun/invenio
@Osso Osso BibSched: many improvements
* Displays a Yes/No box to make sure you don't delete tasks by mistake.

* In the bibsched daemon, every 50 cycles, check for local tasks that have
  crashed.

* Store debug mode in database so that we can switch it on and off without
  restarting the bibsched daemons.

* Fixes a bug that would mark a task as crashed because the pid of that
  task would not exist anymore but because the task has completed properly.

* Press B to lood bibsched.log in your pager.

* Fixes the task options panel when a task has a very long list of arguments.
  (closes #1177)

* Tasks in about to stop are not going to sleep as soon as possible anymore.
  This proved annoying because instead of stopping they would just wait in
  SLEEPING and you would have to wake them up manually in order to make them
  stop.

* Adds a help panel in the bibsched monitor accessed via the "h" keystroke.

* Limits the progress column char length to match the database schema

* Adds the username of person doing the action in bibsched when running
  a task manually or editing the motd

* Adds --host which allows to force the execution of a task to a
  certain host (closes #991)

* Prevents non concurrent task to wake up too early, prevent higher priority
  tasks to run.

* flush logs after writing each message.
  This can be useful when using a filesystem that buffers yours writes
  like AFS and want to check the logs from a different server than the one
  the task is running on.

* Confirmation dialog before deleting periodic tasks

* Bind signal USR2 to starting foo remote console
  to debug running bibtasks.

* If you ask a task to stop, (status is set to "ABOUT TO STOP") and then you
 lower the priority of the task, say to -11, the scheduler changes the status
 of the task to "ABOUT TO SLEEP", ignoring the previous status.

* When --fixed-time is set and a task is postponed we used the regular
  sleeptime (to respect the fixed time) instead of running as soon as possible
  (in this case the beginning of the allowed times by --limit)
  e.g. A task is scheduled to run between monday and friday and sleep 24 hours
       and is supposed to run at 7am.
       Old behavior, on saturday 7am, it is postponed to run on monday morning
       at midnight.
       New behavior, it is postponed to run sunday, 7am. On sunday it is
       postponed to monday 7am.

* Fixes a bug sleeping a monotask that needs to run instead of
  ourselves.

* Adds STOPPED to displayed status in default bibsched view

* Removes the ability to force run manually tasks via the bibsched
  monitor and out of their time limit (specified via -L 00:40-05:00)

Signed-off-by: Alessio Deiana <alessio.deiana@cern.ch>
Reviewed-by: Samuele Kaplun <samuele.kaplun@cern.ch>
826e8d5
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.