Skip to content

Commit

Permalink
Merge pull request #90 from nesi/kill_task
Browse files Browse the repository at this point in the history
Added a note about killing tasks.
  • Loading branch information
MattBixley committed Sep 18, 2023
2 parents ad2a731 + b553ba4 commit be78d22
Show file tree
Hide file tree
Showing 5 changed files with 78 additions and 52 deletions.
11 changes: 9 additions & 2 deletions _config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -47,11 +47,18 @@ sched:
efficiency: "nn_seff"
projectcode: "nesi99991"

# For 'R'
example:
lang: "R"
shell: "Rscript"
script: "array_sum.r"
module: "R/4.1.0-gimkl-2020a"

episode_order:
- 01-cluster
- 02-filedir
- 03-break1
- 04-moduless
- 04-modules
- 05-scheduler
- 06-lunch
- 07-resources
Expand Down Expand Up @@ -95,7 +102,7 @@ working_dir:

# Start time in minutes (0 to be clock-independent, 540 to show a start at 09:00 am).
# 600 is 10am
start_time: 780
start_time: 600
# Start time for people wanting to skip bash content
hpc_start_time: 780
# Lesson to start at.
Expand Down
101 changes: 60 additions & 41 deletions _episodes/05-scheduler.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,11 +63,7 @@ Lets try this now, create and open a new file in your current directory called `


```
{{ site.remote.bash_shebang }}
module load R/4.1.0-gimkl-2020a
Rscript array_sum.r
echo "Done!"
{% include example_scripts/example-job.sh %}
```
{: .language-bash}

Expand All @@ -77,7 +73,6 @@ echo "Done!"
>
{: .callout}


We can now run this script using
```
{{ site.remote.prompt }} bash example-job.sh
Expand All @@ -100,6 +95,13 @@ Done!

You will get the output printed to your terminal as if you had just run those commands one after another.

> ## Cancelling Commands
>
> You can kill a currently running task by pressing the keys <kbd>ctrl</kbd> + <kbd>c</kbd>.
> If you just want your terminal back, but want the task to continue running you can 'background' it by pressing <kbd>ctrl</kbd> + <kbd>v</kbd>.
> Note, a backgrounded task is still attached to your terminal session, and will be killed when you close the terminal (if you need to keep running a task after you log out, have a look at [tmux](https://support.nesi.org.nz/hc/en-gb/articles/4563511601679-tmux-Reference-sheet)).
{: .callout}

## Scheduled Batch Job

Up until now the scheduler has not been involved, our scripts were run directly on the login node (or Jupyter node).
Expand Down Expand Up @@ -173,7 +175,7 @@ Now, rather than running our script with `bash` we _submit_ it to the scheduler
And that's all we need to do to submit a job. Our work is done -- now the
scheduler takes over and tries to run the job for us.

## Checking on our Job
## Checking on Running/Pending Jobs

While the job is waiting
to run, it goes into a list of jobs called the *queue*. To check on our job's
Expand All @@ -187,9 +189,57 @@ status, we check the queue using the command

{% include {{ site.snippets }}/scheduler/basic-job-status.snip %}

We can see many details about our job, most importantly is it's _STATE_, the most common states you might see are..

- `PENDING`: The job is waiting in the queue, likely waiting for resources to free up or higher prioroty jobs to run.
because other jobs have priority.
- `RUNNING`: The job has been sent to a compute node and it is processing our commands.
- `COMPLETED`: Your commands completed succesfully as far as Slurm can tell (e.g. exit 0).
- `FAILED`: (e.g. exit not 0).
- `CANCELLED`:
- `TIMEOUT`: Your job has running for longer than your `--time` and was killed.
- `OUT_OF_MEMORY`: Your job tried to use more memory that it is allocated (`--mem`) and was killed.

If we were too slow, and the job has already finished (and therefore not in the queue) there is another command we can use `{{ site.sched.hist }}` (**s**lurm **acc**oun**t**). By default `{{ site.sched.hist }}` only includes jobs submitted by you, so no need to include additional commands at this point.
## Cancelling Jobs

Sometimes we'll make a mistake and need to cancel a job. This can be done with
the `{{ site.sched.del }}` command.

<!-- ```
{{ site.remote.prompt }} {{ site.sched.submit.name }} {% if site.sched.submit.options != '' %}{{ site.sched.submit.options }} {% endif %}example-job.sl
{{ site.remote.prompt }} {{ site.sched.status }} {{ site.sched.flag.me }}
```
{: .language-bash} -->

<!-- {% include {{ site.snippets }}/scheduler/terminate-job-begin.snip %} -->

In order to cancel the job, we will first need its 'JobId', this can be found in the output of '{{ site.sched.status }} {{ site.sched.flag.me }}'.

```
{{ site.remote.prompt }} {{site.sched.del }} 231964
```
{: .language-bash}

A clean return of your command prompt indicates that the request to cancel the job was
successful.

Now checking `{{ site.sched.status }}` again, the job should be gone.

```
{{ site.remote.prompt }} {{ site.sched.status }} {{ site.sched.flag.me }}
```
{: .language-bash}

{% include {{ site.snippets }}/scheduler/terminate-job-cancel.snip %}

(If it isn't wait a few seconds and try again).

{% include {{ site.snippets }}/scheduler/terminate-multiple-jobs.snip %}

## Checking Finished Jobs

There is another command `{{ site.sched.hist }}` (**s**lurm **acc**oun**t**) that includes jobs that have finished.
By default `{{ site.sched.hist }}` only includes jobs submitted by you, so no need to include additional commands at this point.

```
{{ site.remote.prompt }} {{ site.sched.hist }}
Expand All @@ -206,10 +256,10 @@ This can be suppressed using the flag `-X`.
> On the login node, when we ran the bash script, the output was printed to the terminal.
> Slurm batch job output is typically redirected to a file, by default this will be a file named `slurm-<job-id>.out` in the directory where the job was submitted, this can be changed with the slurm parameter `--output`.
{: .discussion}

>
> > ## Hint
> >
> > You can use the *manual pages* for {{ site.sched.name }} utilities to find
> > You can use the _manual pages_ for {{ site.sched.name }} utilities to find
> > more about their capabilities. On the command line, these are accessed
> > through the `man` utility: run `man <program-name>`. You can find the same
> > information online by searching > "man <program-name>".
Expand Down Expand Up @@ -269,37 +319,6 @@ restrain their job to the requested resources or kill the job outright. Other
jobs on the node will be unaffected. This means that one user cannot mess up
the experience of others, the only jobs affected by a mistake in scheduling
will be their own. -->

## Cancelling a Job

Sometimes we'll make a mistake and need to cancel a job. This can be done with
the `{{ site.sched.del }}` command. Let's submit a job and then cancel it using
its job number (remember to change the walltime so that it runs long enough for
you to cancel it before it is killed!).

```
{{ site.remote.prompt }} {{ site.sched.submit.name }} {% if site.sched.submit.options != '' %}{{ site.sched.submit.options }} {% endif %}example-job.sl
{{ site.remote.prompt }} {{ site.sched.status }} {{ site.sched.flag.me }}
```
{: .language-bash}

{% include {{ site.snippets }}/scheduler/terminate-job-begin.snip %}

Now cancel the job with its job number (printed in your terminal). A clean
return of your command prompt indicates that the request to cancel the job was
successful.

```
{{ site.remote.prompt }} {{site.sched.del }} 23229413
# It might take a minute for the job to disappear from the queue...
{{ site.remote.prompt }} {{ site.sched.status }} {{ site.sched.flag.me }}
```
{: .language-bash}

{% include {{ site.snippets }}/scheduler/terminate-job-cancel.snip %}

{% include {{ site.snippets }}/scheduler/terminate-multiple-jobs.snip %}

<!-- ## Other Types of Jobs
Up to this point, we've focused on running jobs in batch mode.
Expand Down
5 changes: 5 additions & 0 deletions _includes/example_scripts/example-job.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
#!/bin/bash -e

module load R/4.3.1-gimkl-2022a
Rscript array_sum.r
echo "Done!"
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
```
JobID JobName Alloc Elapsed TotalCPU ReqMem MaxRSS State
--------------- ---------------- ----- ----------- ------------ ------- -------- ----------
31060451 example-job.sl 2 00:00:48 00:33.548 1G COMPLETED
31060451.batch batch 2 00:00:48 00:33.547 102048K COMPLETED
31060451.extern extern 2 00:00:48 00:00:00 0 COMPLETED
31060451 example-job.sl 2 00:00:48 00:33.548 1G CANCELLED
31060451.batch batch 2 00:00:48 00:33.547 102048K CANCELLED
31060451.extern extern 2 00:00:48 00:00:00 0 CANCELLED
```
{: .output}
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,4 @@
JOBID USER ACCOUNT NAME CPUS MIN_MEM PARTITI START_TIME TIME_LEFT STATE NODELIST(REASON)
231964 yourUsername {{site.sched.projectcode}} example-job.sl 1 512M large N/A 1:00 PENDING (Priority)
```
{: .output}

We can see many details about our job, most importantly is it's _STATE_. Sometimes our jobs might need to wait in a queue, so it's state is `PENDING`, likely waiting for resources or
because other jobs have priority. If we are lucky it will have a state of `RUNNING` which means the job has
been sent to a compute node and it is processing our commands. If we are unlucky the job will have an `ERROR` state, menaing something has
gone wrong with our job submission. In many cases this is caused by a typo in the submit script.
{: .output}

0 comments on commit be78d22

Please sign in to comment.