Update to 22.05.09 #57

itkovian · 2023-05-05T09:55:33Z

No description provided.

Bug 15866

CR_SOCKET*. gres_select_filter_sock_core - don't mark req_sock[s] when not required If there are no GRES bound to the socket (for instance CountOnly GRES, or just sock_gres->cnt_any_sock don't mark the socket as required. Continuation of commit 73a650f Bug 15654

Unified list of tests moved to main testsuite/README. Renamed python tests start with number 101_1. Bug 15814

Bug 15886

Since --prefer is now another dimension of job_queue we need to handle it for job arrays as we do for partition and reservation. Before going to "next task" we need to make sure that feature_list_use is set correctly, so we don't attempt starting --prefer if job_queue_rec we're dealing with is for use_prefer = false. Bug 15886

Regression from 49826b9 Bug 15857

This can happen if the requested tasks count is lower than the nodes count. This commit restores behavior before ca510f2 Bug 15857

In _opt_verify(), the number of tasks was calculated by ntasks_per_node * min_nodes. However, the step may be allocated more nodes than min_nodes, so after the slurmctld has created the step allocation, srun needs to recalculate the number of tasks based on the actual number of allocated nodes. Bug 15857

This fixes some valid step requests that hang instead of run. For example: PartitionName=def1 Nodes=ALL MinNodes=2 $ srun -p def1 -n1 -c1 hostname <job is allocated but the step hangs> Bug 15857

Since commit e652133 requesting gpus is mandatory for --cpus-per-cpu like it is for --mem-per-gpu, or job is rejected. Bug 15945

Swap mgr->event_fd[0] to non-blocking in init_con_mgr(). This will correct the possibility where a prior run of _watch() will read all of the pending bytes resulting in the next run of _watch() getting a blocking read() of mgr->event_fd[0]. The existing read() code was already written to allow this requiring no secondary modifications. Bug 15884

Bug 15911

When an interactive job (salloc with LaunchParameters=use_interactive_step or srun --interactive) is scheduled on powered down nodes, the salloc or srun will wait in a loop for slurm_job_node_ready() to say the nodes are ready (i.e. registered with the controller and not running a prolog). After the all nodes have registered, slurm_job_node_ready() will report that the job and nodes are ready and salloc will launch the interactive step. For normal steps, job_config_fini(), which launches the prolog RPC, is called when creating the step if all the nodes have registered. For batch jobs, job_config_fini() is called by job_time_limit() after all the nodes have registered. For interactive steps, job_config_fini() isn't called at step creation and waits till job_time_limit catches it. So there is a window when the salloc/srun that they can run the interactive step before the controller has launched the prolog RPC. The slurmd should block a task launch until the prolog RPC has run if the prolog RPC is expected to run. Due to a separate issue where the slurmd thinks the prolog RPC has already run (_wait_for_job_running_prolog()), the slurmd launches the task without first running the prolog and other setup (i.e. job_container/tmpfs, pam_slurm_adopt, x11 forwarding). This can be handled separately. The solution is to have slurm_job_node_ready() wait until the job is not configuring anymore (i.e. job_config_fini() has been called) before considering the nodes to be ready for the job, which in turn ensures interactive jobs will not launch before the prolog runs. Bug 15993 Co-authored-by: Brian Christiansen <brian@schedmd.com>

If the provided <list> is NULL, the task/affinity functions _validate_[map,mask]() would make slurmd segfault, because NULL is handed as a first argument to strtok_r() in order to parse the requested list. Example reproducers: $ srun --cpu-bind=mask_cpu:, true $ srun --cpu-bind=map_cpu:, true $ srun --cpu-bind=mask_cpu:0x1 : true $ srun --cpu-bind=map_cpu:0 : true Bug 16067

Bug 16119

Backfill's planned_bitmap isn't updated when a node is deleted from node_record_table_ptr so the planned_bitmap may be pointing to a deleted node. In 23.02, this isn't an issue because of the use of next_node_bitmap() which will skip any nodes that are deleted at the set bit. Bug 16165

If a node was deleted while being in the planned state the bit would be left in the planned_bitmap until a restart or a new node was added to the empty node slot. Bug 16165 Signed-off-by: Skyler Malinowski <malinowski@schedmd.com>

If slurmctld was to restart right after a planned node had a planned job terminated, then the node could remain as planned when reading from state save (e.g. slurmctld -R, node is NODE_STATE_DYNAMIC_NORM). Moreover, the node was saved as PLANNED but the planned_bitmap is missing that node until it becomes PLANNED again. Bug 16165

Fix mis-merge from f4b164e. Bug 16149

Swap reference to "dbv0.0.38_meta". Bug 16149

The rc of the make command used was not forwarding the failures because it was not an actual make check. Bug 16262

Bug 16262 Signed-off-by: Ben Glines <ben.glines@schedmd.com>

Regression from commit 9210bea. Bug 16302

Ensure jobs submitted via atf.submit_job() are tracked and cancelled even if not in auto-config mode Bug 16326

Track job ids created with alloc_job_id for teardown when the testsuite ends Bug 16326

Code is extracted from _check_job_credential in preparation for the next commit. Bug 15417

Since the change introduced in 21.08 to use [job|step]_mem_alloc as the reference for the memory limitation in the credential, in case that a message from an old version (20.11) is received, this will not have the *_mem_alloc field, thus we need to do it the old way, by using the *_mem_limit instead. Also checks the protocol_version to decide whether calling _get_ncpus is needed. This commit also changes the slurm_cred_get_mem signature in order to receive the step and job cpu count so that the number of cpus can be calculated. Bug 15417

Bug 15417

Continuation of commit 7cfcbf3 Bug 15417

The argv array from Slurm opt parsing is freed before being packed. Bug 16425

Bug 16425

Commit c3cc3d4 effectively made the fix from 368745b dead. Move the check for non bound GRES marking all cores required if we use a GRES that is not bound to any socket. Bug 16297

The situation where gres_min_cpus is lower than avail_cpus can happen in the current code state is quite often. The value of gres_min_cpus is originally calculated in gres_select_filter_sock_core when we may not know how many tasks will run on the node. The number or required sockets is caluclated there to run as many tasks as possible on the node. The overall logic may be improved which may result in the condition being unexpected, however, this won't happen on tagged branch and we don't want to show error while it may be correct behavior. Bug 16297

Was introduced in 23.02, but added to the 22.05 docs inadvertently in commit 5793da7. Bug 16532

Bug 16378

Bug 16401

Regression in 11712ce. If we fail to find the node in the configuration we should fallback to using the raw host value we have in step_ptr. This may happen if the configured NodeHostname differs from the gethostname() result on the host, but matches one of the aliases. Another common case is for srun running on a submit host that is not configured as a Slurm node. Bug 16401

Create test for --switches option with topology/tree and select/cons_tres Bug 10061

Bug 10061

Update slurm.spec as well.

Tag 22.05.9 release.

mcmult and others added 30 commits January 26, 2023 14:40

Start NEWS for v22.05.9.

1ba76e7

Docs - Fix wrong permissions description for UnkillableStepProgram

56dda0e

Bug 15866

Testsuite - Rename python test to numeric names

381081e

Unified list of tests moved to main testsuite/README. Renamed python tests start with number 101_1. Bug 15814

Move code to a function.

3c20bf8

Bug 15886

Fix ntasks-per-node logic

3b8302e

Regression from 49826b9 Bug 15857

Prevent divide by zero error

aee252a

This can happen if the requested tasks count is lower than the nodes count. This commit restores behavior before ca510f2 Bug 15857

Make _handle_core_select() return true if no cpus are requested on node

a0a8563

This fixes some valid step requests that hang instead of run. For example: PartitionName=def1 Nodes=ALL MinNodes=2 $ srun -p def1 -n1 -c1 hostname <job is allocated but the step hangs> Bug 15857

Testsuite - Improve test38.18 adding --gpus to --cpus-per-gpu jobs

6aac72d

Since commit e652133 requesting gpus is mandatory for --cpus-per-cpu like it is for --mem-per-gpu, or job is rejected. Bug 15945

Fix gres_get_[step|job]_info to correctly return gres count

e1adf7d

Bug 15911

Testsuite - Improve test1.12 skipping if GPUs are not accounted

6365a8b

Bug 16119

Merge branch 'bug16165' into slurm-22.05

d677afa

Testsuite - Fix test_130_3.py typo

0c2c73c

openapi/dbv0.0.38 - remove duplicate dbv0.0.38_tres_update

027037f

Fix mis-merge from f4b164e. Bug 16149

openapi/dbv0.0.38 - correct reference to v0.0.38_meta

d42baaf

Swap reference to "dbv0.0.38_meta". Bug 16149

Testsuite - Fix run-tests to avoid ignoring slurm_unit failures

b282e2b

The rc of the make command used was not forwarding the failures because it was not an actual make check. Bug 16262

Testsuite - Improve run-tests adding the slurm_unit actual output

2b2b04f

Bug 16262 Signed-off-by: Ben Glines <ben.glines@schedmd.com>

Fix regression in 22.05.0rc1 that broke Nodes=ALL in a NodeSet

a08c37b

Regression from commit 9210bea. Bug 16302

Testsuite - Add teardown of submitted jobs always

e7864e1

Ensure jobs submitted via atf.submit_job() are tracked and cancelled even if not in auto-config mode Bug 16326

Reorder require_config_parameter comment to match args order

142faef

Move logic to cancel jobs to else to avoid conflicts

b717374

Tesuite - Add teardown tracking to alloc_job_id

e4fbb51

Track job ids created with alloc_job_id for teardown when the testsuite ends Bug 16326

jvilarru and others added 22 commits April 7, 2023 08:07

Create function to get the node host_index in the hostlist

7cfcbf3

Code is extracted from _check_job_credential in preparation for the next commit. Bug 15417

Create function to extract from the credential the needed job&step cpus

75fce30

Code is extracted from _check_job_credential in preparation for the next commit. Bug 15417

NEWS for the previous commits

765a999

Bug 15417

Fix style - missing newline

666c318

Continuation of commit 7cfcbf3 Bug 15417

openapi/v0.0.38 - avoid use after free of job argv

75b1537

The argv array from Slurm opt parsing is freed before being packed. Bug 16425

openapi/v0.0.37 - avoid use after free of job argv

131db46

The argv array from Slurm opt parsing is freed before being packed. Bug 16425

openapi/v0.0.36 - avoid use after free of job argv

34d259f

The argv array from Slurm opt parsing is freed before being packed. Bug 16425

Add NEWS for the last 3 commits.

26fed97

Bug 16425

Fix regression from c3cc3d4

97c810d

Commit c3cc3d4 effectively made the fix from 368745b dead. Move the check for non bound GRES marking all cores required if we use a GRES that is not bound to any socket. Bug 16297

Docs - Remove SLURM_DEBUG_FLAGS from 22.05 docs

7ab72c4

Was introduced in 23.02, but added to the 22.05 docs inadvertently in commit 5793da7. Bug 16532

openapi/v0.0.36 - avoid double free of batch_features

a5ca58e

Bug 16378

openapi/v0.0.37 - avoid double free of batch_features

20e0fcb

Bug 16378

openapi/v0.0.38 - avoid double free of batch_features

49222c0

Bug 16378

Add NEWS entry for last 3 commits

7b3c94b

Bug 16378

Move common srun address setting code into _srun_set_addr()

e416bc8

Bug 16401

Testsuite - create test_143_1.py

e93b361

Create test for --switches option with topology/tree and select/cons_tres Bug 10061

Testsuite - add test_143_1.py to README

c3420ff

Bug 10061

Update META for 22.05.9.

4e8cf41

Update slurm.spec as well.

Merge tag 'slurm-22-05-9-1' into 22.05.09.ug

a300967

Tag 22.05.9 release.

stdweird merged commit a8e72e1 into hpcugent:22.05.ug May 5, 2023
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update to 22.05.09 #57

Update to 22.05.09 #57

itkovian commented May 5, 2023

Update to 22.05.09 #57

Update to 22.05.09 #57

Conversation

itkovian commented May 5, 2023