forked from SchedMD/slurm
-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update to 22.05.09 #57
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
CR_SOCKET*. gres_select_filter_sock_core - don't mark req_sock[s] when not required If there are no GRES bound to the socket (for instance CountOnly GRES, or just sock_gres->cnt_any_sock don't mark the socket as required. Continuation of commit 73a650f Bug 15654
Unified list of tests moved to main testsuite/README. Renamed python tests start with number 101_1. Bug 15814
Bug 15886
Since --prefer is now another dimension of job_queue we need to handle it for job arrays as we do for partition and reservation. Before going to "next task" we need to make sure that feature_list_use is set correctly, so we don't attempt starting --prefer if job_queue_rec we're dealing with is for use_prefer = false. Bug 15886
Regression from 49826b9 Bug 15857
This can happen if the requested tasks count is lower than the nodes count. This commit restores behavior before ca510f2 Bug 15857
In _opt_verify(), the number of tasks was calculated by ntasks_per_node * min_nodes. However, the step may be allocated more nodes than min_nodes, so after the slurmctld has created the step allocation, srun needs to recalculate the number of tasks based on the actual number of allocated nodes. Bug 15857
This fixes some valid step requests that hang instead of run. For example: PartitionName=def1 Nodes=ALL MinNodes=2 $ srun -p def1 -n1 -c1 hostname <job is allocated but the step hangs> Bug 15857
Since commit e652133 requesting gpus is mandatory for --cpus-per-cpu like it is for --mem-per-gpu, or job is rejected. Bug 15945
Swap mgr->event_fd[0] to non-blocking in init_con_mgr(). This will correct the possibility where a prior run of _watch() will read all of the pending bytes resulting in the next run of _watch() getting a blocking read() of mgr->event_fd[0]. The existing read() code was already written to allow this requiring no secondary modifications. Bug 15884
When an interactive job (salloc with LaunchParameters=use_interactive_step or srun --interactive) is scheduled on powered down nodes, the salloc or srun will wait in a loop for slurm_job_node_ready() to say the nodes are ready (i.e. registered with the controller and not running a prolog). After the all nodes have registered, slurm_job_node_ready() will report that the job and nodes are ready and salloc will launch the interactive step. For normal steps, job_config_fini(), which launches the prolog RPC, is called when creating the step if all the nodes have registered. For batch jobs, job_config_fini() is called by job_time_limit() after all the nodes have registered. For interactive steps, job_config_fini() isn't called at step creation and waits till job_time_limit catches it. So there is a window when the salloc/srun that they can run the interactive step before the controller has launched the prolog RPC. The slurmd should block a task launch until the prolog RPC has run if the prolog RPC is expected to run. Due to a separate issue where the slurmd thinks the prolog RPC has already run (_wait_for_job_running_prolog()), the slurmd launches the task without first running the prolog and other setup (i.e. job_container/tmpfs, pam_slurm_adopt, x11 forwarding). This can be handled separately. The solution is to have slurm_job_node_ready() wait until the job is not configuring anymore (i.e. job_config_fini() has been called) before considering the nodes to be ready for the job, which in turn ensures interactive jobs will not launch before the prolog runs. Bug 15993 Co-authored-by: Brian Christiansen <brian@schedmd.com>
If the provided <list> is NULL, the task/affinity functions _validate_[map,mask]() would make slurmd segfault, because NULL is handed as a first argument to strtok_r() in order to parse the requested list. Example reproducers: $ srun --cpu-bind=mask_cpu:, true $ srun --cpu-bind=map_cpu:, true $ srun --cpu-bind=mask_cpu:0x1 : true $ srun --cpu-bind=map_cpu:0 : true Bug 16067
Backfill's planned_bitmap isn't updated when a node is deleted from node_record_table_ptr so the planned_bitmap may be pointing to a deleted node. In 23.02, this isn't an issue because of the use of next_node_bitmap() which will skip any nodes that are deleted at the set bit. Bug 16165
If a node was deleted while being in the planned state the bit would be left in the planned_bitmap until a restart or a new node was added to the empty node slot. Bug 16165 Signed-off-by: Skyler Malinowski <malinowski@schedmd.com>
If slurmctld was to restart right after a planned node had a planned job terminated, then the node could remain as planned when reading from state save (e.g. slurmctld -R, node is NODE_STATE_DYNAMIC_NORM). Moreover, the node was saved as PLANNED but the planned_bitmap is missing that node until it becomes PLANNED again. Bug 16165
Fix mis-merge from f4b164e. Bug 16149
Swap reference to "dbv0.0.38_meta". Bug 16149
The rc of the make command used was not forwarding the failures because it was not an actual make check. Bug 16262
Bug 16262 Signed-off-by: Ben Glines <ben.glines@schedmd.com>
Regression from commit 9210bea. Bug 16302
Ensure jobs submitted via atf.submit_job() are tracked and cancelled even if not in auto-config mode Bug 16326
Track job ids created with alloc_job_id for teardown when the testsuite ends Bug 16326
Code is extracted from _check_job_credential in preparation for the next commit. Bug 15417
Code is extracted from _check_job_credential in preparation for the next commit. Bug 15417
Since the change introduced in 21.08 to use [job|step]_mem_alloc as the reference for the memory limitation in the credential, in case that a message from an old version (20.11) is received, this will not have the *_mem_alloc field, thus we need to do it the old way, by using the *_mem_limit instead. Also checks the protocol_version to decide whether calling _get_ncpus is needed. This commit also changes the slurm_cred_get_mem signature in order to receive the step and job cpu count so that the number of cpus can be calculated. Bug 15417
Bug 15417
Continuation of commit 7cfcbf3 Bug 15417
The argv array from Slurm opt parsing is freed before being packed. Bug 16425
The argv array from Slurm opt parsing is freed before being packed. Bug 16425
The argv array from Slurm opt parsing is freed before being packed. Bug 16425
Bug 16425
The situation where gres_min_cpus is lower than avail_cpus can happen in the current code state is quite often. The value of gres_min_cpus is originally calculated in gres_select_filter_sock_core when we may not know how many tasks will run on the node. The number or required sockets is caluclated there to run as many tasks as possible on the node. The overall logic may be improved which may result in the condition being unexpected, however, this won't happen on tagged branch and we don't want to show error while it may be correct behavior. Bug 16297
Was introduced in 23.02, but added to the 22.05 docs inadvertently in commit 5793da7. Bug 16532
Bug 16378
Regression in 11712ce. If we fail to find the node in the configuration we should fallback to using the raw host value we have in step_ptr. This may happen if the configured NodeHostname differs from the gethostname() result on the host, but matches one of the aliases. Another common case is for srun running on a submit host that is not configured as a Slurm node. Bug 16401
Create test for --switches option with topology/tree and select/cons_tres Bug 10061
Update slurm.spec as well.
Tag 22.05.9 release.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.