-
Notifications
You must be signed in to change notification settings - Fork 845
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
openmpi 4.1.5 won't spawn ([asustx:2550375] UNPACK-OPAL-VALUE: UNSUPPORTED TYPE 33 FOR KEY) #11749
Comments
What version of OpenPMIx are you using? If you are using OpenPMIx v4.2.3, this is possibly a dup of #11729. |
Thanks for your comment. I'm using OpenPMIx-4.2.3. I'll try the patch in that issue and report here again. |
This is not a dup of #11729. I rebuild And I then checked the latest PKGBUILD( updated 2 weeks ago) in the official repository of Archlinux, finding it already had the patch (#11472) included. |
@sukanka Thanks for the confirmation. Are you working with @bchareyre, or is your report independent of theirs? I did some git bisecting, and it looks like the fix from #11472 is what actually broke spawn (with both the bundled version of PMIx and v4.2.3). We need to investigate further to figure out how we broke it / how to fix it. |
Yeah, I work with @bchareyre. We have been discussing this for a few days. (You can have a look at the url posted in the very first comment) |
please give a try to the inline patch below FWIW, I am not convinced this is the most elegant fix (since we convert --- orig/openmpi-4.1.5/opal/mca/pmix/pmix3x/pmix3x.c 2023-06-14 11:57:42.000000000 +0900
+++ openmpi-4.1.5/opal/mca/pmix/pmix3x/pmix3x.c 2023-06-14 12:00:23.116704037 +0900
@@ -843,10 +843,6 @@
i->value.data.bo.size = 0;
}
break;
- case OPAL_PERSIST:
- i->value.type = PMIX_PERSIST;
- i->value.data.persist = pmix3x_convert_opalpersist((opal_pmix_persistence_t)kv->data.uint8);
- break;
case OPAL_SCOPE:
i->value.type = PMIX_SCOPE;
i->value.data.scope = pmix3x_convert_opalscope((opal_pmix_scope_t)kv->data.uint8); |
Thanks for your reply. But this seems not working. I rebuild openmpi with the patch from #11472 and your comment. Then I re-run the demo script, and the output is ➜ openmpi python3 script.py
[asustx:3025739] PACK-OPAL-VALUE: UNSUPPORTED TYPE 0 FOR KEY
[asustx:3025739] [[3101,0],0] ORTE_ERROR_LOG: Error in file orted/pmix/pmix_server_pub.c at line 312
Traceback (most recent call last):
File "/tmp/RustDesk/openmpi/script.py", line 6, in <module>
comm = world.Spawn("python3", args=[sys.argv[0]], maxprocs=3).Merge()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "mpi4py/MPI/Comm.pyx", line 1931, in mpi4py.MPI.Intracomm.Spawn
mpi4py.MPI.Exception: MPI_ERR_OTHER: known error not in list
➜ openmpi [asustx:3025739] PACK-OPAL-VALUE: UNSUPPORTED TYPE 0 FOR KEY
[asustx:3025739] [[3101,0],0] ORTE_ERROR_LOG: Error in file orted/pmix/pmix_server_pub.c at line 312 Note: I run it only once, but the output looks like that the script is run twice. And when I press |
my bad, it seems I messed up with my test environment. What about this one? It works for me if I build from the sources and I manually apply this patch on top of the existing one. --- orig/openmpi-4.1.5/opal/mca/pmix/pmix3x/pmix3x.c 2023-06-14 21:45:40.159479390 +0900
+++ openmpi-4.1.5/opal/mca/pmix/pmix3x/pmix3x.c 2023-06-14 21:48:02.469473048 +0900
@@ -807,14 +807,17 @@
PMIX_INFO_LOAD(i, kv->key, &kv->data.time, PMIX_TIME);
break;
case OPAL_STATUS:
+ PMIX_LOAD_KEY(i->key, kv->key);
i->value.type = PMIX_STATUS;
i->value.data.status = pmix3x_convert_opalrc(kv->data.status);
break;
case OPAL_VPID:
+ PMIX_LOAD_KEY(i->key, kv->key);
i->value.type = PMIX_PROC_RANK;
i->value.data.rank = pmix3x_convert_opalrank(kv->data.name.vpid);
break;
case OPAL_NAME:
+ PMIX_LOAD_KEY(i->key, kv->key);
i->value.type = PMIX_PROC;
/* have to stringify the jobid */
PMIX_PROC_CREATE(i->value.data.proc, 1);
@@ -833,6 +836,7 @@
i->value.data.proc->rank = pmix3x_convert_opalrank(kv->data.name.vpid);
break;
case OPAL_BYTE_OBJECT:
+ PMIX_LOAD_KEY(i->key, kv->key);
i->value.type = PMIX_BYTE_OBJECT;
if (NULL != kv->data.bo.bytes) {
i->value.data.bo.bytes = (char*)malloc(kv->data.bo.size);
@@ -844,18 +848,22 @@
}
break;
case OPAL_PERSIST:
+ PMIX_LOAD_KEY(i->key, kv->key);
i->value.type = PMIX_PERSIST;
i->value.data.persist = pmix3x_convert_opalpersist((opal_pmix_persistence_t)kv->data.uint8);
break;
case OPAL_SCOPE:
+ PMIX_LOAD_KEY(i->key, kv->key);
i->value.type = PMIX_SCOPE;
i->value.data.scope = pmix3x_convert_opalscope((opal_pmix_scope_t)kv->data.uint8);
break;
case OPAL_DATA_RANGE:
+ PMIX_LOAD_KEY(i->key, kv->key);
i->value.type = PMIX_DATA_RANGE;
i->value.data.range = pmix3x_convert_opalrange((opal_pmix_data_range_t)kv->data.uint8);
break;
case OPAL_PROC_STATE:
+ PMIX_LOAD_KEY(i->key, kv->key);
i->value.type = PMIX_PROC_STATE;
/* the OPAL layer doesn't have any concept of proc state,
* so the ORTE layer is responsible for converting it */
@@ -873,6 +881,7 @@
* opal_value_t's that we need to convert to a pmix_data_array
* of pmix_info_t structures */
list = (opal_list_t*)kv->data.ptr;
+ PMIX_LOAD_KEY(i->key, kv->key);
i->value.type = PMIX_DATA_ARRAY;
i->value.data.darray = (pmix_data_array_t*)malloc(sizeof(pmix_data_array_t));
i->value.data.darray->type = PMIX_INFO;
@@ -893,6 +902,7 @@
}
break;
case OPAL_PROC_INFO:
+ PMIX_LOAD_KEY(i->key, kv->key);
i->value.type = PMIX_PROC_INFO;
PMIX_PROC_INFO_CREATE(i->value.data.pinfo, 1);
/* see if this job is in our list of known nspaces */ |
@ggouaillardet Works for me! Here's my test program in C: #include <stdio.h>
#include <mpi.h>
int main(int argc, char* argv[])
{
MPI_Init(NULL, NULL);
MPI_Comm comm;
MPI_Comm parent, child;
MPI_Comm_get_parent(&parent);
if (parent == MPI_COMM_NULL) {
int errors[3];
MPI_Comm_spawn(argv[0], NULL, 3, MPI_INFO_NULL, 0,
MPI_COMM_WORLD, &child, errors);
MPI_Intercomm_merge(child, 0, &comm);
MPI_Comm_free(&child);
} else {
MPI_Intercomm_merge(parent, 1, &comm);
}
int rank;
MPI_Comm_rank(comm, &rank);
printf("Hello, I am rank %d in the merged comm\n", rank);
MPI_Barrier(comm);
MPI_Comm_free(&comm);
MPI_Finalize();
return 0;
} @bchareyre @sukanka If this patch works for you, I'll make a PR and get it merged into the v4.1.x branch. |
Thanks! this patch works! |
Awesome! Thanks @ggouaillardet for fixing, and thank you @sukanka for following up. |
Fixed in #11767. Will be included in Open MPI v4.1.6. |
This MWE is broken again in The compiled program can print the expected messages, but it never terminates, just gets stuck instead. |
Hello,
An issue was revealed by regression tests of yade-dem on archlinux [1].
The tests pass with 4.1.4. and earlier. They fail with 4.1.5.
Note: the test involves python binding by mpi4py
[1] https://gitlab.com/yade-dev/trunk/-/issues/309
openmpi-4.1.4 is built with extra-x86_64-build -r /tmp/arch-chroot with the PKGBUILD in archlinux official repo)
openmpi-4.1.5 is from official archlinux repo.
python 3.11.3-1
python-mpi4py 3.1.4-3
Details of the problem
Expected behavior with the script below (obtained with openmpi-4.1.4):
Output with
openmpi-4.1.5
, the script does not terminate and the output is:The (python) test script:
The text was updated successfully, but these errors were encountered: