Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
PPC64 builds broken on master #874
Comments
jsquyres
added
the
bug
label
Sep 8, 2015
|
@jsquyres it isn't in the PMIx library, but in the opal integration to that library. Looks like some kind of issue with moving data between OPAL and PMIx structures. |
|
i found several issues related to big endian (i reproduced them on a sparc architecture) the inlined patch fixes some of them.
does work
suffers from intermittent hang
always fail with the following error
btw, is PMIx supposed to work on an heterogeneous cluster (e.g. mix of big and little endian) ? here is the patch diff --git a/opal/mca/pmix/pmix1xx/pmix/src/buffer_ops/pack.c b/opal/mca/pmix/pmix1xx/pmix/src/buffer_ops/pack.c
index f1db83c..faa9a6e 100644
--- a/opal/mca/pmix/pmix1xx/pmix/src/buffer_ops/pack.c
+++ b/opal/mca/pmix/pmix1xx/pmix/src/buffer_ops/pack.c
@@ -11,6 +11,8 @@
* All rights reserved.
* Copyright (c) 2011-2013 Cisco Systems, Inc. All rights reserved.
* Copyright (c) 2014-2015 Intel, Inc. All rights reserved.
+ * Copyright (c) 2015 Research Organization for Information Science
+ * and Technology (RIST). All rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
@@ -264,7 +266,7 @@ int pmix_bfrop_pack_int64(pmix_buffer_t *buffer, const void *src,
int32_t num_vals, pmix_data_type_t type)
{
int32_t i;
- uint64_t tmp, *srctmp = (uint64_t*) src;
+ uint64_t tmp, tmp2;
char *dst;
size_t bytes_packed = num_vals * sizeof(tmp);
@@ -275,7 +277,8 @@ int pmix_bfrop_pack_int64(pmix_buffer_t *buffer, const void *src,
}
for (i = 0; i < num_vals; ++i) {
- tmp = pmix_hton64(srctmp[i]);
+ memcpy(&tmp2, (char *)src+i*sizeof(uint64_t), sizeof(uint64_t));
+ tmp = pmix_hton64(tmp2);
memcpy(dst, &tmp, sizeof(tmp));
dst += sizeof(tmp);
}
diff --git a/opal/mca/pmix/pmix1xx/pmix/src/client/pmix_client.c b/opal/mca/pmix/pmix1xx/pmix/src/client/pmix_client.c
index 95e8aa5..c6c3496 100644
--- a/opal/mca/pmix/pmix1xx/pmix/src/client/pmix_client.c
+++ b/opal/mca/pmix/pmix1xx/pmix/src/client/pmix_client.c
@@ -230,8 +230,7 @@ int PMIx_Init(pmix_proc_t *proc)
return PMIX_ERR_BAD_PARAM;
}
- ++pmix_globals.init_cntr;
- if (1 < pmix_globals.init_cntr) {
+ if (0 < pmix_globals.init_cntr) {
/* since we have been called before, the nspace and
* rank should be known. So return them here if
* requested */
@@ -339,6 +338,9 @@ int PMIx_Init(pmix_proc_t *proc)
rc = cb.status;
PMIX_DESTRUCT(&cb);
+ if (PMIX_SUCCESS == rc) {
+ pmix_globals.init_cntr++;
+ }
return rc;
}
diff --git a/opal/mca/pmix/pmix1xx/pmix_pmix1.c b/opal/mca/pmix/pmix1xx/pmix_pmix1.c
index 3abeee9..84c7774 100644
--- a/opal/mca/pmix/pmix1xx/pmix_pmix1.c
+++ b/opal/mca/pmix/pmix1xx/pmix_pmix1.c
@@ -281,7 +281,7 @@ void pmix1_value_load(pmix_value_t *v,
break;
case OPAL_SIZE:
v->type = PMIX_SIZE;
- memcpy(&(v->data.size), &kv->data.size, sizeof(size_t));
+ v->data.size = (size_t)kv->data.size;
break;
case OPAL_PID:
v->type = PMIX_PID;
@@ -344,7 +344,7 @@ void pmix1_value_load(pmix_value_t *v,
if (NULL != kv->data.bo.bytes) {
v->data.bo.bytes = (char*)malloc(kv->data.bo.size);
memcpy(v->data.bo.bytes, kv->data.bo.bytes, kv->data.bo.size);
- memcpy(&(v->data.bo.size), &kv->data.bo.size, sizeof(size_t));
+ v->data.bo.size = (size_t)kv->data.bo.size;
} else {
v->data.bo.bytes = NULL;
v->data.bo.size = 0;
@@ -382,7 +382,7 @@ int pmix1_value_unload(opal_value_t *kv,
break;
case PMIX_SIZE:
kv->type = OPAL_SIZE;
- memcpy(&kv->data.size, &(v->data.size), sizeof(size_t));
+ kv->data.size = (int)v->data.size;
break;
case PMIX_PID:
kv->type = OPAL_PID;
@@ -444,7 +444,7 @@ int pmix1_value_unload(opal_value_t *kv,
kv->type = OPAL_BYTE_OBJECT;
if (NULL != v->data.bo.bytes && 0 < v->data.bo.size) {
kv->data.bo.bytes = (uint8_t*)v->data.bo.bytes;
- kv->data.bo.size = v->data.bo.size;
+ kv->data.bo.size = (int)v->data.bo.size;
} else {
kv->data.bo.bytes = NULL;
kv->data.bo.size = 0; |
|
Ah crud - I thought I'd been careful enough about avoiding the alignment issues, but obviously not. I'll take a gander at these in the morning and bring them into PMIx - will also see if I can spot any additional problems. Thanks! |
|
@rhc54 note there are two kind of issues
|
|
I can confirm that I no longer get segfaults on ppc64 but the same errors as @ggouaillardet . |
|
@ggouaillardet PMIx is supposed to work on hetero clusters - it has the equivalent of the OPAL dss for packing/unpacking to make it work. Did you see something that would prevent it? |
|
@nysal Thanks! |
|
@rhc54 my question about hetero clusters was a naive one. with the following inlined oneline patch, i get no more issues with singleton and mpirun -np 1 diff --git a/opal/mca/pmix/pmix1xx/pmix/src/buffer_ops/pack.c b/opal/mca/pmix/pmix1xx/pmix/src/buffer_ops/pack.c
index faa9a6e..cf453ee 100644
--- a/opal/mca/pmix/pmix1xx/pmix/src/buffer_ops/pack.c
+++ b/opal/mca/pmix/pmix1xx/pmix/src/buffer_ops/pack.c
@@ -643,7 +643,7 @@ int pmix_bfrop_pack_proc(pmix_buffer_t *buffer, const void *src,
if (PMIX_SUCCESS != (ret = pmix_bfrop_pack_string(buffer, &ptr, 1, PMIX_STRING))) {
return ret;
}
- if (PMIX_SUCCESS != (ret = pmix_bfrop_pack_sizet(buffer, &proc[i].rank, 1, PMIX_INT))) {
+ if (PMIX_SUCCESS != (ret = pmix_bfrop_pack_int(buffer, &proc[i].rank, 1, PMIX_INT))) {
return ret;
}
} |
|
@ggouaillardet Oh wow - yeah, that would definitely be wrong! Sorry for that typo. |
|
and here is another twolines patch that is needed to fix mpirun -np n helloworld with n > 1 on big endian arch diff --git a/opal/mca/pmix/pmix1xx/pmix/src/server/pmix_server.c b/opal/mca/pmix/pmix1xx/pmix/src/server/pmix_server.c
index 2dbfb2b..3b311be 100644
--- a/opal/mca/pmix/pmix1xx/pmix/src/server/pmix_server.c
+++ b/opal/mca/pmix/pmix1xx/pmix/src/server/pmix_server.c
@@ -381,7 +381,8 @@ static void _register_nspace(int sd, short args, void *cbdata)
pmix_setup_caddy_t *cd = (pmix_setup_caddy_t*)cbdata;
pmix_nspace_t *nptr, *tmp;
pmix_status_t rc;
- size_t i, j, size, rank;
+ size_t i, j, size;
+ int rank;
pmix_kval_t kv;
char **nodes=NULL, **procs=NULL;
pmix_buffer_t buf2; |
|
So it seems all patches from @ggouaillardet are already in the master branch and now the test cases seem to start up correctly. The tests are, however, not running correctly:
and then it seems to hang forever. If I remove the parameters |
|
this parameter is set to use the collective module optimized for shared memory with intra node communicator. |
|
The
I was just using the defaults. Maybe needs some fixing too. |
|
@jsquyres is there any reason why we use coll/sm by default ? |
|
Is there a reason not to use coll sm? |
|
not really, except a memory leak in v1.10 when a communicator is freed. |
|
Maybe the real question is why coll sm has a low priority by default... is there a known issue with it? (there's something in the back of my head that says that there is, but I don't recall what it is offhand...) |
|
@jsquyres Should we close this one? as the original issue seems to be fixed. I can confirm that some basic tests that I ran on ppc64 pass. For sm coll component maybe we can open another issue? |
doko42
commented
Feb 12, 2016
|
seeing similar issues on powerpc 32bit, https://bugs.debian.org/814183 are these fix applied for the 1.10.x series as well? |
|
the fix was for pmix and there is no such thing in the v1.10 series |
|
@doko42 From that Debian bug report, it's not clear what the error is. Is there a corefile or some other error product that shows what has failed? |
|
@doko42 I haven't tried running powerpc 32-bit in a while. Could you attach gdb to the hung tasks and get a backtrace? Do simple examples shipped with ompi work? |
sbez44
commented
Apr 4, 2016
|
this all looks toppling hot.I got thrown here from Mac download mpi repair bug homebrew! looking for NetPipe cluster trails of cray && earlier blue ibm (pre zos)dont think your comments should contain live links yunno!!! those bug block build up queues are all under pressure,from what is coursoing and causing such,but dont know what.need step way back! |
|
@adrianreber Is this issue still relevant? |
|
@jsquyres I do not see the segfault mentioned in this ticket anymore. All tests of the intel_tests directory are running on all branches on ppc64 without errors. Seems fixed, yes. |
|
@adrianreber Thanks. So I'm closing this issue. |
jsquyres commentedSep 8, 2015
Per http://www.open-mpi.org/community/lists/devel/2015/09/17979.php, @adrianreber reports:
Since a few days the MTT runs on my ppc64 systems are failing with:
I think I do not see these kind of errors on any of the other MTT setups so it might be ppc64 related. Just wanted to point it out.
@rhc54 I see that the failure is in PMIX...?
@gpaulsen @nysal Pinging the PPC64/IBM people...