Skip to content

Conversation

@bosilca
Copy link
Member

@bosilca bosilca commented Mar 7, 2018

Fix the calculation of the number of elements in the datatype. As the loop variable has been set outside the internal loop, we were not correctly updating the loop stats, and we were counting more elements than actually existing in the datatype representation.

The loop should be updated on each internal iteration.

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
@bosilca bosilca requested a review from edgargabriel March 7, 2018 18:19
@bosilca bosilca added this to the v3.0.1 milestone Mar 7, 2018
@ibm-ompi
Copy link

ibm-ompi commented Mar 7, 2018

The IBM CI (GNU Compiler) build failed! Please review the log, linked below.

Gist: https://gist.github.com/2f5b1e23de66e06b8982acd00e2c336f

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
@artpol84
Copy link
Contributor

artpol84 commented Mar 7, 2018

@jladd-mlnx @Di0gen @yosefe
@jsquyres just pinged me about this Jenkins failure.
It looks like ompi can’t find UCX. Can someone have a look?

@jsquyres
Copy link
Member

jsquyres commented Mar 7, 2018

I pushed a missing include file fix to this PR. If it passes CI, I'll squash it back down to George's original commit and then we can merge.

@bosilca
Copy link
Member Author

bosilca commented Mar 7, 2018

I compile in debug mode, and the output was available there. I'll squash.

@jsquyres
Copy link
Member

jsquyres commented Mar 7, 2018

@bosilca Cool -- thanks!

@bosilca bosilca force-pushed the topic/get_element_fix branch from fd4dd7f to 999de13 Compare March 7, 2018 19:23
@edgargabriel
Copy link
Member

bot:mellanox:retest

1 similar comment
@alinask
Copy link
Member

alinask commented Mar 8, 2018

bot:mellanox:retest

@artpol84
Copy link
Contributor

artpol84 commented Mar 8, 2018

@alinask thanks!

@edgargabriel
Copy link
Member

Are we good to merge this pr, i.e. has the squash been already done?

@jsquyres
Copy link
Member

jsquyres commented Mar 8, 2018

@edgargabriel Yes, squash has been done -- we're good to go.

@jsquyres jsquyres merged commit 70c59f7 into open-mpi:master Mar 8, 2018
@jsquyres
Copy link
Member

jsquyres commented Mar 8, 2018

@bosilca This PR changed the default value of the MCA param opal_cuda_verbose from 0 to -1.

Was that intended? That doesn't seem to have anything to do with this GET_ELEMENT fix.

@jsquyres
Copy link
Member

jsquyres commented Mar 8, 2018

@bosilca Is this fix also applicable to v2.1.x? The cherry picks did not apply anywhere close to clean for me.

@bosilca
Copy link
Member Author

bosilca commented Mar 8, 2018

@jsquyres the opal_cuda_verbose was not supposed to be part of this. Fortunately, it has no impact on the code. I will change back in master, I have another cleanup pending.

The bug is present in 2.1, and as you noticed this patch will not apply as is. One will have to search for the lines in this patch and apply them by hand. Got few minutes before my plane takes of, here is the patch for 2.1 (please test I haven't tried it yet).

diff --git a/opal/datatype/opal_datatype_get_count.c b/opal/datatype/opal_datatype_get_count.c
index 7b539fbec8..7b33b73480 100644
--- a/opal/datatype/opal_datatype_get_count.c
+++ b/opal/datatype/opal_datatype_get_count.c
@@ -1,6 +1,6 @@
 /* -*- Mode: C; c-basic-offset:4 ; -*- */
 /*
- * Copyright (c) 2004-2009 The University of Tennessee and The University
+ * Copyright (c) 2004-2018 The University of Tennessee and The University
  *                         of Tennessee Research Foundation.  All rights
  *                         reserved.
  * Copyright (c) 2009      Oak Ridge National Labs.  All rights reserved.
@@ -58,9 +58,8 @@ ssize_t opal_datatype_get_element_count( const opal_datatype_t* datatype, size_t
             continue;
         }
         if( OPAL_DATATYPE_LOOP == pElems[pos_desc].elem.common.type ) {
-            ddt_loop_desc_t* loop = &(pElems[pos_desc].loop);
             do {
-                PUSH_STACK( pStack, stack_pos, pos_desc, OPAL_DATATYPE_LOOP, loop->loops, 0 );
+                PUSH_STACK( pStack, stack_pos, pos_desc, OPAL_DATATYPE_LOOP, pElems[pos_desc].loop.loops, 0 );
                 pos_desc++;
             } while( OPAL_DATATYPE_LOOP == pElems[pos_desc].elem.common.type ); /* let's start another loop */
             DDT_DUMP_STACK( pStack, stack_pos, pElems, "advance loops" );
@@ -121,9 +120,8 @@ int32_t opal_datatype_set_element_count( const opal_datatype_t* datatype, size_t
             continue;
         }
         if( OPAL_DATATYPE_LOOP == pElems[pos_desc].elem.common.type ) {
-            ddt_loop_desc_t* loop = &(pElems[pos_desc].loop);
             do {
-                PUSH_STACK( pStack, stack_pos, pos_desc, OPAL_DATATYPE_LOOP, loop->loops, 0 );
+                PUSH_STACK( pStack, stack_pos, pos_desc, OPAL_DATATYPE_LOOP, pElems[pos_desc].loop.loops, 0 );
                 pos_desc++;
             } while( OPAL_DATATYPE_LOOP == pElems[pos_desc].elem.common.type ); /* let's start another loop */
             DDT_DUMP_STACK( pStack, stack_pos, pElems, "advance loops" );

@bosilca bosilca deleted the topic/get_element_fix branch October 18, 2018 04:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants