Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove inter-dependencies between OSC modules. #4527

Merged
merged 1 commit into from Feb 9, 2018

Conversation

clementFoyer
Copy link
Contributor

The osc monitoring component needed to include other OSC components header in order to be able to access communicator through the component specific ompi_osc_*_module_t structures. Now we keep a hashtable to keep track of the window specific communicator. This commit remove the dependency, and resolve the issue #4523 opened by @tkordenbrock .

@bosilca @jsquyres Please review.

@ompiteam-bot
Copy link

Can one of the admins verify this patch?

OSC_MONITORING_MODULE_TEMPLATE_GENERATE(portals4, ompi_osc_portals4_module_t, comm)
#undef GET_MODULE
#endif /* OMPI_WITH_OSC_PORTALS4 */
OSC_MONITORING_MODULE_TEMPLATE_GENERATE(rdma)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still feels pretty brittle (although it is a step up). There probably not a generic solution, but a less-brittle version would be to have an autogenerated file from configure with these OSC_MONITORING_MODULE_TEMPLATE_GENERATE() statements, which was generated during configure time by looping over mca_ompi_osc_m4_config_component_list and mca_ompi_osc_no_config_component_list.

As an example, I notice that there's no template for the ucx component.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed I have not came back for a few few months, and I missed this new component. You are totally right on how careful this whole component have to be handle. I had not thought of this m4 solution, but that's a very good idea.

@bwbarrett
Copy link
Member

ok to test

@bosilca
Copy link
Member

bosilca commented Nov 23, 2017

Another possible approach is to unify the different OSC modules structs by moving the ompi_communicator_t pointer from the specialized modules down into the base ompi_osc_base_module_t structure (in MPI every window has it's corresponding communicator).

@clementFoyer
Copy link
Contributor Author

clementFoyer commented Nov 24, 2017

Maybe this PR should not be merged yet, as it may be followed by another change to remove the introduced hashtable.
Also, should the upcoming commits affect all osc components that saved the communicator to their own module structures? And finally, should the communicator field in the base structure be at the same nested level as the functions?
I.E. :

struct ompi_osc_base_module_3_0_0_t {
    struct ompi_communicator_t *comm;
    ompi_osc_base_module_win_shared_query_fn_t osc_win_shared_query;
    ompi_osc_base_module_win_attach_fn_t osc_win_attach;
    ompi_osc_base_module_win_detach_fn_t osc_win_detach;
    ompi_osc_base_module_free_fn_t osc_free;
    ompi_osc_base_module_put_fn_t osc_put;
    ompi_osc_base_module_get_fn_t osc_get;
    ompi_osc_base_module_accumulate_fn_t osc_accumulate;
    ompi_osc_base_module_compare_and_swap_fn_t osc_compare_and_swap;
    ompi_osc_base_module_fetch_and_op_fn_t osc_fetch_and_op;
    ompi_osc_base_module_get_accumulate_fn_t osc_get_accumulate;
    ompi_osc_base_module_rput_fn_t osc_rput;
    ompi_osc_base_module_rget_fn_t osc_rget;
    ompi_osc_base_module_raccumulate_fn_t osc_raccumulate;
    ompi_osc_base_module_rget_accumulate_fn_t osc_rget_accumulate;
    ompi_osc_base_module_fence_fn_t osc_fence;
    ompi_osc_base_module_start_fn_t osc_start;
    ompi_osc_base_module_complete_fn_t osc_complete;
    ompi_osc_base_module_post_fn_t osc_post;
    ompi_osc_base_module_wait_fn_t osc_wait;
    ompi_osc_base_module_test_fn_t osc_test;
    ompi_osc_base_module_lock_fn_t osc_lock;
    ompi_osc_base_module_unlock_fn_t osc_unlock;
    ompi_osc_base_module_lock_all_fn_t osc_lock_all;
    ompi_osc_base_module_unlock_all_fn_t osc_unlock_all;
    ompi_osc_base_module_sync_fn_t osc_sync;
    ompi_osc_base_module_flush_fn_t osc_flush;
    ompi_osc_base_module_flush_all_fn_t osc_flush_all;
    ompi_osc_base_module_flush_local_fn_t osc_flush_local;
    ompi_osc_base_module_flush_local_all_fn_t osc_flush_local_all;
};
typedef struct ompi_osc_base_module_3_0_0_t ompi_osc_base_module_3_0_0_t;
typedef ompi_osc_base_module_3_0_0_t ompi_osc_base_module_t;

or

struct ompi_osc_base_module_3_0_0_fn_t {
    ompi_osc_base_module_win_shared_query_fn_t osc_win_shared_query;
    ompi_osc_base_module_win_attach_fn_t osc_win_attach;
    ompi_osc_base_module_win_detach_fn_t osc_win_detach;
    ompi_osc_base_module_free_fn_t osc_free;
    ompi_osc_base_module_put_fn_t osc_put;
    ompi_osc_base_module_get_fn_t osc_get;
    ompi_osc_base_module_accumulate_fn_t osc_accumulate;
    ompi_osc_base_module_compare_and_swap_fn_t osc_compare_and_swap;
    ompi_osc_base_module_fetch_and_op_fn_t osc_fetch_and_op;
    ompi_osc_base_module_get_accumulate_fn_t osc_get_accumulate;
    ompi_osc_base_module_rput_fn_t osc_rput;
    ompi_osc_base_module_rget_fn_t osc_rget;
    ompi_osc_base_module_raccumulate_fn_t osc_raccumulate;
    ompi_osc_base_module_rget_accumulate_fn_t osc_rget_accumulate;
    ompi_osc_base_module_fence_fn_t osc_fence;
    ompi_osc_base_module_start_fn_t osc_start;
    ompi_osc_base_module_complete_fn_t osc_complete;
    ompi_osc_base_module_post_fn_t osc_post;
    ompi_osc_base_module_wait_fn_t osc_wait;
    ompi_osc_base_module_test_fn_t osc_test;
    ompi_osc_base_module_lock_fn_t osc_lock;
    ompi_osc_base_module_unlock_fn_t osc_unlock;
    ompi_osc_base_module_lock_all_fn_t osc_lock_all;
    ompi_osc_base_module_unlock_all_fn_t osc_unlock_all;
    ompi_osc_base_module_sync_fn_t osc_sync;
    ompi_osc_base_module_flush_fn_t osc_flush;
    ompi_osc_base_module_flush_all_fn_t osc_flush_all;
    ompi_osc_base_module_flush_local_fn_t osc_flush_local;
    ompi_osc_base_module_flush_local_all_fn_t osc_flush_local_all;
};
typedef struct ompi_osc_base_module_3_0_0_fn_t ompi_osc_base_module_3_0_0_fn_t;
struct ompi_osc_base_module_3_0_0_t {
    struct ompi_communicator_t *comm;
    ompi_osc_base_module_3_0_0_fn_t module;
};
typedef ompi_osc_base_module_3_0_0_t ompi_osc_base_module_t;

@hppritcha
Copy link
Member

OS-X is not happy:

aking all in mca/osc/monitoring
  CC       osc_monitoring_component.lo
In file included from osc_monitoring_component.c:24:
./osc_monitoring_template_gen.h:40:3: error: expected ';' after expression
-n     
  ^
  ;
./osc_monitoring_template_gen.h:40:2: error: use of undeclared identifier 'n'
-n     
 ^
./osc_monitoring_template_gen.h:43:3: error: expected ';' after expression
-n     } else 
  ^
  ;
./osc_monitoring_template_gen.h:43:2: error: use of undeclared identifier 'n'
-n     } else 
 ^
./osc_monitoring_template_gen.h:46:3: error: expected ';' after expression
-n     } else 
  ^
  ;
./osc_monitoring_template_gen.h:46:2: error: use of undeclared identifier 'n'
-n     } else 
 ^
./osc_monitoring_template_gen.h:49:3: error: expected ';' after expression
-n     } else 
  ^
  ;
./osc_monitoring_template_gen.h:49:2: error: use of undeclared identifier 'n'
-n     } else 
 ^
./osc_monitoring_template_gen.h:52:3: error: expected ';' after expression
-n     } else 
  ^
  ;
./osc_monitoring_template_gen.h:52:2: error: use of undeclared identifier 'n'
-n     } else 
 ^
./osc_monitoring_template_gen.h:55:3: error: expected ';' after expression
-n     } else 
  ^
  ;
./osc_monitoring_template_gen.h:55:2: error: use of undeclared identifier 'n'
-n     } else 
 ^
12 errors generated.
make[2]: *** [osc_monitoring_component.lo] Error 1
make[1]: *** [all-recursive] Error 1
make: *** [all-recursive] Error 1
Build step 'Execute shell' marked build as failure
Finished: FAILURE
Page generated: Nov 26, 2017 3:35:18 AM UTCREST APIJenkins ver. 2.91

@ibm-ompi
Copy link

The IBM CI (GNU Compiler) build failed! Please review the log, linked below.

Gist: https://gist.github.com/5692e25afdb5edea04a1451fad841cf2

@ibm-ompi
Copy link

The IBM CI (PGI Compiler) build failed! Please review the log, linked below.

Gist: https://gist.github.com/a6c633a1019927effca8fdabc35289eb

Copy link
Member

@jsquyres jsquyres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of these commits should be squashed.

@@ -17,7 +17,8 @@ monitoring_sources = \
osc_monitoring_active_target.h \
osc_monitoring_dynamic.h \
osc_monitoring_module.h \
osc_monitoring_template.h
osc_monitoring_template.h \
osc_monitoring_template_gen.h
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

template_gen.h shouldn't be in monitoring_sources. Since it's generated at every run of configure, it shouldn't be distributed in the tarball (which monitoring_sources will be). You probably need to add a nodist_mca_osc_monitoring_la_SOURCES line (I think that's right, my Automake foo is getting rusty).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in commit 2011660 .
I'm not a automake guru, but it seems like adding osc_monitoring_template_gen.h to the nodist_mcacomponent_HEADERS target worked as intended.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I saw that in the next commit (I review commits independently, because changes should be bisectable). You need to merge those two commits into one before merging this PR to master.

done
echo '' >>[$1]
cat <<EOF >>[$1]
return OMPI_ERR_NOT_SUPPORTED;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to say that this shell script is really ugly to read. It would be a lot clearer (in my opinion) if you didn't try to get the else if on the same line and just had some slightly uglier code generated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to say I totally agree with you :-D And it actually led to issue with OS-X, and replacing "echo -n" with "printf" did not worked either. Commit 2011660 fixes the issue. And at this point I start feeling like I should not have kept the wrong commits, and only keep once everything was fixed. I'll do that tomorrow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option you might want to do here is not generate the inline code here in the .h file, but rather just instantiate an argv-style array of the component names. That's the only thing that needs to be generated (the array of component names), right? Then the code can be beautifully indented in a .c file. 😄

[$1],
[$2])

mca_ompi_osc_monitoring_generate_templates([$OMPI_TOP_SRCDIR/ompi/mca/osc/monitoring/osc_monitoring_template_gen.h],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised this worked with "make distcheck"; you should be writing into the build directory, not the source directory. You actually shouldn't need the prefix ($OMPI_TOP_SRCDIR/) at all; configure is by definition run at the top of the list.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It did not. I think the cleancheck failed (but I did not know that at that point. I fought latter with targets to make it succeed. And I reached beyond despair at some point... But now it seems fine)

Copy link
Member

@bwbarrett bwbarrett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sigh; this commit should be squashed into the previous one.

Copy link
Member

@bwbarrett bwbarrett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be squashed into 08ec754

@bosilca
Copy link
Member

bosilca commented Nov 30, 2017

As we can use the group to translate ranks why do we need the knowledge about the other OSC modules ?

@clementFoyer
Copy link
Contributor Author

The OSC component specific module is kept in the window. As it keeps all the structured data needed by a component to be effective, it cannot be replaced in the window structure, because some component are expecting these information to be there, and are accessing directly through the window given as parameter. I would have love to "extend" these module to add my own informations, but I would have to be able to know their module sizes, and thus, I would have to include their module headers. Otherwise I could create a module big enough to contain any OSC module but that is quite fragile and it would get us back to the original issue that had me to create this PR.
In order to intercept the call, I cannot replace this module. So I change the module functions to my templated OSC component specific functions, so for each OSC function, and for each component, I can intercept the call, record data, then forward to the originally targeted function.

@clementFoyer clementFoyer force-pushed the osc-no-includes branch 3 times, most recently from 704d6e1 to 8127b2d Compare December 4, 2017 16:58
@hppritcha
Copy link
Member

bot:ompi:retest

@clementFoyer
Copy link
Contributor Author

Everything here is squashed into two commits. The first one does the first quick solution that translate rank from windows' group to MPI_COMM_WORLD rank. The second commit enables the automatic generation at config time. IMO, it is relevant to keep these two commits separated. From the differents tests I've made, all seems to work smoothly.

As the monitoring was introduced into v3.1, this PR may be also added to the target v3.1.x

Copy link
Member

@bwbarrett bwbarrett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you get a chance, it would be awesome to fix the one Makefile bit, but let's ship it...


noinst_LTLIBRARIES = $(component_noinst)
libmca_osc_monitoring_la_SOURCES = $(monitoring_sources)
libmca_osc_monitoring_la_LDFLAGS = -module -avoid-version

# C file depends on the header file generated at configure time
$(srcdir)/osc_monitoring_component.c: $(builddir)/osc_monitoring_template_gen.h
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the point of this rule? You don't have any generating rules, so all it's doing is expressing a dependency, which should be picked up in dependency generation. It's not harmful, but it's odd.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expressing the dependency was the only point. I wasn't sure whether or not it would have been expressed otherwise. I think I tried to check that if the generated .h file was edited by hand (should not be, but who knows...) osc_monitoring_component.c was compiled again, and I think it was not. If it should be here, I'll remove it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can look at the .deps/... file to make sure, but the compiler should generate a dependency line. In general we don't add extra dependency information unless we're adding a build rule. So it's not required, but it's also not harmful. If you can change it, that would be great. But I'm also fine if you change it in a follow-up commit.

@bwbarrett
Copy link
Member

@jsquyres, I think @clementFoyer has addressed your squash concern; mind dismissing your review (or doing another review)?

@clementFoyer
Copy link
Contributor Author

There was a prototype of a function that became useless and needed to be removed. I did it and removed the useless dependency explicitation from Makefile.am

#define MCA_OSC_MONITORING_GEN_TEMPLATE_H

#include <string.h>
#include <ompi_config.h>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<ompi_config.h> should be first.

AC_CONFIG_COMMANDS([$1],
[filename="$1"
components=`echo "$2" | sed -e 's/,/ /g' -e 's/monitoring//'`
cat <<EOF >$filename
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit: this indenting is inconsistent with the indenting below (i.e., the indenting below is nice 4-space indenting, but this is much larger than that).

#
AC_DEFUN([mca_ompi_osc_monitoring_generate_templates],[
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

m4 macros as usually named with upper case, so that we know they are macros.

/* Include template generating macros */
#include "osc_monitoring_template.h"
/**************************************/
/* Include templated macros */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not just macros, right? There's also inline functions.

done
echo '' >>[$1]
cat <<EOF >>[$1]
return OMPI_ERR_NOT_SUPPORTED;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option you might want to do here is not generate the inline code here in the .h file, but rather just instantiate an argv-style array of the component names. That's the only thing that needs to be generated (the array of component names), right? Then the code can be beautifully indented in a .c file. 😄

Copy link
Member

@jsquyres jsquyres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason that iterating over the already-existing ompi_osc_base_framework.framework_components list isn't sufficient such that you need to build your own? I'm sorry I didn't make that clear in my initial review...

Also, you don't need to keep all the commits in this PR.

The 1st one adds some hard-coded component names that you then remove in later commits. That kind of stuff can be squashed together so that we only merge the final/desired/good code (multiple commits on a PR is fine -- it's just a little silly to have ABC code in commit X, and then remove ABC code in commit Y on the same PR. Know what I mean?).

@@ -40,10 +40,10 @@

/* Define and set the ompi_osc_monitoring_## template ##_template
* variable. The functions recorded here are linked to the original
* functions of the original {template} module that were replaced.
* functions of the original {template} module that weas replaced.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo.

@clementFoyer
Copy link
Contributor Author

Not sure where you ask about the ompi_osc_base_framework.framework_components. Is it the list created in the configure.m4 file or when I iterate in osc_monitoring_component.c?

@jsquyres
Copy link
Member

jsquyres commented Jan 9, 2018

@clementFoyer This list should already exist and be fully populated by the time your monitoring component is opened. I.e., you shouldn't have to do anything to create this list -- you should be able to just use it. It should contain the names of all the components that are available (but not necessarily selected).

@bwbarrett
Copy link
Member

I wouldn't have squashed all the commits into one commit (I disagree with Jeff there). But since you did, you should make the commit message read like one commit.

I'm not sure I understand Jeff's comment about iterating over the framework list, because that's a run-time construct. Since the code needs the list at compile time (to build the templates), that list won't do what you want. Now, there's the question of whether you still need the templates, but I'm pretty sure the answer is yes, because you need to know the size of the module for the underlying transport in order to do the replace logic. So I think this is the best you can do...

@clementFoyer
Copy link
Contributor Author

If needed I still have a version somewhere where everything is not squashed, if that would be required.

Two things about the list : I need the list to omit the monitoring component, and it is used at compilation time as it is used in the macros.

Saddly, I still need the templates as I cannot simply replace the modules in the windows structures. Without it, I cannot know wich was the underlying osc omponent. I originally wanted to be able to replace the module pointers by inheriting from them, but that would require whether to know the size of the module, or to include the other headers, which led to the original issue.

@jsquyres
Copy link
Member

jsquyres commented Jan 9, 2018

Sorry, I wasn't advocating for squashing into a single commit; multiple commits are both fine and good. But I do dislike seeing:

  1. Commit A introduces code XYZ.
  2. Commit B removes code XYZ.

I also agree with Brian -- I forgot to mention that one of your commit messages looks like a squash of 2 commit messages. If you're going to squash, please make the commit message look like a single commit message.

In terms of iterating over the list of components, let me go read the code again; perhaps I misunderstood what it was trying to do...

@jsquyres
Copy link
Member

jsquyres commented Jan 9, 2018

Ok, I understand your code now -- you did it all at compile time with macros and m4 (I think my bias would have been to do it at run time iterating over the framework component list and swapping around function pointers, which is why my brain got confused when I read your code).

@jsquyres
Copy link
Member

jsquyres commented Jan 9, 2018

BTW, I still left the "X" review so that the commits can be cleaned up per Brian's and my comments. Thanks!

Copy link
Member

@jsquyres jsquyres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry, I missed that a new commit was put here to be reviewed. I have super minor nit-pick requests. Other than that, I think this is good.

@@ -94,6 +94,13 @@ static inline int mca_common_monitoring_get_world_rank(int dst, struct ompi_comm
return ret;
}

/* Translate the rank from the given communicator of a process to its rank in MPI_COMM_RANK. */
static inline int mca_common_monitoring_get_world_rank(int dst, struct ompi_communicator_t*comm,
int*world_rank)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why rename this function, and leave a 1-line inline alias function to get to the new function? Why not just call the new function name directly?

#
# Additional copyrights may follow
# Append to file $1 an array of components by listing component names in $2.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify this comment: this function does not append to the end of an existing file -- it overwrites file $1.


noinst_LTLIBRARIES = $(component_noinst)
libmca_osc_monitoring_la_SOURCES = $(monitoring_sources)
libmca_osc_monitoring_la_LDFLAGS = -module -avoid-version

distclean-local:
rm -f $(builddir)/osc_monitoring_template_gen.h
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit: you can add this file to DISTCLEANFILES and not have to make a distclean-local rule.

The osc monitoring component needed to include other OSC components
header in order to be able tu access communicator through the
component specific ompi_osc_*_module_t structures. This commit remove
the dependency, and resolve the issue open-mpi#4523.

Extend the common monitoring API.

  * Now it's possible to translate from local rank to world rank from
    both the communicator and the group.
  * Remove useless hashtable as we directly use the w_group contained
    in window structure.

Add automatic generation at config time.

The templates are expanded at configure time. It creates a new header
file that generates all the variables/functions needed. Adding this
during the autogen automagicaly generates for each of the available
modules the proper functions.

Only keep a generated argv-style array.

Following Jeff's advice, the configure.m4 file generate a simple array
of module variables to be iterated over to find the proper module.

Signed-off-by: Clement Foyer <clement.foyer@inria.fr>
@clementFoyer
Copy link
Contributor Author

@jsquyres Every thing seems to pass. The requested changes have been made.

@jsquyres
Copy link
Member

jsquyres commented Feb 8, 2018

@bwbarrett A bunch of things have changed since you reviewed. Care to review again, or are you good / we should merge?

@jsquyres
Copy link
Member

jsquyres commented Feb 9, 2018

I'll take Brian's silence and his prior approval as a positive sign. Let's merge.

Thanks for your patience, @clementFoyer!

@jsquyres jsquyres merged commit e7f91f8 into open-mpi:master Feb 9, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants