Skip to content

Conversation

@hjelmn
Copy link
Member

@hjelmn hjelmn commented Apr 9, 2015

This commit enables the use of opal classes after a call to
opal_class_finalize. This is needed to support re-initializing opal
after calling opal_finalize_util (). This is needed to support the
following without leaking:

MPI_T_init_thread ();
MPI_T_finalize ();

MPI_Init ();
MPI_Finalize ();

Before this commit the above code would crash in MPI_Init because the
constructor array for some class was freed by opal_class_finalize ().

The fix is to turn the cls_initialized member of opal_class_t into an
init epoch identifier and compare it against opal_class_init_epoch
instead of 1. On each call to opal_class_finalize the
opal_class_init_epoch counter is incremented forcing re-initialization
of classes after finalize.

Signed-off-by: Nathan Hjelm hjelmn@lanl.gov

This commit enables the use of opal classes after a call to
opal_class_finalize. This is needed to support re-initializing opal
after calling opal_finalize_util (). This is needed to support the
following without leaking:

MPI_T_init_thread ();
MPI_T_finalize ();

MPI_Init ();
MPI_Finalize ();

Before this commit the above code would crash in MPI_Init because the
constructor array for some class was freed by opal_class_finalize ().

The fix is to turn the cls_initialized member of opal_class_t into an
init epoch identifier and compare it against opal_class_init_epoch
instead of 1. On each call to opal_class_finalize the
opal_class_init_epoch counter is incremented forcing re-initialization
of classes after finalize.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
@hjelmn
Copy link
Member Author

hjelmn commented Apr 9, 2015

@jsquyres, @rolfv

I don't know why I didn't think of this sooner. The fix is simple but does require re-compilation of components. So not possible for 1.8 but will fix the problem going forward.

@bosilca
Copy link
Member

bosilca commented Apr 9, 2015

It looks like an extremely convoluted solution. Why is setting cls->cls_initialized to 0 upon destruction not an acceptable approach?

@hjelmn
Copy link
Member Author

hjelmn commented Apr 9, 2015

@bosilca Because the class may be in a dynamic component that has been dlclose'd. I couldn't find a way around that.

@hjelmn
Copy link
Member Author

hjelmn commented Apr 9, 2015

I still like the idea of having a destructor function in opal. I feel it is a cleaner solution to the problem but I know there is some resistance to the idea. This fixes the underlying problem without a destructor.

@mellanox-github
Copy link

Refer to this link for build results (access rights to CI server needed):
http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/424/
Test PASSed.

@bosilca
Copy link
Member

bosilca commented Apr 9, 2015

But if the component has been closed I don't see how your proposed patch fixes the problem.

The destructor approach seems somehow cleaner, but still a stretch for a environment that clearly defines initialization and finalization functions.

Why not extending the class structure to contain a pointer to the component that defines it. This will create a link between the registered classes and the component that registered them. Upon registration a class increases the ref_count of it's parent class, as well as the ref_count of the component holding the class definition. Upon class destruction, we do the opposite. The ref_count on the component will prevent the system from unloading them for as long as they have a class registered.

@hjelmn
Copy link
Member Author

hjelmn commented Apr 9, 2015

If a dlclose'd component is re-loaded its classes will again have cls_initialized set to 0 and they will go through the class initialization. That has always worked and that behavior will not change. This commit ensures that we get the same behavior from classes not in dlclose'd components (opal/class, etc) by ensuring their cls_initialized member is stale without having to see which classes can still be accessed.

As for adding a pointer to the component. That will require changing OBJ_CLASS_INSTANCE to include a pointer to a component (if there is one). That is a lot of work but may be the correct solution in the end.

hjelmn added a commit to hjelmn/ompi that referenced this pull request Apr 14, 2015
found at:

http://www.open-mpi.org/community/lists/devel/2014/07/15140.php

The finalize function is set using either the linker option -fini or
__attribute__((destructor)) depending on compiler support. I have
confirmed that this hybrid approach works with all the major
compilers. The attribute is supported by gcc, clang, llvm, xlc, and
icc. The fini function will support pgi. If a compiler/linker
combination does not support either the destructor or fini function a
message will be printed on re-init indicating it is not supported (an
improvement over the current behavior-- SEGV).

I moved the following to the destructor function:

 - Class system finalize. This solves a bug when MPI_T_finalize is
   called before MPI_Init. The only downside to this change is we will
   leave the footprint of the opal class system after
   MPI_Finalize. This footprint should be relatively small.

This is an alternative to open-mpi#517 but the two PRs are not
mutually-exclusive (with some modifications). This commit should also
be safe for 1.8.x as it does not change internal or external ABI (open-mpi#517
changes internal ABI).

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
hjelmn added a commit to hjelmn/ompi that referenced this pull request Apr 14, 2015
This commit is related to an RFC from June 2014. Disscussion can be
found at:

http://www.open-mpi.org/community/lists/devel/2014/07/15140.php

The finalize function is set using either the linker option -fini or
__attribute__((destructor)) depending on compiler support. I have
confirmed that this hybrid approach works with all the major
compilers. The attribute is supported by gcc, clang, llvm, xlc, and
icc. The fini function will support pgi. If a compiler/linker
combination does not support either the destructor or fini function a
message will be printed on re-init indicating it is not supported (an
improvement over the current behavior-- SEGV).

I moved the following to the destructor function:

 - Class system finalize. This solves a bug when MPI_T_finalize is
   called before MPI_Init. The only downside to this change is we will
   leave the footprint of the opal class system after
   MPI_Finalize. This footprint should be relatively small.

This is an alternative to open-mpi#517 but the two PRs are not
mutually-exclusive (with some modifications). This commit should also
be safe for 1.8.x as it does not change internal or external ABI (open-mpi#517
changes internal ABI).

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
hjelmn added a commit to hjelmn/ompi that referenced this pull request Apr 14, 2015
This commit is related to an RFC from June 2014. Disscussion can be
found at:

http://www.open-mpi.org/community/lists/devel/2014/07/15140.php

The finalize function is set using either the linker option -fini or
__attribute__((destructor)) depending on compiler support. I have
confirmed that this hybrid approach works with all the major
compilers. The attribute is supported by gcc, clang, llvm, xlc, and
icc. The fini function will support pgi. If a compiler/linker
combination does not support either the destructor or fini function a
message will be printed on re-init indicating it is not supported (an
improvement over the current behavior-- SEGV).

I moved the following to the destructor function:

 - Class system finalize. This solves a bug when MPI_T_finalize is
   called before MPI_Init. The only downside to this change is we will
   leave the footprint of the opal class system after
   MPI_Finalize. This footprint should be relatively small.

This is an alternative to open-mpi#517 but the two PRs are not
mutually-exclusive (with some modifications). This commit should also
be safe for 1.8.x as it does not change internal or external ABI (open-mpi#517
changes internal ABI).

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
hjelmn added a commit to hjelmn/ompi that referenced this pull request Apr 14, 2015
This commit is related to an RFC from June 2014. Disscussion can be
found at:

http://www.open-mpi.org/community/lists/devel/2014/07/15140.php

The finalize function is set using either the linker option -fini or
__attribute__((destructor)) depending on compiler support. I have
confirmed that this hybrid approach works with all the major
compilers. The attribute is supported by gcc, clang, llvm, xlc, and
icc. The fini function will support pgi. If a compiler/linker
combination does not support either the destructor or fini function a
message will be printed on re-init indicating it is not supported (an
improvement over the current behavior-- SEGV).

I moved the following to the destructor function:

 - Class system finalize. This solves a bug when MPI_T_finalize is
   called before MPI_Init. The only downside to this change is we will
   leave the footprint of the opal class system after
   MPI_Finalize. This footprint should be relatively small.

This is an alternative to open-mpi#517 but the two PRs are not
mutually-exclusive (with some modifications). This commit should also
be safe for 1.8.x as it does not change internal or external ABI (open-mpi#517
changes internal ABI).

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
hjelmn added a commit to hjelmn/ompi that referenced this pull request Apr 17, 2015
This commit is related to an RFC from June 2014. Disscussion can be
found at:

http://www.open-mpi.org/community/lists/devel/2014/07/15140.php

The finalize function is set using either the linker option -fini or
__attribute__((destructor)) depending on compiler support. I have
confirmed that this hybrid approach works with all the major
compilers. The attribute is supported by gcc, clang, llvm, xlc, and
icc. The fini function will support pgi. If a compiler/linker
combination does not support either the destructor or fini function a
message will be printed on re-init indicating it is not supported (an
improvement over the current behavior-- SEGV).

I moved the following to the destructor function:

 - Class system finalize. This solves a bug when MPI_T_finalize is
   called before MPI_Init. The only downside to this change is we will
   leave the footprint of the opal class system after
   MPI_Finalize. This footprint should be relatively small.

This is an alternative to open-mpi#517 but the two PRs are not
mutually-exclusive (with some modifications). This commit should also
be safe for 1.8.x as it does not change internal or external ABI (open-mpi#517
changes internal ABI).

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
bosilca pushed a commit to ICLDisco/ompi that referenced this pull request May 1, 2015
This commit is related to an RFC from June 2014. Disscussion can be
found at:

http://www.open-mpi.org/community/lists/devel/2014/07/15140.php

The finalize function is set using either the linker option -fini or
__attribute__((destructor)) depending on compiler support. I have
confirmed that this hybrid approach works with all the major
compilers. The attribute is supported by gcc, clang, llvm, xlc, and
icc. The fini function will support pgi. If a compiler/linker
combination does not support either the destructor or fini function a
message will be printed on re-init indicating it is not supported (an
improvement over the current behavior-- SEGV).

I moved the following to the destructor function:

 - Class system finalize. This solves a bug when MPI_T_finalize is
   called before MPI_Init. The only downside to this change is we will
   leave the footprint of the opal class system after
   MPI_Finalize. This footprint should be relatively small.

This is an alternative to open-mpi#517 but the two PRs are not
mutually-exclusive (with some modifications). This commit should also
be safe for 1.8.x as it does not change internal or external ABI (open-mpi#517
changes internal ABI).

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
@hppritcha
Copy link
Member

Can one of the admins verify this patch?

@rhc54
Copy link
Contributor

rhc54 commented May 11, 2015

Who or what is an "admin"??

@hjelmn
Copy link
Member Author

hjelmn commented May 11, 2015

I would still like to see this make it into master but I would like to verify that there are no performance implications of this change. I doubt that there are but we do have a lot of OBJ_CONSTRUCT calls in the critical path.

bosilca pushed a commit to eddy16112/ompi that referenced this pull request Jun 18, 2015
This commit is related to an RFC from June 2014. Disscussion can be
found at:

http://www.open-mpi.org/community/lists/devel/2014/07/15140.php

The finalize function is set using either the linker option -fini or
__attribute__((destructor)) depending on compiler support. I have
confirmed that this hybrid approach works with all the major
compilers. The attribute is supported by gcc, clang, llvm, xlc, and
icc. The fini function will support pgi. If a compiler/linker
combination does not support either the destructor or fini function a
message will be printed on re-init indicating it is not supported (an
improvement over the current behavior-- SEGV).

I moved the following to the destructor function:

 - Class system finalize. This solves a bug when MPI_T_finalize is
   called before MPI_Init. The only downside to this change is we will
   leave the footprint of the opal class system after
   MPI_Finalize. This footprint should be relatively small.

This is an alternative to open-mpi#517 but the two PRs are not
mutually-exclusive (with some modifications). This commit should also
be safe for 1.8.x as it does not change internal or external ABI (open-mpi#517
changes internal ABI).

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
@hppritcha
Copy link
Member

bot:retest

@jsquyres
Copy link
Member

@hjelmn Where are we on this?

@hjelmn
Copy link
Member Author

hjelmn commented Aug 17, 2015

This PR works fine for what it was intended for. The problem is that now that we have an opal destructor it is no longer necessary except in the small number of cases where there is neither a destructor attribute or a fini function. This can happen on OS X with the PGI compiler.

@jsquyres
Copy link
Member

@bosilca What's your current feeling on this one?

@bosilca
Copy link
Member

bosilca commented Aug 18, 2015

It does have it's benefit. Go for it 👍 !

@hjelmn
Copy link
Member Author

hjelmn commented Aug 18, 2015

:bot:retest

hjelmn added a commit that referenced this pull request Aug 31, 2015
opal/class: enable use of opal classes after opal_class_finalize
@hjelmn hjelmn merged commit 3c34f6f into open-mpi:master Aug 31, 2015
jsquyres pushed a commit to jsquyres/ompi that referenced this pull request Nov 10, 2015
@hjelmn hjelmn deleted the class_fix branch May 23, 2016 17:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants