-
Notifications
You must be signed in to change notification settings - Fork 936
Description
Background information
What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)
git master acc2a70
(current latest)
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
From a git clone, with --enable-debug.
Please describe the system on which you are running
- Operating system/version: Ubuntu 16.04 x86_64
- Computer hardware: Intel desktop
- Network type: N/A
Details of the problem
A simple no-op test program causes a use-after-free in MPI_Finalize.
Test program:
program a
use mpi
implicit none
integer :: err
call MPI_Init(err)
call MPI_Finalize(err)
end program aRun with valgrind ./a.out
Valgrind error:
Invalid read of size 4
at 0x5B9FFB4: opal_hash_table_get_value_ptr (opal_hash_table.c:653)
by 0x5BE61F2: find_component (mca_base_component_repository.c:316)
by 0x5BE628D: mca_base_component_repository_release (mca_base_component_repository.c:338)
by 0x5BE786B: mca_base_component_unload (mca_base_components_close.c:46)
by 0x5BE78FF: mca_base_component_close (mca_base_components_close.c:65)
by 0x5BE7987: mca_base_components_close (mca_base_components_close.c:91)
by 0x5BE792E: mca_base_framework_components_close (mca_base_components_close.c:71)
by 0x5C6ACB8: opal_installdirs_base_close (installdirs_base_components.c:171)
by 0x5BF78A2: mca_base_framework_close (mca_base_framework.c:252)
by 0x5BF7C3B: mca_base_framework_close_list (mca_base_framework.c:292)
by 0x5BAF20B: opal_finalize_cleanup_domain (opal_finalize.c:136)
by 0x5BAF36B: opal_finalize_util (opal_finalize.c:151)
Address 0x7605d80 is 2,144 bytes inside a block of size 8,672 free'd
at 0x4C2EDEB: free (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
by 0x5B9F184: opal_hash_table_destruct (opal_hash_table.c:144)
by 0x5BE5380: opal_obj_run_destructors (opal_object.h:462)
by 0x5BE6D64: mca_base_component_repository_finalize (mca_base_component_repository.c:562)
by 0x5BE3805: mca_base_close (mca_base_close.c:59)
by 0x5BAF20B: opal_finalize_cleanup_domain (opal_finalize.c:136)
by 0x5BAF36B: opal_finalize_util (opal_finalize.c:151)
by 0x57F80E1: ompi_mpi_finalize (ompi_mpi_finalize.c:495)
by 0x58396B0: PMPI_Finalize (pfinalize.c:54)
by 0x4E8423E: PMPI_FINALIZE (pfinalize_f.c:71)
by 0x400A15: MAIN__ (in [...]/a.out)
by 0x400A4C: main (in [...]/a.out)
Block was alloc'd at
at 0x4C2FB55: calloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
by 0x5B9F21C: opal_hash_table_init2 (opal_hash_table.c:167)
by 0x5B9F2DB: opal_hash_table_init (opal_hash_table.c:185)
by 0x5BE5FFE: mca_base_component_repository_init (mca_base_component_repository.c:258)
by 0x5BE8272: mca_base_open (mca_base_open.c:172)
by 0x5BAFF34: opal_init_util (opal_init.c:501)
by 0x57F56EE: ompi_mpi_init (ompi_mpi_init.c:428)
by 0x584765B: PMPI_Init (pinit.c:67)
by 0x4E87D3E: mpi_init (pinit_f.c:87)
by 0x400A09: MAIN__ (in [...]/a.out)
by 0x400A4C: main (in [...]/a.out)
The order in which opal_init_util_frameworks calls its finalizers is
mca_base_close
opal_dss_close
opal_datatype_finalize
opal_net_finalize
opal_deregister_params
mca_base_var_finalize
opal_util_keyval_parse_finalize
opal_show_help_finalize
opal_malloc_finalize
opal_output_finalize
mca_base_framework_close_list(opal_init_util_frameworks)
The first of these ends up calling mca_base_component_repository_finalize.
The last one tries to call mca_base_framework_close on installdirs, which then tries to free its components and attempts to use the already destructed hash table mca_base_component_repository.
To confirm this, add the line ht->ht_table = NULL; in the function opal_hash_table_destruct, which changes the use-after-free into a segfault.
My understanding of the component system is not sufficient to suggest a solution to the problem. The following comment in opal_installdirs_base_open is perhaps related:
/* NTH: Is it ok not to close the components? If not we can add a flag
to mca_base_framework_components_close to indicate not to deregister
variable groups */