Skip to content

Conversation

@hjelmn
Copy link
Member

@hjelmn hjelmn commented Apr 7, 2015

This commit fixes the following bugs:

  • opal_output_finalize did not properly set internal state. This
    caused problems when calling the sequence opal_output_init (),
    opal_output_finalize (), opal_output_init ().
  • opal_info support called mca_base_open () but never called the
    matching mca_base_close (). mca_base_open () and mca_base_close ()
    have been updated to use a open count instead of an open flag to
    allow mca_base_open to be called through multiple paths (as may be
    the case when MPI_T is in use).
  • orte_info support did not register opal variables. This can cause
    orte-info to not return opal variables.
  • opal_info, orte_info, and ompi_info support have been updated to
    use a register count.
  • When opening the dl framework the reference count was added to
    ensure the framework stuck around. The framework being closed
    prematurely was a bug in the MCA base that has since been
    corrected. The increment (and associated decrement) have been
    removed.
  • dl/dlopen did not set the value of
    mca_dl_dlopen_component.filename_suffixes_mca_storage on each call
    to register. Instead the value was set in the component
    structure. This caused the value to be lost when re-loading the
    component. Fixed by setting the default value in register.
  • Reset shmem framework state on close to avoid returning a stale
    component after reloading opal/shmem.
  • MCA base parameters were not properly deregistered when the MCA
    base was closed.

This commit may fix #374.

Signed-off-by: Nathan Hjelm hjelmn@lanl.gov

@hjelmn
Copy link
Member Author

hjelmn commented Apr 7, 2015

@rolfv Can you verify if this fixes #374?

@jsquyres Please review.

@mellanox-github
Copy link

Refer to this link for build results (access rights to CI server needed):
http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/418/

Build Log
last 50 lines

[...truncated 20213 lines...]
++ unset COV_HOME
+ return 0
+ set -eE
+ '[' -n 513 -a -f /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/cov_file_418.txt ']'
++ cat /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/cov_file_418.txt
+ gh pr 513 --comment '
* Coverity found 910 errors for all: http://bgate.mellanox.com:8888/jenkins/job/gh-ompi-master-pr//ws/cov_build/all_418/output/errors/index.html

* Coverity found 5 errors for oshmem: http://bgate.mellanox.com:8888/jenkins/job/gh-ompi-master-pr//ws/cov_build/oshmem_418/output/errors/index.html

* Coverity found 2 errors for fca: http://bgate.mellanox.com:8888/jenkins/job/gh-ompi-master-pr//ws/cov_build/fca_418/output/errors/index.html'

module.js:340
    throw err;
          ^
Error: Cannot find module '../lib/cmds/help'
    at Function.Module._resolveFilename (module.js:338:15)
    at Function.Module._load (module.js:280:25)
    at Module.require (module.js:364:17)
    at require (module.js:380:17)
    at Object.<anonymous> (/hpc/local/lib/node_modules/gh/bin/gh.js:23:12)
    at Module._compile (module.js:456:26)
    at Object.Module._extensions..js (module.js:474:10)
    at Module.load (module.js:356:32)
    at Function.Module._load (module.js:312:12)
    at Function.Module.runMain (module.js:497:10)
Build step 'Execute shell' marked build as failure
TAP Reports Processing: START
Looking for TAP results report in workspace using pattern: **/*.tap
Saving reports...
Processing '/var/lib/jenkins/jobs/gh-ompi-master-pr/builds/418/tap-master-files/cov_stat.tap'
Parsing TAP test result [/var/lib/jenkins/jobs/gh-ompi-master-pr/builds/418/tap-master-files/cov_stat.tap].
not ok - coverity detected 910 failures in all_418 # SKIP http://bgate.mellanox.com:8888/jenkins/job/gh-ompi-master-pr//ws/cov_build/all_418/output/errors/index.html
not ok - coverity detected 5 failures in oshmem_418 # TODO http://bgate.mellanox.com:8888/jenkins/job/gh-ompi-master-pr//ws/cov_build/oshmem_418/output/errors/index.html
ok - coverity found no issues for yalla_418
ok - coverity found no issues for mxm_418
not ok - coverity detected 2 failures in fca_418 # TODO http://bgate.mellanox.com:8888/jenkins/job/gh-ompi-master-pr//ws/cov_build/fca_418/output/errors/index.html
ok - coverity found no issues for hcoll_418

TAP Reports Processing: FINISH
coverity_for_all    http://bgate.mellanox.com:8888/jenkins/job/gh-ompi-master-pr//ws/cov_build/all_418/output/errors/index.html
coverity_for_oshmem http://bgate.mellanox.com:8888/jenkins/job/gh-ompi-master-pr//ws/cov_build/oshmem_418/output/errors/index.html
coverity_for_fca    http://bgate.mellanox.com:8888/jenkins/job/gh-ompi-master-pr//ws/cov_build/fca_418/output/errors/index.html
[copy-to-slave] The build is taking place on the master node, no copy back to the master will take place.
Setting commit status on GitHub for https://api.github.com/repos/open-mpi/ompi/commit/cf2367acc080ef018998e38626d6d40a3e05bca4
[BFA] Scanning build for known causes...

[BFA] Done. 0s
Setting status of c646f2f5dbcfdb287aa4e27c5260d6500f551086 to FAILURE with url http://bgate.mellanox.com:8888/jenkins/job/gh-ompi-master-pr/418/ and message: Merged build finished.

Test FAILed.

@hjelmn
Copy link
Member Author

hjelmn commented Apr 7, 2015

?? Weird error.

bot:retest

@hjelmn
Copy link
Member Author

hjelmn commented Apr 7, 2015

bot:retest

@mellanox-github
Copy link

Refer to this link for build results (access rights to CI server needed):
http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/420/

Build Log
last 50 lines

[...truncated 24154 lines...]
              MCA sshmem: parameter "sshmem_base_verbose" (current value: "0", data source: default, level: 8 dev/detail, type: int)
                          Verbosity level for the sshmem framework (0 = no verbosity)
              MCA sshmem: parameter "sshmem_base_start_address" (current value: "4278190080", data source: default, level: 9 dev/all, type: unsigned_long_long, synonyms: memheap_base_start_address)
                          Specify base address for shared memory region
              MCA sshmem: parameter "sshmem_base_backing_file_dir" (current value: "/dev/shm", data source: default, level: 9 dev/all, type: string)
                          Specifies where backing files will be created when mmap is used and shmem_mmap_anonymous set to 0.
              MCA sshmem: parameter "sshmem_sysv_priority" (current value: "30", data source: default, level: 3 user/all, type: int)
                          Priority for the sshmem sysv component (default: 30)
              MCA sshmem: parameter "sshmem_sysv_use_hp" (current value: "-1", data source: default, level: 4 tuner/basic, type: int)
                          Huge pages usage [0 - off, 1 - on, -1 - auto] (default: -1)
              MCA sshmem: parameter "sshmem_mmap_priority" (current value: "20", data source: default, level: 3 user/all, type: int)
                          Priority for sshmem mmap component (default: 20)
              MCA sshmem: parameter "sshmem_mmap_anonymous" (current value: "1", data source: default, level: 4 tuner/basic, type: int)
                          Select whether anonymous sshmem is used for mmap component (default: 1)
              MCA sshmem: parameter "sshmem_mmap_fixed" (current value: "1", data source: default, level: 4 tuner/basic, type: int)
                          Select whether fixed start address is used for shmem (default: 1)
              MCA sshmem: parameter "sshmem_verbs_priority" (current value: "40", data source: default, level: 3 user/all, type: int)
                          Priority for sshmem verbs component (default: 40)
              MCA sshmem: parameter "sshmem_verbs_hca_name" (current value: "", data source: default, level: 3 user/all, type: string, synonyms: memheap_base_hca_name)
                          Preferred hca (default: the first)
              MCA sshmem: parameter "sshmem_verbs_mr_interleave_factor" (current value: "2", data source: default, level: 3 user/all, type: int, synonyms: memheap_base_mr_interleave_factor)
                          try to give at least N Gbytes spaces between mapped memheaps of other PEs that are local to me (default: 2)
              MCA sshmem: parameter "sshmem_verbs_shared_mr" (current value: "-1", data source: default, level: 4 tuner/basic, type: int)
                          Shared memory region usage [0 - off, 1 - on, -1 - auto] (default: -1)
oshmem_info: runtime/opal_info_support.c:322: opal_info_close_components: Assertion `opal_info_registered' failed.
./jenkins_scripts/jenkins/ompi/ompi_jenkins.sh: line 188:  1836 Segmentation fault      (core dumped) oshmem_info -a -l 9
Build step 'Execute shell' marked build as failure
TAP Reports Processing: START
Looking for TAP results report in workspace using pattern: **/*.tap
Saving reports...
Processing '/var/lib/jenkins/jobs/gh-ompi-master-pr/builds/420/tap-master-files/cov_stat.tap'
Parsing TAP test result [/var/lib/jenkins/jobs/gh-ompi-master-pr/builds/420/tap-master-files/cov_stat.tap].
not ok - coverity detected 910 failures in all_420 # SKIP http://bgate.mellanox.com:8888/jenkins/job/gh-ompi-master-pr//ws/cov_build/all_420/output/errors/index.html
not ok - coverity detected 5 failures in oshmem_420 # TODO http://bgate.mellanox.com:8888/jenkins/job/gh-ompi-master-pr//ws/cov_build/oshmem_420/output/errors/index.html
ok - coverity found no issues for yalla_420
ok - coverity found no issues for mxm_420
not ok - coverity detected 2 failures in fca_420 # TODO http://bgate.mellanox.com:8888/jenkins/job/gh-ompi-master-pr//ws/cov_build/fca_420/output/errors/index.html
ok - coverity found no issues for hcoll_420

TAP Reports Processing: FINISH
coverity_for_all    http://bgate.mellanox.com:8888/jenkins/job/gh-ompi-master-pr//ws/cov_build/all_420/output/errors/index.html
coverity_for_oshmem http://bgate.mellanox.com:8888/jenkins/job/gh-ompi-master-pr//ws/cov_build/oshmem_420/output/errors/index.html
coverity_for_fca    http://bgate.mellanox.com:8888/jenkins/job/gh-ompi-master-pr//ws/cov_build/fca_420/output/errors/index.html
[copy-to-slave] The build is taking place on the master node, no copy back to the master will take place.
Setting commit status on GitHub for https://api.github.com/repos/open-mpi/ompi/commit/fbf7c860b5064e293f1ceff7a175bbd177650ef3
[BFA] Scanning build for known causes...

[BFA] Done. 0s
Setting status of c646f2f5dbcfdb287aa4e27c5260d6500f551086 to FAILURE with url http://bgate.mellanox.com:8888/jenkins/job/gh-ompi-master-pr/420/ and message: Merged build finished.

Test FAILed.

@hjelmn
Copy link
Member Author

hjelmn commented Apr 8, 2015

Looks like I have to correct something in oshmem as well.

This commit fixes the following bugs:

 - opal_output_finalize did not properly set internal state. This
   caused problems when calling the sequence opal_output_init (),
   opal_output_finalize (), opal_output_init ().

 - opal_info support called mca_base_open () but never called the
   matching mca_base_close (). mca_base_open () and mca_base_close ()
   have been updated to use a open count instead of an open flag to
   allow mca_base_open to be called through multiple paths (as may be
   the case when MPI_T is in use).

 - orte_info support did not register opal variables. This can cause
   orte-info to not return opal variables.

 - opal_info, orte_info, and ompi_info support have been updated to
   use a register count.

 - When opening the dl framework the reference count was added to
   ensure the framework stuck around. The framework being closed
   prematurely was a bug in the MCA base that has since been
   corrected. The increment (and associated decrement) have been
   removed.

 - dl/dlopen did not set the value of
   mca_dl_dlopen_component.filename_suffixes_mca_storage on each call
   to register. Instead the value was set in the component
   structure. This caused the value to be lost when re-loading the
   component. Fixed by setting the default value in register.

 - Reset shmem framework state on close to avoid returning a stale
   component after reloading opal/shmem.

 - MCA base parameters were not properly deregistered when the MCA
   base was closed.

This commit may fix open-mpi#374.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
@mellanox-github
Copy link

Refer to this link for build results (access rights to CI server needed):
http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/421/
Test PASSed.

@rolfv
Copy link

rolfv commented Apr 8, 2015

@hjelmn Just tested this and it does indeed fix the issue reported in #374.

@hjelmn
Copy link
Member Author

hjelmn commented Apr 8, 2015

I am going to go ahead an merge this and set Jeff to review the 1.8 PR.

hjelmn added a commit that referenced this pull request Apr 8, 2015
opal: fix multiple bugs in MCA and opal
@hjelmn hjelmn merged commit eb56117 into open-mpi:master Apr 8, 2015
jsquyres added a commit to jsquyres/ompi that referenced this pull request Nov 10, 2015
@hjelmn hjelmn deleted the mca_bug_fixes branch May 23, 2016 17:44
markalle pushed a commit to markalle/ompi that referenced this pull request Sep 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MPI_T_finalize() gets SEGV when called

3 participants