OPFLOW with RAJA/HIOP sparse GPU solvers #8

wperkins · 2023-09-19T13:09:03Z

This adds a new OPFLOW solver (HIOPSPARSEGPU) and model (PBPOLRAJAHIOPSPARSE) to utilize sparse, GPU-based solvers available in HiOp.

wperkins · 2023-09-19T14:56:32Z

With the current state of this branch (ad00d28d), @abhyshr has coded on-host assembly of the Jacobian and Hessian in PETSc matrices. The contents of these matrices are stripped into triple format and copied to the GPU. This way, the sparse linear solver can be exercised while the GPU-based Jacobian/Hessian assembly is coded.
I've been testing this with the 9-bus case (datafiles/case9/case9mod.m) on deception. The optimization completes and the computed objective appears correct.

However, OPFLOW crashes in the solution callback as described here:

Program received signal SIGSEGV, Segmentation fault.
0x00002aaad0a8daf9 in __memcpy_ssse3_back () from /usr/lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.17-307.el7.1.x86_64 libibverbs-50mlnx1-1.50218.x86_64 libnl3-3.2.28-4.el7.x86_64 librdmacm-50mlnx1-1.50218.x86_64 numactl-libs-2.0.12-5.el7.x86_64 ucx-1.8.0-1.50218.x86_64 ucx-cma-1.8.0-1.50218.x86_64 ucx-ib-1.8.0-1.50218.x86_64 ucx-knem-1.8.0-1.50218.x86_64 ucx-rdmacm-1.8.0-1.50218.x86_64 zlib-1.2.7-18.el7.x86_64
(gdb) up
#1  0x00000000005e5b58 in OPFLOWHIOPSPARSEGPUInterface::solution_callback (
    this=0x1da07970, status=hiop::Solve_Success, n=24, xsol=0x2aab64c08e00, 
    z_L=0x2aab64c09e00, z_U=0x2aab64c0a000, m=36, gsol=0x2aab64c14200, 
    lamsol=0x2aab64c17e00, obj_value=4144.4507489105636)
    at /people/d3g096/ExaGo/src/exago/src/opflow/solver/hiop/opflow_hiopsparsegpu.cpp:294
294       memcpy(x, xsol, opflow->nx * sizeof(double));

The SEGV is caused by trying to access the contents of xsol. Accessing the other arrays passed (gsol, lamsol, etc.) also causes a SEGV.

Any thoughts @abhyshr?

abhyshr · 2023-09-23T16:01:38Z

With the current state of this branch (ad00d28d), @abhyshr has coded on-host assembly of the Jacobian and Hessian in PETSc matrices. The contents of these matrices are stripped into triple format and copied to the GPU. This way, the sparse linear solver can be exercised while the GPU-based Jacobian/Hessian assembly is coded. I've been testing this with the 9-bus case (datafiles/case9/case9mod.m) on deception. The optimization completes and the computed objective appears correct.

However, OPFLOW crashes in the solution callback as described here:
Program received signal SIGSEGV, Segmentation fault.
0x00002aaad0a8daf9 in __memcpy_ssse3_back () from /usr/lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.17-307.el7.1.x86_64 libibverbs-50mlnx1-1.50218.x86_64 libnl3-3.2.28-4.el7.x86_64 librdmacm-50mlnx1-1.50218.x86_64 numactl-libs-2.0.12-5.el7.x86_64 ucx-1.8.0-1.50218.x86_64 ucx-cma-1.8.0-1.50218.x86_64 ucx-ib-1.8.0-1.50218.x86_64 ucx-knem-1.8.0-1.50218.x86_64 ucx-rdmacm-1.8.0-1.50218.x86_64 zlib-1.2.7-18.el7.x86_64
(gdb) up
#1  0x00000000005e5b58 in OPFLOWHIOPSPARSEGPUInterface::solution_callback (
    this=0x1da07970, status=hiop::Solve_Success, n=24, xsol=0x2aab64c08e00, 
    z_L=0x2aab64c09e00, z_U=0x2aab64c0a000, m=36, gsol=0x2aab64c14200, 
    lamsol=0x2aab64c17e00, obj_value=4144.4507489105636)
    at /people/d3g096/ExaGo/src/exago/src/opflow/solver/hiop/opflow_hiopsparsegpu.cpp:294
294       memcpy(x, xsol, opflow->nx * sizeof(double));
The SEGV is caused by trying to access the contents of xsol. Accessing the other arrays passed (gsol, lamsol, etc.) also causes a SEGV.

Any thoughts @abhyshr?

I believe you have already fixed this, right?

Can you squash all the commits.

wperkins · 2023-09-26T14:17:47Z

[...]
The SEGV is caused by trying to access the contents of xsol. Accessing the other arrays passed (gsol, lamsol, etc.) also causes a SEGV.
Any thoughts @abhyshr?

I believe you have already fixed this, right?

Yes, this is fixed.

Can you squash all the commits.

Done.

wperkins · 2023-09-26T15:19:18Z

I've tested this branch with the 9-bus (datafiles/case9/case9mod.m) and 200-bus (datafiles/case_ACTIVSg200.m) cases and the current sparse GPU-based solver/model produces objectives, bus voltages, and generator power that are very similar to the IPOPT solver. This PR should be ready to go. Another PR will be started for on-GPU Jacobian and Hessian assembly.

wperkins · 2023-09-29T14:00:33Z

I have removed the changes to yaml files that I somehow rebased in. I think the branch is now good.

wperkins · 2023-09-29T14:01:27Z

Converted back to draft. A unit test is needed for this solver/model combination.

wperkins · 2023-10-03T16:15:01Z

I've added the FUNCTIONALITY_TEST_OPFLOW_RAJAHIOP_SPARSE_GPU_TOML_TESTSUITE unit test. This is just a copy of the FUNCTIONALITY_TEST_OPFLOW_HIOPSPARSE_TOML_TESTSUITE. It runs the 9- and 200-bus cases, but not the 118-bus case. This seems to consistently crash. I have documented this in #25.

wperkins · 2023-10-03T16:16:36Z

I've added the FUNCTIONALITY_TEST_OPFLOW_RAJAHIOP_SPARSE_GPU_TOML_TESTSUITE unit test. This is just a copy of the FUNCTIONALITY_TEST_OPFLOW_HIOPSPARSE_TOML_TESTSUITE. It runs the 9- and 200-bus cases, but not the 118-bus case. This seems to consistently crash. I have documented this in #25.

I don't think this should block this PR. I think it's ready to go.

abhyshr

Looks good to me. Lets have this change in as the first phase and then replace the Jacobian and Hessian with their "true" GPU implementations in the second phase.

WIP - HIOP Sparse solver with GPU model OPFLOW: Started work on support for HIOP sparse solver interface for GPUs. Added a copy of hiop sparse solver interface. OPFLOW: Added model skeleton for GPU sparse version (copying from pbpolrajahiop) Fixed build Did some copy paste to add a test for HIOPSPARSE. This test is not actually functional yet. Started updating the hiopsparse model and solver code. More work on updating the solver and model Added scalar and vector unit tests for model to be used with HIOP sparse solver on GPU Apply cmake lint Fix unit tests. Set the size of array when using Umpire memset. Code formatting Some minor changes to get PBPOLRAJAHIOPSPARSE model code to compile Separate BUS/LINE/GEN/.../Param structs into reusable module Minor edit Rename files Fix typo Use BUS/LINE/GEN/.../Param structs in Raja HiOp Sparse model (compiles) Updating HIOP sparse solver GPU API Completed bounds kernels Completed scalar and vector functions WIP - HIOP Sparse solver with GPU model OPFLOW: Started work on support for HIOP sparse solver interface for GPUs. Added a copy of hiop sparse solver interface. OPFLOW: Added model skeleton for GPU sparse version (copying from pbpolrajahiop) Fixed build Did some copy paste to add a test for HIOPSPARSE. This test is not actually functional yet. Started updating the hiopsparse model and solver code. More work on updating the solver and model Added scalar and vector unit tests for model to be used with HIOP sparse solver on GPU Apply cmake lint Fix unit tests. Set the size of array when using Umpire memset. Code formatting Rename files Use BUS/LINE/GEN/.../Param structs in Raja HiOp Sparse model (compiles) Updating HIOP sparse solver GPU API Completed bounds kernels Jacobian and Hessian for sparse model (CPU --> GPU copy) Use correct array lengths in Eq. Jacobian Fix bug in Jacobian. Fix unused variable/parameter errors OPFLOW: rework solution callback for RAJA/HIOP GPU-based solver Formatting changes

abhyshr · 2023-10-10T17:11:59Z

@cameronrutherford : This PR is ready to merge. Can I go ahead merging it?

* OPFLOW: initial implementation of RAJA/HiOp sparse GPU-based solver WIP - HIOP Sparse solver with GPU model OPFLOW: Started work on support for HIOP sparse solver interface for GPUs. Added a copy of hiop sparse solver interface. OPFLOW: Added model skeleton for GPU sparse version (copying from pbpolrajahiop) Fixed build Did some copy paste to add a test for HIOPSPARSE. This test is not actually functional yet. Started updating the hiopsparse model and solver code. More work on updating the solver and model Added scalar and vector unit tests for model to be used with HIOP sparse solver on GPU Apply cmake lint Fix unit tests. Set the size of array when using Umpire memset. Code formatting Some minor changes to get PBPOLRAJAHIOPSPARSE model code to compile Separate BUS/LINE/GEN/.../Param structs into reusable module Minor edit Rename files Fix typo Use BUS/LINE/GEN/.../Param structs in Raja HiOp Sparse model (compiles) Updating HIOP sparse solver GPU API Completed bounds kernels Completed scalar and vector functions WIP - HIOP Sparse solver with GPU model OPFLOW: Started work on support for HIOP sparse solver interface for GPUs. Added a copy of hiop sparse solver interface. OPFLOW: Added model skeleton for GPU sparse version (copying from pbpolrajahiop) Fixed build Did some copy paste to add a test for HIOPSPARSE. This test is not actually functional yet. Started updating the hiopsparse model and solver code. More work on updating the solver and model Added scalar and vector unit tests for model to be used with HIOP sparse solver on GPU Apply cmake lint Fix unit tests. Set the size of array when using Umpire memset. Code formatting Rename files Use BUS/LINE/GEN/.../Param structs in Raja HiOp Sparse model (compiles) Updating HIOP sparse solver GPU API Completed bounds kernels Jacobian and Hessian for sparse model (CPU --> GPU copy) Use correct array lengths in Eq. Jacobian Fix bug in Jacobian. Fix unused variable/parameter errors OPFLOW: rework solution callback for RAJA/HIOP GPU-based solver Formatting changes * Add unit test for RAJA/HiOp Sparse GPU model (9-bus only) * Apply pre-commmit fixes * Add test for 200-bus case * Apply pre-commmit fixes --------- Co-authored-by: Abhyankar, Shrirang G <shrirang.abhyankar@pnnl.gov>

* only print error messages if mpi rank is 0 * add rank check for num ranks * have non-zero ranks exit gracefully when throwing exago error * pflow functionality tests fully mpi aware * add logging rank variable * Apply pre-commmit fixes * Deleted unused header file. * Brought SCOPFLOW test driver in line with PFLOW driver. * Applied additional changes to selfcheck.cpp file for PFLOW, SOPFLOW and SCOPFLOW to adapt tests for running on multiple MPI ranks. * Apply pre-commmit fixes * Initialized some variables that were not getting properly set for serial test case. * only print error messages if mpi rank is 0 * pflow functionality tests fully mpi aware * add logging rank variable * Apply pre-commmit fixes * Deleted unused header file. * Brought SCOPFLOW test driver in line with PFLOW driver. * Applied additional changes to selfcheck.cpp file for PFLOW, SOPFLOW and SCOPFLOW to adapt tests for running on multiple MPI ranks. * Apply pre-commmit fixes * Update summit modules (#21) * Minor fix for Summit build system * Fix '--nnodes'-->'-nodes' on Summit * Attempt to update Summit modules * Reinstall Ginkgo and python dependencies on Summit * Enforce cuda@11.4.2 on Summit * Specify RelWithDebInfo for ExaGO and HiOp on Summit * Update Spack * Relax constraints on exago dependencies on Summit * Add constraints on HiOp in the spack config. Part of the ExaGO package was conflicting with building HiOp in release mode. * Cleaner module install on Summit * Update spack_cpu_build.yaml to work without fork * Update .github/workflows/spack_cpu_build.yaml * Update Spack * Try updating pybind11 submodule to see if it fixes errors with exago+python builds --------- Co-authored-by: Cameron Rutherford <robert.rutherford@pnnl.gov> * OPFLOW with RAJA/HIOP sparse GPU solvers (#8) * OPFLOW: initial implementation of RAJA/HiOp sparse GPU-based solver WIP - HIOP Sparse solver with GPU model OPFLOW: Started work on support for HIOP sparse solver interface for GPUs. Added a copy of hiop sparse solver interface. OPFLOW: Added model skeleton for GPU sparse version (copying from pbpolrajahiop) Fixed build Did some copy paste to add a test for HIOPSPARSE. This test is not actually functional yet. Started updating the hiopsparse model and solver code. More work on updating the solver and model Added scalar and vector unit tests for model to be used with HIOP sparse solver on GPU Apply cmake lint Fix unit tests. Set the size of array when using Umpire memset. Code formatting Some minor changes to get PBPOLRAJAHIOPSPARSE model code to compile Separate BUS/LINE/GEN/.../Param structs into reusable module Minor edit Rename files Fix typo Use BUS/LINE/GEN/.../Param structs in Raja HiOp Sparse model (compiles) Updating HIOP sparse solver GPU API Completed bounds kernels Completed scalar and vector functions WIP - HIOP Sparse solver with GPU model OPFLOW: Started work on support for HIOP sparse solver interface for GPUs. Added a copy of hiop sparse solver interface. OPFLOW: Added model skeleton for GPU sparse version (copying from pbpolrajahiop) Fixed build Did some copy paste to add a test for HIOPSPARSE. This test is not actually functional yet. Started updating the hiopsparse model and solver code. More work on updating the solver and model Added scalar and vector unit tests for model to be used with HIOP sparse solver on GPU Apply cmake lint Fix unit tests. Set the size of array when using Umpire memset. Code formatting Rename files Use BUS/LINE/GEN/.../Param structs in Raja HiOp Sparse model (compiles) Updating HIOP sparse solver GPU API Completed bounds kernels Jacobian and Hessian for sparse model (CPU --> GPU copy) Use correct array lengths in Eq. Jacobian Fix bug in Jacobian. Fix unused variable/parameter errors OPFLOW: rework solution callback for RAJA/HIOP GPU-based solver Formatting changes * Add unit test for RAJA/HiOp Sparse GPU model (9-bus only) * Apply pre-commmit fixes * Add test for 200-bus case * Apply pre-commmit fixes --------- Co-authored-by: Abhyankar, Shrirang G <shrirang.abhyankar@pnnl.gov> * Upgrade ascent build system and use `hiop@1.0.0` on CI platforms (#20) * Boilerplate scripts to install modules on Ascent via submodule Spack * Fix '--nnodes'-->'-nodes' on Ascent * Improve Ascent env.sh * magma@2.6.2 on Ascent * Apply pre-commmit fixes * Relax constraints on exago dependencies on Ascent and build ~python * concretizer: reuse was causing several packages to be duplicated in the environment. Require clean concretizations on Ascent. * Minor module update on Ascent * Add LAPACK_LIBRARIES to Ascent base script. CMAKE was picking up python's openblas otherwise. * Error with unzip. * Apply pre-commmit fixes * Add working build on ascent. * Add working gcc11.2.0 spack spec. * Add Ascent Spack pipeline. [ascent-rebuild] * Update gcc version to 11.2.0 in base.sh [skip-ci] * Fix stages of Ascent pipeline [ascent-rebuild] * Add working ascent spack build. * Add hiop@develop force rebuild to PNNL CI [ascent-rebuild] [newell-rebuild] [deception-rebuild] [incline-rebuild]. * Update Ascent spack built tcl modules * Only test ascent on tcl module update [ci-skip] * Update base.sh to disable python on ascent [skip ci] * Remove LAPACK_LIBRARIES spec [ascent-test] * Update ascent.gitlab-ci.yml to fix needs/dependencies [ascent-test] * Update deception spack built tcl modules - [deception-test] * Try again with Python, but have Spack build it instead of using the external module [ascent-rebuild] * Force python rebuild on ascent and use hiop@0.7.2 on incline [ascent-rebuild] [newell-rebuild] [incline-rebuild] * Pin hiop@1.0.0 on all CI platforms [decetpion-rebuild] [ascent-rebuild] [newell-rebuild] [incline-rebuild] * Fix false positive/negative in Ascent pipelines [deception-rebuild] [ascent-test] * Update incline spack built tcl modules - [incline-test] * Update newell spack built tcl modules - [newell-test] * Fix HiOp spec on Ascent [ascent-rebuild]. * Update deception spack built tcl modules - [deception-test] * Update CPU Spack build with issue for each failing build [ci skip] * Update Ascent spack built tcl modules [ascent-test] * Add 1.0.0 dep into CHANGELOG. * Add ascent-skip to CI to get tests passing [ascent-test] --------- Co-authored-by: nkoukpaizan <nkoukpaizan@users.noreply.github.com> Co-authored-by: Cameron Rutherford <robert.rutherford@pnnl.gov> Co-authored-by: cameronrutherford <cameronrutherford@users.noreply.github.com> Co-authored-by: spack-auto-module <spack.bot@no-reply.com> * Add Spack CPU build with `exago+hiop+raja~ipopt ^hiop+raja~sparse` (#41) * Add CPU build with hiop+sparse and exago~ipopt+hiop+raja * Update .github/workflows/spack_cpu_build.yaml * `+mpi` to `+raja` CPU build * Add HIOPRAJASPARSE model if sparse and raja enabled * Fix other HIOPRAJASPARSE ifdef * pflow functionality tests fully mpi aware * add logging rank variable * Apply pre-commmit fixes * Deleted unused header file. * Brought SCOPFLOW test driver in line with PFLOW driver. * Applied additional changes to selfcheck.cpp file for PFLOW, SOPFLOW and SCOPFLOW to adapt tests for running on multiple MPI ranks. * Apply pre-commmit fixes * Apply pre-commmit fixes * Updated third party libraries * Set more default values in selfcheck.cpp to get rid of uninitialized variables errors in Valgrind and modified a few test values so that tests pass. * Apply pre-commmit fixes * Fixed up some preprocessor glitches that got introduced in the rebase. * Modified versions on pybind11 and spack to match develop. * Fix remaining issues in merge request. * Apply pre-commmit fixes * Fixed preprocessor directives to match develop branch. * Modified constructor of FunctionalityTestContext to get rid of a bunch of code checking MPI calls. * Apply pre-commmit fixes * Remove logging ranks variable. --------- Co-authored-by: Jaelyn Litzinger <jaelyn.litzinger@pnnl.gov> Co-authored-by: Bruce J Palmer <d3g293@deception04.pnl.gov> Co-authored-by: Nicholson Koukpaizan <72402802+nkoukpaizan@users.noreply.github.com> Co-authored-by: Cameron Rutherford <robert.rutherford@pnnl.gov> Co-authored-by: Bill <wperkins@users.noreply.github.com> Co-authored-by: Abhyankar, Shrirang G <shrirang.abhyankar@pnnl.gov> Co-authored-by: nkoukpaizan <nkoukpaizan@users.noreply.github.com> Co-authored-by: cameronrutherford <cameronrutherford@users.noreply.github.com> Co-authored-by: spack-auto-module <spack.bot@no-reply.com> Co-authored-by: Bruce J Palmer <d3g293@deception03.pnl.gov>

wperkins self-assigned this Sep 19, 2023

wperkins marked this pull request as draft September 19, 2023 13:09

wperkins requested a review from pelesh September 19, 2023 13:10

wperkins added the enhancement New feature or request label Sep 19, 2023

wperkins force-pushed the perk/opflow/hiop-sparse-gpu-rebase branch from 1698c37 to ad00d28 Compare September 19, 2023 13:25

wperkins force-pushed the perk/opflow/hiop-sparse-gpu-rebase branch 2 times, most recently from c79a13a to e931694 Compare September 26, 2023 14:16

wperkins requested a review from abhyshr September 26, 2023 15:19

wperkins changed the title ~~DRAFT: OPFLOW with RAJA/HIOP sparse GPU solvers~~ OPFLOW with RAJA/HIOP sparse GPU solvers Sep 26, 2023

wperkins marked this pull request as ready for review September 26, 2023 15:21

wperkins force-pushed the perk/opflow/hiop-sparse-gpu-rebase branch from e931694 to 5a09c02 Compare September 29, 2023 13:52

wperkins marked this pull request as draft September 29, 2023 13:58

wperkins force-pushed the perk/opflow/hiop-sparse-gpu-rebase branch from 000974d to 8da362e Compare October 3, 2023 15:47

wperkins mentioned this pull request Oct 3, 2023

OPFLOW: HIOPSPARSEGPU/PBPOLRAJAHIOPSPARSE crashes for 118-bus case #25

Closed

wperkins marked this pull request as ready for review October 3, 2023 16:15

wperkins force-pushed the perk/opflow/hiop-sparse-gpu-rebase branch from 8da362e to 9002c30 Compare October 6, 2023 15:39

abhyshr approved these changes Oct 6, 2023

View reviewed changes

Abhyankar, Shrirang G and others added 5 commits October 10, 2023 07:57

Add unit test for RAJA/HiOp Sparse GPU model (9-bus only)

61d95d6

Apply pre-commmit fixes

b4b2a04

Add test for 200-bus case

26d537c

Apply pre-commmit fixes

bb078ce

wperkins force-pushed the perk/opflow/hiop-sparse-gpu-rebase branch from 9002c30 to bb078ce Compare October 10, 2023 14:57

cameronrutherford merged commit 828db06 into develop Oct 10, 2023
10 checks passed

cameronrutherford added this to the 1.6.0 Release milestone Oct 11, 2023

wperkins mentioned this pull request Nov 28, 2023

Sparse GPU OPFLOW Model: Assemble Jacobian and Hessian on device #90

Draft

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OPFLOW with RAJA/HIOP sparse GPU solvers #8

OPFLOW with RAJA/HIOP sparse GPU solvers #8

wperkins commented Sep 19, 2023

wperkins commented Sep 19, 2023

abhyshr commented Sep 23, 2023

wperkins commented Sep 26, 2023

wperkins commented Sep 26, 2023

wperkins commented Sep 29, 2023

wperkins commented Sep 29, 2023

wperkins commented Oct 3, 2023

wperkins commented Oct 3, 2023

abhyshr left a comment

abhyshr commented Oct 10, 2023

OPFLOW with RAJA/HIOP sparse GPU solvers #8

OPFLOW with RAJA/HIOP sparse GPU solvers #8

Conversation

wperkins commented Sep 19, 2023

wperkins commented Sep 19, 2023

abhyshr commented Sep 23, 2023

wperkins commented Sep 26, 2023

wperkins commented Sep 26, 2023

wperkins commented Sep 29, 2023

wperkins commented Sep 29, 2023

wperkins commented Oct 3, 2023

wperkins commented Oct 3, 2023

abhyshr left a comment

Choose a reason for hiding this comment

abhyshr commented Oct 10, 2023