oneapi-src · IgorOchocki · May 17, 2023 · May 3, 2023
diff --git a/Libraries/oneMKL/batched_linear_solver/License.txt b/Libraries/oneMKL/batched_linear_solver/License.txt
@@ -0,0 +1,7 @@
+Copyright Intel Corporation
+
+Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
diff --git a/Libraries/oneMKL/batched_linear_solver/README.md b/Libraries/oneMKL/batched_linear_solver/README.md
@@ -0,0 +1,92 @@
+# Batched Linear Solver Sample
+Solving a batch of linear systems is a common operation in engineering and scientific computing.
+Consequently, the oneMKL LAPACK implementation provides batched LU solver functions that are optimized
+for Intel processors.
+
+For more information on oneMKL, and complete documentation of all oneMKL routines, see https://software.intel.com/content/www/us/en/develop/tools/oneapi/components/onemkl.html.
+
+| Optimized for       | Description
+|:---                 |:---
+| OS                  | Linux* Ubuntu* 18.04
+| Hardware            | Intel&reg; Skylake with Gen9 or newer
+| Software            | Intel&reg; oneMKL, Intel&reg; Fortran Compiler
+| What you will learn | How to optimize host-device data transfer when using OpenMP target offload
+| Time to complete    | 45 minutes
+
+## Purpose
+This sample shows how to solve a batch of linear systems using the batched solver functions
+([``getrf_batch_strided``](https://www.intel.com/content/www/us/en/docs/onemkl/developer-reference-fortran/2023-1/getrf-batch-strided.html) and [``getrs_batch_strided``](https://www.intel.com/content/www/us/en/docs/onemkl/developer-reference-fortran/2023-1/getrs-batch-strided.html)), how to offload these functions to an accelerator using the OpenMP target directives, and how to minimize host-device data transfer to achieve better performance. The following article provides more detailed descriptions of the sample codes: [Solving Linear Systems Using oneMKL and OpenMP Target Offloading](https://www.intel.com/content/www/us/en/developer/articles/technical/solve-linear-systems-onemkl-openmp-target-offload.html).
+
+## Key Implementation Details
+In general, the factored matrices and pivots can be discarded after the linear systems are solved. Only the solutions
+need to be transferred back to the host. Two sample codes are provided: `lu_solve_omp_offload.F90` and
+`lu_solve_omp_offload_optimized.F90`. The first shows a straightforward way to
+dispatch the LU factorization and solver functions using two OpenMP target regions. The code gives correct results
+but performs some unnecessary host-device data transfer. The second sample code fuses the two OpenMP regions to
+improve performance by minimizing data transfer.
+
+## License
+Code samples are licensed under the MIT license. See [License.txt](https://github.com/oneapi-src/oneAPI-samples/blob/master/License.txt) for details.
+
+Third party program Licenses can be found here: [third-party-programs.txt](https://github.com/oneapi-src/oneAPI-samples/blob/master/third-party-programs.txt).
+
+## Building and Running the Batched Linear Solver Sample
+> **Note**: If you have not already done so, set up your CLI
+> environment by sourcing  the `setvars` script located in
+> the root of your oneAPI installation.
+>
+> Linux Sudo: . /opt/intel/oneapi/setvars.sh
+>
+> Linux User: . ~/intel/oneapi/setvars.sh
+>
+>For more information on environment variables, see [Use the setvars Script for Linux or macOS](https://www.intel.com/content/www/us/en/develop/documentation/oneapi-programming-guide/top/oneapi-development-environment-setup/use-the-setvars-script-with-linux-or-macos.html).
+
+### Running Samples on the DevCloud
+When running a sample in the Intel DevCloud, remember that you must specify the compute node (CPU, GPU, FPGA) as well whether to run in batch or interactive mode. For more information see the Intel® oneAPI Base Toolkit Get Started Guide (https://devcloud.intel.com/oneapi/get-started/base-toolkit/).
+
+Run `make` to build and run the sample. Three programs are generated: 
+
+1. `lu_solve`: CPU-only, OpenMP disabled
+2. `lu_solve_omp_offload`: two OpenMP target regions with accelerator offload enabled
+3. `lu_solve_omp_offload_optimized`: one OpenMP target offload region to minimize host-device data transfer
+
+Note that the makefile only runs small tests to verify that the executables are working correctly. The problem sizes are too small to justify accelerator offload. Use the following command-line options to run the tests shown in [Solving Linear Systems Using oneMKL and OpenMP Target Offloading](https://www.intel.com/content/www/us/en/developer/articles/technical/solve-linear-systems-onemkl-openmp-target-offload.html): `-n 16000 -b 8 -r 1 -c 5`.
+
+You can remove all generated files with `make clean`.
+
+### Example of Output
+If everything is working correctly, the output should be similar to this:
+```
+u172874@s001-n157:~/oneAPI-samples/Libraries/oneMKL/batched_linear_solver$ make
+ifx lu_solve_omp_offload.F90 -o lu_solve -i8 -free -qmkl
+ifx lu_solve_omp_offload.F90 -o lu_solve_omp_offload -i8 -free -qmkl -DMKL_ILP64 -qopenmp -fopenmp-targets=spir64 -fsycl -L/glob/development-tools/versions/oneapi/2023.0.1/oneapi/mkl/2023.0.0/lib/intel64 -lmkl_sycl -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread -ldl
+ifx lu_solve_omp_offload_optimized.F90 -o lu_solve_omp_offload_optimized -i8 -free -qmkl -DMKL_ILP64 -qopenmp -fopenmp-targets=spir64 -fsycl -L/glob/development-tools/versions/oneapi/2023.0.1/oneapi/mkl/2023.0.0/lib/intel64 -lmkl_sycl -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread -ldl
+./lu_solve -n 64 -b 8 -r 1 -c 2
+ Matrix dimensions:                    64
+ Batch size:                     8
+ Number of RHS:                     1
+ Number of test cycles:                     2
+ Computation completed successfully  2.849000000000000E-002 seconds
+ Computation completed successfully  4.600000000000000E-005 seconds
+ Total time:  2.853600000000000E-002 seconds
+./lu_solve_omp_offload -n 64 -b 8 -r 1 -c 2
+ Matrix dimensions:                    64
+ Batch size:                     8
+ Number of RHS:                     1
+ Number of test cycles:                     2
+ Computation completed successfully   1.52985400000000      seconds
+ Computation completed successfully  1.212700000000000E-002 seconds
+ Total time:   1.54198100000000      seconds
+./lu_solve_omp_offload_optimized -n 64 -b 8 -r 1 -c 2
+ Matrix dimensions:                    64
+ Batch size:                     8
+ Number of RHS:                     1
+ Number of test cycles:                     2
+ Computation completed successfully   1.54803900000000      seconds
+ Computation completed successfully  1.202200000000000E-002 seconds
+ Total time:   1.56006100000000      seconds
+```
+
+### Troubleshooting
+If an error occurs, troubleshoot the problem using the Diagnostics Utility for Intel® oneAPI Toolkits
+[Learn more](https://www.intel.com/content/www/us/en/docs/oneapi/user-guide-diagnostic-utility/2023-1/overview.html)
diff --git a/Libraries/oneMKL/batched_linear_solver/lu_solve_omp_offload.F90 b/Libraries/oneMKL/batched_linear_solver/lu_solve_omp_offload.F90
@@ -0,0 +1,194 @@
+!===============================================================================
+! Copyright 2021-2022 Intel Corporation.
+!
+! This software and the related documents are Intel copyrighted  materials,  and
+! your use of  them is  governed by the  express license  under which  they were
+! provided to you (License).  Unless the License provides otherwise, you may not
+! use, modify, copy, publish, distribute,  disclose or transmit this software or
+! the related documents without Intel's prior written permission.
+!
+! This software and the related documents  are provided as  is,  with no express
+! or implied  warranties,  other  than those  that are  expressly stated  in the
+! License.
+!===============================================================================
+!
+!  Content:
+!      Intel(R) oneAPI Math Kernel Library (oneMKL)
+!      FORTRAN OpenMP offload examples for solving batched linear systems.
+!
+! Compile for CPU:
+!     ifx -i8 -qmkl -free \
+!         lu_solve_omp_offload_ex1_timer.F90 -o lu_solve_ex1_timer
+!
+! Compile for GPU:
+!     ifx -i8 -DMKL_ILP64 -qopenmp -fopenmp-targets=spir64 -fsycl -free \
+!         lu_solve_omp_offload_ex1_timer.F90 -o lu_solve_ex1_omp_timer \
+!         -L${MKLROOT}/lib/intel64 -lmkl_sycl -lmkl_intel_ilp64 \
+!         -lmkl_intel_thread -lmkl_core -liomp5 -lpthread -ldl
+!
+! Compile with -DSP to use single precision instead of double precision.
+!
+!*******************************************************************************
+
+!$ include "mkl_omp_offload.f90"
+
+program solve_batched_linear_systems
+
+! Decide whether to use 32- or 64-bit integer type
+#if defined(MKL_ILP64)
+!$  use onemkl_lapack_omp_offload_ilp64   ! 64-bit
+#else
+!$  use onemkl_lapack_omp_offload_lp64    ! 32-bit
+#endif
+
+    implicit none
+
+    integer              :: n = 64, batch_size = 4096, nrhs = 1, cycles = 5
+    integer              :: lda, stride_a, stride_ipiv
+    integer              :: ldb, stride_b
+    integer, allocatable :: ipiv(:,:), info(:)
+
+#if defined(SP)
+    real (kind=4), allocatable :: a(:,:), b(:,:), a_orig(:,:), b_orig(:,:), x(:)
+    real (kind=4)              :: residual, threshold = 1.0e-5
+#else
+    real (kind=8), allocatable :: a(:,:), b(:,:), a_orig(:,:), b_orig(:,:), x(:)
+    real (kind=8)              :: residual, threshold = 1.0d-9
+#endif
+
+    integer (kind=8) :: start_time, end_time, clock_precision
+    real    (kind=8) :: cycle_time, total_time = 0.0d0
+
+    integer               :: i, j, c, allocstat, stat
+    character (len = 132) :: allocmsg
+    character (len =  32) :: arg1, arg2
+
+    ! Simple command-line parser with no error checks
+    do i = 1, command_argument_count(), 2
+        call get_command_argument(i, arg1)
+        select case (arg1)
+            case ('-n')
+                call get_command_argument(i+1, arg2)
+                read(arg2, *, iostat=stat) n
+            case ('-b')
+                call get_command_argument(i+1, arg2)
+                read(arg2, *, iostat=stat) batch_size
+            case ('-r')
+                call get_command_argument(i+1, arg2)
+                read(arg2, *, iostat=stat) nrhs
+            case ('-c')
+                call get_command_argument(i+1, arg2)
+                read(arg2, *, iostat=stat) cycles
+            case default
+                print *, 'Unrecognized command-line option:', arg1
+                stop
+        end select
+    enddo
+    print *, 'Matrix dimensions:', n
+    print *, 'Batch size:', batch_size
+    print *, 'Number of RHS:', nrhs
+    print *, 'Number of test cycles:', cycles
+
+    lda         = n
+    stride_a    = n * lda
+    stride_ipiv = n
+    ldb         = n
+    stride_b    = n * nrhs
+
+    ! Allocate memory for linear algebra computations
+    allocate (a(stride_a, batch_size), b(n, batch_size*nrhs),  &
+              ipiv(stride_ipiv, batch_size), info(batch_size), &
+              stat = allocstat, errmsg = allocmsg)
+    if (allocstat > 0) stop trim(allocmsg)
+
+    ! Allocate memory for error checking
+    allocate (a_orig(stride_a, batch_size), b_orig(n, batch_size*nrhs), x(n), &
+              stat = allocstat, errmsg = allocmsg)
+    if (allocstat > 0) stop trim(allocmsg)
+
+    call system_clock(count_rate = clock_precision)
+    call random_seed()
+
+    do c = 1, cycles
+        ! Initialize the matrices with a random number in the interval (-0.5, 0.5)
+        call random_number(a)
+        a = 0.5 - a
+        ! Make diagonal band values larger to ensure well-conditioned matrices
+        do i = 1, n
+            a(i+(i-1)*lda,:) = a(i+(i-1)*lda,:) + 50.0
+            if (i .ne. 1) a(i+(i-1)*lda-1,:) = a(i+(i-1)*lda-1,:) + 20.0
+            if (i .ne. n) a(i+(i-1)*lda+1,:) = a(i+(i-1)*lda+1,:) + 20.0
+        enddo
+
+        ! Initialize the RHS with a random number in the interval (-2.5, 2.5)
+        call random_number(b)
+        b = 2.5 - (5.0 * b)
+        a_orig = a
+        b_orig = b
+
+        call system_clock(start_time)   ! Start timer
+
+        ! Compute the LU factorizations and solve the linear systems using OpenMP offload.
+        ! On entry, "a" contains the input matrices. On exit, it contains the factored matrices.
+        !$omp target data map(tofrom:a) map(from:ipiv) map(from:info)
+            !$omp dispatch
+#if defined(SP)
+            call sgetrf_batch_strided(n, n, a, lda, stride_a, ipiv, stride_ipiv, batch_size, info)
+#else
+            call dgetrf_batch_strided(n, n, a, lda, stride_a, ipiv, stride_ipiv, batch_size, info)
+#endif
+        !$omp end target data
+
+        if (any(info .ne. 0)) then
+            print *, 'Error: getrf_batch_strided returned with errors.'
+            stop
+        else
+            ! Solving the linear systems. On exit, the solutions are stored in b.
+            !$omp target data map(to:a) map(to:ipiv) map(tofrom: b) map(from:info)
+                !$omp dispatch
+#if defined(SP)
+                call sgetrs_batch_strided('N', n, nrhs, a, lda, stride_a, ipiv, stride_ipiv, &
+                                                        b, ldb, stride_b, batch_size, info)
+#else
+                call dgetrs_batch_strided('N', n, nrhs, a, lda, stride_a, ipiv, stride_ipiv, &
+                                                        b, ldb, stride_b, batch_size, info)
+#endif
+            !$omp end target data
+
+            call system_clock(end_time)   ! Stop timer
+
+            if (any(info .ne. 0)) then
+                print *, 'Error: getrs_batch_strided returned with errors.'
+                stop
+            else
+
+                ! Compute a_orig*b and compare result to saved RHS
+                do i = 1, batch_size
+                    do j = 1, nrhs
+                        x = 0.0
+#if defined(SP)
+                        call sgemv('N', n, n, 1.0, a_orig(:,i), lda, b(:,(i-1)*nrhs+j), 1, 0.0, x, 1)
+#else
+                        call dgemv('N', n, n, 1.0d0, a_orig(:,i), lda, b(:,(i-1)*nrhs+j), 1, 0.0d0, x, 1)
+#endif
+
+                        ! Check relative residual
+                        residual = norm2(b_orig(:,(i-1)*nrhs+j) - x(:)) / norm2(b_orig(:,(i-1)*nrhs+j))
+                        if (residual > threshold) then
+                            print *, 'Warning: relative residual of ', residual
+                        endif
+                    enddo
+                enddo
+
+                cycle_time = dble(end_time - start_time) / dble(clock_precision)
+                print *, 'Computation completed successfully', cycle_time, 'seconds'
+                total_time = total_time + cycle_time
+            endif
+        endif
+    enddo
+
+    print *, 'Total time:', total_time, 'seconds'
+
+    ! Clean up
+    deallocate (a, b, a_orig, b_orig, x, ipiv, info)
+end program solve_batched_linear_systems