Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions Libraries/oneMKL/batched_linear_solver/License.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
Copyright Intel Corporation

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
92 changes: 92 additions & 0 deletions Libraries/oneMKL/batched_linear_solver/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
# Batched Linear Solver Sample
Solving a batch of linear systems is a common operation in engineering and scientific computing.
Consequently, the oneMKL LAPACK implementation provides batched LU solver functions that are optimized
for Intel processors.

For more information on oneMKL, and complete documentation of all oneMKL routines, see https://software.intel.com/content/www/us/en/develop/tools/oneapi/components/onemkl.html.

| Optimized for | Description
|:--- |:---
| OS | Linux* Ubuntu* 18.04
| Hardware | Intel® Skylake with Gen9 or newer
| Software | Intel® oneMKL, Intel® Fortran Compiler
| What you will learn | How to optimize host-device data transfer when using OpenMP target offload
| Time to complete | 45 minutes

## Purpose
This sample shows how to solve a batch of linear systems using the batched solver functions
([``getrf_batch_strided``](https://www.intel.com/content/www/us/en/docs/onemkl/developer-reference-fortran/2023-1/getrf-batch-strided.html) and [``getrs_batch_strided``](https://www.intel.com/content/www/us/en/docs/onemkl/developer-reference-fortran/2023-1/getrs-batch-strided.html)), how to offload these functions to an accelerator using the OpenMP target directives, and how to minimize host-device data transfer to achieve better performance. The following article provides more detailed descriptions of the sample codes: [Solving Linear Systems Using oneMKL and OpenMP Target Offloading](https://www.intel.com/content/www/us/en/developer/articles/technical/solve-linear-systems-onemkl-openmp-target-offload.html).

## Key Implementation Details
In general, the factored matrices and pivots can be discarded after the linear systems are solved. Only the solutions
need to be transferred back to the host. Two sample codes are provided: `lu_solve_omp_offload.F90` and
`lu_solve_omp_offload_optimized.F90`. The first shows a straightforward way to
dispatch the LU factorization and solver functions using two OpenMP target regions. The code gives correct results
but performs some unnecessary host-device data transfer. The second sample code fuses the two OpenMP regions to
improve performance by minimizing data transfer.

## License
Code samples are licensed under the MIT license. See [License.txt](https://github.com/oneapi-src/oneAPI-samples/blob/master/License.txt) for details.

Third party program Licenses can be found here: [third-party-programs.txt](https://github.com/oneapi-src/oneAPI-samples/blob/master/third-party-programs.txt).

## Building and Running the Batched Linear Solver Sample
> **Note**: If you have not already done so, set up your CLI
> environment by sourcing the `setvars` script located in
> the root of your oneAPI installation.
>
> Linux Sudo: . /opt/intel/oneapi/setvars.sh
>
> Linux User: . ~/intel/oneapi/setvars.sh
>
>For more information on environment variables, see [Use the setvars Script for Linux or macOS](https://www.intel.com/content/www/us/en/develop/documentation/oneapi-programming-guide/top/oneapi-development-environment-setup/use-the-setvars-script-with-linux-or-macos.html).

### Running Samples on the DevCloud
When running a sample in the Intel DevCloud, remember that you must specify the compute node (CPU, GPU, FPGA) as well whether to run in batch or interactive mode. For more information see the Intel® oneAPI Base Toolkit Get Started Guide (https://devcloud.intel.com/oneapi/get-started/base-toolkit/).

Run `make` to build and run the sample. Three programs are generated:

1. `lu_solve`: CPU-only, OpenMP disabled
2. `lu_solve_omp_offload`: two OpenMP target regions with accelerator offload enabled
3. `lu_solve_omp_offload_optimized`: one OpenMP target offload region to minimize host-device data transfer

Note that the makefile only runs small tests to verify that the executables are working correctly. The problem sizes are too small to justify accelerator offload. Use the following command-line options to run the tests shown in [Solving Linear Systems Using oneMKL and OpenMP Target Offloading](https://www.intel.com/content/www/us/en/developer/articles/technical/solve-linear-systems-onemkl-openmp-target-offload.html): `-n 16000 -b 8 -r 1 -c 5`.

You can remove all generated files with `make clean`.

### Example of Output
If everything is working correctly, the output should be similar to this:
```
u172874@s001-n157:~/oneAPI-samples/Libraries/oneMKL/batched_linear_solver$ make
ifx lu_solve_omp_offload.F90 -o lu_solve -i8 -free -qmkl
ifx lu_solve_omp_offload.F90 -o lu_solve_omp_offload -i8 -free -qmkl -DMKL_ILP64 -qopenmp -fopenmp-targets=spir64 -fsycl -L/glob/development-tools/versions/oneapi/2023.0.1/oneapi/mkl/2023.0.0/lib/intel64 -lmkl_sycl -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread -ldl
ifx lu_solve_omp_offload_optimized.F90 -o lu_solve_omp_offload_optimized -i8 -free -qmkl -DMKL_ILP64 -qopenmp -fopenmp-targets=spir64 -fsycl -L/glob/development-tools/versions/oneapi/2023.0.1/oneapi/mkl/2023.0.0/lib/intel64 -lmkl_sycl -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread -ldl
./lu_solve -n 64 -b 8 -r 1 -c 2
Matrix dimensions: 64
Batch size: 8
Number of RHS: 1
Number of test cycles: 2
Computation completed successfully 2.849000000000000E-002 seconds
Computation completed successfully 4.600000000000000E-005 seconds
Total time: 2.853600000000000E-002 seconds
./lu_solve_omp_offload -n 64 -b 8 -r 1 -c 2
Matrix dimensions: 64
Batch size: 8
Number of RHS: 1
Number of test cycles: 2
Computation completed successfully 1.52985400000000 seconds
Computation completed successfully 1.212700000000000E-002 seconds
Total time: 1.54198100000000 seconds
./lu_solve_omp_offload_optimized -n 64 -b 8 -r 1 -c 2
Matrix dimensions: 64
Batch size: 8
Number of RHS: 1
Number of test cycles: 2
Computation completed successfully 1.54803900000000 seconds
Computation completed successfully 1.202200000000000E-002 seconds
Total time: 1.56006100000000 seconds
```

### Troubleshooting
If an error occurs, troubleshoot the problem using the Diagnostics Utility for Intel® oneAPI Toolkits
[Learn more](https://www.intel.com/content/www/us/en/docs/oneapi/user-guide-diagnostic-utility/2023-1/overview.html)
194 changes: 194 additions & 0 deletions Libraries/oneMKL/batched_linear_solver/lu_solve_omp_offload.F90
Original file line number Diff line number Diff line change
@@ -0,0 +1,194 @@
!===============================================================================
! Copyright 2021-2022 Intel Corporation.
!
! This software and the related documents are Intel copyrighted materials, and
! your use of them is governed by the express license under which they were
! provided to you (License). Unless the License provides otherwise, you may not
! use, modify, copy, publish, distribute, disclose or transmit this software or
! the related documents without Intel's prior written permission.
!
! This software and the related documents are provided as is, with no express
! or implied warranties, other than those that are expressly stated in the
! License.
!===============================================================================
!
! Content:
! Intel(R) oneAPI Math Kernel Library (oneMKL)
! FORTRAN OpenMP offload examples for solving batched linear systems.
!
! Compile for CPU:
! ifx -i8 -qmkl -free \
! lu_solve_omp_offload_ex1_timer.F90 -o lu_solve_ex1_timer
!
! Compile for GPU:
! ifx -i8 -DMKL_ILP64 -qopenmp -fopenmp-targets=spir64 -fsycl -free \
! lu_solve_omp_offload_ex1_timer.F90 -o lu_solve_ex1_omp_timer \
! -L${MKLROOT}/lib/intel64 -lmkl_sycl -lmkl_intel_ilp64 \
! -lmkl_intel_thread -lmkl_core -liomp5 -lpthread -ldl
!
! Compile with -DSP to use single precision instead of double precision.
!
!*******************************************************************************

!$ include "mkl_omp_offload.f90"

program solve_batched_linear_systems

! Decide whether to use 32- or 64-bit integer type
#if defined(MKL_ILP64)
!$ use onemkl_lapack_omp_offload_ilp64 ! 64-bit
#else
!$ use onemkl_lapack_omp_offload_lp64 ! 32-bit
#endif

implicit none

integer :: n = 64, batch_size = 4096, nrhs = 1, cycles = 5
integer :: lda, stride_a, stride_ipiv
integer :: ldb, stride_b
integer, allocatable :: ipiv(:,:), info(:)

#if defined(SP)
real (kind=4), allocatable :: a(:,:), b(:,:), a_orig(:,:), b_orig(:,:), x(:)
real (kind=4) :: residual, threshold = 1.0e-5
#else
real (kind=8), allocatable :: a(:,:), b(:,:), a_orig(:,:), b_orig(:,:), x(:)
real (kind=8) :: residual, threshold = 1.0d-9
#endif

integer (kind=8) :: start_time, end_time, clock_precision
real (kind=8) :: cycle_time, total_time = 0.0d0

integer :: i, j, c, allocstat, stat
character (len = 132) :: allocmsg
character (len = 32) :: arg1, arg2

! Simple command-line parser with no error checks
do i = 1, command_argument_count(), 2
call get_command_argument(i, arg1)
select case (arg1)
case ('-n')
call get_command_argument(i+1, arg2)
read(arg2, *, iostat=stat) n
case ('-b')
call get_command_argument(i+1, arg2)
read(arg2, *, iostat=stat) batch_size
case ('-r')
call get_command_argument(i+1, arg2)
read(arg2, *, iostat=stat) nrhs
case ('-c')
call get_command_argument(i+1, arg2)
read(arg2, *, iostat=stat) cycles
case default
print *, 'Unrecognized command-line option:', arg1
stop
end select
enddo
print *, 'Matrix dimensions:', n
print *, 'Batch size:', batch_size
print *, 'Number of RHS:', nrhs
print *, 'Number of test cycles:', cycles

lda = n
stride_a = n * lda
stride_ipiv = n
ldb = n
stride_b = n * nrhs

! Allocate memory for linear algebra computations
allocate (a(stride_a, batch_size), b(n, batch_size*nrhs), &
ipiv(stride_ipiv, batch_size), info(batch_size), &
stat = allocstat, errmsg = allocmsg)
if (allocstat > 0) stop trim(allocmsg)

! Allocate memory for error checking
allocate (a_orig(stride_a, batch_size), b_orig(n, batch_size*nrhs), x(n), &
stat = allocstat, errmsg = allocmsg)
if (allocstat > 0) stop trim(allocmsg)

call system_clock(count_rate = clock_precision)
call random_seed()

do c = 1, cycles
! Initialize the matrices with a random number in the interval (-0.5, 0.5)
call random_number(a)
a = 0.5 - a
! Make diagonal band values larger to ensure well-conditioned matrices
do i = 1, n
a(i+(i-1)*lda,:) = a(i+(i-1)*lda,:) + 50.0
if (i .ne. 1) a(i+(i-1)*lda-1,:) = a(i+(i-1)*lda-1,:) + 20.0
if (i .ne. n) a(i+(i-1)*lda+1,:) = a(i+(i-1)*lda+1,:) + 20.0
enddo

! Initialize the RHS with a random number in the interval (-2.5, 2.5)
call random_number(b)
b = 2.5 - (5.0 * b)
a_orig = a
b_orig = b

call system_clock(start_time) ! Start timer

! Compute the LU factorizations and solve the linear systems using OpenMP offload.
! On entry, "a" contains the input matrices. On exit, it contains the factored matrices.
!$omp target data map(tofrom:a) map(from:ipiv) map(from:info)
!$omp dispatch
#if defined(SP)
call sgetrf_batch_strided(n, n, a, lda, stride_a, ipiv, stride_ipiv, batch_size, info)
#else
call dgetrf_batch_strided(n, n, a, lda, stride_a, ipiv, stride_ipiv, batch_size, info)
#endif
!$omp end target data

if (any(info .ne. 0)) then
print *, 'Error: getrf_batch_strided returned with errors.'
stop
else
! Solving the linear systems. On exit, the solutions are stored in b.
!$omp target data map(to:a) map(to:ipiv) map(tofrom: b) map(from:info)
!$omp dispatch
#if defined(SP)
call sgetrs_batch_strided('N', n, nrhs, a, lda, stride_a, ipiv, stride_ipiv, &
b, ldb, stride_b, batch_size, info)
#else
call dgetrs_batch_strided('N', n, nrhs, a, lda, stride_a, ipiv, stride_ipiv, &
b, ldb, stride_b, batch_size, info)
#endif
!$omp end target data

call system_clock(end_time) ! Stop timer

if (any(info .ne. 0)) then
print *, 'Error: getrs_batch_strided returned with errors.'
stop
else

! Compute a_orig*b and compare result to saved RHS
do i = 1, batch_size
do j = 1, nrhs
x = 0.0
#if defined(SP)
call sgemv('N', n, n, 1.0, a_orig(:,i), lda, b(:,(i-1)*nrhs+j), 1, 0.0, x, 1)
#else
call dgemv('N', n, n, 1.0d0, a_orig(:,i), lda, b(:,(i-1)*nrhs+j), 1, 0.0d0, x, 1)
#endif

! Check relative residual
residual = norm2(b_orig(:,(i-1)*nrhs+j) - x(:)) / norm2(b_orig(:,(i-1)*nrhs+j))
if (residual > threshold) then
print *, 'Warning: relative residual of ', residual
endif
enddo
enddo

cycle_time = dble(end_time - start_time) / dble(clock_precision)
print *, 'Computation completed successfully', cycle_time, 'seconds'
total_time = total_time + cycle_time
endif
endif
enddo

print *, 'Total time:', total_time, 'seconds'

! Clean up
deallocate (a, b, a_orig, b_orig, x, ipiv, info)
end program solve_batched_linear_systems
Loading