This repo contains examples demonstrating our SMT-based ArrayIndexAnalysis for PSyclone. The analysis itself is defined in array_index_analysis.py, which contains a fairly detailed description in the comments.
Recursively clone the PSyclone branch containing the analysis:
$ git clone --recursive \
-b mn416-smt-array-index-analysis \
https://github.com/stfc/PSyclone
Apply a patch to the fparser submodule to support Fortran 2008 bit-shifting intrinsics, which are useful in some of the examples:
$ (cd PSyclone/external/fparser/ && git apply ../../../fparser.patch)
Enter a docker shell providing all the necessary dependencies:
$ ./docker-shell.sh
The reverse.f90 example contains the following code.
subroutine reverse(arr)
integer, intent(inout) :: arr(:)
integer :: i, n, tmp
n = size(arr)
do i = 1, n/2
tmp = arr(i)
arr(i) = arr(n+1-i)
arr(n+1-i) = tmp
end do
end subroutineEach iteration i writes to elements i and n+1-i at opposite ends of the
array but i never passes the midpoint n/2, hence the loop is conflict free
(with respect to array accesses).
The program can be analysed with the command
$ psyclone -s analyse.py -o /dev/null examples/reverse.f90
which yields the following output.
Routine: reverse
Loop i: conflict free
If we change the upper bound of i from n/2 to n we get:
Routine: reverse
Loop i: conflicts
The routine in odd_even_trans.f90, from Knuth's Odd/Even Transposition Sorter, is as follows.
subroutine odd_even_transposition(arr, start)
integer , intent(inout) :: arr(:)
integer, intent(in) :: start
integer :: i, tmp
do i = start, size(arr)-1, 2
if (arr(i) > arr(i+1)) then
tmp = arr(i+1)
arr(i+1) = arr(i)
arr(i) = tmp
end if
end do
end subroutineIt swaps adjacent array elements from a given starting point, but the loop has a step size of 2 making it conflict free.
The analysis gives:
Routine: odd_even_transposition
Loop i: conflict free
The routine in matmul.f90 is an implementation of matrix multplication that has been tiled using PSyclone's LoopTilingTrans.
subroutine my_matmul(a, b, c)
integer, dimension(:,:), intent(in) :: a
integer, dimension(:,:), intent(in) :: b
integer, dimension(:,:), intent(out) :: c
integer :: x, y, k, k_tile, x_tile, y_tile
c(:,:) = 0
do y_tile = 1, size(a, 2), 8
do x_tile = 1, size(b, 1), 8
do k_tile = 1, size(a, 1), 8
do y = y_tile, min(y_tile + (8 - 1), size(a, 2)), 1
do x = x_tile, min(x_tile + (8 - 1), size(b, 1)), 1
do k = k_tile, min(k_tile + (8 - 1), size(a, 1)), 1
c(x,y) = c(x,y) + a(k,y) * b(x,k)
enddo
enddo
enddo
enddo
enddo
enddo
end subroutine my_matmulUnfortunately, after tiling, PSyclone is unable to detect the loops that can be parallelised. It gives the error:
Error: The write access to 'c(x,y)' and the read access to 'c(x,y)' are dependent and cannot be parallelised.
Handling this example was one of the main motivations for our analysis, which gives the following output.
Routine: my_matmul
Loop y_out_var: conflict free
Loop x_out_var: conflict free
Loop k_out_var: conflicts
Loop y: conflict free
Loop x: conflict free
Loop k: conflicts
Handling tiled/chunked loops was one of the motivations for this work, so it is
worth exploring this area in more detail. Unlike in the matmul.f90 example,
the chunk size in chunking.f90 is not statically
known:
module chunking_example
contains
subroutine chunking(arr, chunk_size)
integer, dimension(:), intent(inout) :: arr
integer, intent(in) :: chunk_size
integer :: n, chunk_begin, chunk_end
n = size(arr)
do chunk_begin = 1, n, chunk_size
chunk_end = min(chunk_begin+chunk_size-1, n)
call modify(arr(chunk_begin:chunk_end))
end do
end subroutine
pure subroutine modify(a)
integer, intent(inout) :: a(:)
end subroutine
end moduleIn this example, the i loop passes a different slice to the pure modify
routine on each iteration, and is conflict free:
$ psyclone -s analyse.py -o /dev/null examples/chunking.f90
Routine chunking:
Loop chunk_begin: conflict free
Routine sub:
If we change change the value of chunk_end to
chunk_end = min(chunk_begin+chunk_size, n)then the slices become overlapping between iterations, and the analysis reports a conflict.
The flatten.f90 example contains the following routine.
subroutine flatten(mat, arr)
real, intent(in) :: mat(0:,0:)
real, intent(out) :: arr(0:)
integer :: x, y
integer :: nx, ny
nx = size(mat, 1)
ny = size(mat, 2)
do y = 0, ny-1
do x = 0, nx-1
arr(nx * y + x) = mat(x, y)
end do
end do
end subroutineThe analysis reports:
Routine: flatten
Loop y: conflict free
Loop x: conflict free
Note that loop y can actually conflict in practice, if integer overflow
leads to wrap-around. However, integer overflow is undefined behaviour in
Fortran, so the analysis ignores it.
The routine in gauss_jordan.f90 is taken from a tutorial on numerical methods.
subroutine gauss_jordan(a, n)
real, intent(inout) :: a(:,:)
integer, intent(in) :: n
integer :: i, j, k
real :: ratio
do k = 1, n
do i = 1, n
if (i /= k) then
ratio = a(i,k) / a(k,k)
do j=1,n+1
a(i,j) = a(i,j) - a(k,j)*ratio
end do
end if
end do
end do
end subroutineThe interesting question here is whether the i loop is parallelisable. Each
iteration writes a(i,j) (for all j) and reads a(k,j) (for all j and a
fixed k) but, due to the if condition i /= k, these accesses are
non-overlapping. The analysis infers this:
$ psyclone -s analyse.py -o /dev/null examples/gauss_jordan.f90
Routine: gauss_jordan
Loop k: conflicts
Loop i: conflict free
Loop j: conflict free
As a more complex example, oem_sort.f90 contains Batcher's Odd/Even Merge Sort transcribed to Fortran from Wikepedia's Pseudocode.
subroutine odd_even_merge_sort(arr, log_n)
integer, intent(inout) :: arr(0:)
integer, intent(in) :: log_n
integer :: p, k, j, i, idx1, idx2, log_p, log_k, tmp, n
n = shiftl(1, log_n)
do log_p = 0, log_n-1
p = shiftl(1, log_p)
do log_k = log_p, 0, -1
k = shiftl(1, log_k)
do j = mod(k, p), n-1-k, 2*k
do i = 0, min(k-1, n-j-k-1)
idx1 = i+j
idx2 = i+j+k
if (idx1 / (p*2) == idx2 / (p*2)) then
if (arr(idx1) > arr(idx2)) then
tmp = arr(idx1)
arr(idx1) = arr(idx2)
arr(idx2) = tmp
end if
end if
end do
end do
end do
end do
end subroutineInside the j loop, each array element idx1 is considered for swapping with
its partner element idx2. The partner pairs are unique and hence
non-conflicting. The arthmetic to compute a partner pairs is quite complex
but the analysis can handle it:
$ psyclone -s analyse.py -o /dev/null examples/oem_sort.f90
Routine: odd_even_merge_sort
Loop log_p: conflicts
Loop log_k: conflicts
Loop j: conflict free
Loop i: conflict freeThe routine in bitonic_sort.f90 is similar to the above example and has been ported to Fortran from C code in a tutorial on sorting.
subroutine bitonic_sort(arr, log_n)
integer, intent(inout) :: arr(0:)
integer, intent(in) :: log_n
integer :: i, j, k, l, log_j, log_k, tmp
do log_k = 1, log_n
k = shiftl(1, log_k)
do log_j = log_k-1, 0, -1
j = shiftl(1, log_j)
do i = 0, shiftl(1, log_n) - 1
l = ieor(i, j)
if (l > i) then
if ((iand(i, k) == 0 .and. arr(i) > arr(l)) .or. &
(iand(i, k) /= 0 .and. arr(i) < arr(l))) then
tmp = arr(i)
arr(i) = arr(l)
arr(l) = tmp
end if
end if
end do
end do
end do
end subroutineThe analysis can also handle this example:
$ psyclone -s analyse.py -o /dev/null examples/bitonic_sort.f90
Routine: bitonic_sort
Loop log_k: conflicts
Loop log_j: conflicts
Loop i: conflict free
The routine in parallel_prefix.f90 computes the parallel prefix sum of an array.
subroutine parallel_prefix_sum(arr, chunk_size)
integer, intent(inout) :: arr(0:)
integer, intent(in) :: chunk_size
integer :: inc_by(0:chunk_size-1)
integer :: n, chunk_begin, chunk_end, chunk_id, i, acc
n = size(arr)
do chunk_begin = 0, n-1, chunk_size
chunk_end = min(chunk_begin + chunk_size - 1, n-1)
acc = 0
do i = chunk_begin, chunk_end
acc = acc + arr(i)
arr(i) = acc
end do
end do
acc = 0
do chunk_begin = 0, n-1, chunk_size
chunk_end = min(chunk_begin + chunk_size - 1, n-1)
chunk_id = chunk_begin / chunk_size
inc_by(chunk_id) = acc
acc = acc + arr(chunk_end)
end do
do chunk_begin = chunk_size, n-1, chunk_size
chunk_end = min(chunk_begin + chunk_size - 1, n-1)
chunk_id = chunk_begin / chunk_size
do i = chunk_begin, chunk_end
arr(i) = arr(i) + inc_by(chunk_id)
end do
end do
end subroutineThe idea is to set chunk_size to the number of available threads, and to
parallelise the first and final chunk_begin loops, which the analysis finds
are conflict free:
$ psyclone -s analyse.py -o /dev/null examples/parallel_prefix.f90
Routine: parallel_prefix_sum
Loop chunk_begin: conflict free
Loop i: conflict free
Loop chunk_begin: conflict free
Loop chunk_begin: conflict free
Loop i: conflict free
Note that analyse.py only considers array accesses. The first i loop and
the first chunk_begin loop both contain scalar conflicts, which are handled
by the existing DependencyTools. We can use parallelise.py instead of
analyse.py to explore DependencyTools and ArrayIndexAnalysis in
combination:
$ USE_SMT=yes psyclone -s parallelise.py -o /dev/null examples/parallel_prefix.f90
Routine parallel_prefix_sum:
Loop chunk_begin: conflict free
Loop i: conflicts
Loop chunk_begin: conflicts
Loop chunk_begin: conflict free
Loop i: conflict free
The ArrayIndexAnalysis can run in various modes:
-
Integer: Fortran integers are interepreted as arbitrary precision integers by the SMT solver. -
32 Bit: Fortran integers are interepreted as 32-bit bit vectors by the solver, and bit-vector overflow is explicitly ignored. -
8 Bit: Fortran integers are interepreted as 8-bit bit vectors by the solver, and bit-vector overflow is explicitly ignored. -
Default: same asIntegermode unless the routine enclosing the loop uses bit vector operations (shift/bitwise intrinsics), in which case it is the same as32 Bitmode.
The following table summarises the results of the ArrayIndexAnalysis in the
various modes. We use "yes" to mean that the solver succeeds in finding all
the non-conflicting loops, and a blank box to mean that it doesn't.
| Default | Integer | 32 Bit | 8 Bit | |
|---|---|---|---|---|
reverse |
yes | yes | yes | yes |
odd_even_trans |
yes | yes | yes | yes |
matmul |
yes | yes | yes | yes |
chunking |
yes | yes | yes | |
flatten |
yes | yes | yes | |
gauss_jordan |
yes | yes | yes | yes |
oem_sort |
yes | yes | yes | |
bitonic_sort |
yes | yes | yes | |
parallel_prefix |
yes | yes | yes |
In addition to these examples, there are a number of simple examples in
simple.f90 in which the existing analysis fails but
ArrayIndexAnalysis succeeds:
$ USE_SMT=yes psyclone -s parallelise.py -o /dev/null examples/simple.f90
Routine copy_array:
Loop i: conflict free
Routine injective_index:
Loop i: conflict free
Routine double_inner_loop:
Loop i: conflict free
Loop j: conflict free
Loop k: conflict free
Routine last_iteration:
Loop i: conflict free
Routine invariant_if:
Loop i: conflict free
Routine one_elem_slice:
Loop i: conflict free
Routine extend_array:
Loop i: conflict free
Routine main:
Loop i: conflict free
Routine modify:
Routine main:
Loop i: conflict free
Routine modify:
Routine triangular_loop:
Loop i: conflicts
Loop j: conflict free
Routine main:
Loop i: conflict free
Routine inc:
All of the examples thus far contain one or more parallelisable loops that existing analysis in PSyclone does not parallelise.
Finally, beware.f90 contains some examples that
illustrate bugs in the existing DepdendencyTools analysis. These have been
reported on the PSyclone issue tracker.