Squashed commit of the following:

commit 6654c774fb0d2d6fac760b911a547b4e66b23127 Author: Chuck Jacobs <cjacobs@microsoft.com> Date: Wed Apr 27 00:44:53 2022 +0000 Merged PR 2522: Generalize array indexing in tensorized GEMM This PR generalizes the MFMA tensorization pass to improve the handling of code in the innermost loop. It recognizes more ways of writing the GEMM kernel, and rejects many ill-formed GEMM kernels. There are also a number of tests. This PR doesn't yet generalize to batch-GEMM, where the matrices (typically) have 3 indices. Related work items: #3676 commit 4d030709101f3653712b805bd8f3698e0e293bd3 Author: Lisa Ong <onglisa@microsoft.com> Date: Tue Apr 26 17:50:18 2022 +0000 Merged PR 2551: [nfc][ci] Switch hosted pipelines to 1ES hosted pool * The Linux1ESPool is created to support internal builds of LLVM * Fix regression in pipeline due to overzealous .dockerignore commit 9b9d6b4b77c46b12788665412b9d0d1c2ff62d18 Author: Lisa Ong <onglisa@microsoft.com> Date: Tue Apr 26 10:43:28 2022 +0000 Merged PR 2550: [nfc] [docs] Merge changes from GitHub remote In preparation for merge from ADO to GitHub for Case Studies publishing commit c1298946d18fb785788c556ea2959b9438f9c6b7 Author: Lisa Ong <onglisa@microsoft.com> Date: Tue Apr 26 08:10:47 2022 +0000 Merged PR 2549: [Compliance] Switching from Dockerhub to ACR for third party containers Updating Dockerfile references commit 0c7a3610ba082e82e554297bdadbf9579b094745 Author: Denny Sun <dennys@microsoft.com> Date: Tue Apr 26 04:40:05 2022 +0000 Merged PR 2548: Add README file for case studies README file has a table where each case study points to the external repo link. commit edbc50edd00efe8f12a675735d7e52371e43f7b1 Author: Lisa Ong <onglisa@microsoft.com> Date: Mon Apr 25 23:49:15 2022 +0000 Merged PR 2546: [dev] [nfc] Natively support macOS/arm64 for development Limited to local development scenarios (LLVM_SETUP_VARIANT=Default) No plans to release pip packages until there is CI support Verified on: Big Sur (MacOSX 12.3 arm64) / Python 3.10 commit 166e333a3d10b77c804dc3edc1c71bfc5716c768 Author: Ritwik Das <ritdas@microsoft.com> Date: Mon Apr 25 17:50:22 2022 +0000 Merged PR 2543: Add precomputed offset map optimization for tensorization (no caching) - Add flag to tensorize() to enable optimization (off by default) - Optimization only affects load/store of accumulator (C) argument - Supports all 4 mfma shapes Related work items: #3671 commit e11c4d4e87bbae87f7cb9035eff8e6af650c9d1a Author: Chuck Jacobs <cjacobs@microsoft.com> Date: Sun Apr 24 01:00:41 2022 +0000 Merged PR 2542: An assortment of minor fixes This PR is a hodgepodge of tiny fixes. I'm happy to split it up into separate PRs if a kitchen-sink PR is too gross. The specific things are: - Add 2 new target models to `Targets.py` (that correspond to my local dev boxes) - Change the snapshot IR format for sub-passes to use the same format as the top-level passes (that is, not "generic" format) - Print a warning message if `check_correctness` skips a correctness check because no hat file was generated - Add a "minimum version" constraint to `requirements.txt` for `hatlib` commit 8da7903ac9b6d8612711593308e49a7a3e82678d Author: Kern Handa <kerha@microsoft.com> Date: Sat Apr 23 23:59:53 2022 +0000 Merged PR 2545: Unifies CUDA and CPP enum values to SOURCE for Package.Format Unifies CUDA and CPP enum values to SOURCE for Package.Format Related work items: #3679 commit fe2c40fa8f1c28dcf47e1533223457fd3e6bf195 Author: Kern Handa <kerha@microsoft.com> Date: Sat Apr 23 23:17:43 2022 +0000 Merged PR 2544: [nfc] Removes now unnecessary ldebug output [nfc] Removes now unnecessary ldebug output commit 32090d786ce13299bb77a6675c3478b3d7cdf48c Author: Mason Remy <masonr@microsoft.com> Date: Fri Apr 22 21:31:01 2022 +0000 Merged PR 2527: Enable vectorized shared memory write Enable vectorized shared memory write - This adds mod simplification support needed for vecotrizing shared memory writes - Also refactors some of the affine simplification code slightly to share some common code between the floordiv and mod simplifications Related work items: #3586, #3661, #3689 commit 0eb698af118b94bf3f4d4862a142c86055f8b7bb Author: Mason Remy <masonr@microsoft.com> Date: Fri Apr 22 19:13:27 2022 +0000 Merged PR 2526: Enable GPU global read vectorization Enable GPU global read vectorization - Implements a floor div simplification that enables better recognition of vectorizable load and stores Related work items: #3661, #3690 commit df849f066ff6c2c82c796d9b48e3bea6390c7877 Author: Chuck Jacobs <cjacobs@microsoft.com> Date: Fri Apr 22 06:03:27 2022 +0000 Merged PR 2541: Fix a few issues with GEMM benchmarking script This PR fixes a couple of errors: - there was a bug in the GEMM kernel - sometimes hatlib would fail to return a compiled function, but not throw an exception. These are now flagged as "uncompilable" It makes a couple of other tweaks: - it fails if the `alpha` and `beta` parameters aren't `1.0` and `0.0` - it culls some variants with known-uncompilable tensorization parameters before trying to compile them commit 339253767ae4bb4f7e5c323f77fc938ba1a4ab92 Author: Lisa Ong <onglisa@microsoft.com> Date: Fri Apr 22 01:26:53 2022 +0000 Merged PR 2538: Fix std::pair unpacking issue in TensorizeAffineForOpConversion In debug builds, we are getting garbage values for warpSizeX and warpSizeY, resulting in division by 0 errors in the emitted .cu files commit 075c83247d34bfd9fb291e4ea6b9df059a94993a Author: Denny Sun <dennys@microsoft.com> Date: Fri Apr 22 00:26:56 2022 +0000 Merged PR 2536: Parameter supports most of the arithmetic/binary/unary operations defined in operator lib Parameter supports the basic arithmetic operations (+, -, *, //, %), for example, the user can write the following code: fma_unit_count, vector_size = acc.create_parameters(2) jjj = schedule.split(jj, fma_unit_count * vector_size) jjjj = schedule.split(jjjj, vector_size) Related work items: #3692 commit 6d5e71899c6fb606e32ec46ee871ae1af25d3cd6 Author: Lisa Ong <onglisa@microsoft.com> Date: Thu Apr 21 18:22:12 2022 +0000 Merged PR 2539: [nfc][docs] Merging commits from Github/main commit ee28126a338d905eb5931038d3c5daba6ead3811 Author: Lisa Ong <11318241+lisaong@users.noreply.github.com> Date: Wed Apr 20 21:35:20 2022 +0800 Update arrow label positions (#35) * [nfc] [doc] Update arrow label positions * make arrowhead more visible * nfc commit ddcecaaffd9dd0861999a6d29443dc7c37d79665 Author: Lisa Ong <11318241+lisaong@users.noreply.github.com> Date: Wed Apr 20 21:34:40 2022 +0800 demo fixes for hatlib 0.0.11 (#36)
microsoft · Apr 27, 2022 · 5b0f142 · 5b0f142
1 parent 89850bf
commit 5b0f142
Show file tree

Hide file tree

Showing 56 changed files with 3,337 additions and 522 deletions.
diff --git a/.azure/llvm-canary.yml b/.azure/llvm-canary.yml
@@ -13,6 +13,7 @@ pool:
 container:
   # Container with the latest available vcpkg LLVM port + patches
   image: $(CONTAINER_REGISTRY)/accera-llvm-ubuntu:latest
+  endpoint: acceracontainers
 
 steps:
 - script: |

diff --git a/.azure/manylinux/Dockerfile b/.azure/manylinux/Dockerfile
@@ -4,7 +4,10 @@
 # Usage: call docker build from the root of this repository
 #  docker build -f .azure\manylinux\Dockerfile . -t registry_name/accera-llvm-manylinux2014:latest
 ####################################################################################################
-FROM quay.io/pypa/manylinux2014_x86_64:latest
+
+# cf: quay.io/pypa/manylinux2014_x86_64:2022-04-24-d28e73e
+# cf. https://quay.io/repository/pypa/manylinux2010_x86_64?tab=tags
+FROM acceracontainers.azurecr.io/pypa/manylinux2014_x86_64:2022-04-24-d28e73e
 
 ADD .azure/manylinux/scripts /tmp/scripts
 ADD requirements.txt /tmp/scripts/requirements.txt

diff --git a/.azure/manylinux/manylinux-llvm.yml b/.azure/manylinux/manylinux-llvm.yml
@@ -6,7 +6,7 @@ trigger:
     include:
       - external/llvm
 
-pool: LinuxScaleSetAgentPool
+pool: Linux1ESPool
 
 steps:
 - script: |

diff --git a/.azure/rocm/Dockerfile b/.azure/rocm/Dockerfile
@@ -7,7 +7,9 @@
 
 # https://docs.docker.com/engine/reference/builder/#understand-how-arg-and-from-interact
 ARG ROCMVER=5.1.1-ub20
-FROM amddcgpuce/rocm:${ROCMVER}
+
+# cf: amddcgpuce/rocm:${ROCMVER}
+FROM acceracontainers.azurecr.io/rocm:${ROCMVER}
 
 ARG ROCMVER
 RUN echo "ROCm Version: " ${ROCMVER}

diff --git a/.dockerignore b/.dockerignore
@@ -1,3 +1,4 @@
 build/
-external/
+external/vcpkg/downloads
+external/vcpkg/buildtrees
 *.egg-info
diff --git a/CMake/BuildTargetSetup.cmake b/CMake/BuildTargetSetup.cmake
@@ -0,0 +1,15 @@
+####################################################################################################
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License. See LICENSE in the project root for license information.
+####################################################################################################
+
+if(APPLE)
+  # cf. https://discourse.cmake.org/t/how-to-determine-which-architectures-are-available-apple-m1/2401/10
+  # on macOS "uname -m" returns the architecture (x86_64 or arm64)
+  execute_process(
+      COMMAND uname -m
+      RESULT_VARIABLE result
+      OUTPUT_VARIABLE OSX_NATIVE_ARCH
+      OUTPUT_STRIP_TRAILING_WHITESPACE
+  )
+endif()
diff --git a/CMakeLists.txt b/CMakeLists.txt
@@ -88,6 +88,8 @@ set(PACKAGE_ROOT ${ACCERA_EXTERNAL_DIR})
 # Set up install location in build directory
 set(CMAKE_INSTALL_PREFIX ${CMAKE_BINARY_DIR}/install)
 
+include(BuildTargetSetup)
+
 if(USE_MKL)
   include(MKLSetup)
 else()
@@ -174,11 +176,14 @@ else()
   else() # GCC
     add_compile_options(-Wno-ignored-attributes)
     add_compile_options(-fdiagnostics-color=always)
-    # Set options for Control Flow Integrity
-    add_compile_options(-fcf-protection)
     add_compile_options(-Wl,dynamicbase)
     # Enable Shadow Stack mitigation
     add_compile_options(-Wshadow)
+
+    if(NOT ${OSX_NATIVE_ARCH} STREQUAL "arm64")
+      # Set options for Control Flow Integrity (not supported on macos/arm64)
+      add_compile_options(-fcf-protection)
+    endif()
   endif()
 endif()
 

diff --git a/accera/acc-opt/test/affine_simplification.mlir b/accera/acc-opt/test/affine_simplification.mlir
@@ -0,0 +1,138 @@
+// RUN: acc-opt --verify-each=false --acc-affine-simplify %s | FileCheck %s
+
+module @test_accera_affine_simplification {
+    accv.module "test_accera_affine_simplification"  {
+
+        // FloorDiv simplification tests
+
+        // CHECK-LABEL accv.func nested @test_simplify_floordiv_no_terms_strides
+        accv.func nested @test_simplify_floordiv_no_terms_strides(%arg0: memref<32xf32>) attributes {exec_target = 0 : i64} {
+            %0 = memref.alloc() : memref<32xf32>
+            affine.for %arg1 = 0 to 16 {
+                affine.for %arg2 = 0 to 16 {
+                    affine.for %arg3 = 0 to 4 {
+                        // CHECK: %1 = affine.load %arg0[(%arg1 * 64 + %arg2 * 33 + %arg3 * 31) floordiv 32] : memref<32xf32>
+                        %1 = affine.load %arg0[(%arg1 * 64 + %arg2 * 33 + %arg3 * 31) floordiv 32] : memref<32xf32>
+                        // CHECK: affine.store %1, %0[(%arg1 * 64 + %arg2 * 33 + %arg3 * 31) floordiv 32] : memref<32xf32>
+                        affine.store %1, %0[(%arg1 * 64 + %arg2 * 33 + %arg3 * 31) floordiv 32] : memref<32xf32>
+                    } {begin = 0 : i64, end = 4 : i64}
+                } {begin = 0 : i64, end = 16 : i64}
+            } {begin = 0 : i64, end = 16 : i64}
+            accv.return
+        }
+
+        // CHECK-LABEL accv.func nested @test_simplify_floordiv_no_terms_range
+        accv.func nested @test_simplify_floordiv_no_terms_range(%arg0: memref<32xf32>) attributes {exec_target = 0 : i64} {
+            %0 = memref.alloc() : memref<32xf32>
+            affine.for %arg1 = 0 to 16 {
+                affine.for %arg2 = 0 to 16 {
+                    affine.for %arg3 = 0 to 5 { // This range being 5 will prevent the simplification from removing this term
+                        // CHECK: %1 = affine.load %arg0[(%arg1 * 64 + %arg2 * 4 + %arg3) floordiv 32] : memref<32xf32>
+                        %1 = affine.load %arg0[(%arg1 * 64 + %arg2 * 4 + %arg3) floordiv 32] : memref<32xf32>
+                        // CHECK: affine.store %1, %0[(%arg1 * 64 + %arg2 * 4 + %arg3) floordiv 32] : memref<32xf32>
+                        affine.store %1, %0[(%arg1 * 64 + %arg2 * 4 + %arg3) floordiv 32] : memref<32xf32>
+                    } {begin = 0 : i64, end = 4 : i64}
+                } {begin = 0 : i64, end = 16 : i64}
+            } {begin = 0 : i64, end = 16 : i64}
+            accv.return
+        }
+
+        // CHECK-LABEL accv.func nested @test_simplify_floordiv_one_term
+        accv.func nested @test_simplify_floordiv_one_term(%arg0: memref<32xf32>) attributes {exec_target = 0 : i64} {
+            %0 = memref.alloc() : memref<32xf32>
+            affine.for %arg1 = 0 to 16 {
+                affine.for %arg2 = 0 to 16 {
+                    affine.for %arg3 = 0 to 4 {
+                        // CHECK: %1 = affine.load %arg0[%arg1 * 2 + (%arg2 * 48) floordiv 32] : memref<32xf32>
+                        %1 = affine.load %arg0[(%arg1 * 64 + %arg2 * 48 + %arg3) floordiv 32] : memref<32xf32>
+                        // CHECK: affine.store %1, %0[%arg1 * 2 + (%arg2 * 48) floordiv 32] : memref<32xf32>
+                        affine.store %1, %0[(%arg1 * 64 + %arg2 * 48 + %arg3) floordiv 32] : memref<32xf32>
+                    } {begin = 0 : i64, end = 4 : i64}
+                } {begin = 0 : i64, end = 16 : i64}
+            } {begin = 0 : i64, end = 16 : i64}
+            accv.return
+        }
+
+        // CHECK-LABEL accv.func nested @test_simplify_floordiv_two_terms
+        accv.func nested @test_simplify_floordiv_two_terms(%arg0: memref<32xf32>) attributes {exec_target = 0 : i64} {
+            %0 = memref.alloc() : memref<32xf32>
+            affine.for %arg1 = 0 to 16 {
+                affine.for %arg2 = 0 to 16 {
+                    affine.for %arg3 = 0 to 4 {
+                        // CHECK: %1 = affine.load %arg0[%arg1 * 2] : memref<32xf32>
+                        %1 = affine.load %arg0[(%arg1 * 128 + %arg2 * 4 + %arg3) floordiv 64] : memref<32xf32>
+                        // CHECK: affine.store %1, %0[%arg1 * 2] : memref<32xf32>
+                        affine.store %1, %0[(%arg1 * 128 + %arg2 * 4 + %arg3) floordiv 64] : memref<32xf32>
+                    } {begin = 0 : i64, end = 4 : i64}
+                } {begin = 0 : i64, end = 16 : i64}
+            } {begin = 0 : i64, end = 16 : i64}
+            accv.return
+        }
+
+        // Mod simplification tests
+
+        // CHECK-LABEL accv.func nested @test_simplify_mod_no_terms_strides
+        accv.func nested @test_simplify_mod_no_terms_strides(%arg0: memref<32xf32>) attributes {exec_target = 0 : i64} {
+            %0 = memref.alloc() : memref<32xf32>
+            affine.for %arg1 = 0 to 16 {
+                affine.for %arg2 = 0 to 16 {
+                    affine.for %arg3 = 0 to 4 {
+                        // CHECK: %1 = affine.load %arg0[(%arg1 * 68 + %arg2 * 33 + %arg3 * 31) mod 32] : memref<32xf32>
+                        %1 = affine.load %arg0[(%arg1 * 68 + %arg2 * 33 + %arg3 * 31) mod 32] : memref<32xf32>
+                        // CHECK: affine.store %1, %0[(%arg1 * 68 + %arg2 * 33 + %arg3 * 31) mod 32] : memref<32xf32>
+                        affine.store %1, %0[(%arg1 * 68 + %arg2 * 33 + %arg3 * 31) mod 32] : memref<32xf32>
+                    } {begin = 0 : i64, end = 4 : i64}
+                } {begin = 0 : i64, end = 16 : i64}
+            } {begin = 0 : i64, end = 16 : i64}
+            accv.return
+        }
+
+        // CHECK-LABEL accv.func nested @test_simplify_mod_no_terms_range
+        accv.func nested @test_simplify_mod_no_terms_range(%arg0: memref<32xf32>) attributes {exec_target = 0 : i64} {
+            %0 = memref.alloc() : memref<32xf32>
+            affine.for %arg1 = 0 to 16 {
+                affine.for %arg2 = 0 to 16 {
+                    affine.for %arg3 = 0 to 5 { // This range being 5 will prevent the simplification from removing this term
+                        // CHECK: %1 = affine.load %arg0[(%arg1 * 64 + %arg2 * 4 + %arg3) mod 32] : memref<32xf32>
+                        %1 = affine.load %arg0[(%arg1 * 64 + %arg2 * 4 + %arg3) mod 32] : memref<32xf32>
+                        // CHECK: affine.store %1, %0[(%arg1 * 64 + %arg2 * 4 + %arg3) mod 32] : memref<32xf32>
+                        affine.store %1, %0[(%arg1 * 64 + %arg2 * 4 + %arg3) mod 32] : memref<32xf32>
+                    } {begin = 0 : i64, end = 4 : i64}
+                } {begin = 0 : i64, end = 16 : i64}
+            } {begin = 0 : i64, end = 16 : i64}
+            accv.return
+        }
+
+        // CHECK-LABEL accv.func nested @test_simplify_mod_one_term
+        accv.func nested @test_simplify_mod_one_term(%arg0: memref<32xf32>) attributes {exec_target = 0 : i64} {
+            %0 = memref.alloc() : memref<32xf32>
+            affine.for %arg1 = 0 to 16 {
+                affine.for %arg2 = 0 to 16 {
+                    affine.for %arg3 = 0 to 4 {
+                        // CHECK: %1 = affine.load %arg0[%arg3 + (%arg1 * 68 + %arg2 * 48) mod 32] : memref<32xf32>
+                        %1 = affine.load %arg0[(%arg1 * 68 + %arg2 * 48 + %arg3) mod 32] : memref<32xf32>
+                        // CHECK: affine.store %1, %0[%arg3 + (%arg1 * 68 + %arg2 * 48) mod 32] : memref<32xf32>
+                        affine.store %1, %0[(%arg1 * 68 + %arg2 * 48 + %arg3) mod 32] : memref<32xf32>
+                    } {begin = 0 : i64, end = 4 : i64}
+                } {begin = 0 : i64, end = 16 : i64}
+            } {begin = 0 : i64, end = 16 : i64}
+            accv.return
+        }
+
+        // CHECK-LABEL accv.func nested @test_simplify_mod_all_terms
+        accv.func nested @test_simplify_mod_all_terms(%arg0: memref<64xf32>) attributes {exec_target = 0 : i64} {
+            %0 = memref.alloc() : memref<64xf32>
+            affine.for %arg1 = 0 to 16 {
+                affine.for %arg2 = 0 to 16 {
+                    affine.for %arg3 = 0 to 4 {
+                        // CHECK: %1 = affine.load %arg0[%arg3 + %arg2 * 4] : memref<64xf32>
+                        %1 = affine.load %arg0[(%arg1 * 128 + %arg2 * 4 + %arg3) mod 64] : memref<64xf32>
+                        // CHECK: affine.store %1, %0[%arg3 + %arg2 * 4] : memref<64xf32>
+                        affine.store %1, %0[(%arg1 * 128 + %arg2 * 4 + %arg3) mod 64] : memref<64xf32>
+                    } {begin = 0 : i64, end = 4 : i64}
+                } {begin = 0 : i64, end = 16 : i64}
+            } {begin = 0 : i64, end = 16 : i64}
+            accv.return
+        }
+    }
+}
diff --git a/accera/acc-opt/test/barrier_opt_tests/barrier_opt_test_generator.py b/accera/acc-opt/test/barrier_opt_tests/barrier_opt_test_generator.py
@@ -10,7 +10,7 @@
 def build_package(plan, args, name):
     package = acc.Package()
     package.add(plan, args=args, base_name=name)
-    package.build(name, format=acc.Package.Format.MLIR_VERBOSE | acc.Package.Format.CUDA, output_dir="build")
+    package.build(name, format=acc.Package.Format.MLIR_VERBOSE | acc.Package.Format.DEFAULT, output_dir="build")
 
 
 def barrier():
@@ -31,7 +31,7 @@ def barrier_trivial_test_1():
     @nest.iteration_logic
     def _():
         shA = acc.NativeArray(acc.Allocate(type=acc.ScalarType.float32, layout=acc._lang_python._MemoryLayout([blocksize]).set_memory_space(acc._lang_python._lang._MemorySpace.SHARED)))
-        
+
         shA[i] = A[i]
         A[i] *= 2.0
         B[i] = shA[i]
@@ -55,7 +55,7 @@ def barrier_single_warp_test_1():
     @nest.iteration_logic
     def _():
         shA = acc.NativeArray(acc.Allocate(type=acc.ScalarType.float32, layout=acc._lang_python._MemoryLayout([N]).set_memory_space(acc._lang_python._lang._MemorySpace.SHARED)))
-        
+
         barrier()
         shA[i] = A[i]
         barrier()
@@ -85,7 +85,7 @@ def barrier_single_warp_test_2():
     @nest.iteration_logic
     def _():
         shA = acc.NativeArray(acc.Allocate(type=acc.ScalarType.float32, layout=acc._lang_python._MemoryLayout([N]).set_memory_space(acc._lang_python._lang._MemorySpace.SHARED)))
-        
+
         barrier()
         shA[i] = A[i]
         barrier()
@@ -117,7 +117,7 @@ def barrier_single_warp_test_3():
     @nest.iteration_logic
     def _():
         shA = acc.NativeArray(acc.Allocate(type=acc.ScalarType.float32, layout=acc._lang_python._MemoryLayout([N]).set_memory_space(acc._lang_python._lang._MemorySpace.SHARED)))
-        
+
         barrier()
         shA[i] = A[i]
         barrier()
@@ -150,7 +150,7 @@ def barrier_multi_warp_test_1():
     @nest.iteration_logic
     def _():
         shA = acc.NativeArray(acc.Allocate(type=acc.ScalarType.float32, layout=acc._lang_python._MemoryLayout([N]).set_memory_space(acc._lang_python._lang._MemorySpace.SHARED)))
-        
+
         barrier()
         shA[i] = A[i]
         barrier()
@@ -182,9 +182,9 @@ def barrier_seq_test_1():
 
     @nest.iteration_logic
     def _():
-        # Performs excessive barriers. 
+        # Performs excessive barriers.
         shA = acc.NativeArray(acc.Allocate(type=acc.ScalarType.float32, layout=acc._lang_python._MemoryLayout([blocksize]).set_memory_space(acc._lang_python._lang._MemorySpace.SHARED)))
-        
+
         barrier()
         shA[i] = A[i]
         barrier()
@@ -214,10 +214,10 @@ def barrier_seq_test_2():
 
     @nest.iteration_logic
     def _():
-        # Performs excessive barriers. 
+        # Performs excessive barriers.
         shA = acc.NativeArray(acc.Allocate(type=acc.ScalarType.float32, layout=acc._lang_python._MemoryLayout([blocksize]).set_memory_space(acc._lang_python._lang._MemorySpace.SHARED)))
         shB = acc.NativeArray(acc.Allocate(type=acc.ScalarType.float32, layout=acc._lang_python._MemoryLayout([blocksize]).set_memory_space(acc._lang_python._lang._MemorySpace.SHARED)))
-        
+
         barrier()
         shA[i] = A[i]
         barrier()
@@ -247,10 +247,10 @@ def barrier_seq_test_3():
 
     @nest.iteration_logic
     def _():
-        # Performs excessive barriers. 
+        # Performs excessive barriers.
         shA = acc.NativeArray(acc.Allocate(type=acc.ScalarType.float32, layout=acc._lang_python._MemoryLayout([blocksize]).set_memory_space(acc._lang_python._lang._MemorySpace.SHARED)))
         shB = acc.NativeArray(acc.Allocate(type=acc.ScalarType.float32, layout=acc._lang_python._MemoryLayout([blocksize]).set_memory_space(acc._lang_python._lang._MemorySpace.SHARED)))
-        
+
         shB[i] = A[i]
         barrier()
         shA[i] = A[i]
@@ -317,7 +317,7 @@ def barrier_if_test_2():
     def _():
         shA = acc.NativeArray(acc.Allocate(type=acc.ScalarType.float32, layout=acc._lang_python._MemoryLayout([blocksize]).set_memory_space(acc._lang_python._lang._MemorySpace.SHARED)))
         shB = acc.NativeArray(acc.Allocate(type=acc.ScalarType.float32, layout=acc._lang_python._MemoryLayout([blocksize]).set_memory_space(acc._lang_python._lang._MemorySpace.SHARED)))
-        
+
         def if_block():
             barrier()
             shB[i] = A[i]
@@ -416,9 +416,9 @@ def else_block():
         barrier()
         shB[i] = B[i]
         barrier()
-        
+
         acc._lang_python._lang._If(i < acc._lang_python._lang.as_index(N), if_block).Else(else_block)
-        
+
         barrier()
         shA[i] = A[i]
         barrier()
@@ -448,7 +448,7 @@ def barrier_loop_test_1():
     @nest.iteration_logic
     def _():
         shA = acc.NativeArray(acc.Allocate(type=acc.ScalarType.float32, layout=acc._lang_python._MemoryLayout([blocksize]).set_memory_space(acc._lang_python._lang._MemorySpace.SHARED)))
-        
+
         start = acc.Scalar(0)
         stop = acc.Scalar(32)
         step = acc.Scalar(1)
@@ -490,7 +490,7 @@ def barrier_loop_test_2():
     def _():
         shA = acc.NativeArray(acc.Allocate(type=acc.ScalarType.float32, layout=acc._lang_python._MemoryLayout([blocksize]).set_memory_space(acc._lang_python._lang._MemorySpace.SHARED)))
         shB = acc.NativeArray(acc.Allocate(type=acc.ScalarType.float32, layout=acc._lang_python._MemoryLayout([blocksize]).set_memory_space(acc._lang_python._lang._MemorySpace.SHARED)))
-        
+
         start = acc.Scalar(0)
         stop = acc.Scalar(32)
         step = acc.Scalar(1)
@@ -533,7 +533,7 @@ def barrier_loop_test_3():
     def _():
         shA = acc.NativeArray(acc.Allocate(type=acc.ScalarType.float32, layout=acc._lang_python._MemoryLayout([blocksize]).set_memory_space(acc._lang_python._lang._MemorySpace.SHARED)))
         shB = acc.NativeArray(acc.Allocate(type=acc.ScalarType.float32, layout=acc._lang_python._MemoryLayout([blocksize]).set_memory_space(acc._lang_python._lang._MemorySpace.SHARED)))
-        
+
         start = acc.Scalar(0)
         stop = acc.Scalar(32)
         step = acc.Scalar(1)
@@ -594,7 +594,7 @@ def barrier_loop_test_4():
     def _():
         shA = acc.NativeArray(acc.Allocate(type=acc.ScalarType.float32, layout=acc._lang_python._MemoryLayout([blocksize]).set_memory_space(acc._lang_python._lang._MemorySpace.SHARED)))
         shB = acc.NativeArray(acc.Allocate(type=acc.ScalarType.float32, layout=acc._lang_python._MemoryLayout([blocksize]).set_memory_space(acc._lang_python._lang._MemorySpace.SHARED)))
-        
+
         start = acc.Scalar(0)
         stop = acc.Scalar(32)
         step = acc.Scalar(1)
@@ -650,7 +650,7 @@ def barrier_loop_test_5():
     @nest.iteration_logic
     def _():
         shA = acc.NativeArray(acc.Allocate(type=acc.ScalarType.float32, layout=acc._lang_python._MemoryLayout([blocksize]).set_memory_space(acc._lang_python._lang._MemorySpace.SHARED)))
-        
+
         start = acc.Scalar(0)
         stop = acc.Scalar(32)
         step = acc.Scalar(1)