## cse5441 - parallel computing

https://upload.wikimedia.org/wikipedia/commons/thumb/5/52/Explosion-153710\_icon.svg/833px-Explosion-153710\_icon.svg.png http://2.bp.blogspot.com/-iEM0ks4VajM/Tu8sGVUJBcI/AAAAAAAEao/Hn4uRBVoK7s/s1600/PackedStore.png http://pngimg.com/uploads/darth\_vader\_PNG3.png https://hyting.com/vi/LVV5LcZRnCc/maxresdefault.jpg



# vectorizing compilers



not all compilers are created equally ...



© Copyright. Jeffrey S. Jones, 2018. May be used for educational purposes without written permission but with a citation to this source. May not be distributed electronically without written permission.

1

## array programming

array programming languages (such as Fortran 90) provide operators which generalize scalar functions to higher dimensions

#### Fortran 77

do 
$$i=1,n$$
  
do  $j=1,n$   
 $C(j,i) = A(j,i) + B(j,i)$   
enddo  
enddo

#### Fortran 90

$$C = A + B$$

### scalar arithmetic / logic unit (ALU)



3

### vector arithmetic / logic unit



4

### unordered data distribution

```
for (iband = 0; iband < nbands; iband++)
    for (idir = 0; idir < ndir; idir++)
        for (icell = 0; icell < ncells; icell++)
              for (iface = 0; iface < nfcell[icell]; iface++)
                   if (bface[currf] == 0)
                                                 //interior face
                      do stuff(interior);
                                                 //boundary face
                  else
                      if (bctype[ibface] == ADIA)
                           do stuff(ADIA);
                      else if (bctype[ibface] == ISOT)
                           do stuff(ISOT);
                   }//end -- if interior or boundary case
                   else
              }//end -- loop over cell faces
        }//end - cell loop
   }//end – dir loop
}//end - band loop
```

#### INPUT DATA (stylized)

INTER
INTER
ADIA
ISOT
PERM
INTER
PERM
ADIZ
ADIZ
INTER
PERM
ISOT
PERM

**ADIA** 

ISOT XPR

ISOT

PERM ADIA INTER assume all inputs are of same type, and in these application categories

# **JSU CSE 5441**

# SIMD adaptation

### UNORDERED INPUT DATA

**INTER INTER ADIA** ISOT **PERM INTER PERM** ADIZ ADIZ **INTER PERM** ISOT **PERM ADIA ISOT XPR** ISOT **PERM ADIA INTER** 

### PARTITIONED INPUT DATA

**ADIA ADIA INTER INTER INTER INTER INTER** 

## SIMD adaptation

```
for (iband = 0; iband < nbands; iband++)
    for (idir = 0; idir < ndir; idir++)
        for (icell = 0; icell < ncells; icell++)
              for (iface = 0; iface < nfcell[icell]; iface++)
                   if ( bface[currf] == 0 )
                                                  //interior face
                       do stuff(interior);
                                                  //boundary face
                   else
                       if (bctype[ibface] == ADIA)
                           do_stuff(ADIA);
                       else if (bctype[ibface] == ISOT)
                           do_stuff(ISOT);
                   }//end -- if interior or boundary case
              }//end -- loop over cell faces
        }//end - cell loop
    }//end - dir loop
}//end - band loop
```

```
for (iband = 0; iband < nbands; iband++)
   for (idir = 0; idir < ndir; idir++)
       for (iface = 0; iface < nf max; iface++)
              //process interior cell faces
              for (indx = 0; indx < num if cells[iface]; indx++)
                  do stuff(interior);
              //process ISOT cell faces
              for (indx = ISOT offset;
                  indx < ISOT offset+num isot cells[iface];
                  indx++)
                  do stuff(ISOT);
              //process ADIA cell faces
              for (indx = ADIA offset;
                  indx < ADIA offset+num adia cells[iface];
                   indx++)
                  do stuff(ADIA);
       }//end - face loop
   }//end - dir loop
}//end - band loop
```

# C/C++ vectorizing compilers

```
for (int i = 0; i < n; i++)
c[i] = a[i] + b[i];
```

compilers automatically handle the simple cases

```
for (int i = 0; i < n; i++)
{
    sum = 0.0;
    for (int j = 0; j < n; j++)
    {
        sum += A[j][i];
    }
    B[i] = sum;
}</pre>
```

what makes this loop more challenging?

- no stride-1 access
- sum creates a loop-carried dependence for i

# C/C++ vectorizing compilers

example

```
for (int i = 0; i < n; i++)
  sum = 0.0;
  for (int j = 0; j < n; j++)
     sum += A[j][i];
  B[i] = sum;
```

```
for (int i = 0; i < n; i++)
  sum[i] = 0.0;
  for (int j = 0; j < n; j++)
     sum[i] += A[j][i];
  B[i] = sum[i];
                        В
```

```
scalar expansion:
  eliminates loop
  dependency
```

```
for (int i = 0; i < n; i++)
  sum[i] = 0.0;
for (int j = 0; j < n; j++)
  for (int i = 0; i < n; i++)
     sum[i] += A[j][i];
for (int i = 0; i < n; i++)
  B[i] = sum[i];
                         C
```

plus loop reordering and distribution:

> provides stride-1 access

# C/C++ vectorizing compilers

example

```
for (int i = 0; i < n; i++)
{
    sum = 0.0;
    for (int j = 0; j < n; j++)
    {
        sum += A[j][i];
    }
    B[i] = sum;
}</pre>
```

#### Intel Nehalem

loop not vectorized

#### **IBM Power 7**

loop not vectorized

```
for (int i = 0; i < n; i++)
{
    sum[i] = 0.0;
    for (int j = 0; j < n; j++)
    {
        sum[i] += A[j][i];
    }
    B[i] = sum[i];
}</pre>
```

#### Intel Nehalem

loop vectorized speedup: 2.6 (62% faster) relative run-time 0.6

#### **IBM Power 7**

loop interchanged and vectorized speedup: 2.0 (50% faster) relative run-time 0.2

10

## stripmine

step 1 of 2

#### **ORIGINAL**

```
for (int i = 0; i < n; i++)
{
S1 A[i] = B[i] + 1.0;
S2 C[i] = A[i] + 2.0;
}
```

#### **STRIPMINE**

```
for (int i = 0; i < n; i+= stripsize)
{
   for (int j = i; j < i+stripsize; j++)
   {
      A[j] = B[j] + 1.0;
      C[j] = A[j] + 2.0;
   }
}</pre>
```

### stripmine - distribute

step 2 of 2

#### **ORIGINAL**

```
for (int i = 0; i < n; i++)
{
S1 A[i] = B[i] + 1.0;
S2 C[i] = A[i] + 2.0;
}
```

#### **STRIPMINE**

```
for (int i = 0; i < n; i+= stripsize)
{
   for (int j = i; j < i+stripsize; j++)
   {
      A[j] = B[j] + 1.0;
      C[j] = A[j] + 2.0;
   }
}</pre>
```

#### **VECTORIZED**

```
for (int i = k; i < n; i+=q)
{
    A[i:i+q-1] = B[i:i+q-1] + 1.0;
}
for (int i = k; i < n; i+=q)
    {
    C[i:i+q-1] = B[i:i+q-1] + 2.0;
}</pre>
```

#### DISTRIBUTE

```
for (int i = 0; i < n; i+= stripsize)
{
    for (int j = i; j < i+stripsize; j++)
    {
        A[j] = B[j] + 1.0;
    }
    for (int j = i; j < i+stripsize; j++)
    {
        C[j] = A[j] + 2.0;
    }
}</pre>
```

### vectorization result

#### scalar:

&operand1 load r1 load &operand2 add r1, r2

&result store

load &operand1a

&operand2a load

add r1, r2

&result store r3

&operand1b load &operand2b load

add r1, r2

&result r3 store

&operand1c load

&operand2c load

add r1, r2

store r3 &result



#### vector:

loadv vr1 &operand1 &operand2 loadv vr2 addv vr1, vr2 storev vr3 &result

# OSU CSE 5441

### how well do compilers vectorize?

| Compiler         | XLC  | ICC  | GCC  |
|------------------|------|------|------|
| Total            | 159  |      |      |
| Vectorized       | 74   | 75   | 32   |
| Not vectorized   | 85   | 84   | 127  |
| Average Speed Up | 1.73 | 1.85 | 1.30 |

| Compiler   | XLC but not ICC | ICC but<br>not XLC |
|------------|-----------------|--------------------|
| Vectorized | 25              | 26                 |



adding manual vectorization hints increased the average speedup (IBM) from 1.73 to 3.78

### acyclic dependence graphs

forward dependencies

```
for (int i = 0; i < max; i++)
{
S1 a[i] = b[i] + c[i];
S2 d[i] = a[i] + 1;
}
```





can we group all the S1<sub>n</sub> and follow with S2<sub>n</sub>?

### forward dependencies

example

```
for (int i = 0; i < max; i++)
{
S1 a[i] = b[i] + c[i];
S2 d[i] = a[i] + 1;
}
```



XLC: vectorized, speedup 2.0

ICC: vectorized, speedup 1.6

# OSU CSE 5441

### acyclic dependence graphs

backward dependencies

```
for (int i = 0; i < max; i++)
{
S1 a[i] = b[i] + c[i];
S2 d[i] = a[i+1] + 1;
}
```





can we group all the S2<sub>n</sub> and follow with S1<sub>n</sub>?

### acyclic dependence graphs

backward dependencies

```
for (int i = 0; i < max; i++)
{
$2 d[i] = a[i+1] + 1;
$1 a[i] = b[i] + c[i];
}
```





can we group all the S2<sub>n</sub> and follow with S1<sub>n</sub>?

### backward dependencies

example

```
original

for (int i = 0; i < max; i++)
{
    a[i] = b[i] + c[i];
    d[i] = a[i+1] + 1;
}</pre>
re-ordered

for (int i = 0; i < max; i++)
{
    d[i] = a[i+1] + 1;
    a[i] = b[i] + c[i];
}</pre>
```

ICC time 12.6 non-vectorized

XLC time 0.6 vectorized

ICC time 9.4 vectorized

XLC time 0.6 vectorized

# **JSU CSE 5441**

## cyclic dependence graphs



```
for (int i = 0; i < max; i++)
{
    a[i] = a[i+1] + b[i];
}
```

vectorized
XLC
vectorized

```
S_1 S_1\delta^fS_1
```

```
for (int i = 0; i < max; i++)
{
    a[i] = a[i-1] + b[i];
}
```

ICC
non-vectorized
XLC
non-vectorized

```
S_1 S_1\delta^fS_1
```

```
for (int i = 0; i < max; i++)
{
    a[ i ] = a[ i-4] + b[ i ];
}
```

vectorized XLC vectorized

### trade-offs

a rapidly changing landscape ...



### re-factoring on the horizon ???







### more on vectorizing compilers ...

#### references:

- "Intel® Cilk™ Plus Support," [online] \_software.intel.com/en-us/articles/intel-cilk-plus-support
- "A Guide to Auto-vectorization with Intel® C++ Compilers," [online] software.intel.com/en-us/articles/a-guide-to-auto-vectorization-with-intel-c-compilers
- "Quick-Reference Guide to Optimization with Intel® Compilers version 13," [online] software.intel.com/sites/default/files/Compiler\_QRG\_2013.pdf
- "Intel® C++ Intrinsics Reference," [online] software.intel.com/sites/products/documentation/studio/composer/en-us/2011/compiler\_c/intref\_cls/common/intref\_bk\_intro.htm
- "Intel Advanced Vector Extensions (AVX)," [online] software.intel.com/en-us/avx
- "Compiler Prefetching for the Intel® Xeon Phi™ coprocessor," [online] software.intel.com/sites/default/files/managed/5d/f3/5.3-prefetching-on-mic-4.pdf

# cse5441 - parallel computing

https://upload.wikimedia.org/wikipedia/commons/thumb/5/52/Explosion-153710\_icon.svg/833px-Explosion-153710\_icon.svg.png http://2.bp.blogspot.com/-iEM0ks4VajM/Tu8sCVUJBcI/AAAAAAAEao/Hn4uRBVoK7s/s1600/PackedStore.png http://pngimg.com/uploads/darth\_vader/PNG3.png https://hyting.com/vi/LVV5LcZRnCc/maxresdefault.jpg



# vectorizing compilers



not all compilers are created equally ...



© Copyright. Jeffrey S. Jones, 2018. May be used for educational purposes without written permission but with a citation to this source. May not be distributed electronically without written permission.

24