





Multithreading for Visual Effects 9:00am-12:15pm August 12, 2015







The 42nd International Conference and Exhibition on Computer Graphics and Interactive Techniques



Multithreading for Visual Effects
Multithreading Introduction and Overview
James Reinders
Intel

# Agenda

| 9:00am  | Start                                              |                                   |
|---------|----------------------------------------------------|-----------------------------------|
| 9:00am  | Introduction                                       | James Reinders, Intel             |
| 9:05am  | Multithreading Introduction and Overview           | James Reinders, Intel             |
| 9:45am  | Parallelism in Houdini - practical lessons learned | Jeff Lait, Side Effects Software  |
| 10:30am | Break                                              |                                   |
| 10:45am | GPU Rigid Body Simulation Using OpenCL             | Erwin Coumans, Google             |
| 11:10am | Asynchronous Computation Engine for Animation      | George ElKoura, Pixar             |
| 11:40am | Parallel Evaluation of Character Rigs Using TBB    | Martin Watt, Dreamworks Animation |
| 12:15pm | Done                                               |                                   |



### @multithreadvfx



www.multithreadingandvfx.org



"Multithreading for Visual Effects"



Taylor & Francis

Booth 528



### @multithreadvfx



www.multithreadingandvfx.org



"Multithreading for Visual Effects"

# Agenda

| 9:00am  | Start                                              |                                   |
|---------|----------------------------------------------------|-----------------------------------|
| 9:00am  | Introduction                                       | James Reinders, Intel             |
| 9:05am  | Multithreading Introduction and Overview           | James Reinders, Intel             |
| 9:45am  | Parallelism in Houdini - practical lessons learned | Jeff Lait, Side Effects Software  |
| 10:30am | Break                                              |                                   |
| 10:45am | <b>GPU Rigid Body Simulation Using OpenCL</b>      | Erwin Coumans, Google             |
| 11:10am | Asynchronous Computation Engine for Animation      | George ElKoura, Pixar             |
| 11:40am | Parallel Evaluation of Character Rigs Using TBB    | Martin Watt, Dreamworks Animation |
| 12:15pm | Done                                               |                                   |

### "Think Parallel"

### "Think Parallel"

- Parallelism is almost never effective to "stick in" at the last minute in a program
- Think about everything in terms of how to do in parallel as cleanly as possible

# **Motivation**







Multithread: grow die area small % for addition hardware thread(s) sharing resources.

Multicore/Many Core: 100% die area for additional hardware thread without sharing,

Data parallelism: handling more data at once,
multibyte, multiword, many words.

© 2015, James Reinders, used with permission. http://lotsofcores.com



© 2015, James Reinders, used with permission. http://lotsofcores.com

### Task

- Key thing to know:
  - Program in TASKS not THREADS.

http://tinyurl.com/threadsYUCK

### Task

- Key thing to know:
  - Program in TASKS not THREADS.

#### This means:

- Programmer's job: identify (lots) of tasks to do
- Runtime's job: map tasks onto threads at runtime

# James' BIG 3 REASONS to avoid programming to specific hardware mechanisms

- 1. Portability is impaired severely when coding "close to the hardware"
- 2. Nested parallelism is IMPORTANT (nearly impossible to manage well using "mandatory parallelism" methods such as threads)
- 3. Other parallelism mechanisms (e.g., vectorization) needs to be considered and balanced.

## What makes a good abstraction?

- Hardware agnostic
- -Performance
- "Scale forward" (preserve investments)
- Reliable and predictable
- Effective use of scarce resource: us

# **Parallel Programming**

- No widely used (popular) programming language was designed for expressing parallel programming.
  - Not Fortran, C, C++, Java, C#, Perl, Python, Ruby

- This creates many challenges
- Fundamental question of all programming languages: level of abstraction

# **Parallel Programming Models**

- Sequential Semantics?
- A fundamental design choice;
- most abstract solutions being "retrofit" into existing programming languages tend to choose to preserve sequential semantics.
- TEST: if ignoring the parallel keywords/directives/calls would result in equivalent functionality (expected to be slower), then I'd expect that sequential semantics are at play.

# **Programming Model Ideal Goals**

- Performance:
  - achievable, scalable, predictable, tunable, portable
- Productivity:
  - expressive, composable, debugable, maintainable
- Portability
  - functionality & performance across
     operating systems, compilers, targets

### Level of Abstraction: Parallelism

- There is no "perfect" answer (one size fits all)
- Higher level programming (more abstract):
  - Desired benefits:
     More portable, more performance portable, better investment preservation over time.
- Lower level programming (less abstract):
  - Desired benefits:
     More control for the programmer.

### Level of Abstraction: Parallelism

- There is no "perfect" answer (one size fits all)
- Higher level programming (more abstract):
  - Desired benefits:

     More portable, more performance portable, better investment preservation over time.
- Lower level programming (less abstract):
  - Desired benefits: OpenCL\*, CUDA\*
     More control for the programmer.

<sup>\*</sup> Third party marks may be claimed as the property of others.

# Advancing C and C++

# www.threadingbuildingblocks.org



- Most popular C++ abstraction
- ✓ Windows\*
- ✓ Linux\*
- ✓ Mac OS\* X
- ✓ Xbox 360
- ✓ Solaris\*
- ✓ FreeBSD\*
- ✓ Intel processors
- ✓ AMD processors
- ✓ SPARC processors
- ✓ IBM processors
- ✓ open source
- ✓ standard committee submissions

The most used method to parallelize C++ programs

 $<sup>^</sup>st$  Other names and brands may be claimed as the property of others.

#### Easier to maintain, scales better, easier to debug (get correct)

#### hand-coded

```
Thread Setup and Initialization
CRITICAL SECTION MyMutex, MyMutex2, MyMutex3;
InitializeCriticalSection (SMoNutex2)
Parallel Task Scheduling and Execution
const int MINPATCH = 150;
work queue entry t *work queue head = NULL
work queue entry t *work queue tail = NULL
```

```
work queue head = q;
bool schedule thread work (patch &pch)
   if (q != NULL) (
       if (scene.displaymode = RT DISPLAY ENABLED) (
```

```
Thread Setup and Initialization
#include "tbb/task scheduler init.h"
#include "tbb/spin_mutex.h"
tbb::task_scheduler_init_init;
tbb::spin_mutex MyMutex, MyMutex2;
Parallel Task Scheduling and Execution
#include "tbb/parallel for.h"
#include "tbb/blocked range2d.h"
class parallel task (
public:
   void operator() (const tbb::blocked range2dCint> &r) const (
                 render one pixel (x, y);
            tbb::spin_mutex::scoped_lock_lock_(MyMutex2);
                 GraphicsDrawRow(startx-1, y-1, totalx, (unsigned
char *) Eglobal buffer[(y-starty)*totalx*3]);
    parallel task () ()
parallel for (tbb::blocked range2dkint> (starty, stopy + 1,
grain size, startx, stopx + 1, grain size), parallel task ());
```

# **Work Stealing**









- 1. Take youngest task from my deque
- 2. Steal task advertised in mailbox
- 3. Steal oldest task from random victim

Locality

**Cache Affinity** 

**Load balance** 

# Work Depth First; Steal Breadth First



victim thread

### **Scale Forward**





Configuration Info - SW Versions: Intel® C++ Intel® 64 Compiler, Version 12.1, Intel® Threading Building Blocks 4.0; Hardware: 4 \* Intel® Xeon® CPU E7- 4860 @ 2.27GHz (40 cores), 256GB Main Memory: Operating System: Linux, Red Hat\* Enterprise Server\* release 5.4, kernel 2.6.18-194.11.4.el5; Benchmark Source: Intel Corp.

Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, refer to <a href="https://www.intel.com/performance/resources/benchmark limitations.htm">www.intel.com/performance/resources/benchmark limitations.htm</a>. \* Other brands and names are the property of their respective owners

Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804

### **TBB Components**

Threads and synchronization

Memory allocation and task scheduling

### Generic Parallel Algorithms

Efficient scalable way to exploit the power of multi-core without having to start from scratch.

#### Flow Graph

A set of classes to express parallelism as a graph of compute dependencies and/or data flow

#### **Concurrent Containers**

Concurrent access, and a scalable alternative to serial containers with external locking

#### **Synchronization Primitives**

Atomic operations, a variety of mutexes with different properties, condition variables

#### **Task Scheduler**

Sophisticated work scheduling engine that empowers parallel algorithms and the flow graph

| Thread Local Storage                       | Threads            | Miscellaneous                                  |
|--------------------------------------------|--------------------|------------------------------------------------|
| Unlimited number of thread-local variables | OS API<br>wrappers | Thread-safe timers<br>and exception<br>classes |

#### **Memory Allocation**

Scalable memory manager and false-sharing free allocators

**Course Notes ONLY material** 

# tbb::parallel\_for

#### Has several forms.

```
Execute functor(i) for all i \in [lower, upper)
```

```
parallel_for( lower, upper, functor );
```

```
Execute functor(i) for all i \in \{lower, lower+stride, lower+2*stride, ...\}
```

```
parallel_for( lower, upper, stride, functor );
```

Execute functor(subrange) for all subrange in range

```
parallel_for( range, functor );
```

# Optional partitioner Argument

Recurse all the way down range.

```
tbb::parallel_for( range, functor, tbb::simple_partitioner() );
```

Choose recursion depth heuristically.

```
tbb::parallel_for( range, functor, tbb::auto_partitioner() );
```

Replay with cache optimization.

```
tbb::parallel_for( range, functor, affinity_partitioner );
```

## without lambda, code has to go in a class

```
class ApplyABC {
public:
    float *a;
    float *b;
    float *c;
   ApplyABC(float *a ,float *b ,float *c ):a(a ), b(b ), c(c ) {}
    void operator()(const blocked range<size t>& r) const {
        for(size t i=r.begin(); i!=r.end(); ++i)
            a[i] = b[i] + c[i];
void ParallelFoo( float* a, float *b, float *c, int n ) {
    parallel for(blocked range<size t>(0,n,10000),ApplyABC(a,b,c));
```

## doing with lambdas support is more natural

Course Notes ONLY material

# Intel® TBB Class Graph: Components Interesting "new" aspect since TBB 4.0 Release (2011)

#### Graph object

- Contains a pointer to the root task
- Owns tasks created on behalf of the graph
- Users can wait for the completion of all tasks of the graph

#### Graph nodes

- Implement sender and/or receiver interfaces
- Nodes manage messages and/or execute function objects

#### Edges

Connect predecessors to successors



## tbb\_thread

- Make # of software threads higher than # of hardware threads
- Blocking tasks support
- Task scheduler permits blocking tasks (use this special tbb\_thread)
- Needed for GUI, AI, I/O, and network events
- Doesn't interfere with computational work



# affinity partitioner

Portable affinity mechanism gets more performance from chained parallel operations.

More efficient cache use.



# **Mutex Summary**

- "Scalable": does no worse than serializing
- Fair: preserves first-come first-serve (guarantees no starvation)
- Reentrant: thread can hold multiple locks on the same mutex
- Wait method: how threads wait for the lock

|                  | "Scalable"   | Reentrant | Fair         | Long Wait | Size     |
|------------------|--------------|-----------|--------------|-----------|----------|
| mutex            | OS dependent | No        | OS dependent | Block     | ≥3 words |
| recursive_mutex  | OS dependent | Yes       | OS dependent | Block     | ≥3 words |
| spin_mutex       | No           | No        | No           | Yield     | 1 byte   |
| queuing_mutex    | Yes          | No        | Yes          | Yield     | 1 word   |
| spin_rw_mutex    | No           | No        | No           | Yield     | 1 word   |
| queuing_rw_mutex | Yes          | No        | Yes          | Yield     | 1 word   |
| null_mutex       | -            | Yes       | Yes          | -         | empty    |
| null_rw_mutex    | -            | Yes       | Yes          | -         | empty    |

# Intel<sup>®</sup> Cilk™ Plus

Open specification at cilkplus.org

## Scale Efficiently Intel® Cilk™ Plus, three keywords to go parallel

```
cilk_for (int i=0; i<n; ++i) {
  foo(a[i]);
}</pre>
```

Open specification at cilkplus.org

## Scale Efficiently Intel® Cilk™ Plus, three keywords to go parallel

```
int fib(int n)
                                      int fib(int n)
    if (n <= 2)
                                          if (n <= 2)
                                              return n;
        return n;
    else {
                                          else {
        int x, y;
                                              int x, y;
        x = fib(n-1);
                                              x = cilk spawn fib(n-1);
        y = fib(n-2);
                                              y = fib(n-2);
                                              cilk sync;
        return x+y;
                                              return x+y;
```

## **TBB**

threadingbuildingblocks.org

- Reasons it might matter:
  - Everywhere (ports, open source)
  - Forever (commercial and open source)
  - Performance Portable
- Problems:
  - C++ not C

## Cilk™ Plus

- Reasons it might matter:
  - Space/time guarantees
  - Performance Portable
  - C++ and C
  - Keywords bring compiler into the "know"
  - "Parent stealing"
  - Vectorization help too (array notations, elem. func, simd)
- Problems:
  - Requires compiler changes
  - Not feature rich
  - Only in Intel and gcc compilers (and some in Clang/LLVM)
  - Standards adoption still "future"

# OpenMP\*

- Reasons it might matter:
  - Everywhere (all major compilers)
  - Solutions for Tasking, vectorization, offload
  - Performance Portable
- Problems:
  - C and Fortran, not so much C++
  - Not composable
  - Not always in-sync with language standards

<sup>\*</sup> Third party marks may be claimed as the property of others.

# OpenCL\*

- Reasons it might matter:
  - Explicit heterogeneous controls
  - Everywhere (ports)
  - Non-proprietary
  - Underpinning for tools and libraries
- Problems:
  - Low level
  - Not performance portable

<sup>\*</sup> Third party marks may be claimed as the property of others.

## **Choice is Good**

 Our favorite programming languages were NOT DESIGNED for parallelism.

They need HELP.

Multiple approaches and options are NEEDED.

- - early key features "register" keyword out of use– "volatile" fading in usage

  - added: stronger typing (ANSI C, 1989)
  - -C11

  - OpenMP\* (1996)– Cilk™ Plus (2010)



- - Objected oriented
  - Intel® Threading Building Blocks (2006)
  - -C++11
  - Cilk™ Plus (2010)



## C++11 (some applies to C11 also)

#### Core language runtime performance enhancements

- Rvalue references and move constructors
- Generalized constant expressions
- Modification to the definition of plain old data

#### Core language build time performance enhancements

Extern template

#### Core language usability enhancements

- Initializer lists
- Uniform initialization
- Type inference
- Range-based for-loop
- Lambda functions and expressions
- Alternative function syntax
- Object construction improvement
- Explicit overrides and final
- Null pointer constant
- Strongly typed enumerations
- Right angle bracket
- Explicit conversion operators
- Alias templates
- Unrestricted unions

#### Core language functionality improvements

- Variadic templates
- New string literals
- User-defined literals
- Multithreading memory model
- Thread-local storage
- Explicitly defaulted and deleted special member functions
- Type long long int
- Static assertions
- Allow size of to work on members of classes without an explicit object
- Control and guery object alignment
- Allow garbage collected implementations

#### C++ standard library changes

- Upgrades to standard library components
- Threading facilities
- Tuple types
- Hash tables
- Regular expressions
- General-purpose smart pointers
- Extensible random number facility
- Wrapper reference
- Polymorphic wrappers for function objects
- Type traits for metaprogramming
- Uniform method for computing the return type of function objects

# C++11 (some applies to C11 also)

#### Core language runtime performance enhancements

- Rvalue references and move constructors
- Generalized constant expressions
- Modification to the definition of plain old data

### Core language build time performance enhancements

Extern template

#### Core language usability enhancements

- Initializer lists
- Uniform initialization
- Type inference
- Range based for loop
- Lambda functions and expressions

#### Alternative function syntax

- Object construction improvement
- Explicit overrides and final
- Null pointer constant
- Strongly typed enumerations
- Right angle bracket
- Explicit conversion operators
- Alias templates
- Unrestricted unions

## anonymous functions

#### Core language functionality improvements

- Variadic templates
- New string literals
- User-defined literals
- Multithreading memory model
- Thread-local storage
- Explicitly defaulted and deleted special member functions
- Type long long int
- Static assertions
- Allow size of to work on members of classes without an explicit object
- Control and query object alignment
- Allow garbage collected implementations

#### C++ standard library changes

- Upgrades to standard library components
- Threading facilities
- Tuple types
- Hash tables
- Regular expressions
- General-purpose smart pointers
- Extensible random number facility
- Wrapper reference
- Polymorphic wrappers for function objects
- Type traits for metaprogramming
- Uniform method for computing the return type of function objects

# C++11 (some applies to C11 also)

#### Core language runtime performance enhancements

- Rvalue references and move constructors
- Generalized constant expressions
- Modification to the definition of plain old data

### Core language build time performance enhancements

Extern template

#### Core language usability enhancements

- Initializer lists
- Uniform initialization
- Type inference
- Range based for loop
- Lambda functions and expressions

#### Alternative function syntax

- Object construction improvement
- Explicit overrides and final
- Null pointer constant
- Strongly typed enumerations
- Right angle bracket
- Explicit conversion operators
- Alias templates
- Unrestricted unions

### Core language functionality improvements Variadic templates defining visibility

- Variadic templatesNew string literals
- defining visibility of stores
- ... denned literals
- Multithreading memory model
- Micauriocarotorage
- Explicitly defaulted and deleted special member functions
- Type long long int
- Static assertions
- Allow size of to work on members of classes without an explicit object
- Control and query object alignment
- Allow garbage collected implementations

#### C++ standard library changes

- Upgrades to standard library components
- Threading facilities
- Tuple types
- Hash tables
- Regular expressions
- General-purpose smart pointers
- Extensible random number facility
- Wrapper reference
- Polymorphic wrappers for function objects
- Type traits for metaprogramming
- Uniform method for computing the return type of function objects

anonymous

functions

## C++11 (some applies to C11 also)

#### Core language runtime performance enhancements

- Rvalue references and move constructors
- Generalized constant expressions
- Modification to the definition of plain old data

### Core language build time performance enhancements

Extern template

#### Core language usability enhancements

- Initializer lists
- Uniform initialization
- Type inference
- · Range-based for loop
- Lambda functions and expressions

#### Alternative function syntax

- Object construction improvement
- Explicit overrides and final
- Null pointer constant
- Strongly typed enumerations
- Right angle bracket
- Explicit conversion operators
- Alias templates
- Unrestricted unions

#### sions

anonymous functions

#### Core language functionality improvements

- Variadic templatesNew string literals
- defining visibility of stores
- denneu illerais
- Multithreading memory model
- I MEau-local Storage
- Explicitly defaulted and deleted special member functions
- Type long long int
- Static assertions
- Allow size of to work on members of classes without an explicit object
- Control and query object alignment
- Allow garbage collected implementations

#### C++ standard library changes

- Ungrades to standard I'll
- Threading facilities
- Turk types
- Hash tables

#### futures & promises, async

- Regular expressions
- General-purpose smart pointers
- Extensible random number facility
- Wrapper reference
- Polymorphic wrappers for function objects
- Type traits for metaprogramming
- Uniform method for computing the return type of function objects

future: think of as a consumer end of a 1-element produce/consumer queue

- A future can be created only from an existing promise object.
- Producer computes the value: calls set\_value() on the promise.
- Consumer needs the future value: it calls **get()** on the future.
- Consumer blocks waiting on the producer if producer has not yet set\_value().
- Futures can be used via the async() member function.

```
double foo(double arg); // consider normal function
    // You can execute foo(x) asynchronously by calling
std::future<double> result = std::async(foo, x);
...
double val = result.get();
```

The problems with the future/async model are both linguistic and performance-related.

The key flaw is that the whole notion of scalability with using futures was soundly refuted in the seminal 1993 paper:

Space-efficient scheduling of multithreaded computations by Blumofe and Leiserson.

This is the paper that motivated the development of Cilk in the first place.

The linguistic problems are more subtle.

The following two statements that do roughly the same thing:

```
std::future<double> result = std::async(foo, x);
double result = cilk_spawn foo(x);
```

The first statement looks like a call to async(). The second statement looks like a call to foo().

Semantically, consider the following:
std::string s("hello");
int bar(const std::string& s);
std::future<int> result = std::async(bar, s + " world");

The above statement is intended to pass "hello world" to bar and run it asynchronously.

Semantically, consider the following:

```
std::string s("hello");
int bar(const std::string& s);

std::future<int> result = std::async(bar, s + " world");
```

The above statement is intended to pass "hello world" to bar and run it asynchronously.

The problem is that s + "world" is a *temporary* object that gets destroyed as soon as the statement completes.

```
std::string s("hello");
int bar(const std::string& s);
std::future<int> result = std::async(bar, s + " world");
```

Boosters of std::async will counter that all you need is to add a lambda:

```
std::future<int> result = std::async([&]{ bar(s + " world"); });
```

Without the lambda - it is a *race condition* that should **not** exist in a linguistically sound parallel construct, but it is pretty much unavoidable in a library-only specification.

## Vectorization: Who, What, Why, When

## A moment of clarity

task parallelism
doesn't matter
without data parallelism

(James' observation about what Amdahl and Gustafson were ultimately pointing out)

## Parallel first

Vectorize second



Multithreading is more powerful than vectorization

- by simple math:

16 way from vectorization

fron

244 waythread parallelism





## $16 \times 244 = 3904$





```
void v add (float *c,
              float *a,
              float *b)
    for (int i=0; i<= MAX; i++)</pre>
       c[i]=a[i]+b[i];
```

## Loop:

- 1. LOAD a[i] -> Ra
- 2. LOAD b[i] -> Rb
- 3. ADD Ra, Rb -> Rc
- 4. STORE Rc -> c[i]
- 5. ADD i + 1 -> i

```
void v_add (float *c,
```

## Loop:

- 1. LOADv4 a[i:i+3] -> Rva
- 2. LOADv4 b[i:i+3] -> Rvb
- 3. ADDv4 Rva, Rvb -> Rvc
- 4. STOREv4 Rvc -> c[i:i+3]
- 5. ADDi + 4 -> i

## Loop:

- 1. LOAD a[i] -> Ra
- 2. LOAD b[i] -> Rb
- 3. ADD Ra, Rb -> Rc
- 4. STORE Rc -> c[i]
- 5. ADD i + 1 -> i

## vector data operations:

# We call this "vectorization"

void v add (Iloat \*c,

## Loop:

- 1. LOADv4 a[i:i+3] -> Rva
- 2. LOADv4 b[i:i+3] -> Rvb
- 3. ADDv4 Rva, Rvb -> Rvc
- 4. STOREv4 Rvc -> c[i:i+3]
- 5. ADDi + 4 -> i

## Loop:

- 1. LOAD a[i] -> Ra
- 2. LOAD b[i] -> Rb
- 3. ADD Ra, Rb -> Rc
- 4. STORE Rc -> c[i]
- 5. ADD i + 1 -> i

```
void v_add (float *c, float *a, float *b)
{
    for (int i=0; i<= MAX; i++)
        c[i]=a[i]+b[i];
}</pre>
```

#### PROBLEM:

This LOOP is NOT LEGAL to (automatically) VECTORIZE in C / C++ (without more information).

- Arrays not really in the language
- Pointers are, evil pointers!

## vectorization solutions

- 1. auto-vectorization (let the compiler do it when legal)
  - sequential languages and practices gets in the way
- 2. give the compiler hints
  - C99 "restrict" keyword (implied in FORTRAN since 1956)
  - Traditional pragmas like "#pragma IVDEP"
  - Cilk Plus #pragma simd
  - OpenMP 4.0 #pragma omp simd
- 3. code differently
  - simd instruction intrinsics
  - Cilk Plus array notations
  - Cilk Plus \_\_declspec (vector)
  - OpenMP 4.0 #pragma omp declare simd
  - OpenCL / CUDA kernel functions

## vectorization solutions

- 1. auto-vectorization (let the compiler do it *when legal*)
  - sequential languages and practices gets in the way
- 2. give the compiler hints
  - C99 "restrict" keyword (implied in FORTRAN since 1956)
  - Traditional pragmas like "#pragma IVDEP"
  - Cilk Plus #pragma simd
  - OpenMP 4.0 #pragma omp simd
- 3. code differently
  - simd instruction intrinsics
  - Cilk Plus array notations
  - Cilk Plus \_\_declspec (vector)
  - OpenMP 4.0 #pragma omp declare simd
  - OpenCL / CUDA kernel functions

In all cases, studying "vectorization reports" can become a way of life.

# C99 "restrict" keyword

#### 2. give the compiler hints

C99 "restrict" keyword

Traditional pragmas like "#pragma IVDEP"

Cilk Plus #pragma simd

OpenMP 4.0 #pragma omp simd

#### 3. code differently

simd instruction intrinsics
Cilk Plus array notations
Cilk Plus \_\_declspec (vector)
OpenMP 4.0 #pragma omp declare simd
OpenCL / CUDA kernel functions

## **IVDEP**

#### 2. give the compiler hints

C99 "restrict" keyword

Traditional pragmas like "#pragma IVDEP"

Cilk Plus #pragma simd

OpenMP 4.0 #pragma omp simd

#### 3. code differently

simd instruction intrinsics

Cilk Plus array notations

Cilk Plus \_\_declspec (vector)

OpenMP 4.0 #pragma omp declare simd

OpenCL / CUDA kernel functions

## simd

### 2. give the compiler hints

C99 "restrict" keyword
Traditional pragmas like "#pragma IVDEP"
Cilk Plus #pragma simd
OpenMP 4.0 #pragma omp simd

### 3. code differently

## omp simd

#### 2. give the compiler hints

C99 "restrict" keyword
Traditional pragmas like "#pragma IVDEP"
Cilk Plus #pragma simd

OpenMP 4.0 #pragma omp simd

### 3. code differently

## simd instruction intrinsics

```
void v add (float *c,
                     float *a,
                     float *b)
                                                          2. give the compiler hints
                                                               C99 "restrict" keyword
        m128* pSrc1 = (m128*)
                                                               Traditional pragmas like "#pragma IVDEP"
                                                               Cilk Plus #pragma simd
        m128* pSrc2 = ( m128*) b;
                                                               OpenMP 4.0 #pragma omp simd
        m128* pDest = ( m128*) c;
                                                          3. code differently
     for (int i=0; i<= MAX/4; i++)
                                                               simd instruction intrinsics
                                                               Cilk Plus array notations
            *pDest++ = mm add ps(*pSrc
                                                                   'pSrc2++)
                                                               openivir 4.0 "prubina omp acciare simd
                                                               OpenCL / CUDA kernel functions
```

# array operations

Challenge: long vector slices can cause cache issues; fix is to keep vector slices short.

#### 2. give the compiler hints

C99 "restrict" keyword

Traditional pragmas like "#pragma IVDEP"

Cilk Plus #pragma simd

OpenMP 4.0 #pragma omp simd

### 3. code differently

simd instruction intrinsics

Cilk Plus array notations

Cilk Plus \_\_declspec (vector)

OpenMP 4.0 #pragma omp declare simd

OpenCL / CUDA kernel functions

# \_\_declspec(vector)

#### 2. give the compiler hints

C99 "restrict" keyword
Traditional pragmas like "#pragma IVDEP"
Cilk Plus #pragma simd
OpenMP 4.0 #pragma omp simd

### 3. code differently

## #pragma omp declare simd

#### 2. give the compiler hints

C99 "restrict" keyword
Traditional pragmas like "#pragma IVDEP"
Cilk Plus #pragma simd

OpenMP 4.0 #pragma omp simd

### 3. code differently

simd instruction intrinsics

Cilk Plus array notations

Cilk Plus \_\_declspec (vector)

OpenMP 4.0 #pragma omp declare simd

OpenCL / CUDA kernel functions

### kernel

```
kernel void v add (global const float *c,
                      global const float *a,
                      global const float *b)
                                                            2. give the compiler hints
                                                                 C99 "restrict" keyword
         int id = get global id(0);
                                                                 Traditional pragmas like "#pragma IVDEP"
                                                                 Cilk Plus #pragma simd
         c[id]=a[id]+b[id];
                                                                 OpenMP 4.0 #pragma omp simd
                                                            3. code differently
                                                                 simd instruction intrinsics
                                                                 Cilk Plus array notations
                                                                 Cilk Plus declspec (vector)
                                                                 OpenMP 4.0 #pragma omp declare simd
                                                                 OpenCL / CUDA kernel functions
```

### Many choices... I recommend:

# omp simd

#### 2. give the compiler hints

C99 "restrict" keyword

Traditional pragmas like "#pragma IVDEP"

Cilk Plus #pragma simd

OpenMP 4.0 #pragma omp simd

### 3. code differently

Course Notes ONLY material

## Memory

optimizing the sharing & movement of data

is very often more important than

optimizing calculations

TREND: "very often" heads toward "always"

## Data Layout: AoS vs. SoA





Structure of arrays (SoA) can be easily aligned to cache boundaries and is vectorizable.

Array of structures (AoS) tends to cause cache alignment problems, and is hard to vectorize.

## Data Layout: Alignment



**Course Notes ONLY material** 

## sharing

decompose data to minimize sharing between tasks (threads)

beware: false sharing (it's very real)

### Think Parallel

 Parallelism does NOT work well if "added" at the last minute in a program

- think about every thing "how to do in parallel cleanly"

### **Think Parallel**

Abstract parallelism is best – can be taught without learning computer architecture

 Contradicting the desire to ignore computer architecture as a driver for programming...

Computer architecture basics matter – most of all: movement of data needs to be minimalized



http://www.multithreadingandvfx.org/ - downloads include material from today 2014: Multithreading for Visual Effects

Expansion of material covered in this tutorial plus some extras.

English

### http://threadingbuildingblocks.com

2007: Intel Threading Building Blocks Remains an excellent introduction to TBB; newer features are not covered English, Chinese, Japanese, Korean









http://parallelbook.com website has free teaching materials

2012: Structured Parallel Programming Excellent text for essentials of parallel programming

for C/C++ English, Japanese





### http://lotsofcores.com





High Performance Parallelism Pearls

High Performance Parallelism 2015 ... Pearls Volume Two 2014 ... Pearls Volume One

English



2013: Intel® Xeon Phi™ Coprocessor High Performance Programming English, Japanese, Chinese



# Thank you!



@multithreadvfx



www.multithreadingandvfx.org



"Multithreading for Visual Effects"

# Agenda

| 9:00am  | Start                                              |                                   |
|---------|----------------------------------------------------|-----------------------------------|
| 9:00am  | Introduction                                       | James Reinders, Intel             |
| 9:05am  | Multithreading Introduction and Overview           | James Reinders, Intel             |
| 9:45am  | Parallelism in Houdini - practical lessons learned | Jeff Lait, Side Effects Software  |
| 10:30am | Break                                              |                                   |
| 10:45am | <b>GPU Rigid Body Simulation Using OpenCL</b>      | Erwin Coumans, Google             |
| 11:10am | Asynchronous Computation Engine for Animation      | George ElKoura, Pixar             |
| 11:40am | Parallel Evaluation of Character Rigs Using TBB    | Martin Watt, Dreamworks Animation |
| 12:15pm | Done                                               |                                   |

### Legal Disclaimer & Optimization Notice

INFORMATION IN THIS DOCUMENT IS PROVIDED "AS IS". NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, reference www.intel.com/software/products.

Copyright © 2015, Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Xeon Phi, Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries. \*Other names and brands may be claimed as the property of others.

#### **Optimization Notice**

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804