Switch branches/tags
Nothing to show
Find file History
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.



Building parallel programs is easy with Thrust's power tools like parallel maps, sorts, and reductions.

$ git clone git://github.com/jaredhoberock/thrust-workshop
$ cd thrust-workshop/fun_with_points
$ scons
$ ./exercise
$ ./spoilers

(Requires scons and CUDA.)

Fun with Points!

Have you ever wondered how to build programs that run on parallel processors like GPUs? It turns out it's easy if you think parallel.

In this post, we'll become familiar with algorithms such as transform, sort, and reduce to implement common parallel operations such as map and histogram construction. And they'll run on the GPU.

In exercise.cu, we have a C++ program which generates some random two-dimensional points, finds the centroid of those points, and then names which quadrant of the square centered about the centroid each point lies in.

We're dying to find out how many of these points are in each of the four quadrants.

At a high level, the program looks like this:

int main()
  const size_t num_points = 10000000;

  std::vector<float2> points(num_points);


  float2 centroid = compute_centroid(points);

  std::vector<int> quadrants(points.size());
  classify_points_by_quadrant(points, centroid, quadrants);

  std::vector<int> counts_per_quadrant(4);
  count_points_in_quadrants(points, quadrants, counts_per_quadrant);

  std::cout << "Per-quadrant counts:" << std::endl;
  std::cout << "  Bottom-left : " << counts_per_quadrant[0] << " points" << std::endl;
  std::cout << "  Bottom-right: " << counts_per_quadrant[1] << " points" << std::endl;
  std::cout << "  Top-left    : " << counts_per_quadrant[2] << " points" << std::endl;
  std::cout << "  Top-right   : " << counts_per_quadrant[3] << " points" << std::endl;
  std::cout << std::endl;

Let's port this C++ program to run on the GPU using the Thrust parallel algorithms library. Since the program is already broken down into a high level description using functions with names like generate_random_points and count_points_in_quadrants that operate on large collections of data, it'll be a breeze.

To follow along with this post in exercise.cu, type scons exercise into your command line to build the program, and ./exercise to run it. You'll need to install CUDA to get NVIDIA's compiler to compile .cu files and SCons, which is the build system we'll use.

If you get stuck, you can peek at the full solution in spoilers.cu.

Plan of Attack

To port this program to run on the GPU, let's attack it in a series of steps. Since the program is already nicely factored into a series of function calls all we need to do is:

  1. Parallelize each call individually
  2. Point each parallel call at the GPU

Generating the Points

Let's start out with taking a look at the inside of generate_random_points:

void generate_random_points(std::vector<float2> &points)
  // sequentially generate some random 2D points in the unit square
  for(int i = 0; i < points.size(); ++i)
    float x = float(rand()) / RAND_MAX;
    float y = float(rand()) / RAND_MAX;

    points[i] = make_float2(x,y);

It seems like there's a big opportunity for parallelism here because we're performing the same operation a huge number of times across a large collection of data, but there are some obstacles to overcome first.

This operation is basically a sequential for loop that calls rand a bunch of times. That means that each iteration of this loop gets executed one at a time, in order.

To make matters worse, we know that the reason you get a different number each time you call rand is because there's some implicit shared, mutable state inside that gets updated with each call. That means each iteration of our loop depends on the previous.

To parallelize this operation, we'll need to rethink our algorithm to come up with a computation we can perform independently of anything else going on. We'd best avoid state.

Besides the stateful method used by rand, it turns out another reasonable way to generate pseudorandom numbers is with a stateless integer hash, like this one right here:

__host__ __device__
unsigned int hash(unsigned int x)
  x = (x+0x7ed55d16) + (x<<12);
  x = (x^0xc761c23c) ^ (x>>19);
  x = (x+0x165667b1) + (x<<5);
  x = (x+0xd3a2646c) ^ (x<<9);
  x = (x+0xfd7046c5) + (x<<3);
  x = (x^0xb55a4f09) ^ (x>>16);
  return x;

No state here -- to get a number, all we need to do is stick an integer in. But what's this __host__ __device__ business? That's there to let the CUDA C++ compiler know that the hash function can be called from either the __host__ (the CPU) or the __device__ (the GPU). Without it, our program won't compile.

But what about that sequential for loop? That's where Thrust comes in. Thrust has a large suite of algorithms for solving parallel problems like this one.

In particular, we can use tabulate to call our hash function for each point in our set. Let's see how to hook it up:

struct random_point
  __host__ __device__
  float2 operator()(unsigned int x)
    return make_float2(float(hash(x)) / UINT_MAX, float(hash(2 * x)) / UINT_MAX);

void generate_random_points(std::vector<float2> &points)
  thrust::tabulate(points.begin(), points.end(), random_point());

tabulate fills all the points from begin to end with a point created by random_point. Each time it calls the random_point function object, it passes the index of the element in question. The whole thing happens in parallel -- we have no idea in which order the points will be created. This gives Thrust a lot of flexibility in choosing how to execute the algorithm.

Finding the Centroid

The next thing we need to do is take our points and find their centroid (average). The sequential code looks like this:

float2 compute_centroid(const std::vector<float2> &points)
  float2 sum = make_float2(0,0);

  // compute the mean
  for(int i = 0; i < points.size(); ++i)
    sum = sum + points[i];

  return make_float2(sum.x / points.size(), sum.y / points.size());

Here, we're just summing up all the points and then dividing by the number at the end. Parallel programmers call this kind of operation a reduction because we've taken a collection of things (points) and reduced them to a single thing (the centroid).

With Thrust, reductions are easy -- we just call reduce:

float2 compute_centroid(const std::vector<float2> &points)
  float2 init = make_float2(0,0);
  float2 sum = thrust::reduce(points.begin(), points.end(), init);
  return make_float2(sum.x / points.size(), sum.y / points.size());

We start by choosing an initial value for the reduction -- init -- which initializes our sum to zero. Then we tell Thrust we want to reduce all the points from begin to end using init as the initial value of the sum. (If points was empty, reduce would simply return init.)

By default, reduce assumes we want to compute a mathematical sum, but it's actually a higher order function like tabulate. If we wanted to compute some other kind of reduction besides a sum, we could pass an associative function object that defines what it means to combine two points together.

Classifying each Point

Before we can count how many points are in each of the four quadrants, we need to figure out which quadrant each point is in.

An easy way to do this is to compare each point to the centroid. The sequential code looks like this:

void classify_points_by_quadrant(const std::vector<float2> &points, float2 centroid, std::vector<int> &quadrants)
  // classify each point relative to the centroid
  for(int i = 0; i < points.size(); ++i)
    float x = points[i].x;
    float y = points[i].y;

    // bottom-left:  0
    // bottom-right: 1
    // top-left:     2
    // top-right:    3

    quadrants[i] = (x <= centroid.x ? 0 : 1) | (y <= centroid.y ? 0 : 2);

We compare each point's x and y coordinate to the centroid, and compute a number between 0 and 3 using some leet bitwise manipulation with the | operation.

In this example, the important thing to realize is that unlike our sequential for loop from the last example, none of the iterations of this for loop have any dependency on any other iteration.

Sometimes we call these kinds of operations embarassingly parallel, because parallelizing them is embarassingly easy. Another term for this operation is a parallel map because each thing (point) in our collection gets mapped to another thing (a number).

With Thrust, we can compute parallel map operations using transform (map means something else in C++):

struct classify_point
  float2 center;

  __host__ __device__
  classify_point(float2 c)
    center = c;

  __host__ __device__
  unsigned int operator()(float2 p)
    return (p.x <= center.x ? 0 : 1) | (p.y <= center.y ? 0 : 2);

void classify_points_by_quadrant(const std::vector<float2> &points, float2 center, std::vector<int> &quadrants)
  thrust::transform(points.begin(), points.end(), quadrants.begin(), classify_point(center));

transform works kind of like tabulate, but instead of automatically generating a series of integer indices for us, transform passes each point from begin to end to our classify_point function object.

Each call to classify_point performs the same operation as each iteration of our sequential for loop. However, instead of assigning the result to quadrants[i], classify_point returns it. Since we passed quadrants.begin() to transform, it knows where each result should go.

Tallying Up Each Quadrant

We're almost done. The last thing we need to do is answer is our original burning question of how many of our points are in each quadrant.

When we have a set of buckets (quadrants) and we want to count how many things (points) fall in each, we sometimes say we want to compute a histogram.

The sequential code counts up the points by looping over them and incrementing a counter in each bucket:

void count_points_in_quadrants(std::vector<float2> &points, std::vector<int> &quadrants,
                               std::vector<int> &counts_per_quadrant)
  // sequentially compute a histogram
  for(int i = 0; i < quadrants.size(); ++i)
    int q = quadrants[i];

    // increment the number of points in this quadrant

Hmm... there's that shared state problem again. The iterations of this loop depend on each other, because we expect many points to fall into the same quadrant and contend to increment the same counter. If we tried to perform all these increments at once in parallel with an algorithm like transform, they would all run into each other! We'll have to figure out a different way to build this histogram.

You may notice that our count_points_in_quadrants function takes points as a parameter, but it doesn't do anything with them. That's a hint that to parallelize this operation, we'll need to think about reorganizing our input data to make our job easier.

This problem of counting up a collection of items might remind you of our earlier use of reduce. Like before, we have a collection of things (points in a quadrant) and we want to reduce them to a single thing (the number of points in each quadrant).

But it doesn't make sense to just call reduce again -- it seems like reduce could only count up the number of points in a single quadrant. So we'd have to call reduce several times -- once for each quadrant. But how do we find the points associated with a particular quadrant and pick them out for reduce?

It turns out if we're willing to sort our data, we can bring all the points from a particular quadrant together so that we can reduce them all at once. We call this kind of operation sorting by key because we're sorting a collection of things (points) by a key (quadrant number) associated with each of them.

This is super easy with the sort_by_key algorithm:

thrust::sort_by_key(quadrants.begin(), quadrants.end(), points.begin());

Here, we're telling Thrust to consider all the quadrant numbers from begin to end as sorting keys for the points beginning at points.begin(). Afterwards, both collections will be sorted according to the keys.

So now our points are sorted, but so what? How does sorting help us count them? How do we use reduce?

Since by sorting we've brought all of the things (points) with the same key (quadrant number) together, we can do a special kind of reducing by key operation to count them all at once:

thrust::reduce_by_key(quadrants.begin(), quadrants.end(),

Whoa... what just happened?

Let's break it down piece by piece:

Like sort_by_key, reduce_by_key takes a collection of keys:

quadrants.begin(), quadrants.end() // The key is the quadrant number again.

And a collection of values:

thrust::constant_iterator<int>(1) // The endlessly repeating sequence 1, 1, 1, ...

And reduces each span of contiguous values with the same key. For each span, it returns the key:

thrust::discard_iterator<>() // We're not interested in retaining the key; just drop it on the floor.

And the reduced value:

counts_per_quadrant.begin() // The result we're actually interested in.

Just like reduce, reduce_by_key's default reduction is a sum. So for each key, we're summing up the value 1, which comes from that constant_iterator thing.

Iterators are like pointers. They're how Thrust knows where to find the inputs and outputs to each algorithm. The .begin() and .end() thingies we've used are examples but you can also get fancy with iterators like constant_iterator and discard_iterator to generate data on the fly.

Here's the whole function:

void count_points_in_quadrants(std::vector<float2> &points, std::vector<int> &quadrants,
                               std::vector<int> &counts_per_quadrant)
  // sort points by quadrant
  thrust::sort_by_key(quadrants.begin(), quadrants.end(), points.begin());

  // count points in each quadrant
  thrust::reduce_by_key(quadrants.begin(), quadrants.end(),

Pointing it at the GPU

So we're all done, right? Not quite. Our plan was to attack our porting problem in two parts: first by reorganizing our code into high-level parallel operations, and then by pointing those parallel operations at the GPU.

By default, whenever the inputs to Thrust algorithms come from things like std::vector, Thrust executes those algorithms sequentially on the CPU.

Remember how I said that describing our program as a composition of high-level parallel algorithms gives Thrust a lot of flexibility in deciding how to execute? Here's where we take advantage of that flexibility.

Porting our program to run on the GPU is the easiest part. To point Thrust at the GPU, all we need to do is s/std::vector/thrust::device_vector/ and we're set. device_vector is a special kind of vector container that sticks its data in memory that's easy for the GPU to access. Whenever a Thrust algorithm gets its input and output from a device_vector, that algorithm will execute on the GPU in parallel.

(If you're uninterested in GPUs, that's okay too -- it's easy to target Thrust at multicore CPUs).


Now it's time to find out if all our hard work was worth it. performance.cu provides an instrumented version of the solution we can use for measuring its performance. Since we've built our solution using thrust::device_vector, it's easy to switch between building a program which targets the CPU or the GPU on the command line:

# build the cpu solution
$ scons cpu_performance
$ ./cpu_performance 
Warming up...


20.5157 megapoints generated and counted per second.

# build the cpu solution
$ scons gpu_performance
$ ./gpu_performance 
Warming up...


433.961 megapoints generated and counted per second.

For the GPU (an NVIDIA Tesla K20c) version, that's over 20 times the performance as the sequential version of the solution running on the CPU (an Intel Core i7 860). Not too bad!

It's not really the fairest of comparisons -- the GPU is fairly recent and beefy while the CPU is getting a little long in the tooth -- but I hope the outcome has persuaded you that the general strategy of building parallel programs out of high-level, reusable building blocks goes a long way towards ensuring parallel applications stay performance portable and scalable across different architectures.

It's true that a lower-level program which targeted individual threads, optimized for hardware-specific cache sizes, or programmed CUDA kernels directly might net us higher performance for either solution, but the cost of persuing such a path would be significant. Moreover, it's unlikely that the hard-won performance of a lower-level solution would be portable across generations of hardware within the same processor family, much less the vastly different architectures of the CPU and GPU.

Since we've built our program out of abstract parallel primitives which are agnostic to processor-specific idiosyncracies, we can be comfortable that the performance of our program will continue to scale as parallel processor architectures grow wider.

Wrapping Up

In working through this example, we saw how hazards like shared mutable state hidden in sequential programs require us to think differently when going parallel. One way to think parallel is to adopt the mindset of building programs composed of abstract, high-level algorithms like thrust::sort operating on collections like thrust::device_vector. Operating at a higher level of abstraction allows us to stay portable and retarget our applications at different parallel platforms with ease.

This is often easier said than done. In our example, we had the luxury of building a simple parallel application from scratch. In real world legacy codebases, any high-level parallelism inherent to the problem is often obscured by low-level sequential constructs. Teasing the implicit parallelism out of an existing sequential program can be tricky.

Since parallel processors are here to stay, it makes sense to first identify the high-level parallel structure of a problem whenever embarking on a new programming adventure. Thrust is a great way to build parallel C++ programs while remaining performance portable across different architectures.

If you're interested in learning more about Thrust, check out the quick start guide or browse the extensive collection of example programs to discover the parallelism hidden in everyday computations.