# <span id="chap_parallel"></span> Parallel processing for network science

So far so good. We've shown how we can analyse complex networks and their processes analytically, how we can simulate them, and how we can use the interactive capabilities of IPython to control computations, gather the results, produce graphs, and then make the whole lot available on the web.

If you've taken the opportunity to look at any of the research papers mentioned in the text so far, however, you may have noticed something that makes you slightly uncomfortable. Simulations need to be run several times for different parameter values to even-out the stochastic nature of the network/process interactions. Complex network effects are often defined "in the limit" and so often only appear on large networks: in theory as $N \rightarrow \infty$, in practice with "lots" of nodes and edges. Both these factors combine to generate a *lot* of computation: $10^3$ repeated simulations for each of 30 parameter values over networks of $10^5$ nodes would not be especially unusual. Performing a simulation like this on a typical desktop workstation or laptop is clearly not going to work: even if the individual simulation runs only take a minute, ten thousand of them is a week of compute time, and if we do that for 30 different parameter values &ndash; well, you see the problem.

This has been a persistent narrative in computer science, of course, and it's usually tackled by appealing to [Moore's law](https://en.wikipedia.org/wiki/Moore%27s_law), the notion that the amount of computing power available at a given price doubles every eighteen months. However, the programs we commonly write are sequential, meaning that they make use of a single processor doing a single sequence of things &ndash; and the Moore's law curve is flattening out for individual processor cores, reducing the speed-up available to sequential programs over time. Extra speed-up comes, not from faster cores, but from having more cores available, either on one machine or on a collection of them. Taking advantage of this means moving from sequential to *parallel* processing with multiple "threads" of activity happening simultaneously.

To do research-grade simulation of complex networks and complex processes, then, requires that we tackle high-performance parallel computing. While the general techniques of parallel programming are quite arcane, there are specific techniques that are easy to use both conceptually and technically &ndash; and fortunately most of network science simulation fits into these special cases. 

In this part of the book we explore the approaches we need to work with large networks. We start with the [concepts](#sec_thinking_parallel) of parallel systems as far as we need to understand them for our purposes. We'll then talk about using [IPython's parallel tools](parallel-ipython.ipynb), which turn out to be well-suited to the sorts of computations we want to perform, and which mean we can utilise high-performance computing from within the same interactive environment we've been using up to now.

To use IPythin's parallel tools we set up a "compute cluster", which is just a set of IPython processes dedicated to performing calculations. There are a lot of ways we can do this. We'll start by looking at [how to set up a simple cluster](parallel-simple.ipynb) that we can access either by logging-in to the machine it's running on or, more flexibly, from a [remote client](parallel-client.ipynb) such as a laptop so that you can disconnect from an on-going simulation and come back later to collect the results. Then we'll look at some [more complex cluster set-ups](parallel-complex.ipynb) such as those on a network of machines or those running in the cloud, and at [asynchronous working](parallel-async.ipynb) so we're not tied to a long-running simulation. Finally we'll return to the subject of [reproducibility](parallel-recomputation.ipynb) and how it affects, and is affected, by working at scale.

It's inevitable that this part of the book will dive very deeply into issues of system set-up and configuration that are a long way from network simulation (and even farther from the mathematics of their analysis). There are a lot of elements to get right in order to get cluster computing to work: but once we've done it, we get much shorter compute times and the ability to address larger problems more convincingly with only slight changes in the code we need to write. Hopefully that's an acceptable trade-off.

In what follows we'll assume that the machines involved are all running Linux or something similar. (Mac OS X-based clients are fine too.) Windows is less well-set-up for this sort of work, although it's perfectly possible to set things up on Windows machines as well. We'll also assume you have a moderate familiarity with the Linux command line and its basic tools. 

## <span id="sec_thinking_parallel"></span> Thinking parallel

Parallel computing simply means that a *single* program does *several* things *simultaneously*. (That's the easy part over with: it gets harder from here.)

Writing programs is a process of writing algorithms, either from scratch or by mashing-up existing libraries to provide the functionality you want. Most algorithms &ndash; and most programs built from them &ndash; are designed to run *sequentially*. When designed around arrays, for example, they might start with the first element and process it, then move to the second, and so forth: one element at a time in a particular order. This is an easy, straightforward, and above all easily-comprehensible way to describe an algorithm, and it's taught to all programmers from when they first start programming. But it's not the only way to express computation.

This "reductionist" style of programming is sometimes referred to as the *von Neumann style*. [John von Neumann](https://en.wikipedia.org/wiki/John_von_Neumann) was another giant in the history of mathematics and computing, and his design for computers &ndash; the *von Neumann architecture* consisting a single central processing unit connected to memory, disc, and other peripherals &ndash; has influenced all the computers ever built. Although modern machines don't strictly follow von Neumann's design in at the hardware level, they typically take great pains to *behave* as though they do, and it's the mental model model programmers typically have of the way their computer is organised. The problem is that this forces algorithms to run sequentially because that'd how they're explicitly written. This limitation has been called the *von Neumann bottleneck* &ndash; although that's unfair to von Neumann, who intended his architecture as a reference model rather than as a design for real machines that would persist for sixty-odd years.

There is another style of programming, however, that focuses on describing how a program deals with data structures as a whole. Rather than write loops that traverse arrays one element at a time, for example, this style provides functions that (for example) apply the *same* operation to *all* elements of the array in one go, or to reduce all the elements through repeated application of some binary operator. The programmer writes her program to manipulate the entire array in one go using these "bulk" operators. Internally the bulk operations might be written as loops, as in the von Neumann style, but &ndash; critically &ndash; they *might* be written to work in parallel. Algorithms written in this style therefore aren't inherently sequential (although they might be realised that way): they have the *opportunity* for parallelism.

(This style is often associated with functional programming, but that's misleading. Functional programming is concerned with lack of side effects, amongst other concerns, which do make the "bulk" style easier to work with, but aren't necessary for it. It's perfectly possibly apply bulk operations in traditional or object-oriented languages with side effects. There are plenty of reasons to adopt functional languages and a functional style, but parallelism isn't really a very good one: you can get the same benefits in a sufficiently rich imperative language, and keep access to a better range of libraries.)

The point about bulk operators is that they identify things that can happen simultaneously without interacting with each other, like applying a function independently to each element of an array. There will be other parts of a program that have to happen in a particular order. In order to apply our function in parallel to an array, for example, we first have to create the array, and afterwards we might have to print out a result. We can't apply the function until we've created the array; we can't print the result until we've done the calculation. So we have a sequential program (create array, compute, print results) that embeds a parallel compute state within which we do as much as we can in parallel. If we have a multicore machine, the parallel compute stage might make use of all the cores in the machine, with each core performing an independent part of the computation. If we have 16 cores we can do 16 parts together and, all things being equal, the computation will only take 1/16th as long as it would take as a sequential algorithm running on a single core. The more of the program we can push into bulk operations executed as a parallel stage, the more opportunities we have for parallelism and therefore the more opportunities for speeding-up the program overall. This approach is a form of *explicit* parallelism, in the sense that we'll be identifying those computations we want to happen in parallel and coding them slightly differently to the way we code the sequential parts of programs.

This is the way we'll think about parallel programming in this book: as essentially sequential programs that contain computational stages consisting of applying the *same* basic computation to *multiple* piece of data in parallel on as many cores as we have available. In most cases the computations that interest us will be network simulations, and we'll get parallelism by performing the *multiple runs* of the *same* simulation code across *different* networks. Each network can be simulated independently; potentially each repetiton could be simulated simultaneously; and we'll look at how to get as much speed-up as we can from these parallel opportunities.

## <span id="sec_parallel_concepts"></span> Parallel concepts and issues

Modern machines are typically *multicore*, where the single central processor is composed of several idependent "cores" that can run independent programs. Getting all these cores working simultaneously is the key to getting parallel speed-up. You can keep cores busy by running several different programs together, or by having parallel sections to programs that can each take a core. It's important to remember that you, as a user or as a progammer, never have to worry about assigning programs to cores. The operating system does that automatically. Your job, in order to get parallel speed-up, is to make jobs available to be scheduled onto the different cores. 

Speedup

Data movement

Distribution

TBC

**----- END OF CHAPTER -----**

## <span id="sec_remote_cluster"></span> Setting up a remote cluster

A local cluster will only get you so far, however. Once you start trying to perform research-grade simulations, the sizes of the networks and the number of repetitions needed simply overwhelms a single machine, and even if you upgrade you'll soon be wanting more power. There are two solutions:

1. Use a proper computing cluster
1. Use a cloud computing provider

In this section we'll deal with running on a cluster, and [then](#sec_cloud_computing) look at running in the cloud. Both have their advantages, and the techniques used are similar.

### <span id="sec_cluster_computing"</span> Cluster computing

Computing clusters come in two flavours: *dedicated* clusters built for the purpose, and *workstation* clusters that make use of "spare" computing power available from machines on people's desk or labs when they're not using them. They're actually not all that different from each other: dedicated clusters tend to all be the same kind of machine and typically have a shared file store which makes them somewhat easier touse. We'll focus on them first.

[DIAGRAM OF TYPICAL DEDICATED/NoW SETUPS]

Dedicated clusters typically run Linux: either a standard distro, or something more specialised like [Rocks](http://www.rocksclusters.org/) that provides some more advanced management tools. Again, the details aren't typically all that important, and we'll focus on a "likely" set-up whle highlighting the differences you may encounter.

### Planning a remote cluster

The nice thing about using a system like IPython is that it isolates you, as far as possible, from the underlying mechanics of cluster-building. From IPython's perspective, you simply attach to a cluster controller (an instance of `ipcontroller` running somewhere) and then fire compute jobs at it. It doesn't matter whether it's a local compute cluster or a remote one. What could possibly go wrong?

Quite a lot, as it turns out. There are essentially three problems we need to deal with.

Firstly, we have to set up the controller and engines so that they all talk to each other within the cluster. "Talking" in this case means both the the controller and the engines can route packets between each other using an interconnect, and that they have permission to interact: a communications issue and a security issue coupled together, in other words. There are lots of ways a cluster's interconnect can be set up, but a common configuration has the engines and controller on the same local-area network (LAN) and able to communicate directly. The client may be on the same LAN, or may be on another that's accessibly, or the controller may be multihomed and sit on both the client's and the engine's networks simultaneously using two different network interfaces.

Secondly, we have to get software and data to the engines. In dedicated clusters the engines (and the controller) often share a common file system, so all files are visible (with the same names) at all machines. In networks of workstations, the machines will often all have independent file systems and we'll have to handle synchronising the software and data on them.

Thirdly, we have to let the IPython client talk to the cluster controller. This may be equally easy, but is more often more difficult, since client and cluster may live on different networks. Even if they do happen to be directly visible to each other, you might decide to go on the road outside your LAN &ndash; to work from home, for example &ndash; and have to access the cluster across a firewall. The cluster may be local to itself, but sits on the other side of a firewall from your client machine at least some of the time.

### <span id="sec_ssh"</span> The proper use of `ssh`

Now log-out and try to `ssh` back into `cluster`. You *should* be allowed straight in without a password, with `ssh` using the keypair to automate your log-in.

To let programs fully use `ssh` there's one more step to be performed, which is to set up a program that manages log-ins on behalf of other programs called an `ssh` *agent*. The agent maintains a "keychain" that lets you specify what key (or keys) should be automatically made available to programs when they try to build `ssh` sessions. Setting up the agent involves issuing two commands: 

<tt>
ssh-agent bash<br>
ssh-add
</tt>

The first command (`ssh-agent bash`) sets up the agent and wraps it around a new instance of the `bash` shell. Within that shell, we then use `ssh-add` to add our default private key to the keychain being managed by the agent.

This may seem like quite a lot of work, but what it does it to establish the keypairs and keychains needed to securely log-in to machines without using passwords. Most people who do a lot of this sort of thing use keypairs all the time when using `ssh`, as (once it's set up) it's fundamentally less hassle than remembering and typing passwords all the time.

where *cluster* is the name of the profile, which seems fairly sensible in general but you might want to name your profile after the name of the cluster it describes. (In St Andrews I sometimes use the *blob* cluster, so I have a profile is called *blob*. If I'm using a cloud computing set-up on Microsoft Azure, I use a profile called `azure`.) This command creates all the configuration files we need, with the `--parallel` option including the files necessray for cluster computing. We do, however, need to edit things slightly.

We start with the file `.ipython/profile_cluster/ipcontroller_config.py` (assuming our profile was called *cluster*). Opening this file in an editor will reveal a load of commented-out Python code that sets various parameters that control the behaviour of the controller. The vast majority of these can be left as they are for the vast majority of installations, but we need to make the following modifications by uncommenting variables and setting their values:

1. Set `c.IPControllerApp.ssh_server` to the name of the machine that hosts the `ssh` server we'll use to access the cluster from the client. Typically this is the public name of the cluster head *as seen from outside*.
2. Set `c.IPControllerApp.reuse_files` to `True`. We'll explain this later. 
3. Set `c.HubFactory.ip` to  `u'*'`. This will cause the controller to listen on all network interfaces for engines.
4. Set `c.IPControllerApp.location` to the name of the cluster head *as seen by engines*.

(It's completely acceptable to add these definitions at the end of the file and leave everything else as it is.) What do we mean by the names of machines as seen by the client and by engines? In the part ??? of the figure [above](#sec_cluster_computing) the worker nodes in a cluster are on their own interconnect, without full access to the internet and with a special name set-up for the cluster head. The cluster head node itself, by contrast, is typically accessible more or less publically, and so will have a public name. In simpler set-ups (like part ??? of the figure [above](#sec_cluster_computing)) the two names will be the same. In my case the worker nodes think the cluster head is called `blob-cs.local` while the outside world think it's called `blob.cs.st-andrews.ac.uk`, the machine I can log-in to. The machine is actually multi-homed with two separate network interconnect cards. These two machines names would be assigned to `c.IPControllerApp.location` and `c.IPControllerApp.ssh_server` respectively.)

That's it as far as the cluster is concerned.

### Starting the cluster

We're now ready to start the cluster. When we did this earlier we simply started everything from the IPython notebook using the `ipcluster` shell command. We can't do this for a remote cluster, as a typical workflow is to start the cluster and leave it running, connecting from outside as required. The cluster topology is also often more complicated, in that the controller talks to multiple engines, each living on a different machine and needing to be started up: this is after all the point, to get access to lots of machines' computing power simultaneously.

The mechanism is very simple. We first start a controller on the head node, then start engines on the worker nodes, and finally ship some security information to the client to let it connect.

Starting the controller is simple. On the head node just run:

<tt>
(ipcontroller --profile=cluster &)
</tt>

(again assuming *cluster* is the name of your profile). This puts the controller into the background: you'll see some status information appearing in the terminal, that can usually just be used to provide confidence that something is happening.

How you start the engines depends on the cluster management software, and can range from having to log-in to all the machines individually to running some more automated process. In the manual case, suppose you had a cluster consisting of three machines called `tom`, `dick`, and `harry`. You would need to log-in to each machine and start an engine:

<tt>
ssh tom '(ipengine --profile=cluster &)' <br>
ssh dick '(ipengine --profile=cluster &)' <br>
ssh harry '(ipengine --profile=cluster &)'
</tt>

In order for this to work, `tom`, `dick`, and `harry` will all need to have your `ssh` public key installed as explained [above](#sec_ssh), and you'll need to run the commands from a shell that's being managed by an `ssh` agent. 
By passing the profile name to `ipengine` we cause it to find and connect to the controller, using the configuration we set up earlier. If your cluster runs a cluster-friendly Linux distro like Rocks, starting the engines is simpler as there is support for running a command on all hosts:

<tt>
rocks run host command='(ipengine --profile=cluster &)'
</tt>

Much status information will appear, again hopefully just confirming things are working. At this stage we hopefully have a cluster controller running on the head node, connected to engines sitting ready for work on the worker nodes.

[ADD SOME SCREENSHOTS OF ALL THIS]

[TROUBLESHOOTING]

### Connecting to the cluster

Finally we need to configure the client machine to connect to the controller from IPython. We need a final step to make the security work: IPython will not let *anyone* run jobs on a cluster, even if they can log-in to the cluster head. In order to use the a cluster you need a security capabilities file that describes the machine names and connection protocols, and also provides a shared secret that your code presents to the cluster controller.

When the controller starts, it stores connection metadata in `.ipython/profile_cluster/security/ipcontroller-client.json`. We copy this file from the cluster head to the client. We don't need to create a complete IPython profile client-side as we did before (although it does no harm to do so): we just need the security file to be accessible. If we open a connection to the cluster, IPython will use the profile description to find the cluster head, log-in using `ssh`, authenticate itself to the controller, and create the necessary connection ready to use the engines.

Phew!

You may be happy to learn that we don't have to go through this process every time we start the cluster. The reason for this is the slightly mysterious line in the cluster's profile configuration in step 2 [above](#sec_configure_remote_cluster) where we did `c.IPControllerApp.reuse_files = True`. If we kill the cluster (or it dies for some reason), when we re-start the controller it will re-use the same key file, which means that authentication from existing clients will still work. However, if you edit the cluster's configuration for any reason, you'll need to remove the key files, let the controller create new ones, and then copy these down to the client. Once the system is set up and known to be working, though, this very seldom happens. 

### <span id="sec_remote_cluster_connect"</span> An example of how to connect

Let's assume things go according to plan. We've downloaded the `.ipython/profile_cluster/security/ipcontroller-client.json` file locally so it sits alongside our IPython notebook. We can fire-up a connection to the cluster using code like the following:

In [None]:
from ipyparallel import Client

cluster = Client(url_file = 'ipcontroller-client.json')

Compare this to what we had [before](#sec_parallel_ipython_programming) and you'll see that the only change is to provide a link to the security file, which contains all the information we need to get the connection going under most circumstances. We can now use the same code we used [earlier](#sec_parallel_ipython_programming) to get the engines working:

In [9]:
with cluster[:].sync_imports():
    import numpy
view = cluster.load_balanced_view()

view.map_sync(factorial, ns)

importing numpy on engine(s)


[2,
 720,
 3628800,
 2432902008176640000,
 265252859812191058636308480000000L,
 815915283247897734345611269596115894272000000000L]

So parallel IPython code runs unchanged on a small local cluster or a potentially much larger remote one. The view will handle allocating jobs to engines, and the more engines are available the more jobs will work in parallel and the more speed-up we will potentially achieve. However, designing complex code and getting decent speed-up requires a little care, as we'll explore [in a moment](#sec_ipython_parallel_programmming_practice) after we've dealt with some common complications to cluster set-up.

### Some more complicated set-ups

The description above will get you started with a cluster on the most common kind of dedicated cluster, where all the machines sit on the same interconnect with a common file system and you manage everything from your own user account. There are various ways in which real-world installations diverge from this norm: too many to explain in detail, so we'll settle for explaining some of the most common and suggest that you find a local Linux systems guru to help out if you get confused.

**Non-shared file systems.** In decicated clusters, the workers and cluster head machines typically share a networked file system. If, however, you're making use of a network of workstations &ndash; a set of machines lying around a lab that you can use for computation &ndash; this often isn't the case, and you'll need to copy files around.

We have to deal with two sets of data:

* the IPython profile that describes the cluster set-up; and
* the `ssh` keys needed for log-ins.

Fortunately these are easily managed: just copy `~/.ipython/profile_cluster` (assuming you're cluster profile is called *cluster* as before) from the cluster jhead to all the workers. Likewise, copy `~/.ssh` to copy the `ssh` keypairs. (You can use `scp` to copy these directories, with the `-r` flag to make them work recursively.) If you change any of the cluster configration you'll need to re-copy everything around again, and there are few things more frustrating than trying to debug a fault caused by files not being the same everywhere, so be careful!

**Different users.** It may be that your username on your cluster is different to your username on your client workstation. In that case you need to do two things. Firstly, you need to make sure you install the IPython profile and `ssh` credentials into the right user's home directory. Secondly, you need to tell IPython that it needs to log-in as a different user on the cluster head than you're using on the client. The simplest way to do this is to edit the `ipcontroller-client.json` file you copy down from the cluster head. If you open this file in an editor you'll see a line of the form:

<tt>
   "ssh": "cluster",
</tt>

where `cluster` is the name of your cluster head machine.) If your username on the cluster is `john`, change this line to:

<tt>
   "ssh": "john@cluster",
</tt>

and IPython should use the correct user. This situation occurs fairly commonly in cloud computing, by the way, when you often have less choice about usernames.

**Dedicated `ssh` keys.** If you're a properly paranoid programmer you'll have been worried above that you had to put your private `ssh` key on the cluster head, rather than keeping it really safe on your own machine. Since anyone (and any program) who has your private key can log-in to any machine you can yourself, good practice really demands that you isolate your own identity from that of your network science simulation set-up. This is easily accomplished by creating a dedicated `ssh` keypair that you use only for this purpose. There are two stages to this, Firstly, we create a new keypair:

<tt>
ssh-keygen -t rsa -b 2048 -f network_rsa -N ''
</tt>

This creates two files `network_rsa` and `network_rsa.pub` for the keypair. You can use any name you like for the keypair, but convention says it should end `_rsa`. Copy both to the cluster head and the public key to the engines as before, while keeping a copy of `network_rsa` on the client alongside the `ipcontroller-client.json` file. (You can put it anywhere you like, but this is convenient.) You end up with the following key distribution:

* Private key on the client
* Public and private key on the controller
* Public key on the engines

You will also need to run `ipcontroller` inside an `ssh-agent` to manage the keys. You need to add the private key to the agent's keychain:

<tt>
ssh-add network_rsa
</tt>

All this leads to a set of keys and processes as shown in the figure. While this looks complicated, you just need to remember the core principles of `ssh`:

* Private key wherever you log-in *from*
* Public key wherever you log-in *to*
* `ssh` agent wherever you're using programs to connect rather than logging-in by hand

<div class=figure id=fig_ipython_ssh_keys>
<div class=figurebody>
<img alt="Distribution of ssh keys" src="ipython-dedicated-ssh-keys.svg">
<br>
<span class=caption>The distribution of <tt>ssh</tt> keys across cluster and client.</span>
</div>
</div>

Now make a small change to the IPython notebook code you use to [connect to the cluster](#sec_remote_cluster_connect), supplying the name of the private key file to the `Client` object: 

In [None]:
from ipyparallel import Client

cluster = Client(url_file = 'ipcontroller-client.json',
                 sshkey = 'network_rsa')

The `Client` object will now connect using the dedicated keypair. Anyone who hacks your cluster machines can steal your private `ssh` key, but the only thing they will be able to do with it is connect back to the cluster they've already hacked: they won't be able to access any machines that you could access using your own private key, because that was never put onto the cluster in the first place. (If you *did* use your own keypair for experimentation, now might be a good time to delete it. Leave the public key, though, so you can still log-in as yourself: it's only the private key you need to be paranoid about.)

## <span id="sec_ipython_parallal_programming_practice"</span> Programming for a remote cluster

The incantations above will hopefully set up a remote cluster that has exactly the same interface as the local cluster we set up earlier. While this is a huge benefit &ndash; we don't have to commit to using a particular set-up when writing our code &ndash; things aren't quite that simple. 

There are essentially two problems we need to deal with. Firstly, the purpose of using a cluster is to get performance, and we only get maximum performance if all the engines are kept busy computing all the time. Even if we step back a little from this ideal position, we need to keep the engines *as busy as possible doing useful computing* if we're to get the performance boost that a cluster is able to give us. We need to structure code to this end, which means thinking about how to keep engines evenly supplied with work.

Secondly, the remote cluster is, well, *remote*, in the sense that there is a network between the client and the cluster head, and potentially between the cluster head and the worker nodes. Modern networks are fast, but modern simulations can be large, and even a fast network can start to collapse under the pressure of moving lots of data around. Sending lots of data can slow things down substantially, especially as an engine that's exchanging data isn't doing useful computation. So our thoughts about code structure need to deal with two related phenomena: keeping engines fed with work, and keeping them doing that work rather than talking over the network. We'll deal with these two questions in the next two sections.

### <span id="sec_parallel_work"></span> Keeping engines supplied with work

Keeping engines supplied with work comes in two phases. IPython provides half, but the programmer needs to provide the other.

IPython's half is provided by the view we take of the cluster. When we first discussed IPython parallelism [above](#sec_ipython_parallel) we used both *direct* and *load-balanced* views. A direct view lets the programmer send particular jobs to particular engines in the cluster, which is useful if you want each engine to do something different. However, often we want the engines doing the *same* thing but over different data, and this is where a load-balanced view comes in. A load-balanced view takes a set of jobs and allocates them to engines *as they become free*. Suppose we do the following: 

In [None]:
rc = view.map(f, range(10000000))

What does this do? If `view` is a direct view, then it tries to run function `f` in parallel for each of the 1000000 values in the range we're applying it to. If we have a cluster with 1000000 engines, that's fine; if not, we can't do this computation.

But what if `view` is load-balancing? Suppose we have a reasonably-sized cluster with 128 engines. The load-balanced view will take  and apply it in parallel to the first 128 elements of the range, computing `f(0)`, `f(1)`, and so on up to `f(127)`. These calculations might take radically different amounts of time. When an engine finishes and returns a result, the view gives it the next calculation &ndash; `f(128)` in this case &ndash; to do, and similarly as other engines complete. This all happens transparently of the programmer, so as far as you're concerned you map `f` over the list and get speed-up from parallel execution.

There is an implication here, though. If an engine finishes its job, it can only keep busy &ndash; keep contributing useful work &ndash; if it can receive another job. The load-banaced view handles the mechanics of this, but the programmer has to supply the jobs. In some cases jobs all take roughly the same time; in others jobs make radically different computational demands. The ideal is where every engine is kept 100% busy and they all finish their final job at roughly the same time; in reality, we can keep the engines busy until the last jobs ahave been submitted to an engine, and work will trail off as these jobs are finished and no new ones are allocated.

From a programming perspective, this suggests structuring programs to have a large number of small jobs: a large number so there is always work available, and small to allow more efficient scheduling. One might, for example, decide to run network simulations where a job is a single run &ndash; a network with a process to run over it &ndash; and perform hundreds (or thousands) or repetitions of each set of parameters being tested.

While this is good in principle from this perspective, in practice it has some disadvantages that we'll now consider.

### <span id="sec_parallel_chunking"></span> Locality and chunking

Suppose we have the following scenario:

* We want to simulate a process on a network under two parameters (say $a$ and $b$)
* We want to test $a$ and $b$ over a range of ten data points each, so over all pairs of $(a_0, b_0), (a_0, b_1), \ldots , (a_0, b_9), (a_1, b_0), \ldots , (a_9, b_9)$
* Because the process is stochastic, we decide to perform 1000 repetitions of the experiment at each parameter pair, generating a different network for each
* To make things reliable, we'll use a network of 100000 nodes for each test

Using what we learned above, we could proceed as follows. We define a function for the process dynamics for each pair of parameters. Then, for each pair of parameters, we build a list consisting of 10000 instances of 100000-node networks with the appropriate degree distribution and edges. Then we map the process function across the networks using a load-balanced view. Looking at the numbers, we end up with 100 simulation functions, each mapped across a list of 10000 elements, each of which is a 100000-node network: $10^6$ compute jobs.

Does this meet the criteria we set? Well, yes: lots of jobs, each of which, while not small, is about as small as we can make it under the circumstances (a simulation across a single network). So is this a good solution to the problem that we could code-up?

No.

Why not? Basically for three reasons:

1. the memory of the client;
1. the volume of traffic over the network; and
1. the management overhead.

Let's deal with the problems first, starting with the client's memory, and then move to some possible solutions. To run the simulations for a given pair of parameters, we have to build 10000 instances of a 100000-node network. That's going to use a considerable amount of memory, since it's $10^9$ nodes with their associated edges. A slightly different programming approach, and we'd store all the networks for *all* the parameter combinations: $10^{11}$ nodes, with edges. That's a *lot* of storage.

Suppose our client machine can cope: OK, we now have to spin-up the simulation jobs, which means passing the network data over the network to the controller, and then to an engine. The controller has to acquire and store the data, and then retain it until it can be passed to an engine. Both of these require communicating a 100000-node network between two machines. That's a *lot* of data movement.

While this is happening the client, controller, and engines all have to maintain and exchange information in the background to keep things working. We have to keep the work flowing, possibly monitoring when jobs complete and sending a new job to the now-free engine. Finally, we have to return whatever metrics we derive at the end of the simulation back to the client for processing.

We can argue about whether this partcular scenario would work on a particular hardware set-up, and the notion of something being "too big" will change over time, but I think it's safe to say that we have too much of a good thing here: too many jobs, too much network traffic, too much overhead, too much memory being used in one place. Each of these "too much"es is a recipe for a failure that stops things working and loses data.

But didn't we do everything right? Yes and no. Conceptually we did indeed do the right thing. We split up our initial problem into a load of independent computations that could be performed in parallel and then aggregated. But we then simply implemented that solution na&iuml;vely, without considering the reality of the situation. We were on the right track, however: having the right conceptual framework is half-way to a solution, what we need to do is realise this framework in a more realistic manner.

There are two techniques we can use to improve our implementation. Firstly, we can think about *where computation occurs*, and specifically about where we build the networks we're going to use: do they *have* to be passed between machines, or can we build them where they'll be used? Secondly, we can "chunk" jobs together and schedule a smaller number of larger jobs: still enough to keep the engines busy, but not enough to cause problems.

Let's use these two techniques in our scenario. Conceptually we have a simulation function and a network it runs over. We could have the simulation function build the network itself and then work on it. That sounds equivalent, and indeed it is, at one level: that's really the point. But implementationally, in the first case we created functions and networks at the client and then communicated them to an engine *via* the controller; in the second case we created functions at the client and communicated them, and they then built their own network locally at the engine: no communication needed. We've immediately reduced the client's memory requirement and the amount of data communicated &ndash; and, incidentally, speeded things up by allowing non-trivial task of creating the large networks to happen in parallel at the engines that will then use them.

We've still got a lot of jobs. Currently our simulation function performs a single simulation. We could re-code it to perform all 10000 repetitions for a given pair of parameters, building a new random network locally for each repetition. Now instead of $10^6$ jobs (10000 repetitions each of 100 parameter pairs), we have 100 jobs each of which performs 10000 repetitions. Moreover, since the reason we're doing the repetitions is probably to compute average values for the various metrics, we could do *that* computation at the engine too and just ship back the average values, rather than the values for each repetition (since we don't care about them). Again, we've reduced communications, *and* speeded things up by performing parallel calculations, *and* saved computation at the client since the code returns averages not raw data needing further processing. 

But we need to pause a moment: 100 jobs does not sound like very many. If we had a cluster with 128 engines &ndash; not unreasonable at the time of writing &ndash; then we only have enough work for 100 of them: we leave around 20% of the engines idle. On the other hand, if we only have 64 engines, we can keep them all busy for longer. If we had 128 engines, maybe we'd want to build smaller chunks &ndash; two jobs doing 5000 repetitions each for each parameter pair, perhaps? &ndash; to give ourselves 200 jobs to spread across the available machines. Alternatively we could increase the number of parameters, maybe explore $a$ and $b$ over 20 points each, yielding 400 jobs each of 10000 repetitions. (If that sounds a bit gratuitous, in reality we often let problems expand to fill the computers we have available: a sort of computational version of [Parkinson's law](https://en.wikipedia.org/wiki/Parkinson%27s_law).)

Let's recapitulate the journey we've just made. We started with a scientific problem to solve. We build a conceptual model that divided-up the problem into lots of independent jobs to solve independently and then combine together. We then tweaked this model to make it realistic, taking account of the limitations of real computers and their interconnects, and arrived at a model that is tailored to fit actual computational environment we have available. We can now code this up and run it. It's important to note that the steps we've taken haven't really tied us down too much: we'd need to re-code the simulation slightly and provide appropriate values for the number of repetitions and the like, but fundamentally the final code is very much like our initial formulation. Later we'll see how to do this in practice.  