# DirectView as multiplexer

In [2]:
from IPython.display import display
from ipyparallel import Client
rc = Client()

The DirectView can be readily understood as an Engine Multiplexer -
it does the same thing on all of its engines.

The only difference between running code on a single remote engine
and running code in parallel is how many engines the DirectView is
instructed to use.

You can create DirectViews by index-access to the Client.  This creates
a DirectView using the engines after passing the same index (or slice)
to the `ids` list.

In [3]:
e0 = rc[0]
eall = rc[:]
even = rc[::2]
odd = rc[1::2]

# this is the one we are going to use:
dview = eall
dview.block = True

Now, the only difference from single-engine remote execution is that the code we run happens on all of the engines of a given view:

In [5]:
import os
for view in (e0, eall, even, odd):
    print(view, view.apply_sync(os.getpid))

<DirectView 0> 18094
<DirectView [0, 1, 2, 3]> [18094, 18093, 18095, 18092]
<DirectView [0, 2]> [18094, 18095]
<DirectView [1, 3]> [18093, 18092]


The results of multiplexed execution is always a list of the length of the number of engines.

In [6]:
dview['a'] = 5
dview['a']

[5, 5, 5, 5]

# Scatter and Gather

Lots of parallel computations involve partitioning data onto processes.  
DirectViews have `scatter()` and `gather()` methods, to help with this.
Pass any container or numpy array, and IPython will partition the object onto the engines wih `scatter`,
or reconstruct the full object in the Client with `gather()`.

In [8]:
import numpy as np
dview.scatter('a',np.arange(16))
a = len(dview['a'])

print('a in the engines:',dview['a'])
print('a here:', a)
print('len of whole thing:', len(dview.gather('a')))
print(dview.gather('a'))

a in the engines: [array([0, 1, 2, 3]), array([4, 5, 6, 7]), array([ 8,  9, 10, 11]), array([12, 13, 14, 15])]
a here: 4
len of whole thing: 16
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15]


In [9]:
e0.block = True
e0.scatter('a',np.arange(16))
e0['a']

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15])

In [10]:
dview.gather('a')

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15,  4,
        5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15])

In [11]:
dview.execute("asum = sum(a)")
dview.gather('asum')

[120, 22, 38, 54]

The cell magic `%%px` is equivalent to calling `dview.execute()` on an entire cell, with a more convenient syntax:

In [12]:
%%px
# This entire cell will be executed in all the engines as if we'd called
# dview.execute("...") with the contents below.
asum2 = 2*sum(a)
import numpy as np
b = np.random.rand(4)

We can now agther

In [13]:
print('asum2:\n', dview.gather('asum2'))
print('b    :\n', dview.gather('b'))

asum2:
 [240, 44, 76, 108]
b    :
 [ 0.67769157  0.07856507  0.7548284   0.76355569  0.2633911   0.05192797
  0.74228314  0.63888739  0.63031973  0.56405279  0.74885016  0.56197146
  0.30082453  0.11919063  0.50652661  0.3188103 ]


With gather and `%%px` we can conveniently break up computation across multiple engines, for example a set of data files that need processing:

In [14]:
files = ['one.txt', 'two.txt', 'three.txt']
dview.scatter('files', files)
dview['files']

[['one.txt'], ['two.txt'], ['three.txt'], []]

Note  how when  we run code with `%%px`, IPython automatically captures and summarizes print output for us from all engines:

In [16]:
%%px
for file in files:
    print('filename:', file)

[stdout:0] filename: one.txt
[stdout:1] filename: two.txt
[stdout:2] filename: three.txt


We can pass a 'flatten' keyword,
to instruct engines that will only get one item of the list to
get the actual item, rather than a one-element sublist:

In [17]:
dview.scatter('id',rc.ids)
dview['id']

[[0], [1], [2], [3]]

In [18]:
dview.scatter('id',rc.ids, flatten=True)
dview['id']

[0, 1, 2, 3]

Scatter and gather also work with numpy arrays

In [19]:
A = np.random.randint(1,10,(16,4))
B = np.random.randint(1,10,(4,16))
display(A)
display(B)

array([[9, 7, 6, 5],
       [3, 3, 8, 7],
       [9, 5, 2, 4],
       [3, 9, 8, 3],
       [8, 3, 7, 7],
       [2, 9, 4, 8],
       [7, 1, 1, 4],
       [1, 3, 5, 5],
       [1, 9, 7, 8],
       [6, 1, 4, 7],
       [5, 6, 4, 1],
       [7, 2, 4, 2],
       [8, 8, 9, 7],
       [2, 9, 2, 9],
       [8, 3, 9, 4],
       [2, 6, 9, 9]])

array([[9, 4, 2, 8, 6, 5, 4, 4, 4, 7, 3, 9, 6, 9, 6, 5],
       [6, 2, 6, 1, 9, 1, 2, 7, 1, 1, 7, 4, 5, 9, 3, 9],
       [6, 7, 2, 4, 7, 8, 1, 3, 5, 7, 2, 8, 6, 6, 5, 8],
       [7, 8, 2, 5, 2, 3, 4, 7, 5, 1, 2, 5, 3, 9, 5, 6]])

In [20]:
dview.scatter('A', A)
dview.scatter('B', B)
display(e0['A'])
display(e0['B'])

array([[9, 7, 6, 5],
       [3, 3, 8, 7],
       [9, 5, 2, 4],
       [3, 9, 8, 3]])

array([[9, 4, 2, 8, 6, 5, 4, 4, 4, 7, 3, 9, 6, 9, 6, 5]])

# Example: Parallel Matrix Multiply

With what we have seen so far, we can write our own (completely terrible!) parallel matrix multiply.

* Remember - multiply rows of one by the columns of the other.
* easiest implementation involves one each of: push, scatter, execute, gather

In [25]:
%load soln/matmul.py

Let's run this, and validate the result against a local computation.

In [26]:
C_ref = A.dot(B)
C1 = pdot(dview, A, B)
# validation:
(C1==C_ref).all()

True

# Map

DirectViews have a map method, which behaves just like the builtin map,
but computed in parallel.

In [27]:
dview.block = True

serial_result   =       map(lambda x:x**10, range(32))
parallel_result = dview.map(lambda x:x**10, range(32))

serial_result==parallel_result

False

`DirectView.map` partitions the sequences onto each engine,
and then calls `map` remotely.  The result is always a single
IPython task per engine.

In [28]:
amr = dview.map_async(lambda x:x**10, range(32))
amr.msg_ids

['f9d72f87-d53e-4cae-ae06-2c300faa962a',
 '67064a5f-5e22-4442-b2cb-79c2e4b874b9',
 'a207f618-0736-43c2-8464-4be900acc34c',
 'd36bcee7-0d25-48fb-8f4d-3f459df4633a']

We can see this by splitting a larger input range, we still get the same number of message ids (one per task having been created):

In [29]:
amr = dview.map_async(lambda x:x**10, range(64))
amr.msg_ids

['ba62e62f-85d6-49b2-b3c5-61d9e300693b',
 '0aeb22a7-0a40-4630-9ee0-1ad5aabbd853',
 '7c4d4a46-9ccd-403c-b9dc-0ae72286c157',
 '3c1ed667-e534-408b-88d6-b45ee7d5ffc9']

### Example: Pi via simple Monte Carlo

![Monte Carlo Pi](http://docs.picloud.com/_images/basic_example_monte.png)

In [34]:
def sample_circle(n):
    import numpy as np
    m = 0
    for i in range(int(n)):
        p = np.random.rand(2)
        if sum(p**2.) <= 1.:
            m += 1
    return m

def brute_pi(n):
    m = sample_circle(n)
    return 4.* m/n

def err(npi):
    return 100*abs(np.pi-npi)/np.pi

In [35]:
n = 5e5

In [36]:
%time bpi = brute_pi(n)
print("\nError: %.2f%%" % err(bpi))

CPU times: user 2.3 s, sys: 27.3 ms, total: 2.32 s
Wall time: 2.37 s

Error: 0.03%


**Your homework**

Write a function `cluster_pi` that uses the cluster to run the computation in parallel (use `brute_pi` as inspiration).

CPU times: user 13.1 ms, sys: 2.74 ms, total: 15.8 ms
Wall time: 1.49 s
