# DirectView as multiplexer

In [1]:
from IPython.display import display
from ipyparallel import Client
rc = Client()

The DirectView can be readily understood as an Engine Multiplexer -
it does the same thing on all of its engines.

The only difference between running code on a single remote engine
and running code in parallel is how many engines the DirectView is
instructed to use.

You can create DirectViews by index-access to the Client.  This creates
a DirectView using the engines after passing the same index (or slice)
to the `ids` list.

In [2]:
e0 = rc[0]
eall = rc[:]
even = rc[::2]
odd = rc[1::2]

# this is the one we are going to use:
dview = eall
dview.block = True

Now, the only difference from single-engine remote execution is that the code we run happens on all of the engines of a given view:

In [3]:
import os
for view in (e0, eall, even, odd):
    print(view, view.apply_sync(os.getpid))

<DirectView 0> 11797
<DirectView [0, 1, 2, 3]> [11797, 11798, 11796, 11799]
<DirectView [0, 2]> [11797, 11796]
<DirectView [1, 3]> [11798, 11799]


The results of multiplexed execution is always a list of the length of the number of engines.

In [4]:
dview['a'] = 5
dview['a']

[5, 5, 5, 5]

# Scatter and Gather

Lots of parallel computations involve partitioning data onto processes.  
DirectViews have `scatter()` and `gather()` methods, to help with this.
Pass any container or numpy array, and IPython will partition the object onto the engines wih `scatter`,
or reconstruct the full object in the Client with `gather()`.

In [5]:
import numpy as np
dview.scatter('a',np.arange(16))
a = len(dview['a'])

print('a in the engines:',dview['a'])
print('a here:', a)
print('len of whole thing:', len(dview.gather('a')))
print(dview.gather('a'))

a in the engines: [array([0, 1, 2, 3]), array([4, 5, 6, 7]), array([ 8,  9, 10, 11]), array([12, 13, 14, 15])]
a here: 4
len of whole thing: 16
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15]


In [6]:
e0.block = True
e0.scatter('a',np.arange(16))
e0['a']

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15])

In [7]:
dview.gather('a')

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15,  4,
        5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15])

In [8]:
dview.execute("asum = sum(a)")
dview.gather('asum')

[120, 22, 38, 54]

The cell magic `%%px` is equivalent to calling `dview.execute()` on an entire cell, with a more convenient syntax:

In [9]:
%%px
# This entire cell will be executed in all the engines as if we'd called
# dview.execute("...") with the contents below.
asum2 = 2*sum(a)
import numpy as np
b = np.random.rand(4)

We can now agther

In [10]:
print('asum2:\n', dview.gather('asum2'))
print('b    :\n', dview.gather('b'))

asum2:
 [240, 44, 76, 108]
b    :
 [0.16180269 0.21380825 0.69635711 0.07673496 0.54969269 0.72831087
 0.70668113 0.67625155 0.06017366 0.37962166 0.38871766 0.23981086
 0.27313408 0.81904247 0.58196427 0.03409342]


With gather and `%%px` we can conveniently break up computation across multiple engines, for example a set of data files that need processing:

In [11]:
files = ['one.txt', 'two.txt', 'three.txt']
dview.scatter('files', files)
dview['files']

[['one.txt'], ['two.txt'], ['three.txt'], []]

Note  how when  we run code with `%%px`, IPython automatically captures and summarizes print output for us from all engines:

In [12]:
%%px
for file in files:
    print('filename:', file)

[stdout:1] filename: two.txt


[stdout:0] filename: one.txt


[stdout:2] filename: three.txt


We can pass a 'flatten' keyword,
to instruct engines that will only get one item of the list to
get the actual item, rather than a one-element sublist:

In [13]:
dview.scatter('id',rc.ids)
dview['id']

[[0], [1], [2], [3]]

In [14]:
dview.scatter('id',rc.ids, flatten=True)
dview['id']

[0, 1, 2, 3]

Scatter and gather also work with numpy arrays

In [15]:
A = np.random.randint(1,10,(16,4))
B = np.random.randint(1,10,(4,16))
display(A)
display(B)

array([[7, 6, 1, 4],
       [9, 5, 9, 6],
       [6, 6, 1, 5],
       [7, 8, 9, 1],
       [8, 4, 4, 1],
       [8, 8, 5, 8],
       [3, 9, 6, 7],
       [6, 9, 6, 5],
       [4, 6, 2, 4],
       [8, 7, 4, 9],
       [7, 9, 6, 2],
       [3, 9, 3, 8],
       [9, 4, 3, 7],
       [3, 1, 4, 1],
       [7, 8, 3, 4],
       [9, 2, 9, 8]])

array([[9, 1, 7, 7, 3, 4, 5, 3, 8, 4, 9, 7, 3, 8, 8, 9],
       [4, 6, 2, 7, 6, 1, 5, 7, 8, 2, 8, 8, 5, 1, 6, 6],
       [6, 7, 8, 8, 1, 1, 9, 2, 6, 2, 6, 2, 7, 6, 1, 2],
       [5, 2, 7, 1, 9, 6, 7, 8, 2, 6, 3, 6, 6, 7, 4, 9]])

In [16]:
dview.scatter('A', A)
dview.scatter('B', B)
display(e0['A'])
display(e0['B'])

array([[7, 6, 1, 4],
       [9, 5, 9, 6],
       [6, 6, 1, 5],
       [7, 8, 9, 1]])

array([[9, 1, 7, 7, 3, 4, 5, 3, 8, 4, 9, 7, 3, 8, 8, 9]])

# Example: Parallel Matrix Multiply

With what we have seen so far, we can write our own (completely terrible!) parallel matrix multiply.

* Remember - multiply rows of one by the columns of the other.
* easiest implementation involves one each of: push, scatter, execute, gather

In [19]:
# %load soln/matmul.py
def pdot(v, A, B):
    v['B'] = B # push B everywhere
    v.scatter('A', A) # scatter A
    v.execute('C=A.dot(B)') # compute the dot-product
    return v.gather('C', block=True) # gather the resulting sub-arrays


Let's run this, and validate the result against a local computation.

In [20]:
C_ref = A.dot(B)
C1 = pdot(dview, A, B)
# validation:
(C1==C_ref).all()

True

# Map

DirectViews have a map method, which behaves just like the builtin map,
but computed in parallel.

In [21]:
dview.block = True

serial_result   =       map(lambda x:x**10, range(32))
parallel_result = dview.map(lambda x:x**10, range(32))

serial_result==parallel_result

False

`DirectView.map` partitions the sequences onto each engine,
and then calls `map` remotely.  The result is always a single
IPython task per engine.

In [22]:
amr = dview.map_async(lambda x:x**10, range(32))
amr.msg_ids

['5d787271-9c5160456c6bb41f49141b0d_122',
 '5d787271-9c5160456c6bb41f49141b0d_123',
 '5d787271-9c5160456c6bb41f49141b0d_124',
 '5d787271-9c5160456c6bb41f49141b0d_125']

We can see this by splitting a larger input range, we still get the same number of message ids (one per task having been created):

In [23]:
amr = dview.map_async(lambda x:x**10, range(64))
amr.msg_ids

['5d787271-9c5160456c6bb41f49141b0d_126',
 '5d787271-9c5160456c6bb41f49141b0d_127',
 '5d787271-9c5160456c6bb41f49141b0d_128',
 '5d787271-9c5160456c6bb41f49141b0d_129']

### Example: Pi via simple Monte Carlo

![Monte Carlo Pi](http://docs.picloud.com/_images/basic_example_monte.png)

In [24]:
def sample_circle(n):
    import numpy as np
    m = 0
    for i in range(int(n)):
        p = np.random.rand(2)
        if sum(p**2.) <= 1.:
            m += 1
    return m

def brute_pi(n):
    m = sample_circle(n)
    return 4.* m/n

def err(npi):
    return 100*abs(np.pi-npi)/np.pi

In [25]:
n = 5e5

In [26]:
%time bpi = brute_pi(n)
print("\nError: %.2f%%" % err(bpi))

CPU times: user 2.41 s, sys: 46.7 ms, total: 2.46 s
Wall time: 2.65 s

Error: 0.06%


**Your homework**

Write a function `cluster_pi` that uses the cluster to run the computation in parallel (use `brute_pi` as inspiration).

CPU times: user 13.1 ms, sys: 2.74 ms, total: 15.8 ms
Wall time: 1.49 s
