<img src="https://www.mines.edu/webcentral/wp-content/uploads/sites/267/2019/02/horizontallightbackground.jpg" width="100%"> 
### CSCI250 Python Computing: Building a Sensor System
<hr style="height:5px" width="100%" align="left">

# Parallel code execution

# Objective
* introduce parallel code execution
* use Python threads
* use Python logging

# Resources
* [Python threading](https://docs.python.org/3/library/threading.html)
* [Python logging](https://docs.python.org/3/library/logging.html)

# Definition

**Parallelism** means that 
* multiple tasks are executed simultaneously
* the tasks divide a large calculation into smaller chunks

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import time, threading, logging
import numpy.random as rng

# `threading` module

Implements a system that enables execution in concurrent threads.

**Threads** use code sequences executed independently.

`threading` functions provide access to thread information.

`threading.active_count()`

 `threading.main_thread()`
 
 `threading.current_thread()`
 
 `threading.get_ident()`
 
 `threading.enumerate()`

## `Thread` class
Defines threads and thread operations.

`threading.Thread(target, name, args)`

`Thread.start()`

`Thread.is_alive()`

<img src="https://www.dropbox.com/s/u628vjn2uc5h3ua/notebook.png?raw=1" width="10%" align="right">

See the [concurrency notebook](./s_NpConcurrency.ipynb) to get more info.

# serial vs. parallel execution

Consider a set of tasks that we can pass to the function.

We can execute all the tasks using 
* one thread (serial execution)
* multiple threads (parallel execution)

In [None]:
def myFunc(i,t):
    print("%2d >>      %5.2f"%(i,t))
    time.sleep(t)
    print("   <<%2d         "%(i  ))

In [None]:
myFunc(0,2.0)

In [None]:
nTASKS = 6

TASKS = [iTASK for iTASK in range(nTASKS)] # job indices
TIMES = rng.uniform(0.1,1.0,nTASKS)        # time delays

for iTASK in range(nTASKS):
    print( format(TASKS[iTASK],'2d'), format(TIMES[iTASK],'5.2f') )

### Serial execution

The function is called sequentially for different inputs.

In [None]:
for iTASK in TASKS:
    myFunc( iTASK, TIMES[iTASK] )

### Parallel execution

The function is called at once on multiple inputs (multiple threads).

In [None]:
for iTASK in TASKS:
    # define thread
    t = threading.Thread(target = myFunc, args = (iTASK, TIMES[iTASK]))
    
    # start thread
    t.start()                                  

# `logging` module
Implements a flexible system for monitoring parallel execution. 

Can be used with multiple threads executed in parallel.

Logging can be configured for different levels.

The logging message can also be customized.

# `logging.basicConfig()`

There are several levels of logging (`WARNING` is default):
* `DEBUG`: detailed information for diagnosing problems
* `INFO`: confirmation that all works as expected
* `WARNING`: something happened - software still working
* `ERROR`: something happened - software not completed
* `CRITICAL`: a serious error - software may be unable to run

In [None]:
logging.basicConfig(level = logging.DEBUG,
                    format='[%(levelname)s] (%(threadName)-10s) %(message)s',
                    )

# `logging.debug()`

Logging messages indicate when we enter and leave a function:

In [None]:
def myFunc(i,t):
    logging.debug("%2d >>      %5.2f"%(i,t))
    time.sleep(t)
    logging.debug("   <<%2d         "%(i  ))

In [None]:
myFunc(0,2.0)

### Serial execution

The function is called sequentially for different inputs.

In [None]:
for iTASK in TASKS:
    myFunc( iTASK, TIMES[iTASK] )

### Parallel execution

The function is called at once on multiple inputs (multiple threads).

In [None]:
for iTASK in TASKS:
    # define thread
    t = threading.Thread(target = myFunc, args = (iTASK, TIMES[iTASK]))
    
    # start thread
    t.start()                           

Sending tasks to multiple threads at once could be problematic because we may not have enough cores on the computer. 

We should work with as many threads as the number of cores:
* form a group of threads equal to the number of cores;
* start all threads in a group at once;
* wait until this group of tasks complete; 
* form and start similar groups of the remaining tasks.

## `Thread.join()`
The function blocks the calling thread until the thread whose `join` method is called terminates. 

*** 

This can be used to synchronize execution of a group of threads. 

For example, we can initiate a limited number of threads equal to the actual number of compute cores available, instead of initiating a large number of threads for all tasks to be executed.

In [None]:
nCORES = 3                # number of available cores

tGROUP = [None] * nCORES  # list of threads in a group

In [None]:
iTASK = 0
while iTASK < nTASKS:                              # loop over tasks
    iCORE = iTASK%nCORES      
    
    tGROUP[iCORE] = threading.Thread(target = myFunc, args=(iTASK, 1.0) ) 
    tGROUP[iCORE].start()                          # start current thread
    
    if(iCORE == nCORES-1 or                        # completed the group
       iTASK == nTASKS-1):                         # ran out of tasks
        
        for t in tGROUP:     
            if t != None:
                t.join()                           # wait to complete group
            
        tGROUP = [None] * nCORES                   # reset thread group
    iTASK += 1

We can monitor of the number of threads active at various moments.

In [None]:
MONITOR = []                                        # init thread count monitor

iTASK = 0                                           # loop over tasks
while iTASK < nTASKS:
    iCORE = iTASK%nCORES      
    
    tGROUP[iCORE] = threading.Thread(target = myFunc, args=(iTASK, 1.0) ) 
    tGROUP[iCORE].start()                           # start current thread
    MONITOR.append(threading.active_count())        # monitor the thread count
    
    if(iCORE == nCORES-1 or                         # completed the group
       iTASK == nTASKS-1):                          # ran out of tasks
        
        for t in tGROUP:     
            if t != None:
                t.join()                            # wait to complete group
                
        MONITOR.append(threading.active_count())    # monitor the thread count        
        tGROUP = [None] * nCORES                    # reset thread group  
    iTASK += 1

print(MONITOR)                                      # show thread count monitor

<img src="https://www.dropbox.com/s/7vd3ezqkyhdxmap/demo.png?raw=1" width="10%" align="left">

# Demo
Imagine that you would like to simulate a 2D function defined by

$g(x,y) = \dfrac{1}{\sqrt{2\pi}\sigma_x} 
e^{ -\dfrac{1}{2} 
\left( \dfrac{x-c_x}{\sigma_x} \right)^2
}
$

where 

$c_x(y) = y$

and 

$\sigma_x(y) = 0.05(1+|y|)$

We want to take advantage of multi-threading and generate the function $g(x,y)$ by parallelizing its calculation over $y$.

Define the variables $x$ and $y$ over the computation space:

In [None]:
nx,ox,dx = 201,-1.0,0.01
ny,oy,dy = 201,-1.0,0.01

xx = np.linspace(ox, ox+nx*dx, nx)
yy = np.linspace(oy, oy+ny*dy, ny)

X, Y = np.meshgrid(xx, yy)

Define a function that computes a 1D Gaussian for a given $y$ at all values of $x$. Slow down the function execution using `sleep` to emphasize the difference between serial and parallel execution:

In [None]:
def makeGaussian(g, y, xx):
    
    cx = y                       # Gaussian center
    sx = 0.05 * (1 + np.abs(y))  # Gaussian standard deviation
    
    i = int((y-oy)/dy)
    
    logging.debug("%2dG >>     %6.3f"%(i,cx))
        
    g[:] = 1/( np.sqrt(2*np.pi) * sx) * np.exp( -( 0.5*(xx-cx)/(sx) )**2 )
    
    time.sleep(1e-2)             # simulate longer function execution
        
    logging.debug("    <<%2dG "%(i))

In [None]:
gONE = np.zeros(nx, dtype=float) # allocate output space

makeGaussian(gONE, 0.0, xx)      # call Gaussian function

In [None]:
plt.figure(figsize=(15,5))
plt.plot(xx,gONE)                # plot Gaussian function
plt.xlabel('x')
plt.ylabel('g')

plt.show();

For a function of both $x$ and $y$, we can allocate 2D `numpy` arrays:

In [None]:
gSRL = np.zeros( [ny,nx], dtype=float) # for   serial execution
gPAR = np.zeros( [ny,nx], dtype=float) # for parallel execution

The serial code is simply a series of function calls inside a `while` loop: 

In [None]:
# serial code

tick = time.time()       # start the clock

iy = 0
while iy < ny:
    y = oy + iy*dy
    makeGaussian(gSRL[iy,:], y, xx)

    iy += 1
    
tock = time.time()       #  stop the clock
dtSRL = (tock-tick)*1e3  # time difference 

In [None]:
print('elapsed time =',int(dtSRL),'(ms)')

The parallel code uses the `threading` module to define, start and join threads. All threads share the same array, but work on different parts of it to avoid **race conditions** (i.e. a situation when multiple threads try to change the same data at the same time).

In [None]:
nCORES = 4                                          # number of available cores
tGROUP = [None] * nCORES                            # list of threads in a group

tick = time.time()                                  # start the clock

iy = 0
while iy < ny:                                      # loop over tasks
    y = oy + iy*dy
        
    iCORE = iy%nCORES      
    
    tGROUP[iCORE] = threading.Thread( target = makeGaussian, args = (gPAR[iy,:], y, xx))    
    tGROUP[iCORE].start()                           # start current thread
    
    if(iCORE == nCORES-1 or                         # completed the group
          iy == ny-1):                              # ran out of tasks
        
        for t in tGROUP:     
            if t != None:
                t.join()
            
        tGROUP = [None] * nCORES                    # reset thread group
    
    iy += 1
    
tock = time.time()                                      # stop the clock
dtPAR = (tock-tick)*1e3                                 # time difference 

In [None]:
print('elapsed time =',int(dtPAR),'(ms)')

The serial and parallel calculations produce identical output.

In [None]:
plt.figure(figsize=(14,7))

plt.subplot(1,2,1)
plt.contourf(X, Y, gSRL, 50);
plt.xlabel('x');
plt.ylabel('y');
plt.axis('equal');
plt.axis('tight');

plt.subplot(1,2,2)
plt.contourf(X, Y, gPAR, 50);
plt.xlabel('x');
plt.ylabel('y');
plt.axis('equal');
plt.axis('tight');

The speed-up $t_{SRL}/t_{PAR}$ is:

In [None]:
print('speed-up = %4.2f'%(dtSRL/dtPAR) )

Ideally the speed-up is close to the number of cores used.

In [None]:
print('cores used =',nCORES)