# Programming 2
## Master Data Science for the Life Science
### (c) Martijn Herber / Stichting Hanzehogeschool Groningen 2020,2021
### License: Creative Commons BY-NC-SA 4.0 

# Introduction
* You'll learn how to distribute data and calculations over multiple CPU's and computers
* Practical approach, aims to be useful in your project straightaway
* If you pass all assignments, you pass the course :-)

## 3 paradigms
* That's Silicon Valley-speak for "strategies"
* -> How much do you want to do yourself?
* "Manual" -> write data splitting & data processing functions yourself, distribute them yourself
* "Library" -> The data is split up for you, jobs are distributed for you
* "GPU" -> Not a lot of different operations, but those that are there are unbeatably fast


## Parallel processing, Distributed computing
* Modern processors (4-8 cores in the PC)
=> Parallel computing
* Multiple computers (nodes) spread over the network (maybe multiple CPU's per node)k
=> Distributed computing
* Thousands of nodes in datacentres not your own
=> Cloud distributed computing

## Process-based distributed computing
* Processes are the basic unit of computation the OS provides
* Strict separation of memory spaces!
* OS takes care of "scheduling" => spreading processes over the available cores
* Due to technical limitations Python needs to used processes to truly run parallel jobs (Global Interpreeter Lock)
* This is where the multiprocessing built-in lib comes in!

In [7]:
import multiprocessing as mp

# How many Processes to start?
* A good rule of thumb is to start as many as you have cores (see below)
* _Except_ if you're mostly doing I/O (esp. network); then you can start many more
* Also take a look at the "asyncio" module in that case!
* The precise effect of the number of processes can be observed _experimentally_
* A good upper bound is 2x the number of cores
* Check out the cores using `multiprocessing`:

In [8]:
mp.cpu_count()

12

# An easy way to quickly parallelize a function's processing
## The multiprocessing.Pool class
* Sets up a "Pool" of available processes and use functional constructs like apply(), map(), starmap() etc to distribute data automatically
* This is quite like the "automagic" of Spark or Dask!
* We'll use this construct for assignment 1 to get used to the library and other functions etc.
* But it does give you less _control_ over the precise processing flow!
* NB: Don't run these from the Jupyter notebook! It will crash!

In [12]:
def echo(x):
    print(x, "gekregen!")
    return x

def prime(x):
    #print("working on:",x)
    if x <= 0:
        print("Primality is undefined for 0 or less")
        return False
    if x == 1:
        print("No, 1 is not prime!")
        return False
    if x == 2:
        print("Yes, 2 is prime!")
        return True
    for i in range(2, x-1):                 # Dit kan ook (x // 2)+1
        if x % i == 0:
            #print(x, "is not prime!")
            return False
        
    #print(x, "is prime!")
    return True

if __name__ == "__main__":
    cpus = mp.cpu_count()
    with mp.Pool(cpus) as pool:
        results = pool.map(prime, range(1,49))

    print(results)

KeyboardInterrupt: 

# "Processes" are more flexible
## ... but harder to get started with
- Processes are great at responding to _flexible_ demands
- This is "event based programming", quite a chore to organize well!
- But essential if the workload isn't known in advance
- Also if you do lots of I/O this is a better approach

## We'll use this from assignment 2 onwards


In [9]:
import multiprocessing as mp
from time import sleep
from random import randint

def todo(p):
    sleeptime = randint(1,5)
    sleep(sleeptime)
    print("hoi! van process",p , "lekker", sleeptime, "seconden geslapen.")
if __name__ == "__main__":
    processes = []

    for p in range(5):
        temP = mp.Process(target=todo, args=(p,))
        processes.append(temP)
        temP.start()
      # <- hier maken we een heel aantal processen aan, die we starten

    for p in processes:
        p.join()   # <- hier "join"en we elk process als deze klaar is

    # <- alles is klaar!


In [10]:
import multiprocessing as mp
from time import sleep
from random import randint

def todo(p, outputQ):
    sleeptime = randint(1,5)
    sleep(sleeptime)
    outputQ.put("hoi! van process %s heeft %s sec geslapen" % (p , sleeptime))

if __name__ == "__main__":
    processes = []
    outputQ = mp.Queue()

    for p in range(5):
        temP = mp.Process(target=todo, args=(p,outputQ))
        processes.append(temP)
        temP.start()

    for p in processes:
        p.join()

    while not outputQ.empty():
        msg = outputQ.get()
        print(msg)