# Concurrecy and Parralism in Python

## pyconil 2019
### by Guy Doulberg

# Talk Metadata 

# Who am I?


* My name is Guy Doulberg
* I work in Satellogic
* I have been developing in python in the last 3 years



# Why am I doing this talk?



* Because it is important to every python developer
* Becasue I was happy to hear this talk 2 years ago when I had the need to concurrency


# What am I going to talk about?





* What is Concucurency and Parralism
* How to implement softwares that use Concurrency and Parraslism in the python stadard library 
* How to implement Concurrency and Parraslism acrross hosts (cluster)


# Which python?


* python 3.7
* I don't know enough about python 2.7

# Theortical background

# Concurrency

Consider you have two or more tasks that you need to execute on the same time. The tasks can start, work on the same time. 



# Parralism 

Consider you have two or more tasks that run exactly on the same time (in parralel paths).

![image](http://koliber.ir/wp-content/uploads/2018/07/cvp-300x300.jpeg)

# IO bound vs CPU bound

When programing, you should consider to either run you code concurrenty or in parallel, when:






* Your program can run faster if it uses more computing units (CPUs) - **CPU bound**


* Your program can run faster if it uses more bandwidth or reading/writing to several sources. - **IO bound**

It is important to identify the nature of your program becasue each of the bounds:



* Adding more compute units to IO bounded program will not help it -  (might do worst)

* Running tasks that can use only one CPU on a CPU bounded task will not help either.


A program can also be bounded by the avaiable memory of the machine,



# Dealing with IO Bounded tasks


* Break the code that is boudned by IO and run it *Concurrectly*. (scale out)
    * In a single machine you can't utilize more than your avialable disk I/O
    * In a single machine you can't utilize more than your avialable network I/O

* Scale up your disk or network card

Most of the scenarios I was dealing with, using the Network card and Disk card concurrenctly was more than enough for me.

Let's do it in python

First I will create a usecase of I/O bounded task,

A flask server that runs the following code:

```python
from flask import Flask
app = Flask(__name__)
import time

@app.route("/")
def sleep_well():
    sleep_duration = 2
    time.sleep(sleep_duration)
    return "Hello World!"
```

I am running this flask server behind a gunicorn, in order to controll the number of requests it can handle:

```shell
gunicorn --bind 0.0.0.0:5000  -w 10 wsgi
```

# Naive implemetation will be **sequential** 

In [1]:
import time
import requests

host = "http://localhost:5000"
start_time = time.time()

for i in range(10):
    body = requests.get(host)
    assert body.text == "Hello World!"
    
end_time = time.time()
total = end_time - start_time

print("Total time of execution: %s" % (total))

Total time of execution: 20.104703664779663


# Using concurrent.futures.ThreadPoolExecutor


In [2]:
from concurrent.futures import ThreadPoolExecutor, as_completed 
import time
import requests

host = "http://localhost:5000"

def io_bounded_task():
    body = requests.get(host)
    return body
    
start_time = time.time()
with ThreadPoolExecutor() as e:
    futures = {e.submit(io_bounded_task) for i in range(10)}
    for future in as_completed(futures):
        assert future.result().text == "Hello World!"
    
end_time = time.time()
total = end_time - start_time

print("Total time of execution: %s" % (total))

Total time of execution: 2.036428213119507


# What happens when we reach the limit?

* Depends on what is blocking us
* If we just exceed the avaiable bandwidth (by the server/client ISP, network card) - we just wouldn't be able to run more concurrent tasks
* We could also be blocked by the server we are trying to reach 

Thefore we can throttle our requests:


In [3]:
start_time = time.time()
with ThreadPoolExecutor() as e:
    futures = {e.submit(io_bounded_task) for i in range(20)}
    for future in as_completed(futures):
        assert future.result().text == "Hello World!"
    
end_time = time.time()
total = end_time - start_time

# we are being throttled by the server so it will not be broken
#It is not healthy to count on the server for these things, we should count on our own
print("Total time of execution: %s" % (total))

Total time of execution: 4.042229175567627


In [4]:
start_time = time.time()
with ThreadPoolExecutor(max_workers=10) as e:
    futures = {e.submit(io_bounded_task) for i in range(20)}
    for future in as_completed(futures):
        future.result().text == "Hello World!"
    
end_time = time.time()
total = end_time - start_time

print("Total time of execution: %s" % (total))

Total time of execution: 4.106059551239014


# asyncio
## When dealing with IO bounded tasks we can also use the asyncio module



In [7]:
import asyncio

# declaring coroutine method
async def async_io_bounded_task():
    body = requests.get(host)
    assert body.text == "Hello World!"
    
start_time = time.time()
# creating a coroutine mehtod to run several corutines
async def main():
    await asyncio.gather(async_io_bounded_task())
    
# I am running in a jupyter notebook so I have already event loop running:
await main()

# if you don't have event loop then:
# asyncio.run(main())
end_time = time.time()
total = end_time - start_time

print("Total time of execution: %s" % (total))

Total time of execution: 2.009946823120117


# Which module to choose?

I think it is a matter of taste, 

Maybe I am old fashined, but I prefer the the Exector module becasue it is easier for me to folow the code path 



# CPU bounded jobs


## For the propuse of the talk I am going to produce a big array of random floats





In [1]:
import numpy
random_numbers = numpy.random.random(300000000)

* I am using numpy to speed things up, 

* The numpy object is also an object which is easier to share accross processes - I will use this feature

# Let's  try to get the maximal value out of the array

# First Attempt: sequential approach

In [2]:
import time
start_time = time.time()
print("Max values is %s " % max(random_numbers))

end_time = time.time()
total = end_time - start_time

print("Total time of execution: %s" % (total))


Max values is 0.9999999999065105 
Total time of execution: 15.630356311798096


## CPU utilization when running the sequential max:

![image](https://guydou.github.io/pycon2019_con_para//images/max_seq.png)

# Lets use threads


### From the pthread library of C/C++ description:

```
... It is most effective on multi-processor or multi-core systems where the process flow can be scheduled to run on another processor thus gaining speed through parallel or distributed processing...
```

### Proposed implementation:

1. Split the list to chunks
2. Find the max value in each chunk
3. Find the maximal value in all maximal values

In [3]:
from concurrent.futures import ThreadPoolExecutor, as_completed 
start_time = time.time()

num_chunks = 100


with ThreadPoolExecutor() as e:
    offset = int(len(random_numbers)/num_chunks)
    futures = {e.submit(max, random_numbers[i*offset:(i+1)*offset]) for i in range(num_chunks)}
    max_values = [future.result() for future in as_completed(futures)]

print("Max values is %s " % max(max_values))
        
end_time = time.time()
total = end_time - start_time
print("Total time of execution: %s" % (total))

Max values is 0.9999999999065105 
Total time of execution: 16.24165678024292


![oops](https://thumbs.dreamstime.com/z/vector-oops-symbol-over-white-29840798.jpg)

# Say hello to the GIL:  or Global Interperter lock

## The GIL is a mutex (lock) that prevent access to python objects from different threads.

### In our case we need access from different threads, but the GIL in practice makes the threds code to run sequetialy using a single thread

## CPU utilization when running mutlithreading code:

![image](https://guydou.github.io/pycon2019_con_para/images/threads_max.png)

## Lets try the same thing using processes 

In [5]:
from concurrent.futures import ProcessPoolExecutor

start_time = time.time()
    

with ProcessPoolExecutor() as e:
    offset = int(len(random_numbers)/num_chunks)
    futures = {e.submit(max, random_numbers[i*offset:(i+1)*offset]) for i in range(num_chunks)}
    max_values = [future.result() for future in as_completed(futures)]

print("Max values is %s " % max(max_values))

end_time = time.time()
total = end_time - start_time
              
print("Total time of execution: %s" % (total))

Max values is 0.9999999999065105 
Total time of execution: 8.585535049438477


## Why is it not faster

* Spawning proceses is an heavy task 
* Marshaling/Unmarshaling of chunk of data is heavy

## What to do?

* Initialize the processes in advance and reuse the proceses
* Use shared memory and pointer 

In [11]:
from concurrent.futures import ProcessPoolExecutor
import SharedArray as sa


array_key = "shm://test"

array = sa.create(array_key, random_numbers.shape, random_numbers.dtype)
array[:] = random_numbers[:]

start_time = time.time()
def max_shared_array(array_key, start, end):
    array = sa.attach(array_key)
    return max(array[start:end])


start_time = time.time()

with ProcessPoolExecutor() as e:
    offset = int(len(random_numbers)/num_chunks)
    futures = {e.submit(max_shared_array,array_key, i*offset, (i+1)*offset) for i in range(num_chunks)}
    max_values = [future.result() for future in as_completed(futures)]

print("Max values is %s " % max(max_values))

end_time = time.time()
total = end_time - start_time

sa.delete(array_key)

              
print("Total time of execution: %s" % (total))

Max values is 0.9999999999065105 
Total time of execution: 6.842570781707764


## CPU utilization when running multiprocess code:


![image](https://guydou.github.io/pycon2019_con_para/images/processes_max.png)

## Now lets check what happens when running with Numpy

In [77]:
start_time = time.time()
    

print("Max values is %s " % numpy.max(random_numbers))

end_time = time.time()
total = end_time - start_time
              
print("Total time of execution: %s" % (total))

Max values is 0.9999999993369062 
Total time of execution: 0.14561033248901367


# CPU Bound conclusion:

1. Don't forget the GIL
2. Multiprocessing is your friend
3. Use native C/C++ labraries suche as: numpy, pandas.

# Dask


* **Dynamic task scheduling optimized for computation**
* **“Big Data” collections**


![architecture](https://docs.dask.org/en/latest/_images/collections-schedulers.png)

* In other words, you can run a complex excution grpah on many data types using the same code on you laptop and on a multi host cluster

* A way to solve memory bounded problems

### I will demo only the library **dask.distirbuted** that can help you run you code utilizing CPUs and Hosts


In [86]:
import dask

from dask.distributed import Client

client = Client()  # set up local cluster on your laptop
# client = Client(Cluter Host) 

def array_chunks(array, num_chunks):
    
    for i in range(num_chunks):
        offset = int(len(array)/num_chunks)
        yield array[i*offset:(i+1)*offset]

max_values = client.map(max, array_chunks(random_numbers, num_chunks))
max_value = client.submit(max, max_values)
print(max_value.result())



TypeError: can't pickle generator objects