# I40 : Consider Coroutines to Run Many Functions Concurrently

- More about : https://www.geeksforgeeks.org/coroutine-in-python/

- Threads give Python programmers a way to run multiple functions seemingly at the same time. But there are three big problems with threads:

- They require special tools to coodinate with each other safely. This makes code that uses threads harder to reason about than procedural, single-treaded code. This complexity makes threaded code more difficult to extend and maintain over time.

- Thread requirea a lot of memory, about 8 MB per executing thread. On many computers, that amount of memory doesn't matter for a dozen threads or so. But what if you want your program to run tens of thousands of functions "simultaneously"? These functions may correspond to user requests to a server pixels oon a screen, particles in a simulation, etc. Running a thread per unique activity just won't work.
- Theads are costly to start. If you want to constantly be creating new concurrent functions and finishign them, the overhead of using threads becomes large and slows everything down.

- Python can work around all these issues with *coroutines*. Coroutines let you have many seemingly sumultaneous functions in your Python programs. They're implemented as an extension to generators. The cost of starting a generator coroutine is a function call. Once active, they each use less that 1KB of memory until they're exhausted.

- Coroutines work by enabling the code consuing a generator to send a value back into the generator function after each yield expression. The generator function receives the value passed to the send function as the result of the corresponding yield expression.

In [1]:
def my_coroutine():
    while True:
        received = yield
#         value = yield received
        print('Received:', received)
        
it = my_coroutine()
next(it)
it.send('First')
it.send('Second')
print(it.send('yield'))

Received: First
Received: Second
Received: yield
None


- The initail call to next is required to prepare the generator for receiving the first send by advancing it to the first yield expression. Together, yield and send provide generators with a standard way to vary their next yielded value in response to external input.

- For example, say you want to implement a generator coroutine that yields the minimum value it's been sent so far. Here, the bare yield prepares the coroutine with the initial minumum value sent in from the outside. Then the generator repeatedly yields the new munimum in exchange for the next value to consider.

In [30]:
def minimize():
    current = yield
    while True:
        value = yield current
        current = min(value, current)
        
it = minimize()
next(it)
print(it.send(10))
print(it.send(4))
print(it.send(22))
print(it.send(-1))

10
4
4
-1


- The generator function will seemingly run forever, making forward progress with each new call to send. Like threads, coroutines are independent functions that can consume inputs from their environment and produce resulting outputs. The difference is that coroutines pause at each yield expression in the generator function and resume after each call to send from the outside. This is the magical mechanism of coroutines.

In [29]:
def print_name(prefix):
    print("Searching prefix:{}".format(prefix))
    while True:
        name = (yield)
        if prefix in name:
            print(name)
 
# calling coroutine, nothing will happen
corou = print_name("Dear")
 
# This will start execution of coroutine and 
# Prints first line "Searchig prefix..."
# and advance execution to the first yield expression
corou.__next__()
 
# sending inputs
corou.send("Atul")
corou.send("Dear Atul")


Searching prefix:Dear
Dear Atul


- This behavior allows the code consuming the generator to take action after each yield expression in the coroutine. The consuming code can use the generator's output values to call other functions and update data structures. MOst importantly, it can advance other generator functions until their nexy yield expressions. By advancing many separate generators in lockstep, they will all seem to be running simultaneously, mimicking the concurrent behavior of Python threads.

## The game of life

- Let me demonstrate the simultaneous behavior of coroutines with an example. Say you want to use coroutines to implement Conway's Game of Life. The rules of the game are simple. You have a two-dimensional grid of an arbitrary size. Each cell in the grid can either be alive or empty.

In [2]:
ALIVE = '*'
EMPTY = '-'

- The game progresses one tick of the clock at a time. At each tick, each cell counts houw many of its neighboring eight cells are still alive. Based on its neighbor count, each cell decides if it will keep living, die, or regenerate. Here's an example of 5 x 5 Game of Life grid after four generations with time going to the right. I'll explain the specific rules further below.

- I can model this game by representing each cell as a generator coroutine running in lockstep with all the others.

- To implement this, first I need a way to retrieve the status of neighboring cells. I can do this with a coroutine named count\_neighbors that works by yielding Query objects. The Query class I define myself. Its purpose is to provide the generator coroutine with a way to ask its surrounding environment for information.

In [5]:
from collections import namedtuple
Query = namedtuple('Query', ('y', 'x'))

- The coroutine yields a Query for each neighbor. The result of each yield expression will be the value ALIVE or EMPTY. That's the interface contract I've defined between the coroutine and its consuming code. The count\_neighbors generator sees the nneighbors' states and returns the count of living neighbors.

In [9]:
def count_neighbors(y, x):
    n_ = yield Query(y + 1, x + 0)
    ne = yield Query(y + 1, x + 1)
    e_ = yield Query(y + 0, x + 1)
    se = yield Query(y + 1, x + 1)
    s_ = yield Query(y - 1, x + 0)
    sw = yield Query(y - 1, x - 1)
    w_ = yield Query(y + 0, x - 1)
    nw = yield Query(y + 1, x - 1)
    
    neighbor_states = [n_, ne, e_, se, s_, sw, w_, nw]
    count = 0
    for state in neighbor_states:
        if state == ALIVE:
            count += 1
    return count

- I can drive the count\_neighbors coroutine with fake data to test it. Here, I show how Query objects will be yielded for each neighbor. count\_neighbors expects to receive cell states corresponding to ech Query through the coroutine's send method. The final count is returned in the StopIteration exception that is raised when the generator is exhausted by the return statement.

In [10]:
it = count_neighbors(10, 5)
q1 = next(it)
print('First yield:  ', q1)
q2 = it.send(ALIVE)
print('Second yield:', q2)
q3 = it.send(ALIVE)
print('Third yield:', q3)
q4 = it.send(ALIVE)
print('Fourth yield:', q4)
q5 = it.send(ALIVE)
print('Fifth yield:', q5)
q6 = it.send(ALIVE)
print('Sixth yield:', q6)
q7 = it.send(ALIVE)
print('Seventh yield:', q7)
q8 = it.send(ALIVE)
print('Eighth yield:', q8)
try:
    count = it.send(EMPTY)
except StopIteration as e:
    print('Count: ', e.value)


First yield:   Query(y=11, x=5)
Second yield: Query(y=11, x=6)
Third yield: Query(y=10, x=6)
Fourth yield: Query(y=11, x=6)
Fifth yield: Query(y=9, x=5)
Sixth yield: Query(y=9, x=4)
Seventh yield: Query(y=10, x=4)
Eighth yield: Query(y=11, x=4)
Count:  7


- Now I need the ability to indicate that a cell will transition to a new state in response to the neighbor count that it found from count\_neighbors. To do this, I define another coroutine called step\_cell. This generator will indicate transitions in a cell's state by yielding Transition objects. This is another class that I define, just like the Query class.

- The step\_cell coroutine receives its coordinates in the grid as arguments. It yields a Query to get the initail state of those coordinates. It runs count\_neighbors to inspect the celss around it. It runs the game logic to determine what state the cell should have for next clock tick. Finally, it yields a Transition object to tell the environment the cell's next state.

In [12]:
def game_logic(state, neighbors):
    #...
    pass
    
def step_cell(y, x):
    state = yield Query(y, x)
    neighbors = yield from count_neighbors(y, x)
    next_state = game_logic(state, neighbors)
    yield Transition(y, x, next_state)

- Importantly, the call to count\_neighbors uses the yield from expression. This expression allows Python to compose generator coroutines together, making it easy to reuse smaller pieces of functionality and build complex coroutines from simpler ones. When count\_neighbors is exhausted, the final value it returns will be passed to step\_cell as the result of the yield from expression

In [13]:
def game_logic(state, neighbors):
    if state == ALIVE:
        if neighbors < 2:
            return EMPTY 
        elif neighbors > 3:
            return EMPTY
    else:
        if neighbors == 3:
            return ALIVE
    return state

- I can drive the step\_cell coroutine with fake data to test it.

In [31]:
it = step_cell(10, 5)
q0 = next(it)
print('Me:      ', q0)
q1 = it.send(ALIVE)
print('Q1:      ', q1)
t1 = it.send(EMPTY)
print('Outcome: ', t1)

NameError: name 'step_cell' is not defined

- The goal of the game is to run this logic for a whole grid of cells in lockstep. To do this, I can further compose the step\_cell coroutine into a simulate coroutine. This coroutine progresses the grid of cells forward by yielding from step\_cell many times. After progressing every coordinate, it yields a TICK object to indicate that the current generation of cells have all trasitioned.

In [32]:
TICK = object()

def simulate(height, width):
    while True:
        for y in range(height):
            for x in range(width):
                yield from step_cell(y, x)
        yield TICK

- What's imporessive about simulate is that it's completely disconnected from the surrounding environmnet. I still haven't defined how the grid is represented in Python objects, how Query, Transitioin, and TICK values are handled on the outside, nor how the game gets its initail state. But the logic is clear. Eack cell will transition by running step\_cell. Then the game clock will tick. This will continue forever, as long as the simulate coroutine is advanced.

- This is the beauty of coroutines. They help you focus on the logic of what you're trying to accomplich. They decouple your code's instructions for the environment from the implementation that carries out your wishes. This enables you to run coroutines seemingly in parallel. This also allows you to improve the implementation of following those instructions over time without changing the coroutines.
- Now, I want to run simulate in a real environment. To do that, I need to represent the state of each cell in the grid. Here, I define a class to contain the grid:

In [33]:
class Grid(object):
    def __init__(self, height, width):
        self.height = height
        self.width = width
        self.rows = []
        for _ in range(self.height):
            self.rows.append([EMPTY] * self.width)
        
    def __str__(self):
        #...
        pass

- The grid allows you to get and set the value of any coordinate. Coordinates that are out of bounds will wrap around, making the grid act like infinite looping space.

In [34]:
def query(self, y, x):
    return self.rows[y & self.height][x % self.width]

def assign(self, y, x, state):
    self.rows[y % self.height][x % self.width] = state

- At last, I can define the function that interprets the values yielded from simulate and all of its interior coroutines. This function turns the instructions from the coroutines into interactions with the surrounding environment. It progresses the whole grid of cells forward a single step and then returns a new grid  containing the next state.

In [35]:
def live_a_generatioin(grid, sim):
    progeny = Grid(grid.height, grid.width)
    item = next(sim)
    while item is not TICK:
        if isinstance(item, Query):
            state = grid.query(item.y, item.x)
            item = sim.send(state)
        else:
            progeny.assign(item.y, item.x, item.state)
            item = next(sim)
    return progeny

In [36]:
grid = Grid(5, 9)
grid.assign(0, 3, ALIVE)
print(grid)

NameError: name 'EMPTY' is not defined

- To see this function in action, I need to create a grid and set its initial state. Here, I make a classic shape called a glider.

- Now I can progress this grid forward one generation at a time. You can see how the glider moves down and to the right on the grid based on the simple rules from the game\_logic function.

In [None]:
class ColumnPrinter(object):
    pass

columns = ColumnPrinter()
sim = simulate(grid.height, grid.width)
for i in range(5):
    columns.append(str(grid))
    grid = live_a_generation(grid, sim)

print(columns)

- The best part about this approach is that I can change the game\_logic function without having to update the code that surrounds it. I can change the rules or add larger spheres of influence with the existing mechanics of Query, Transition, and TICK. This demonstrates how coroutines enable the separation of concerns, which is an important design principle.

## Things to Remember

- Coroutines provide an efficient way to run tens of thousands of functions seemingly at the same time.
- Within a generator, the value of the yield expression will be whatever value was passed to the generator's send method from the exterior code.
- Coroutines give you a powerful tool for separating the core logic of your program from tis interaction with the surrounding environment

# I41 : Consider concurrent.futures for True Parallelism

- At some point in writing Python programs, you may hit the performace wall. Even after optimizing your code, your program's execution may still be too slow for your needs. On modern computers that have an increasing number of CPU cores, it's reasonable to assume that one solution would be parallelism. What if you could split your code's computation into independent pieces of work taht run simultaneously across multiple CPU cores?

- Unfortunately, Python's global interpreter lock (GIL) prevents true parallelism in threads, so that option is out. Another common suggestion is to rewrite your most performance-critical code as an extension module using the C language. C gets you closer to the bare metal and can run faster than Python, eliminating the need for parallelism. C-extensions can also start native threads that run in parallel and utilize multiple CPU cores. Python's API for C-extensions is well documented and a good choice for an escape hatch.

- But rewriting your code in C has a high cost. Code that is short and understandable in Python can become verbose and complicated in C. Such a port requires extensive testing to ensure that the functionality is equivalent to the original Python code and that no bugs have been introduced. Sometimes it's worth it, which explains the large ecosystem of C-extension modules in the Python community that speed up things like text parsing, image compositing, and matrix math. There are even open source tools such as CPython and Numba that can ease the transition to C.

- The problem is that moving one piece of your program to C isn't sufficient most of the time. Optimized Python programs usually don't have one major source of slowness, but rather, there are often many significant contributors. To get the benefits of C's bare metal and threads, you'd need to port large parts of your program, drastically increasing testing needs and risk. There must be a better way to preserve your investment in Python to solve difficult computational probles.

- The multiprocessing built-in module, easily accessed via the concurrent.futures built-in module, may be exactly what you need. It enables Python to utilize multiple CPU cores in parallel by running additional interpreters as child processes. These child processes are separate from the main interpreter, so their global interpreter locks are also separate. Each child can fully utilize one CPU core. Each child has a link to the main process where it receives instructions to do computation and returns results.

- For example, say you want to do something computationally intensive with Python and utilize multiple CPU cores. I'll use an implementation of finding the greatest common divisor of two numbers as a proxy for a more computationally intense algorithm, like simulating fluids dynamics with the Navier-Stokes equation.

In [37]:
def gcd(pair):
    a, b = pair
    low = min(a, b)
    for i in range(low, 0, -1):
        if a % i == 0 and b % i == 0:
            return i

In [40]:
from time import time
numbers = [(1963309, 2265973), (2030677, 3814172), 
           (1551645, 2229620), (2039045, 2020802)]
start = time()
results = list(map(gcd, numbers))
end = time()
print('Took %.3f seconds' % (end - start))

Took 0.465 seconds


- Running this code on multiple Python threads will yield no speed improvement because the GIL prevents Python from using multiple CPU cores in parallel. Here, I do the same computation as above using the concurrent.futures module with its ThreadPoolExecutor class and two worker threads:

In [45]:
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
start = time()
pool = ThreadPoolExecutor(max_workers=2)
results = list(pool.map(gcd, numbers))
end = time()
print('Took %.3f seconds' % (end - start))

Took 0.461 seconds


- It;s even slower this time because of the overhead of starting and communicating with the pool of threads.

- Now for the surprising part: By changing a single line of code, something magical happens. If I replace the ThreadPoolExecutor with the ProcessPoolExecutor from the concurrent.futures module, everything speeds up.

In [46]:
start = time()
pool = ProcessPoolExecutor(max_workers=2)
results = list(pool.map(gcd, numbers))
end = time()
print('Took %.3f seconds' % (end - start))

Took 0.386 seconds


- Running on my dual-core machine, it's significantly faster! How is this possible? Here's what the ProcessPoolExecutor class actually does:

1. It takes each item from the numbers input data to map.
2. It serializes it into binary data using the pickle module
3. It copies the serialized data from the main interpreter process to a child interpreter process over a local socket.

4. Next, it deserializes the data back into Python objects using pickle in the child process
5. It then imports the Python module containing the gcd function.
6. It runs the function on the input data in parallel with other child processes.
7. It serializes the result back into bytes
8. It copies those bytes back through the socket
9. It deserializes the bytes back into Python objects in the parent process.
10. Finally, it merges the results from multiple children into a single list to return.

- Although it looks simple to the programmer, the multiprocesing module and ProcessPoolExecutor class do a huge amount of work to make parallelism possible. In most other languages, the only touch point you need to coordinate two threads is a single lock or atomic operation. The overhead of using multiprocessing is high because of all of the serialization and deserialization that must happen between the parent and child processes.

- This scheme is well suited to certain types of isolated, high-leverage tasks. By isolated, I mean functions that don't need to share state with other parts of the program. By high-leverage, I mean situations in which only a small amount of data must be transferred between the parent and child processes to enable a large amount of computaion. The greatest common deniminator algorithm is one example of this, but many other mathematical algorithms work similarly.

- If your computaion doesn't have these characteristics, then the overhead of multiprocessing may prevent it from speeding up your program through parallelization. When that happens, multiprocessing provides more advanced facilities for shared memory, cross-process locks, queues, and proxies. But all of these features are very complex. It's hard enough to reason about such tools in the memory space of a single process shared between Python threads. Extending that complexity to other processes and involving sockets makes this much more difficult to understand.


- I suggest avoiding all parts of multiprocessing and using these features via the simpler concurrent.futures module. You can start by using the ThreadPoolExecutor class to run isolated, high-leverage functions in threads. Later, you can move to the ProcessPoolExecutor to get a speedup. Finally, once you've completely exhausted the other options, you can consider using the multiprocessing module directly.

## Things to Remember
- Moving CPU bottlenecks to C-extension modules can be effective way to improve performance while maximizing your investment in Python code. However, the cost of doing so is high and may introduce bugs.
- The multiprocessing module provides powerful tools that can parallelize certain types of Python computation with minimal effort.
- The power of multiprocessing is best accessed through the concurrent.futures built-in module and its simple ProcessPoolExecutor class.
- The advanced parts of the multiprocessing module should be avoided because they are so complex.