## Introduction to Python Concurrency
- Concurrency is about **dealing** with lots of things at once.
- Parallelism is about **doing** lots of things at once.

We will study simple examples to introduce and compare Python’s <span style="color:skyblue">*core packages for concurrent programming: `threading`, `multi processing`*</span>, and `asyncio` which shows <span style="color:lightgreenskyblue">*Python’s three approaches to concurrency: threads, processes, and native coroutines*</span>.

We will also look into high-level overviews of third-party <span style="color:skyblue">*tools, libraries, application servers, and distributed task queues—all of which can enhance the performance and scalability of Python applications*</span>

## The Big Picture

Why concurrent programming is hard?
- <span style="color:orange">*Starting threads or processes is easy enough, but how do you keep track of them?*</span>. When we start a thread or process, we don’t automatically know when it’s done, and getting back results or errors requires setting up some communication channel, such as a <span style="color:lightgreen">*message queue*</span>
- <span style="color:orange">*Starting a thread or a process is not cheap*</span>, so we don't want to start one of them just to perform a single computation and quit. Often we want to amortize the startup cost by <span style="color:lightgreen">*making each thread or process into a “worker” that enters a loop and stands by for inputs to work on (so not shut down and start them all the time)*</span>.
- <span style="color:orange">*How do you make a worker quit when you don't need it anymore, without interrupting a job partway and leaving half-processed data / unreleased resources (e.g. open files)?*</span> -> <span style="color:lightgreen">*message queue*</span>
- <span style="color:skyblue">*A coroutine is cheap to start. If you start a coroutine using the `await` keyword, it’s easy to get a value returned by it, it can be safely cancelled, and you have a clear site to catch exceptions*</span>. <span style="color:orange">*But coroutines are often started by the asynchronous framework, and that can make them as hard to monitor as threads or processes. Finally, Python coroutines and threads are not suitable for CPU-intensive tasks, as we’ll see.*</span>

## Concurrency Terms

- <span style="color:skyblue">**Concurrency**</span>: <span style="color:skyblue">*The ability to handle multiple pending tasks, making progress one at a time or in parallel (if possible) so that each of them eventually succeeds or fails*</span>. A single core CPU is capable of concurrency by interleaving the execution of pending tasks
- <span style="color:skyblue">**Parallelism**</span>: <span style="color:skyblue">*The ability to execute multiple computations at the same time*</span>. This requires a multicore CPU, multiple CPUs, a GPU, or multiple computers in a cluster
- <span style="color:skyblue">**Execution unit**</span>: <span style="color:skyblue">*Objects that execute code concurrently, each with independens state and call stack*</span>. *Python natively supports three kinds of execution units: processes, threads, and coroutines*.
- <span style="color:skyblue">**Process**</span>: <span style="color:skyblue">*An instance of a computer program while it is running, using memory and a slice of the CPU time*</span>. Processes communicate via pipes, sockets, or memory mapped files—all of which can only carry raw bytes. <span style="color:orange">*Python objects must be serialized (converted) into raw bytes to pass from one process to another*</span>. Processes allow *preemptive multitasking*: the OS scheduler preempts—i.e., suspends - each running process periodically to allow other processes to run
- <span style="color:skyblue">**Thread**</span>: <span style="color:skyblue">*An execution unit within a single process. When a process starts, it uses a single thread: the main thread*</span>. A process can create more threads to operate concurrently by calling operating system APIs. Threads within a process share the same memory space, which holds live Python objects -> easy data sharing, but can lead to corrupted data. Like processes, threads also enable *preemptive multitasking* under the supervision of the OS scheduler.
- <span style="color:skyblue">**Coroutine**</span>: <span style="color:skyblue">*A coroutine is a function that can suspend itself and resume later*</span>. In Python, classic coroutines are built from generator functions, and native coroutines are defined with `async def`. Python coroutines usually run within a single thread under the supervision of an event loop, also in the same thread. Coroutines support *cooperative multitasking*: each coroutine must explicitly cede control with the yield or await keyword, so that another may proceed concurrently (but not in parallel).
- <span style="color:skyblue">**Queue**</span>: <span style="color:skyblue">*A data structure that lets us put and get items, usually in FIFO order: first in, first out*</span>. The `queue` package in Python’s std is a FIFO and provides queue classes to support threads. 
- <span style="color:skyblue">**Lock**</span>: <span style="color:skyblue">*An object that execution units can use to synchronize their actions and avoid corrupting data*</span>. While updating a shared data structure, the running code should hold an associated lock. This signals other parts of the program to wait until the lock is released before accessing the same data structure.
- <span style="color:skyblue">**Contention**</span>: Resource contention happens when multiple execution units try to access a shared resource—such as a lock or storage. CPU contention happens when compute-intensive processes or threads must wait for the OS scheduler to give them a share of the CPU time.

## Processes, Threads, and Python’s Infamous GIL

1. *Each instance of the Python interpreter is a process*. You can start additional Python processes using the `multiprocessing` or concurrent`.futures` libraries.
2. *The Python interpreter uses a single thread to run the user’s program and the memory garbage collector*. You can start additional Python threads using the `threading` or `concurrent.futures` libraries.
3. *Access to object reference counts and other internal interpreter state is controlled by a lock, the Global Interpreter Lock (GIL). Only one Python thread can hold the GIL at any time. This means that only one thread can execute Python code at any time, regardless of the number of CPU cores.*
4. To prevent a Python thread from holding the GIL indefinitely, *Python’s bytecode interpreter pauses the current Python thread every 5ms by defaul to release the GIL*.
5. When we write Python code, *we have no control over the GIL*.
6. Every Python standard library function that makes a `syscall` releases the GIL. *syscall is a call from user code to a function of the operating system kernel*. This includes all functions that perform disk I/O, network I/O, and `time.sleep()`
7. Extensions that integrate at the Python/C API level can also launch other non-Python threads that are not affected by the GIL.
8. *The effect of the GIL on network programming with Python threads is relatively small, because the I/O functions release the GIL*, and reading or writing  to the network always implies high latency—compared to reading and writing to memory.
9. *Contention over the GIL slows down compute-intensive Python threads*. Sequential, single-threaded code is simpler and faster for such tasks.
10. *To run CPU-intensive Python code on multiple cores, you must use multiple Python processes*.

## A Concurrent Python Hello World

***Idea***: We will make 2 functions, `spin` and `slow` that run concurrently. *The main thread — the only thread when the program starts — will start a new thread to run `spin` and then call `slow`*.  
`slow` start a function that blocks for 3 seconds while animating characters in the terminal to let the user know that the program is “thinking” (which shows `"\|/-"`) and not stalled.

### Spinner with Threads

We use `threading.Event` to coordinate threads. It has an internal `flag` which is set to `False` first.
- Call `Event.set()` sets the flag to `True`
- While the `flag` is false, if a thread calls `Event.wait()`, it is blocked until another thread calls `Event.set()`, at which time `Event.wait()` returns `True`.
- If a timeout in seconds is given to `Event.wait(s)`, this call returns `False` when the timeout elapses, or returns `True` as soon as `Event.set()` is called by another thread

In [20]:
import itertools
import time
from threading import Thread, Event

def spin(msg: str, done: Event) -> None:  
    """
    This function will run in a separate thread. 

    Args:
        msg (str): the message to be printed
        done (threading.Event): a simple object to synchronize threads
    """
    for char in itertools.cycle(r'\|/-'):  # inf loop as itertools.cycle yields one character at a time, cycling through the string forever.
        status = f'\r{char} {msg} '  # \r move the cursor back to the start of the line
        print(status, end='', flush=True)
        if done.wait(.1):  # used to receive signal to break from the loop from the main thread
            break  # break from the loop
    
    blanks = ' ' * len(status)  # clear the Clear the status line
    print(f'\r{blanks}\r', end='')


def slow(num_sec: int = 3) -> int:
    for i in range(num_sec):
        time.sleep(1)
        print(f"also doing work in the main thread for {i+1} (sec)")
    return 42

In [21]:
def supervisor() -> int:  # supervisor will return the result of slow
    done = Event()  # this key key to coordinate the activities of the main thread and the spinner thread
    spinner = Thread(target=spin, args=('spinning in the secondary thread!', done))  # the spinner thread
    print(f'spinner object: {spinner}')  # should output <Thread(Thread-1, initial)>
    spinner.start()  # Start the spinner thread
    result = slow()  # Call slow, which runs in the main thread and blocks it the main thread
                     # Meanwhile, the secondary thread is running the spinner animation
    done.set()  # Set the Event flag to True to terminate the for loop inside the spin function
    spinner.join()  # Wait until the spinner thread finishes
    return result

def main() -> None:
    result = supervisor()
    print(f'Answer: {result}')

main()

spinner object: <Thread(Thread-5 (spin), initial)>
| spinning in the secondary thread! also doing work in the main thread for 1 (sec)
- spinning in the secondary thread! also doing work in the main thread for 2 (sec)
| spinning in the secondary thread! also doing work in the main thread for 3 (sec)
Answer: 42                           


### Spinner with Processes
<span style="color:skyblue">*The `multiprocessing` package supports running concurrent tasks in separate Python processes instead of threads.  When you create a `multiprocessing.Process` instance, a whole new Python interpreter is started as a child process in the background*</span>. Since each Python process has its own GIL, this allows your program to use all available CPU cores and run in a truly paralell manner

In [16]:
import itertools
import time
from multiprocessing import Process, Event
from multiprocessing import synchronize

def spin(msg: str, done: synchronize.Event) -> None:  # only difference is here compared to the threading version
    """
    This function will run in a separate thread. 

    Args:
        msg (str): the message to be printed
        done (threading.Event): a simple object to synchronize threads
    """
    for char in itertools.cycle(r'\|/-'):  # inf loop as itertools.cycle yields one character at a time, cycling through the string forever.
        status = f'\r{char} {msg} '  # \r move the cursor back to the start of the line
        print(status, end='', flush=True)
        if done.wait(.1):  # used to receive signal to break from the loop from the main thread
            break  # break from the loop
    
    blanks = ' ' * len(status)  # clear the Clear the status line
    print(f'\r{blanks}\r', end='')


def slow(num_sec: int = 3) -> int:
    for i in range(num_sec):
        time.sleep(1)
        print(f"-- also doing work in the main thread of the main process for {i+1} (sec)")
    return 42

In [17]:
def supervisor() -> int:
    done = Event()
    spinner = Process(target=spin, args=('spinning in a child process!', done))  # Basic usage of the `Process` class is similar to `Thread`
    print(f'spinner object: {spinner}')  # should output something like <Process name='Process-1' parent=237132 initial>
    spinner.start()
    result = slow()
    done.set()
    spinner.join()
    return result

def main() -> None:
    result = supervisor()
    print(f'Answer: {result}')

main()

spinner object: <Process name='Process-2' parent=67180 initial>
- spinning in a child process! -- also doing work in the main thread of the main process for 1 (sec)
| spinning in a child process! -- also doing work in the main thread of the main process for 2 (sec)
-- also doing work in the main thread of the main process for 3 (sec)
Answer: 42


The basic API of `threading` and `multiprocessing` are similar, but their implementation is very different, and `multiprocessing` has a much larger API to handle the added complexity of multiprocess programming. For example, one challenge when converting from threads to processes is how to communicate between processes that are isolated by the operating system and can’t share Python objects, and we have to serialize and deserialize them

### Spinner with Coroutines

<span style="color:skyblue">*It is the job of OS schedulers to allocate CPU time to drive threads and processes. In contrast, coroutines are driven by an application-level event loop*</span> that manages a queue of pending coroutines, drives them one by one, monitors events triggered by I/O operations initiated by coroutines, and passes control back to the corresponding coroutine when each event happens. <span style="color:skyblue">*The event loop and the library coroutines and the user coroutines all execute in a **single thread***</span>. Therefore, any time spent in a coroutine slows down the event loop—and all other coroutines.

Look into the code of `spinner_async.py` to see how spinner with coroutines work since in jupyter notebook, we are already running in an async context with Jupyter's running event loop

```python
import asyncio
import itertools
import time

async def spin(msg: str) -> None:  # We don’t need the Event argument that was used to signal that
                                   # `slow` had completed its job like in thread or process versions
    for char in itertools.cycle(r'\|/-'):
        status = f'\r{char} {msg}'
        print(status, flush=True, end='')
        try:
            await asyncio.sleep(.1)  # Use `await asyncio.sleep(.1)` instead of `time.sleep(.1)`, to pause without blocking other coroutines
        except asyncio.CancelledError:  # `asyncio.CancelledError` is raised when the cancel method is called on the `Task` controlling this coroutine
            break
    blanks = ' ' * len(status)
    print(f'\r{blanks}\r', end='')

async def slow() -> int:
    await asyncio.sleep(3)  # also uses `await asyncio.sleep` instead of `time.sleep` to pause without blocking other coroutines
    # time.sleep(3)  # experiment: with time.sleep, we will not see the spin at all
    return 42

async def supervisor() -> int:  # Native coroutines are defined with async def
    spinner = asyncio.create_task(spin('thinking!'))  # `asyncio.create_task` schedules the eventual execution
                                                      # of `spin`, immediately returning an instance of `asyncio.Task`
    print(f'spinner object: {spinner}')  # should print an `asyncio.Task` object
    result = await slow()  # The `await` keyword calls `slow`, blocking supervisor until `slow` returns
    spinner.cancel()  # The `Task.cancel` method raises a `CancelledError` exception inside the `spin` coroutine
    return result

def main() -> None:  # since we are running in the jupyter notebook, which is already an async context, 
                           # we have to make main an async function and await for it
    result = asyncio.run(supervisor())
    # The `asyncio.run` function starts the event loop to drive 
    # the coroutine that will eventually set the other coroutines 
    # in motion. The main function will stay blocked until supervisor returns
    # `supervisor`'s return value will be the return value of `asyncio.run`.
    print(f'-- Answer: {result}')

if __name__ == '__main__':
    main()
```

In [5]:
!python ./spinner_async.py

spinner object: <Task pending name='Task-2' coro=<spin() running at /home/dk/Desktop/projects/programming/python/fluent-python/part04-control-flow/19-python-concurrency-models/./spinner_async.py:5>>
-- Answer: 42


Above we see the 3 main ways of running a coroutine:
- `asyncio.run(coro())`: Called from a regular function to drive a coroutine object that usually is the entry point for all the asynchronous code in the program, like the `supervisor` in this example. This call blocks until the body of `coro` returns.
- `asyncio.create_task(coro())`: Called from a coroutine to schedule another coroutine to execute eventually. This call does not suspend the current coroutine.
- `await coro()`: Called from a coroutine to transfer control to the coroutine object returned by `coro()`.

The execution flows of `spinner_async.py` is as follow:

1. When `main()` is called, it calls `asyncio.run(supervisor())`
2. `supervisor()` starts by creating a task for the `spin` coroutine and `awaits` the result of the `slow` coroutine.
3. While `slow` is waiting (using `await asyncio.sleep(3)`), `spin` continues to run concurrently, printing the spinning animation.
4. After 3 seconds, `slow` completes, and its `result` (`42`) is returned. Then, `spinner.cancel()` is called to cancel the spinning animation.
5. Finally, the result is printed in `main()`

<span style="color:lightgreen">*Since by default, `asyncio` has only one flow of execution => only one coroutine can execute at a given point in time. When `asyncio.sleep()` is used instead of `time.sleep()`, the event loop is able to switch between tasks efficiently, allowing both coroutines (`spin` and `slow`) to run concurrently without blocking each other. If we use `time.sleep`, it will block the main thread's event loop so it will not be able to switch to the `spin` coroutine, hence, we will not be able to see the spinner at all*</span>

### Greenlet and Gevent
- **Greenlet**: The package supports cooperative multitasking through lightweight coroutines—named `greenlet` - that don’t require any special syntax such as `yield` or `await`, therefore are easier to integrate into existing, sequential codebases
- **Gevent**: The `gevent` networking library monkey patches Python’s standard socket module making it nonblocking by replacing some of its code with greenlets.

### Threaded vs Asyncio `supervisor`

#### Threaded Version
```python
def supervisor() -> int:
    done = Event()
    spinner = Thread(target=spin,
                     args=('thinking!', done))
    print('spinner object:', spinner)
    spinner.start()
    result = slow()
    done.set()
    spinner.join()
    return result
```

### Async version
```python
async def supervisor() -> int:
    spinner = asyncio.create_task(spin('thinking!'))
    print('spinner object:', spinner)
    result = await slow()
    spinner.cancel()
    return result
```

- An `asyncio.Task` is roughly the equivalent of a `threading.Thread`
- A `Task` drives a coroutine object, and a `Thread` invokes a callable.
- A coroutine `yields` control explicitly with the `await` keyword.
- You don’t instantiate `Task` objects yourself, you get them by passing a coroutine to `asyncio.create_task(…)`.
- When `asyncio.create_task(…)` returns a `Task` object, it is already scheduled to run, but a `Thread` instance must be explicitly told to run by calling its `start` method
- In the threaded `supervisor`, `slow` is a plain function and is directly invoked by the main thread. In the asynchronous `supervisor`, `slow` is a coroutine driven by `await`.
- There’s no API to terminate a thread from the outside; instead, you must send a signal—like setting the `done` `Event` object. For tasks, there is the `Task.cancel()` instance method, which raises `CancelledError` at the `await` expression where the coroutine body is currently suspended.
- The `supervisor` coroutine must be started with `asyncio.run` in the main
function.


One final point related to threads versus coroutines: 
- If you’ve done any nontrivial programming with threads, you know how challenging it is to reason about the program because the scheduler can interrupt a thread at any time. You <span style="color:skyblue">*must remember to hold locks to protect the critical sections of your program, to avoid getting interrupted in the middle of a multistep operation—which could leave data in an invalid state*</span>.
- With coroutines, your code is protected against interruption by default. You must explicitly `await` to let the rest of the program run. <span style="color:skyblue">*Instead of holding locks to synchronize the operations of multiple threads, coroutines are “synchronized” by definition: only one of them is running at any time. When you want to give up control, you use `await` to yield control back to the scheduler*</span>. That’s why it is possible to safely cancel a coroutine: by definition, a coroutine can only be cancelled when it’s suspended at an `await` expression, so you can perform cleanup by handling the `CanceledError` exception.

## The Real Impact of the GIL

**For IO Intensive Code:**
- We can replace `time.sleep` in the threaded `slow` function with an HTTP client request since <span style="color:skyblue">*a well-designed network library will release the GIL while waiting for the network*</span>. 
- You can also replace the `asyncio.sleep(3)` expression in the `slow` coroutine to `await` for a response from <span style="color:skyblue">*a well-designed asynchronous network library, because such libraries provide coroutines that yield control back to the event loop while waiting for the network*</span>. Meanwhile, the spinner will keep spinning.

**With CPU-intensive code, the story is different.** For example, if we have an intensive CPU task like `is_prime` below

In [7]:
def is_prime(n: int) -> bool:
    if n < 2:
        return False
    if n == 2:
        return True
    if n % 2 == 0:
        return False
    import math
    root = math.isqrt(n)
    for i in range(3, root + 1, 2):
        if n % i == 0:
            return False
    return True

is_prime(5_000_111_000_222_021)

True

Now, we replace the `time.sleep(3)` (or `asyncio.sleep(3)`) in the `slow` function, what will happen?
- **For the spinner with processes version**: Nothing happens as the spinner is controlled by a child process, so it continues spinning while the primality test is computed by the parent process
- **For the threading spinner version**: The spinner is controlled by a secondary thread, so it continues spinning while the primality test is computed by the main thread. In this particular example, the spinner keeps spinning because Python suspends the running thread every 5ms (by default), making the GIL available to other pending threads. Therefore, <span style="color:skyblue">*the main thread running `is_prime` is interrupted every 5ms, allowing the secondary thread to wake up and iterate once through the for loop, until it calls the `wait` method of the `done` event, at which time it will release the GIL. The main thread will then grab the GIL, and the `is_prime` computation will proceed for another 5ms.
This does not have a visible impact on the running time of this specific example,
because the `spin` function quickly iterates once and releases the GIL as it waits for the `done` event, so there is not much contention for the GIL. The main thread running `is_prime` will have the GIL most of the time.*</span>
- **For the asyncio spinner version**: If you call `is_prime` in the `slow` coroutine of the `spinner_async.py` example, the spinner will never appear (similar to when we replaced `await asyncio.sleep(3)` with `time.sleep(3)`).   
<span style="color:lightgreen">*One way to keep the spinner alive is to rewrite `is_prime` as a coroutine, and periodically call `asyncio.sleep(0)` in an `await` expression to yield control back to the event loop so it can run the spinner. However, this will slow down the event loop and hence the whole program with it. Using `await asyncio.sleep(0)` should be considered a stopgap measure before you refactor your asynchronous code to delegate CPU-intensive computations to another process*</span>

In [12]:
!python ./spinner_async.py

spinner object: <Task pending name='Task-2' coro=<spin() running at /home/dk/Desktop/projects/programming/python/fluent-python/part04-control-flow/19-python-concurrency-models/./spinner_async.py:6>>
\ spinning!

\ spinning!
5000111000222021 is a prime
-- Answer: 42


## A Homegrown Process Pool
<span style="color:lightgreen">*This chapter shows the use of multiple processes for CPU-intensive tasks, and the common pattern of using queues to distribute tasks and collect results*</span>. Chapter 20 will show a simpler way of distributing tasks to processes: a `ProcessPoolExecutor` from the `concurrent.futures` package, which uses queues internally.

Example program: a program to check the primality of a sample of 20 integers from 1 to 10^16

### Sequential Solution

In [23]:
from time import perf_counter
from typing import NamedTuple
import math

PRIME_FIXTURE = [
    (2, True),
    (142702110479723, True),
    (299593572317531, True),
    (3333333333333301, True),
    (3333333333333333, False),
    (3333335652092209, False),
    (4444444444444423, True),
    (4444444444444444, False),
    (4444444488888889, False),
    (5555553133149889, False),
    (5555555555555503, True),
    (5555555555555555, False),
    (6666666666666666, False),
    (6666666666666719, True),
    (6666667141414921, False),
    (7777777536340681, False),
    (7777777777777753, True),
    (7777777777777777, False),
    (9999999999999917, True),
    (9999999999999999, False),
]

NUMBERS = [n for n, _ in PRIME_FIXTURE]

# tag::IS_PRIME[]
def is_prime(n: int) -> bool:
    if n < 2:
        return False
    if n == 2:
        return True
    if n % 2 == 0:
        return False

    root = math.isqrt(n)
    for i in range(3, root + 1, 2):
        if n % i == 0:
            return False
    return True

class Result(NamedTuple):
    prime: bool
    elapsed: float

def check(n: int) -> Result:
    """
    calls is_prime(n) and computes the elapsed time to return a Result.
    """
    t0 = perf_counter()
    prime = is_prime(n)
    return Result(prime, perf_counter() - t0)

def main() -> None:
    print(f'Checking {len(NUMBERS)} numbers sequentially:')
    t0 = perf_counter()
    for n in NUMBERS:
        prime, elapsed = check(n)
        label = 'P' if prime else ' '
        print(f'{n:16} {label} {elapsed:9.6f}s')
    elapsed = perf_counter() - t0
    print(f'Total time: {elapsed:.2f}s')

main()

Checking 20 numbers sequentially:
               2 P  0.000001s
 142702110479723 P  0.316831s
 299593572317531 P  0.574896s
3333333333333301 P  1.754009s
3333333333333333    0.000005s
3333335652092209    1.808153s
4444444444444423 P  2.064743s
4444444444444444    0.000001s
4444444488888889    1.878283s
5555553133149889    2.423506s
5555555555555503 P  2.550794s
5555555555555555    0.000007s
6666666666666666    0.000000s
6666666666666719 P  2.357989s
6666667141414921    2.567975s
7777777536340681    2.926401s
7777777777777753 P  2.740157s
7777777777777777    0.000006s
9999999999999917 P  2.949505s
9999999999999999    0.000005s
Total time: 26.91s


### Process-Based Solution

In [28]:
from time import perf_counter
from typing import NamedTuple
from multiprocessing import Process, SimpleQueue, cpu_count
from multiprocessing import queues

# tag::PRIMES_PROC_TOP[]
class PrimeResult(NamedTuple):
    """
    Includes the number checked for primality
    """
    n: int
    prime: bool
    elapsed: float

JobQueue = queues.SimpleQueue[int]  # type alias for a `SimpleQueue` that the `main` function
                                    # will use to send numbers to the processes that will do the work
ResultQueue = queues.SimpleQueue[PrimeResult]  # Type alias for a second `SimpleQueue` 
                                               # that will collect the results in `main`.

def check(n: int) -> PrimeResult:
    """
    Check prime and return PrimeResult for a number `n`
    """
    t0 = perf_counter()
    res = is_prime(n)
    return PrimeResult(n, res, perf_counter() - t0)

def worker(jobs: JobQueue, results: ResultQueue) -> None:
    """
    Get the number to be checked from the `job` queue, check it
    using the `check(n)` function, then use the `results` queue
    to put the PrimeResult. Only stops if encounter a sentinel value 
    (the number to be checked is 0)
    """
    while n := jobs.get():  # if n == 0, then signal for the worker to finish
                            # otherwise loop indefinitely, taking items from
                            # the job queue and process each item with the actual
                            # function that does the work
        results.put(check(n))  # Invoke the primality check and enqueue PrimeResult
    results.put(PrimeResult(0, False, 0.0))  # send the number 0 to the main loop to signal that the worker is done

def start_jobs(
    num_processes: int, jobs: JobQueue, results: ResultQueue
) -> None:
    """
    First, put all the numbers to be check in the `job` queue
    Then luanch `num_processes` child processes, each process
    runs the worker function, 
    """
    for n in NUMBERS:
        jobs.put(n)  # Enqueue the numbers to be checked in jobs
    for _ in range(num_processes):
        # Below, we fork a child process for each worker. Each child will run
        # the loop inside its own instance of the worker function, 
        # until it fetches a 0 from the jobs queue.
        process = Process(target=worker, args=(jobs, results))
        process.start()  # Start each child process
        jobs.put(0)  # Enqueue one 0 for each process, to terminate them
# end::PRIMES_PROC_TOP[]

# tag::PRIMES_PROC_MAIN[]
def main() -> None:
    procs = cpu_count() - 4  # let's work with (total cores - 4) processes

    print(f'Checking {len(NUMBERS)} numbers with {procs} processes:')
    t0 = perf_counter()
    jobs: JobQueue = SimpleQueue()
    results: ResultQueue = SimpleQueue()
    start_jobs(procs, jobs, results)  # Start proc processes to consume jobs and post results
    checked = report(procs, results)  # Retrieve the results and display them
    elapsed = perf_counter() - t0
    print(f'{checked} checks in {elapsed:.2f}s')  # Display how many numbers were checked and the total elapsed time

def report(procs: int, results: ResultQueue) -> int: # <6>
    checked = 0
    procs_done = 0
    while procs_done < procs:  # Loop until all processes are done.
        n, prime, elapsed = results.get()  # Get one PrimeResult. Calling `.get()` on a queue block until there is an item in the queue
        if n == 0:  # If n is zero, then one process exited; increment the procs_done count
            procs_done += 1
        else:
            # Otherwise, increment the checked count (to keep track of the 
            # numbers checked) and display the results
            checked += 1
            label = 'P' if prime else ' '
            print(f'{n:16}  {label} {elapsed:9.6f}s')
    return checked
# end::PRIMES_PROC_MAIN[]

main()

Checking 20 numbers with 8 processes:
               2  P  0.000006s
3333333333333333     0.000014s
4444444444444444     0.000005s
 142702110479723  P  0.575875s
5555555555555555     0.000007s
6666666666666666     0.000001s
 299593572317531  P  0.727143s
3333333333333301  P  2.448605s
4444444444444423  P  2.678654s
5555555555555503  P  3.219040s
7777777777777777     0.000007s
3333335652092209     3.364439s
9999999999999999     0.000007s
4444444488888889     3.776165s
6666667141414921     3.272632s
5555553133149889     4.183812s
6666666666666719  P  3.970167s
7777777777777753  P  2.936584s
7777777536340681     3.261300s
9999999999999917  P  3.258965s
20 checks in 6.51s


### Thread-Based Nonsolution

The code below shows the thread-based version which is very similar to the process-based solution, however, <span style="color:orange">*due to the GIL and the compute-intensive nature of `is_prime`, the threaded version is almost slower than the sequential version, and it gets slower as the number of threads increase, because of CPU contention and the cost of context switching. To switch to a new thread, the OS needs to save CPU registers and update the program counter and stack pointer, triggering expensive side effects like invalidating CPU caches and possibly even swapping memory pages*</span>

In [29]:
import os
import sys
from queue import SimpleQueue
from time import perf_counter
from typing import NamedTuple
from threading import Thread


class PrimeResult(NamedTuple):
    n: int
    prime: bool
    elapsed: float

JobQueue = SimpleQueue[int]  # <4>
ResultQueue = SimpleQueue[PrimeResult]  # <5>

def check(n: int) -> PrimeResult:  # <6>
    t0 = perf_counter()
    res = is_prime(n)
    return PrimeResult(n, res, perf_counter() - t0)

def worker(jobs: JobQueue, results: ResultQueue) -> None:  # <7>
    while n := jobs.get():  # <8>
        results.put(check(n))  # <9>
    results.put(PrimeResult(0, False, 0.0))

def start_jobs(workers: int, jobs: JobQueue, results: ResultQueue) -> None:
    for n in NUMBERS:  # <3>
        jobs.put(n)
    for _ in range(workers):
        proc = Thread(target=worker, args=(jobs, results))  # <4>
        proc.start()  # <5>
        jobs.put(0)  # <6>

def report(workers: int, results: ResultQueue) -> int:
    checked = 0
    workers_done = 0
    while workers_done < workers:
        n, prime, elapsed = results.get()
        if n == 0:
            workers_done += 1
        else:
            checked += 1
            label = 'P' if prime else ' '
            print(f'{n:16}  {label} {elapsed:9.6f}s')
    return checked

def main() -> None:
    workers = os.cpu_count() - 4

    print(f'Checking {len(NUMBERS)} numbers with {workers} threads:')
    t0 = perf_counter()
    jobs: JobQueue = SimpleQueue()
    results: ResultQueue = SimpleQueue()
    start_jobs(workers, jobs, results)
    checked = report(workers, results)
    elapsed = perf_counter() - t0
    print(f'{checked} checks in {elapsed:.2f}s')

main()

Checking 20 numbers with 8 threads:
               2  P  0.000002s
3333333333333333     0.000006s
4444444444444444     0.000002s
 299593572317531  P  1.932373s
5555555555555555     0.000006s
6666666666666666     0.000000s
 142702110479723  P  2.136972s
3333333333333301  P 12.423121s
4444444444444423  P 12.386578s
5555555555555503  P 14.281875s
7777777777777777     0.000008s
3333335652092209    16.243054s
9999999999999999     0.000006s
6666666666666719  P 15.059455s
5555553133149889    16.386158s
6666667141414921    15.597601s
4444444488888889    18.118407s
7777777536340681    10.222358s
7777777777777753  P 10.575917s
9999999999999917  P  8.389822s
20 checks in 23.45s


## Python in the Multicore World

Python’s story started in the early 1990s, when CPUs were still getting exponentially faster at sequential code execution. There was no talk about multicore CPUs except in supercomputers back then. At the time, the decision to have a GIL was a no-brainer. The GIL makes the interpreter faster when running on a single core, and its implementation simpler. The GIL also makes it easier to write simple extensions through the Python/C API.  
Despite the GIL, Python is thriving in applications that require concurrent or parallel execution, thanks to libraries and software architectures that work around the limitations of CPython.

- **System Administration**: Python is widely used to manage large fleets of servers, routers, load balancers, and network-attached storage (NAS).
- **Data Science**: Jupyter Lab & Notebook, Tensorflow / PyTorch, Dask
- **Server-Side Web/Mobile Development**: Python is widely used in web applications and for the backend APIs supporting mobile applications.
- **WSGI Application Servers**: WSGI—the Web Server Gateway Interface—is a standard API for a Python framework or application to receive requests from an HTTP server and send responses to it. WSGI application servers manage one or more processes running your application, maximizing the use of the available CPUs. 
<img src="../images/WSGI.png" style="width: 80%;">.  
Clients connect to an HTTP server that delivers static files and routes other requests to the application server, which forks child processes to run the application code, leveraging multiple CPU cores. The WSGI API is the glue between the application server and the Python application code. The main point: all of these application servers can potentially use all CPU cores on the server by forking multiple Python processes to run traditional web apps written in good old sequential code in Django, Flask, Pyramid, etc. This explains why it’s been possible to earn a living as a Python web developer without ever studying the `threading`, `multiprocessing`, or `asyncio` modules: the application server handles concurrency transparently.
- **ASGI — Asynchronous Server Gateway Interface**: WSGI is a synchronous API. It doesn’t support coroutines with `async/await` — the most efficient way to implement `WebSockets` or HTTP long polling in Python. The ASGI specification is a successor to WSGI, designed for asynchronous Python web frameworks such as aiohttp, Sanic, FastAPI, etc., as well as Django and Flask, which are gradually adding asynchronous functionality.
- **Distributed Task Queues**: When the application server delivers a request to one of the Python processes running your code, your app needs to respond quickly: you want the process to be available to handle the next request as soon as possible. However, some requests demand actions that may take longer — for example, sending email or generating a PDF. That’s the problem that distributed task queues are designed to solve. Celery and RQ are the best known open source task queues with Python APIs

## Conclusion

In this chapter, we learned Python’s three native concurrency programming models:
- Threads, with the `threading` package
- Processes, with `multiprocessing`
- Asynchronous coroutines with `asyncio`

We then explored the real impact of the GIL with an experiment: changing the spin‐
ner examples to compute the primality of a large integer and observe the resulting
behavior. This demonstrated graphically that: 
1. <span style="color:green">*CPU-intensive functions must be avoided in `asyncio`, as they block the event loop.*</span>
2. The `threaded` version of the experiment worked—despite the GIL—because Python periodically interrupts threads, and the example used only two threads: one doing compute-intensive work, and the other
driving the animation only 10 times per second. 
3. The `multiprocessing` variant
worked around the GIL, starting a new process just for the animation, while the main process did the primality check.

The next example, computing several primes, highlighted the difference between `multiprocessing` and `threading`, proving that <span style="color:green">*only processes allow Python to benefit from multicore CPUs. Python’s GIL makes threads worse than sequential code for heavy computations*</span>