# I39 : Use Queue to Coordinate Work Between Threads

- Python programs taht do many things concurrently often need to coordinate their work. One of the most useful arrangements for concurrent work is a pipeline of functions. 

- A pipeline works like an aseembly line used in manufacturing. Pipilines have many phases in serial with a specific function for each phase. New pieces of work are constantly added to the beginning of the pipeline. Each function can operate concurrently on the piece of work in tis phase. The work moves forward as each function completes until there are no subprocesses - acitivities that can easily be parallelized using Python./

- For example, say you want to build a system that will take a constant stream of images from your digital camera, resize them, and then add them to a photo gallery online. Such a program could be split into three phases of a pipeline. New images are retrieved in the first phase. The downloaded images are passed through the resize function in the second phase. The resized images are consumed by the upload function in the final phase.

- Imagine you had already written Python functions that execute the phases: download, resize, upload. How do you assemble a pipeline to do the work concurrently?

- The first thing you need is a way to hand off work between the pipeline phases. This can be modeled as a thread-safe producer-consumer queue.

In [1]:
from collections import deque
from threading import Lock

class MyQueue(object):
    def __init__(self):
        self.items = deque()
        self.lock = Lock()
        
    def put(self, item):
        with self.lock:
            self.items.append(item)
            
    def get(self):
        with self.lock:
            return self.items.popleft()

- Here, I represent each phase of the pipeline as a Python thread that takes 
work from one queue like this, runs a function on it, and puts the result on another queue. I also track how many times the worker has checked for new input and how much work it's completed.

In [2]:
from threading import Thread
from time import sleep

class Worker(Thread):
    def __init__(self, func, in_queue, out_queue):
        super().__init__()
        self.func = func
        self.in_queue = in_queue
        self.out_queue = out_queue
        self.polled_count = 0
        self.work_done = 0
        
    def run(self):
        while True:
            self.polled_count += 1
            try:
                item = self.in_queue.get()
            except IndexError:
                sleep(0.01) 
            else:
                result = self.func(item)
                self.out_queue.put(result)
                self.work_done += 1
        

- The trichiest part is that the worker thread must properly handle the case where the input queue is empty because the previous phase hasn't complated its work yet. This happens where I catch the IndexError exception below. You can think of this as a holdup in the assembly line.

- Now I can connect the three phases together by creating the queues for their coordination points and the corresponding worker threads.

In [3]:
def download():
    pass

def resize():
    pass

def upload():
    pass

In [4]:
download_queue = MyQueue()
resize_queue = MyQueue()
upload_queue = MyQueue()
done_queue = MyQueue()
threads = [
    Worker(download, download_queue, resize_queue),
    Worker(resize, resize_queue, upload_queue),
    Worker(upload, upload_queue, done_queue)
]

- I can start the threads and then inject a bunch of work into the first phase of the pipeline. Here, I use a plain object instance as a proxy for the real data required by the download function:

In [5]:
for thread in threads:
    thread.start()
for _ in range(1000):
    download_queue.put(object())

Exception in thread Thread-4:
Traceback (most recent call last):
  File "/home/dockeruser/anaconda3/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "<ipython-input-2-4de27a683eb0>", line 21, in run
    result = self.func(item)
TypeError: download() takes 0 positional arguments but 1 was given



In [6]:
while len(donw_queue.items) < 1000:
    # do something useful while waiting
    # ...

SyntaxError: unexpected EOF while parsing (<ipython-input-6-1ba0b87de737>, line 3)

- This runs properly, but there's an interesting side effect caused by the threads polling their input queues for new work. The tricky part, where I catch IndexError exceptions in the run methods, executes a large number of times.

In [7]:
processed = len(done_queue.items)
polled = sum(t.polled_count for t in threads)
print('Processed', processed, 'items after polling',
    polled, 'times')

Processed 0 items after polling 371 times


- When the worker functions vary in speeds, an earlier phase can prevent progress in later phases, backing up the pipeline. This causes later phases to starve and constantly check their input queues for new work in a tight loop. The outcome is that worker threads waster CPU time doing nothing useful.

- But that's just the beginning of what's wrong with this implementation. There are three more problems that you should also avoid. First, determining that all of the input work is complete requires yet another busy wait on the done\_queue. Second, in Worker the run method will execute forever in its busy loop. There's no way to signal to a worker thread that it's time to exit.

- Third, and worst of all, a bakcup in the pipeline can cause the program to crash arbitrarily. If the first phase makes rapid progress but the second phase makes slow progress, then the queue connecting the first phase to the second phase will constantly increase in size. The second phase won't be able to keep up. Given enough time and input data, the program will eventually run out of memory and die.

- The lesson isn't that pipelines are bad; it's that it's hard to build a good producer-consumer queue yourself.

> Queue to the Rescue

- The Queue class from the queue built-in module provides all of the functionality you need to solve these problems.

- Queue eliminates the busy waiting in the worker by making the get method block until new data is available. For example, here I start a thread that waits for some input data on a queue:

In [11]:
from queue import Queue
queue = Queue()

def consumer():
    print("Consumer waiting")
    queue.get()
    print("Consumer done")
    
thread = Thread(target=consumer)
thread.start()

print('Producer putting')
queue.put(object())
thread.join()
print('Producer done')

Consumer waiting
Producer putting
Consumer done
Producer done


- Even though the thread is running first, it won't finish until an item is put on the Queue instance and the get method has something to return.

- To solve the pipeline backup issue, the Queue class lets you specify the maximum amount of pending work you'll allow between two phases. This buffer size causes calls to put to block when the queue is already full. For example, here I define a thread that waits for a while before consuming a queue:

In [15]:
import time 

queue = Queue(1)

def consumer():
    time.sleep(0.1)
    print('start')
    queue.get()
    print('Consumer got 1')
    queue.get()
    print('Consumer got 2')
    
thread = Thread(target=consumer)
thread.start()

queue.put(object())
print('Producer put 1')
queue.put(object())
print('Producer put 2')
thread.join()
print('Producer done')

Producer put 1
start
Consumer got 1
Producer put 2
Consumer got 2
Producer done


In [17]:
in_queue = Queue()

def consumer():
    print('Consumer waiting')
    work= in_queue.get()
    print('Conusmer working')
    # Doing work
    print('Consumer done')
    in_queue.task_done()
    
Thread(target=consumer).start()

in_queue.put(object())

print('Producer waiting')

in_queue.join()

print('Producer done')

Producer waiting
Consumer waiting
Conusmer working
Consumer done
Producer done


In [20]:
class ClosableQueue(Queue):
    SENTINEL = object()
    
    def close(self):
        self.put(self.SENTINEL)        
        
    def __iter__(self):
        while True:
            item = self.get()
            try:
                if item is self.SENTINEL:
                    return None
                yield item
            finally:                
                self.task_done()

- Then, I define an iterator for the queue that looks for this special object and stops iteration when it's found. This \_\_iter\_\_ method also calls tase\_done at appropriate times, letting me track the progress of work on the queue.

In [22]:
class StoppableWorker(Thread):
    def __init__(self, func, in_queue, out_queue):
        #...
        pass
        
    def run(self):
        for item in self.in_queue:
            result = self.func(item)
            self.out_queue.put(result)

- Here, I re-create the set of worker threads using the new worker class:

In [23]:
download_queue = ClosableQueue()
#...
threads = [
    StoppableWorker(download, download_queue, resize_queue),   
    #..
]


In [24]:
for thread in threads:
    thread.start()
for _ in range(1000):
    download_queue.put(object())
download_queue.close()

RuntimeError: thread.__init__() not called

In [26]:
download_queue.join()
resize_queue.close()
resize_queue.join()
upload_queue.close()
upload_queue.join()
print(done_queue.qsize(), 'items finished')

AttributeError: 'MyQueue' object has no attribute 'close'

## Things to Remember

- Pipelines are a great way to organize sequences of work that run concurrently using multiple Python threads.
- Be aware of the many problems in building concurrent pipelines: busy waiting, stopping workers, and memory explosion.
- The Queue class has all of the facilities you need to build robust pipelines: blocking operations, buffer sizes, and joining.