Hawkloon (/'hɔːkluːn/) is a Python framework to build synchronous workers running on scalable infrastructures.
To understand how it works, you should understand the concepts of Worker and Job first.
A Worker is a piece of code that runs and completes a task, kinda like a school kid doing his homework.
A Job is a resource, a piece of data given to the Worker. This is the homework I said before.
- You write a list of Jobs, and this is given to every Worker.
- A Worker can split the tasks between some threads.
- When a Job ends, the thread tells to the others, so they can skip it.
Hawkloon uses Redis (website) to keep track of the jobs' states between threads & workers (so you can start a worker on two different machines).
To install it, clone the repo and use the setup.py
file.
git clone https://github.com/heizelnut/hawkloon
cd hawkloon/src
python3 setup.py install
Let's write a dead simple program that downloads cat images from the internet.
First, write a list of links (jobs) the Worker will have to consume.
jobs = (
"https://i.imgur.com/uvFEcJN.jpg",
"https://i.imgur.com/6qL2HSN.jpg",
"https://i.imgur.com/dRxnay8.jpg",
"https://i.imgur.com/aAuTHLe.jpg",
"https://i.imgur.com/SpCbHBI.jpg"
)
Now you can choose two ways to declare workers: with Class Inheritance or Decorators.
Inherit from the Worker
class and choose the number of threads.
from hawk import Worker
class CatWorker(Worker):
THREADS = 3 # Default is 2
pass
Alright, now overwrite the Worker.consume
method by adding the ability to actually download some images.
from hawk import Worker
import random, requests
class CatWorker(Worker):
THREADS = 3
# Requests the image and saves it in binary form
def consume(self, job):
img = requests.get(job, allow_redirects=True).content
with open(f"{random.randint(0, 10000)}.jpg", "wb") as f:
f.write(img)
Then, instiantiate the worker and connect it to a Redis server.
worker = CatWorker(jobs)
worker.connect("redis://localhost:6379")
After that, start it!
worker.start()
Import the Worker
class and instantiate it.
from hawk import Worker
cat_worker = Worker(jobs)
Then, create a function that downloads cat images, and decorate it with the new instantiated object passing the amount of threads to use.
@cat_worker(threads=3)
def download(job):
img = requests.get(job, allow_redirects=True).content
with open(f"{random.randint(0, 10000)}.jpg", "wb") as f:
f.write(img)
The decorator function must have only one argument.
After that, connect and start crunching data.
worker.connect("redis://localhost:6379")
worker.start()
If you'd like to contribute, you're free to do so! Fork my project and then pull request me.