b2luigi
- bringing batch 2 luigi!
b2luigi
is a helper package for luigi
for scheduling large luigi workflows on a batch system. It is as simple as
import b2luigi
class MyTask(b2luigi.Task):
def output(self):
return b2luigi.LocalTarget("output_file.txt")
def run(self):
with self.output().open("w") as f:
f.write("This is a test\n")
if __name__ == "__main__":
b2luigi.process(MyTask(), batch=True)
Jump right into it with out quick-start-label
.
If you have never worked with luigi
before, you may want to have a look into the luigi documentation. But you can learn most of the nice features also from this documentation!
Attention
The API of b2luigi
is still under construction. Please remember this when using the package in production!
Luigi already contains a large set of tasks for scheduling and monitoring batch jobs1. But for thousands of tasks in very large projects with different task-defining libraries, you have some problems:
- You want to run many (many many!) batch jobs in parallel In other luigi batch implementations, for every running batch job you also need a running task that monitors it. On most of the systems, the maximal number of processes is limited per user, so you will not be able to run more batch jobs than this. But what do you do if you have thousands of tasks to do?
- You have already a large set of luigi tasks in your project In other implementations you either have to override a
work
function (and you are not allowed to touch therun
function) or they can only run an external command, which you need to define. The first approach plays not well when mixing non-batch and batch task libraries and the second has problems when you need to pass complex arguments to the external command (via command line). - You do not know which batch system you will run on Currently, the batch tasks are mostly defined for a specific batch system. But what if you want to switch from AWS to Azure? From LSF to SGE?
Entering b2luigi
, which tries to solve all this (but was heavily inspired by the previous implementations):
- You can run as many tasks as your batch system can handle in parallel! There will only be a single process running on your submission machine.
- No need to rewrite your tasks! Just call them with
b2luigi.process(.., batch=True)
or withpython file.py --batch
and you are ready to go! - Switching the batch system is just a single change in a config file or one line in python. In the future, there will even be an automatic discovery of the batch system to use.
As b2luigi
should help you with large luigi projects, we have also included some helper functionalities for luigi
tasks and task handling. b2luigi
task is a super-hero version of luigi
task, with simpler handling for output and input files. Also, we give you working examples and best-practices for better data management and how to accomplish your goals, that we have learned with time.
Have a look into the quick-start-label
.
You can also start reading the api-documentation-label
or the code on github.
If you find any bugs or want to improve the documentation, please send me a pull request.
This project is in beta. Please be extra cautious when using in production mode. You can help me by working with one of the todo items described in development-label
.
usage/installation usage/quickstart usage/batch advanced/basf2-examples documentation/api documentation/run_modes advanced/faq advanced/development
b2luigi
stands for multiple things at the same time:
- It brings batch to (2) luigi.
- It helps you with the bread and butter work in luigi (e.g. proper data management)
- It was developed for the Belle II experiment.
- Main developer
Michael Eliachevitch (meliache)
- Original author
Nils Braun (nils-braun)
- Features, fixing, help and testing
- Felix Metzner (FelixMetzner)
- Patrick Ecker (eckerpatrick)
- Jochen Gemmler
- Maximilian Welsch (welschma)
- Kilian Lieret (klieret)
- Sviatoslav Bilokin (bilokin)
- Phil Grace (philiptgrace)
- Stolen ideas