A workflow for space plasma physics on Apache Airflow. The workflow is implemented as a DAG, and can be run in Apache Airflow, on a Kubernetes cluster.
We provide a DAG to execute Particle-in-Cell simulation, using the SputniPIC space plasma simulation software.
The PIC DAG, as presented in the Apache Airflow UI
The main DAG is contained in pic.py
, we also provide with the following folders:
docker/
contains theDockerfile
, along with bash scripts which are included in the image;misc/
contains various configuration files, used for testing and development;plot/
contains python scripts to create plots.
- Kubernetes: PersistentVolume
- Kubernetes: PersistentVolumeClaim with name
pv-pic
- Docker: Docker image available in public registry with name
example/sputniPIC:latest
- Apache Airflow: pic.py is in Apache Airflow's DAG folder.
- DAG: in pic.py,
PVC_NAME
is set - DAG: in pic.py,
IMAGE_NAME
is set - DAG: one or more
.inp
simulation configuration files stored in the root of the PersitentVolume.
- A Kubernetes cluster
- A working Apache Airflow setup: in particular, Apache Airflow must be configured to be able to run tasks on the Kubernetes cluster.
The workflow relies on a specific PersitentVolumeClaim to be present on the Kubernetes cluster to store files during execution. In this step, we describe how to create a PersistentVolume, and a PersistentVolumeClaim attached to this volume.
PersistentVolume.
Your Kubernetes cluster administrator provides you with the name of the PersistentVolume you need to use.
However, if you manage your own Kubernetes cluster, you need to create a PersistentVolume yourself, we provide an example in misc/pv.yaml
.
In this example, and in the rest of this tutorial, the PersistentVolume is named pv-local
, and uses local storage as a backend.
lease refer to Kubernetes documentation to learn more on PersistentVolume.
It can be deployed using:
kubectl create -f misc/pv.yaml
Note
Any other kind of storage other than local storage can be used as a backend for the PersistenVolume, depending on the Cloud provider.
PersistentVolumeClaim.
Once you know the name of the PersistentVolume (in this example pv-local
), you need to create a PersistentVolumeClaim, containing information on the storage size.
An example is provided in misc/pvclaim.yaml
, the PersistentVolumeClain can be cerated using:
kubectl create -f misc/pvc.yaml -n airflow
Warning
It is crucial that the namespace used for the PVC is the same as the one under which the Apache Airflow is deployed, here we use airflow
.
A Dockerfile is provided, along with scripts that will be included in the image, in the the docker
folder.
To build and publish the image:
cd docker/
docker build -t gabinsc/sputnipic:latest
docker push gabinsc/sputnipic:latest
Note
The image must be published to a public Docker registry, or a registry which is accessible from the Apache Airflow setup. Please refer to Docker documentation for more details on building an image, and publishing it.
In order for the DAG to be executed in your specific environment, some adjusments are required.
- Place the
pic.py
file in the DAG folder of your Apache Airflow setup. - Adjust the following constants in
pic.py
:IMAGE_NAME
: name of the image that will be used for the containers.PVC_NAME
: name of the PersistentVolumeClaim created in step 1,pvc-pic
.
- Validate that you can see the DAG under the name
pic
in the Apache Airflow UI. If not, DAG import errors are reported in the top of the UI.
Before you run the DAG, place the various configuration files for the simulation, in .inp
format, in the root of the PersistentVolume defined in the Kubernetes cluster.
Note
Input files are available in SputniPIC's repository: examples.
Click on "Trigger DAG" in the Apache Airlfow UI to start the DAG with the default parameters. You can customize the DAG parameters to your needs by clicking "Trigger DAG w/ config":
inputlist
: list of experiments, each experiement is the name of the corresponding configuration files, without the.inp
extension.
We provide python scripts to create readable Gantt charts, based on the workflow execution. Note that a Gantt chart can be found for each DAG execution in Apache Airflow UI, however, this chart offers limited interactivity and can be hard to read for complex or long-running DAGs.
Requirements:
python
(≥ 3.9)- python libraries:
plotly
,requests
,pandas
For a specific DAG run, the plot/plot_gantt.py
script creates two Gantt chart in SVG format:
- Resource view: each line in the chart represents a slot in a pool (note that multi-slot tasks are not supported)
- Task view: each line represents a tasks
- (REMOVED) Multi-execution resource view: several DAG runs can be presented on the same Gantt chart, each run has its own color.
Before running this script, some information need to be set in plot/constants.py
:
BASE_URL
: base URL to access Airflow APISESSION_COOKIE
: session cookie, can typically be obtained from the Network section of your browser's DevTools when logged in on the Apache Airflow UIPOOL_ALIAS
: alias names for the various pools, will be shown in the legend
The identifier of the DAG and the identifier of the specific DAG run are given as command-line arguments. The script can also plot data from a JSON file, a sample is provided in the samples/
directory.
When running the script, figures will be written to the figures/
folder.