Scalable pipelines processing and infrastructure management.
- Scalable pipeline processing using datapackage pipelines.
- Infrastructure management using Kubernetes.
Please refer to the datapackage pipelines documentation for full documentation of the pipelines.
This project contains a sample pipeline called noise
- which generates some noise.
The pipelines are defined in the pipeline-spec.yaml
file. Each step's run
attribute can point to a local python file implementing the datapackage pipelines processor interface, see noise.py
for an example.
Install some dependencies (the following should work on recent versions of Ubuntu / Debian)
sudo apt-get install -y python3.6 python3-pip python3.6-dev libleveldb-dev libleveldb1v5
sudo pip3 install pipenv
Install the app depepdencies
pipenv install
Activate the virtualenv
pipenv shell
Get the list of available pipelines
dpp
Run a pipeline
dpp run <PIPELINE_ID>
Run the pipelines using Kubernetes jobs.
- Terminal with
kubectl
command authenticated to a Kubernetes cluster. You should have some running nodes (verify usingkubectl get nodes
).
The following example job configurations are available, you can use them and modify according to your requirements
k8s-job.yaml
- simple job, running once to completionk8s-scheduled-job.yaml
- scheduled job, running daily, before each run - syncs latest data generated from the job defined ink8s-job.yaml
Run the job:
kubecyl apply -f k8s-job.yaml
To modify the job and re-run, delete the old job first
kubectl delete job <JOB_NAME>
kubectl delete cronjob <CRON_JOB_NAME>