Skip to content
This repository has been archived by the owner on Feb 22, 2022. It is now read-only.

[incubator/airflow] Airflow Helm Chart #3959

Merged
merged 72 commits into from
Dec 4, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
72 commits
Select commit Hold shift + click to select a range
5805b72
Naive import from gh:gsemet/kube-airflow
gsemet Mar 2, 2018
7e9d644
update chart with recent fixes
gsemet Mar 3, 2018
4c5cd56
Update the templates from kube-airflow
gsemet Mar 10, 2018
2d5a822
Update after review
gsemet Apr 5, 2018
b04e1b3
Sync with gsemet/kube-airflow
gsemet Apr 5, 2018
283dd00
Move to secrets
gsemet Apr 5, 2018
cc9e1cb
Fixed some (find/replace?) errors.
rolanddb Apr 5, 2018
6cffbdc
fix env vars
rolanddb Apr 9, 2018
7cb7bbb
apply base64 encoding to secrets
rolanddb Apr 9, 2018
e5e4e33
Remove trailing whitespace
rolanddb Apr 9, 2018
8814bd9
Fix templates API versions
ese Apr 12, 2018
3540069
Update instructions for embedded DAGs
rolanddb Apr 13, 2018
14154d3
Add configurable serviceAccountName to workers
rolanddb Apr 11, 2018
ce5ceb1
Rename worker/celery vars
rolanddb Apr 13, 2018
801fbfe
Fix NOTES.txt
rolanddb Apr 13, 2018
27d7828
Remove 'v' from chart version
rolanddb Apr 13, 2018
d2a5ca6
Fix a templating typo
amodig Apr 27, 2018
7cdfaf2
fix celery replica config
rolanddb May 2, 2018
84f4ab2
Fix field serviceAccountName
rolanddb Apr 16, 2018
f41de34
Add quotes around the numbers for celery workers
rolanddb May 9, 2018
fc984a8
Parametrize cpu and mem req/limits for worker statefulset
rolanddb May 9, 2018
38085a1
Fix 'trailing spaces' lint issue
rolanddb May 28, 2018
fa19e4a
added the ability to toggle the scheduler -p option
AdamUnger Jun 4, 2018
48c27e2
missed the airflow namespace
AdamUnger Jun 4, 2018
5fe409d
moved command arguments to separate lines
AdamUnger Jun 4, 2018
6672ee5
Headless service name should match statefulset serviceName
The-Fonz Jun 6, 2018
37554ab
compatibility with kube 2.7
gsemet Jun 8, 2018
ce0d3b8
Configure persistance
gsemet Jun 11, 2018
b4adf6c
Update configuration description in README
gsemet Jun 11, 2018
b950c67
Configure persistance
gsemet Jun 11, 2018
9cd3db2
Support initcontainer git synchro
gsemet Jun 11, 2018
3f4a023
Adding documentation for persistence.existingClaim
Jdban Jun 14, 2018
0dc4a58
airflow.scheduler_do_pickle => dags.donot_pickle
gsemet Jun 15, 2018
5f6fcc4
Injecting ~/.local/bin in PATH
gsemet Jun 15, 2018
cb867cf
Force disable xcom pickling
gsemet Jun 15, 2018
d8eb34e
mkdir ~/.local/bin
gsemet Jun 15, 2018
5a0ba8b
Use real use as base URL
gsemet Jun 15, 2018
f34e715
more docs examples
gsemet Jun 15, 2018
2b17f95
Typo fixes
gsemet Jun 15, 2018
1d00983
export local PATH before pip install
gsemet Jun 15, 2018
ca0ba41
Hardcode local path
gsemet Jun 15, 2018
514e556
revert BASE_URL
gsemet Jun 15, 2018
2d6b5cc
bad order or install in scheduler
gsemet Jun 15, 2018
5dd6257
increase initial delay for webui to 6 min
gsemet Jun 15, 2018
4b9fe6c
Reference proper enabled parameter for values.dags.init_container
Jdban Jun 15, 2018
6d78242
Fixed some more spots for Values.dags.init_container.enabled
Jdban Jun 15, 2018
515ad61
Doc for donot_pickle
gsemet Jun 15, 2018
248fd16
notes about num_runs
gsemet Jun 18, 2018
40e5eb9
remove load_examples from example
gsemet Jul 3, 2018
cf59ee8
remove load_examples from example
gsemet Jul 3, 2018
be1255e
airflow: Add support for ingress TLS termination
Aug 10, 2018
fcd4726
airflow: Allow customisation of worker Pod annotations
Aug 13, 2018
6740649
feature/secrets: add secrets as volume mount
ameier38 Oct 18, 2018
27a45bf
feature/secrets: change variable to camel case; add worker secrets; a…
ameier38 Oct 22, 2018
4de5d5e
feature/secrets: change variables to camel case
ameier38 Oct 22, 2018
a41e3be
feature/secrets: update array spacing
ameier38 Oct 23, 2018
1a948c0
feature/secrets: update README
ameier38 Oct 23, 2018
d4563f6
feature/secrets: add newline to ingresses
ameier38 Oct 23, 2018
fdf13f6
feature/secrets: update values
ameier38 Oct 23, 2018
3bc6e0b
Fix apps/v1beta1
maver1ck Nov 10, 2018
5897408
Change minikube values
maver1ck Nov 10, 2018
d17932c
Update airflow chart with:
kppullin Nov 7, 2018
b4361ff
Bump versions
kppullin Nov 7, 2018
68c8013
'existingClaim' must not be declared for the pvc template to work
kppullin Nov 8, 2018
5dd7316
Support setting a custom postgresHost
kppullin Nov 8, 2018
bd7edb4
Fix indendation
maver1ck Nov 10, 2018
e5fed8e
OWNERS file
maver1ck Nov 11, 2018
4caa4d3
chore/cleanup: fix spelling
ameier38 Nov 12, 2018
9c44d79
Move to stable
gsemet Nov 30, 2018
18965c7
add std labels and component label
gsemet Nov 30, 2018
6260ce7
use template for chart label
gsemet Dec 3, 2018
d08d798
Fix reference to incubator
unguiculus Dec 4, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions stable/airflow/.helmignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
.git
14 changes: 14 additions & 0 deletions stable/airflow/Chart.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
description: Airflow is a platform to programmatically author, schedule and monitor workflows
name: airflow
version: 0.9.0
appVersion: 1.10.0
icon: https://airflow.apache.org/_images/pin_large.png
home: https://airflow.apache.org/
maintainers:
- name: gsemet
email: gaetan@xeberon.net
sources:
- https://airflow.apache.org/
keywords:
- workflow
- dag
6 changes: 6 additions & 0 deletions stable/airflow/OWNERS
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
approvers:
- gsemet
- maver1ck
reviewers:
- gsemet
- maver1ck
264 changes: 264 additions & 0 deletions stable/airflow/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,264 @@
# Airflow / Celery

[Airflow](https://airflow.apache.org/) is a platform to programmatically author, schedule and
monitor workflows.


## Install Chart

To install the Airflow Chart into your Kubernetes cluster :

```bash
helm install --namespace "airflow" --name "airflow" stable/airflow
```

After installation succeeds, you can get a status of Chart

```bash
helm status "airflow"
```

If you want to delete your Chart, use this command:

```bash
helm delete --purge "airflow"
```

### Helm ingresses

The Chart provides ingress configuration to allow customization the installation by adapting
the `values.yaml` depending on your setup.
Please read the comments in the `values.yaml` file for more details on how to configure your reverse
proxy or load balancer.

### Chart Prefix

This Helm automatically prefixes all names using the release name to avoid collisions.

### URL prefix

This chart exposes 2 endpoints:

- Airflow Web UI
- Flower, a debug UI for Celery

Both can be placed either at the root of a domain or at a sub path, for example:

```
http://mycompany.com/airflow/
http://mycompany.com/airflow/flower
```

NOTE: Mounting the Airflow UI under a subpath requires an airflow version >= 2.0.x. For the moment
(June 2018) this is **not** available on official package, you will have to use an image where
airflow has been updated to its current HEAD. You can use the following image:
`stibbons31/docker-airflow-dev:2.0dev`. It is rebase regularly on top of the `puckel/docker-airflow`
image.

Please also note that the Airflow UI and Flower do not behave the same:

- Airflow Web UI behaves transparently, to configure it one just needs to specify the
`ingress.web.path` value.
- Flower cannot handle this scheme directly and requires a URL rewrite mechanism in front
of it. In short, it is able to generate the right URLs in the returned HTML file but cannot
respond to these URL. It is commonly found in software that wasn't intended to work under
something else than a root URL or localhost port. To use it, see the `values.yaml` for how
to configure your ingress controller to rewrite the URL (or "strip" the prefix path).

Note: unreleased Flower (as of June 2018) does not need the prefix strip feature anymore. It is
integrated in `docker-airflow-dev:2.0dev` image.

### Airflow configuration

`airflow.cfg` configuration can be changed by defining environment variables in the following form:
`AIRFLOW__<section>__<key>`.

See the
[Airflow documentation for more information](http://airflow.readthedocs.io/en/latest/configuration.html?highlight=__CORE__#setting-configuration-options)

This helm chart allows you to add these additional settings with the value key `airflow.config`.
You can also add generic environment variables such as proxy or private pypi:

```yaml
airflow:
config:
AIRFLOW__CORE__EXPOSE_CONFIG: True
PIP_INDEX_URL: http://pypi.mycompany.com/
PIP_TRUSTED_HOST: pypi.mycompany.com
HTTP_PROXY: http://proxy.mycompany.com:1234
HTTPS_PROXY: http://proxy.mycompany.com:1234
```

If you are using a private image for your dags (see [Embedded Dags](#embedded-dags))
or for use with the KubernetesPodOperator (available in version 1.10.0), then add
an image pull secret to the airflow config:
```yaml
airflow:
image:
pullSecret: my-docker-repo-secret
```

### Worker Statefulset

Celery workers uses StatefulSet.
It is used to freeze their DNS using a Kubernetes Headless Service, and allow the webserver to
requests the logs from each workers individually.
This requires to expose a port (8793) and ensure the pod DNS is accessible to the web server pod,
which is why StatefulSet is for.

#### Worker secrets

You can add kubernetes secrets which will be mounted as volumes on the worker nodes
at `secretsDir/<secret name>`.
```yaml
workers:
secretsDir: /var/airflow/secrets
secrets:
- redshift-user
- redshift-password
- elasticsearch-user
- elasticsearch-password
```

With the above configuration, you could read the `redshift-user` password
from within a dag or other function using:
```python
import os
from pathlib import Path

def get_secret(secret_name):
secrets_dir = Path('/var/airflow/secrets')
secret_path = secrets_dir / secret_name
assert secret_path.exists(), f'could not find {secret_name} at {secret_path}'
secret_data = secret_path.read_text().strip()
return secret_data

redshift_user = get_secret('redshift-user')
```

To create a secret, you can use:
```bash
$ kubectl create secret generic redshift-user --from-file=redshift-user=~/secrets/redshift-user.txt
```
Where `redshift-user.txt` contains the user secret as a single text string.

### Local binaries

Please note a folder `~/.local/bin` will be automatically created and added to the PATH so that
Bash operators can use command line tools installed by `pip install --user` for instance.

## DAGs Deployment

Several options are provided for synchronizing your Airflow DAGs.


### Mount a Shared Persistent Volume

You can store your DAG files on an external volume, and mount this volume into the relevant Pods
(scheduler, web, worker). In this scenario, your CI/CD pipeline should update the DAG files in the
PV.

Since all Pods should have the same collection of DAG files, it is recommended to create just one PV
that is shared. This ensures that the Pods are always in sync about the DagBag.

This is controlled by setting `persistance.enabled=true`. You will have to ensure yourself the
PVC are shared properly between your pods:
- If you are on AWS, you can use [Elastic File System (EFS)](https://aws.amazon.com/efs/).
- If you are on Azure, you can use
[Azure File Storage (AFS)](https://docs.microsoft.com/en-us/azure/aks/azure-files-dynamic-pv).

To share a PV with multiple Pods, the PV needs to have accessMode 'ReadOnlyMany' or 'ReadWriteMany'.

### Use init-container

If you enable set `dags.init_container.enabled=true`, the pods will try upon startup to fetch the
git repository defined by `dags.git_repo`, on branch `dags.git_branch` as DAG folder.

You can also add a `requirements.txt` file at the root of your DAG project to have other
Python dependencies installed.

This is the easiest way of deploying your DAGs to Airflow.

### Embedded DAGs

If you want more control on the way you deploy your DAGs, you can use embedded DAGs, where DAGs
are burned inside the Docker container deployed as Scheduler and Workers.

Be aware this requires more tooling than using shared PVC, or init-container:

- your CI/CD should be able to build a new docker image each time your DAGs are updated.
- your CI/CD should be able to control the deployment of this new image in your kubernetes cluster

Example of procedure:

- Fork the [puckel/docker-airflow](https://github.com/puckel/docker-airflow) repository
- Place your DAG inside the `dags` folder of the repository, and ensure your Python dependencies
are well installed (for example consuming a `requirements.txt` in your `Dockerfile`)
- Update the value of `airflow.image` in your `values.yaml` and deploy on your Kubernetes cluster

## Helm chart Configuration

The following table lists the configurable parameters of the Airflow chart and their default values.

| Parameter | Description | Default |
|------------------------------------------|---------------------------------------------------------|---------------------------|
| `airflow.fernetKey` | Ferney key (see `values.yaml` for example) | (auto generated) |
| `airflow.service.type` | services type | `ClusterIP` |
| `airflow.executor` | the executor to run | `Celery` |
| `airflow.initRetryLoop` | max number of retries during container init | |
| `airflow.image.repository` | Airflow docker image | `puckel/docker-airflow` |
| `airflow.image.tag` | Airflow docker tag | `1.10.0-4` |
| `airflow.image.pullPolicy` | Image pull policy | `IfNotPresent` |
| `airflow.image.pullSecret` | Image pull secret | |
| `airflow.schedulerNumRuns` | -1 to loop indefinitively, 1 to restart after each exec | |
| `airflow.webReplicas` | how many replicas for web server | `1` |
| `airflow.config` | custom airflow configuration env variables | `{}` |
| `airflow.podDisruptionBudget` | control pod disruption budget | `{'maxUnavailable': 1}` |
| `workers.enabled` | enable workers | `true` |
| `workers.replicas` | number of workers pods to launch | `1` |
| `workers.resources` | custom resource configuration for worker pod | `{}` |
| `workers.celery.instances` | number of parallel celery tasks per worker | `1` |
| `workers.pod.annotations` | annotations for the worker pods | `{}` |
| `workers.secretsDir` | directory in which to mount secrets on worker nodes | /var/airflow/secrets |
| `workers.secrets` | secrets to mount as volumes on worker nodes | [] |
| `ingress.enabled` | enable ingress | `false` |
| `ingress.web.host` | hostname for the webserver ui | "" |
| `ingress.web.path` | path of the werbserver ui (read `values.yaml`) | `` |
| `ingress.web.annotations` | annotations for the web ui ingress | `{}` |
| `ingress.web.tls.enabled` | enables TLS termination at the ingress | `false` |
| `ingress.web.tls.secretName` | name of the secret containing the TLS certificate & key | `` |
| `ingress.flower.host` | hostname for the flower ui | "" |
| `ingress.flower.path` | path of the flower ui (read `values.yaml`) | `` |
| `ingress.flower.livenessPath` | path to the liveness probe (read `values.yaml`) | `/` |
| `ingress.flower.annotations` | annotations for the web ui ingress | `{}` |
| `ingress.flower.tls.enabled` | enables TLS termination at the ingress | `false` |
| `ingress.flower.tls.secretName` | name of the secret containing the TLS certificate & key | `` |
| `persistance.enabled` | enable persistance storage for DAGs | `false` |
| `persistance.existingClaim` | if using an existing claim, specify the name here | `nil` |
| `persistance.storageClass` | Persistent Volume Storage Class | (undefined) |
| `persistance.accessMode` | PVC access mode | `ReadWriteOnce` |
| `persistance.size` | Persistant storage size request | `1Gi` |
| `dags.doNotPickle` | should the scheduler disable DAG pickling | `false` |
| `dags.path` | mount path for persistent volume | `/usr/local/airflow/dags` |
| `dags.initContainer.enabled` | Fetch the source code when the pods starts | `false` |
| `dags.initContainer.installRequirements` | auto install requirements.txt deps | `true` |
| `dags.git.url` | url to clone the git repository | nil |
| `dags.git.ref` | branch name, tag or sha1 to reset to | `master` |
| `rbac.create` | create RBAC resources | `true` |
| `serviceAccount.create` | create a service account | `true` |
| `serviceAccount.name` | the service account name | `` |
| `postgres.enabled` | create a postgres server | `true` |
| `postgres.uri` | full URL to custom postgres setup | (undefined) |
| `postgres.portgresHost` | PostgreSQL Hostname | (undefined) |
| `postgres.postgresUser` | PostgreSQL User | `postgres` |
| `postgres.postgresPassword` | PostgreSQL Password | `airflow` |
| `postgres.postgresDatabase` | PostgreSQL Database name | `airflow` |
| `postgres.persistence.enabled` | Enable Postgres PVC | `true` |
| `postgres.persistance.storageClass` | Persistant class | (undefined) |
| `postgres.persistance.accessMode` | Access mode | `ReadWriteOnce` |
| `redis.enabled` | Create a Redis cluster | `true` |
| `redis.password` | Redis password | `airflow` |
| `redis.master.persistence.enabled` | Enable Redis PVC | `false` |
| `redis.cluster.enabled` | enable master-slave cluster | `false` |

Full and up-to-date documentation can be found in the comments of the `values.yaml` file.
36 changes: 36 additions & 0 deletions stable/airflow/examples/minikube-values.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
airflow:
image:
repository: puckel/docker-airflow
tag: 1.10.0-4
pullPolicy: IfNotPresent
service:
type: NodePort
webReplicas: 1
config:
AIRFLOW__CORE__LOGGING_LEVEL: DEBUG
AIRFLOW__CORE__LOAD_EXAMPLES: True

workers:
replicas: 1
celery:
instances: 1

ingress:
enabled: true
web:
path: "/airflow"
host: "minikube"
annotations:
traefik.frontend.rule.type: PathPrefix
kubernetes.io/ingress.class: traefik
flower:
path: "/airflow/flower"
host: "minikube"
annotations:
traefik.frontend.rule.type: PathPrefixStrip
kubernetes.io/ingress.class: traefik

persistence:
enabled: true
accessMode: ReadWriteOnce
size: 1Gi
9 changes: 9 additions & 0 deletions stable/airflow/requirements.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
dependencies:
- name: postgresql
version: 0.13.1
repository: https://kubernetes-charts.storage.googleapis.com/
condition: postgresql.enabled
- name: redis
version: 3.3.5
repository: https://kubernetes-charts.storage.googleapis.com/
condition: redis.enabled
30 changes: 30 additions & 0 deletions stable/airflow/templates/NOTES.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
Congratulations. You have just deployed Apache Airflow

{{- if .Values.ingress.enabled }}
URL to Airflow and Flower:

- Web UI: http://{{ .Values.ingress.web.host }}{{ .Values.ingress.web.path }}/
- Flower: http://{{ .Values.ingress.flower.host }}{{ .Values.ingress.flower.path }}/

{{- else if contains "NodePort" .Values.airflow.service.type }}

1. Get the Airflow URL by running these commands:

export NODE_PORT=$(kubectl get --namespace {{ .Release.Namespace }} -o jsonpath="{.spec.ports[0].nodePort}" services {{ template "airflow.fullname" . }})
export NODE_IP=$(kubectl get nodes --namespace {{ .Release.Namespace }} -o jsonpath="{.items[0].status.addresses[0].address}")
echo http://$NODE_IP:$NODE_PORT/

{{- else if contains "LoadBalancer" .Values.airflow.service.type }}

NOTE: It may take a few minutes for the LoadBalancer IP to be available.
You can watch the status of the service by running 'kubectl get svc -w {{ template "airflow.fullname" . }}'
export SERVICE_IP=$(kubectl get svc --namespace {{ .Release.Namespace }} {{ template "airflow.fullname" . }} -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
echo http://$SERVICE_IP/

{{- else if contains "ClusterIP" .Values.airflow.service.type }}
export POD_NAME=$(kubectl get pods --namespace {{ .Release.Namespace }} -l "component={{ .Values.airflow.name }}" -o jsonpath="{.items[0].metadata.name}")
echo http://127.0.0.1:{{ .Values.airflow.externalPortHttp }}
kubectl port-forward --namespace {{ .Release.Namespace }} $POD_NAME {{ .Values.airflow.externalPortHttp }}:{{ .Values.airflow.internalPortHttp }}

2. Open Airflow in your web browser
{{- end }}
Loading