Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Airflow AWS MWAA deployment docs #3860

Merged
merged 15 commits into from
May 20, 2024
1 change: 1 addition & 0 deletions RELEASE.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@

## Documentation changes
* Improved documentation for custom starters
* Added a new section on deploying Kedro project on AWS Airflow MWAA

## Community contributions
Many thanks to the following Kedroids for contributing PRs to this release:
Expand Down
98 changes: 87 additions & 11 deletions docs/source/deployment/airflow.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,17 +2,23 @@

Apache Airflow is a popular open-source workflow management platform. It is a suitable engine to orchestrate and execute a pipeline authored with Kedro, because workflows in Airflow are modelled and organised as [DAGs](https://en.wikipedia.org/wiki/Directed_acyclic_graph).

## How to run a Kedro pipeline on Apache Airflow with Astronomer
## Introduction and strategy

The following tutorial shows how to deploy an example [Spaceflights Kedro project](https://docs.kedro.org/en/stable/tutorial/spaceflights_tutorial.html) on [Apache Airflow](https://airflow.apache.org/) with [Astro CLI](https://docs.astronomer.io/astro/cli/overview), a command-line tool created by [Astronomer](https://www.astronomer.io/) that streamlines the creation of local Airflow projects. You will deploy it locally first, and then transition to Astro Cloud.
The general strategy to deploy a Kedro pipeline on Apache Airflow is to run every Kedro node as an [Airflow task](https://airflow.apache.org/docs/apache-airflow/stable/concepts/tasks.html) while the whole pipeline is converted to an [Airflow DAG](https://airflow.apache.org/docs/apache-airflow/stable/concepts/dags.html). This approach mirrors the principles of [running Kedro in a distributed environment](distributed.md).

Check notice on line 7 in docs/source/deployment/airflow.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/deployment/airflow.md#L7

[Kedro.sentencelength] Try to keep your sentence length to 30 words or fewer.
Raw output
{"message": "[Kedro.sentencelength] Try to keep your sentence length to 30 words or fewer.", "location": {"path": "docs/source/deployment/airflow.md", "range": {"start": {"line": 7, "column": 1}}}, "severity": "INFO"}

[Astronomer](https://docs.astronomer.io/astro/install-cli) is a managed Airflow platform which allows users to spin up and run an Airflow cluster in production. Additionally, it also provides a set of tools to help users get started with Airflow locally in the easiest way possible.
Each node will be executed within a new Kedro session, which implies that `MemoryDataset`s cannot serve as storage for the intermediate results of nodes. Instead, all datasets must be registered in the [`DataCatalog`](https://docs.kedro.org/en/stable/data/index.html) and stored in persistent storage. This approach enables nodes to access the results from preceding nodes.

### Strategy
This guide provides instructions on running a Kedro pipeline on different Airflow platforms. You can jump to the specific sections by clicking the links below, how to run a Kedro pipeline on:

The general strategy to deploy a Kedro pipeline on Apache Airflow is to run every Kedro node as an [Airflow task](https://airflow.apache.org/docs/apache-airflow/stable/concepts/tasks.html) while the whole pipeline is converted to an [Airflow DAG](https://airflow.apache.org/docs/apache-airflow/stable/concepts/dags.html). This approach mirrors the principles of [running Kedro in a distributed environment](distributed.md).
- [Apache Airflow with Astronomer](#how-to-run-a-kedro-pipeline-on-apache-airflow-with-astronomer)
- [Amazon AWS Managed Workflows for Apache Airflow (MWAA)](#how-to-run-a-kedro-pipeline-on-amazon-aws-managed-workflows-for-apache-airflow-mwaa)
- [Apache Airflow using a Kubernetes cluster](#how-to-run-a-kedro-pipeline-on-apache-airflow-using-a-kubernetes-cluster)

Each node will be executed within a new Kedro session, which implies that `MemoryDataset`s cannot serve as storage for the intermediate results of nodes. Instead, all datasets must be registered in the [`DataCatalog`](https://docs.kedro.org/en/stable/data/index.html) and stored in persistent storage. This approach enables nodes to access the results from preceding nodes.
## How to run a Kedro pipeline on Apache Airflow with Astronomer

The following tutorial shows how to deploy an example [Spaceflights Kedro project](https://docs.kedro.org/en/stable/tutorial/spaceflights_tutorial.html) on [Apache Airflow](https://airflow.apache.org/) with [Astro CLI](https://docs.astronomer.io/astro/cli/overview), a command-line tool created by [Astronomer](https://www.astronomer.io/) that streamlines the creation of local Airflow projects. You will deploy it locally first, and then transition to Astro Cloud.

Check notice on line 19 in docs/source/deployment/airflow.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/deployment/airflow.md#L19

[Kedro.sentencelength] Try to keep your sentence length to 30 words or fewer.
Raw output
{"message": "[Kedro.sentencelength] Try to keep your sentence length to 30 words or fewer.", "location": {"path": "docs/source/deployment/airflow.md", "range": {"start": {"line": 19, "column": 1}}}, "severity": "INFO"}

[Astronomer](https://docs.astronomer.io/astro/install-cli) is a managed Airflow platform which allows users to spin up and run an Airflow cluster in production. Additionally, it also provides a set of tools to help users get started with Airflow locally in the easiest way possible.

Check warning on line 21 in docs/source/deployment/airflow.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/deployment/airflow.md#L21

[Kedro.weaselwords] 'Additionally' is a weasel word!
Raw output
{"message": "[Kedro.weaselwords] 'Additionally' is a weasel word!", "location": {"path": "docs/source/deployment/airflow.md", "range": {"start": {"line": 21, "column": 162}}}, "severity": "WARNING"}

### Prerequisites

Expand Down Expand Up @@ -44,7 +50,7 @@
3. Open `conf/airflow/catalog.yml` to see the list of datasets used in the project. Note that additional intermediate datasets (`X_train`, `X_test`, `y_train`, `y_test`) are stored only in memory. You can locate these in the pipeline description under `/src/new_kedro_project/pipelines/data_science/pipeline.py`. To ensure these datasets are preserved and accessible across different tasks in Airflow, we need to include them in our `DataCatalog`. Instead of repeating similar code for each dataset, you can use [Dataset Factories](https://docs.kedro.org/en/stable/data/kedro_dataset_factories.html), a special syntax that allows defining a catch-all pattern to overwrite the default `MemoryDataset` creation. Add this code to the end of the file:

```yaml
{base_dataset}:
"{base_dataset}":
type: pandas.CSVDataset
filepath: data/02_intermediate/{base_dataset}.csv
```
Expand Down Expand Up @@ -74,7 +80,7 @@
kedro airflow create --target-dir=dags/ --env=airflow
```

This step should produce a `.py` file called `new_kedro_project_dag.py` located at `dags/`.
This step should produce a `.py` file called `new_kedro_project_airflow_dag.py` located at `dags/`.

### Deployment process with Astro CLI

Expand Down Expand Up @@ -102,15 +108,15 @@
cp -r new-kedro-project/conf kedro-airflow-spaceflights/conf
mkdir -p kedro-airflow-spaceflights/dist/
cp new-kedro-project/dist/new_kedro_project-0.1-py3-none-any.whl kedro-airflow-spaceflights/dist/
cp new-kedro-project/dags/new_kedro_project_dag.py kedro-airflow-spaceflights/dags/
cp new-kedro-project/dags/new_kedro_project_airflow_dag.py kedro-airflow-spaceflights/dags/
```

Feel free to completely copy `new-kedro-project` into `kedro-airflow-spaceflights` if your project requires frequent updates, DAG recreation, and repackaging. This approach allows you to work with kedro and astro projects in a single folder, eliminating the need to copy kedro files for each development iteration. However, be aware that both projects will share common files such as `requirements.txt`, `README.md`, and `.gitignore`.

4. Add a few lines to the `Dockerfile` located in the `kedro-airflow-spaceflights` folder to set the environment variable `KEDRO_LOGGING_CONFIG` to point to `conf/logging.yml` to enable custom logging in Kedro and to install the .whl file of our prepared Kedro project into the Airflow container:
4. Add a few lines to the `Dockerfile` located in the `kedro-airflow-spaceflights` folder to set the environment variable `KEDRO_LOGGING_CONFIG` to point to `conf/logging.yml` to enable custom logging in Kedro (note that from Kedro 0.19.6 onwards, this step is unnecessary because Kedro uses the `conf/logging.yml` file by default) and to install the .whl file of our prepared Kedro project into the Airflow container:

Check warning on line 116 in docs/source/deployment/airflow.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/deployment/airflow.md#L116

[Kedro.weaselwords] 'few' is a weasel word!
Raw output
{"message": "[Kedro.weaselwords] 'few' is a weasel word!", "location": {"path": "docs/source/deployment/airflow.md", "range": {"start": {"line": 116, "column": 10}}}, "severity": "WARNING"}

Check warning on line 116 in docs/source/deployment/airflow.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/deployment/airflow.md#L116

[Kedro.Spellings] Did you really mean 'onwards'?
Raw output
{"message": "[Kedro.Spellings] Did you really mean 'onwards'?", "location": {"path": "docs/source/deployment/airflow.md", "range": {"start": {"line": 116, "column": 240}}}, "severity": "WARNING"}

```Dockerfile
ENV KEDRO_LOGGING_CONFIG="conf/logging.yml"
ENV KEDRO_LOGGING_CONFIG="conf/logging.yml" # This line is not needed from Kedro 0.19.6

RUN pip install --user dist/new_kedro_project-0.1-py3-none-any.whl
```
Expand Down Expand Up @@ -166,6 +172,76 @@

![](../meta/images/astronomer_cloud_deployment.png)

## How to run a Kedro pipeline on Amazon AWS Managed Workflows for Apache Airflow (MWAA)

Check warning on line 175 in docs/source/deployment/airflow.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/deployment/airflow.md#L175

[Kedro.headings] 'How to run a Kedro pipeline on Amazon AWS Managed Workflows for Apache Airflow (MWAA)' should use sentence-style capitalization.
Raw output
{"message": "[Kedro.headings] 'How to run a Kedro pipeline on Amazon AWS Managed Workflows for Apache Airflow (MWAA)' should use sentence-style capitalization.", "location": {"path": "docs/source/deployment/airflow.md", "range": {"start": {"line": 175, "column": 4}}}, "severity": "WARNING"}

### Kedro project preparation
MWAA, or Managed Workflows for Apache Airflow, is an AWS service that makes it easier to set up, operate, and scale Apache Airflow in the cloud. Deploying a Kedro pipeline to MWAA is similar to Astronomer, but there are some key differences: you need to store your project data in an AWS S3 bucket and make necessary changes to your `DataCatalog`. Additionally, you must configure how you upload your Kedro configuration, install your Kedro package, and set up the necessary environment variables.

Check notice on line 178 in docs/source/deployment/airflow.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/deployment/airflow.md#L178

[Kedro.sentencelength] Try to keep your sentence length to 30 words or fewer.
Raw output
{"message": "[Kedro.sentencelength] Try to keep your sentence length to 30 words or fewer.", "location": {"path": "docs/source/deployment/airflow.md", "range": {"start": {"line": 178, "column": 146}}}, "severity": "INFO"}

Check warning on line 178 in docs/source/deployment/airflow.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/deployment/airflow.md#L178

[Kedro.toowordy] 'similar to' is too wordy
Raw output
{"message": "[Kedro.toowordy] 'similar to' is too wordy", "location": {"path": "docs/source/deployment/airflow.md", "range": {"start": {"line": 178, "column": 184}}}, "severity": "WARNING"}

Check warning on line 178 in docs/source/deployment/airflow.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/deployment/airflow.md#L178

[Kedro.weaselwords] 'Additionally' is a weasel word!
Raw output
{"message": "[Kedro.weaselwords] 'Additionally' is a weasel word!", "location": {"path": "docs/source/deployment/airflow.md", "range": {"start": {"line": 178, "column": 349}}}, "severity": "WARNING"}
1. Complete steps 1-4 from the [Create, prepare and package example Kedro project](#create-prepare-and-package-example-kedro-project) section.
2. Your project's data should not reside in the working directory of the Airflow container. Instead, [create an S3 bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/creating-bucket.html) and [upload your data folder from the new-kedro-project folder to your S3 bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/upload-objects.html).

Check warning on line 180 in docs/source/deployment/airflow.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/deployment/airflow.md#L180

[Kedro.toowordy] 'reside' is too wordy
Raw output
{"message": "[Kedro.toowordy] 'reside' is too wordy", "location": {"path": "docs/source/deployment/airflow.md", "range": {"start": {"line": 180, "column": 35}}}, "severity": "WARNING"}
3. Modify the `DataCatalog` to reference data in your S3 bucket by updating the filepath and add credentials line for each dataset in `new-kedro-project/conf/airflow/catalog.yml`. Add the S3 prefix to the filepath as shown below:

Check warning on line 181 in docs/source/deployment/airflow.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/deployment/airflow.md#L181

[Kedro.toowordy] 'Modify' is too wordy
Raw output
{"message": "[Kedro.toowordy] 'Modify' is too wordy", "location": {"path": "docs/source/deployment/airflow.md", "range": {"start": {"line": 181, "column": 4}}}, "severity": "WARNING"}

Check warning on line 181 in docs/source/deployment/airflow.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/deployment/airflow.md#L181

[Kedro.Spellings] Did you really mean 'filepath'?
Raw output
{"message": "[Kedro.Spellings] Did you really mean 'filepath'?", "location": {"path": "docs/source/deployment/airflow.md", "range": {"start": {"line": 181, "column": 81}}}, "severity": "WARNING"}

Check warning on line 181 in docs/source/deployment/airflow.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/deployment/airflow.md#L181

[Kedro.Spellings] Did you really mean 'filepath'?
Raw output
{"message": "[Kedro.Spellings] Did you really mean 'filepath'?", "location": {"path": "docs/source/deployment/airflow.md", "range": {"start": {"line": 181, "column": 206}}}, "severity": "WARNING"}
```shell
companies:
type: pandas.CSVDataset
filepath: s3://<your_S3_bucket>/data/01_raw/companies.csv
credentials: dev_s3
```
4. [Set up AWS credentials](https://docs.aws.amazon.com/keyspaces/latest/devguide/access.credentials.html) to provide read and write access to your S3 bucket. Update `new-kedro-project/conf/local/credentials.yml` with your AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY and copy it to the `new-kedro-project/conf/airflow/` folder:
```shell
dev_s3:
client_kwargs:
aws_access_key_id: *********************
aws_secret_access_key: ******************************************
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't tried this yet, but do the credentials in the local/credentials.yml work on the cloud platform? This file ideally shouldn't be uploaded to the cloud at all. Maybe setting up env variables?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried that approach with the local environment, and it works. We can change the environment from local to prod, but am I correct in understanding that there isn't a big difference?

For setting environment variables, I am using an env_var.sh file that is located in the same S3 bucket as the conf folder, which is currently packed into plugins.zip. Therefore, I don't think it makes much sense to use environment variables for security reasons because, ultimately, they will be stored in the same S3 bucket. However, this S3 bucket is secure and does not have public access, so I think it's not a problem.

5. Add `s3fs` to your project’s `requirements.txt` in `new-kedro-project` to facilitate communication with AWS S3. Some libraries could cause dependency conflicts in the Airflow environment, so make sure to minimise the requirements list and avoid using `kedro-viz` and `pytest`.

Check warning on line 195 in docs/source/deployment/airflow.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/deployment/airflow.md#L195

[Kedro.toowordy] 'facilitate' is too wordy
Raw output
{"message": "[Kedro.toowordy] 'facilitate' is too wordy", "location": {"path": "docs/source/deployment/airflow.md", "range": {"start": {"line": 195, "column": 78}}}, "severity": "WARNING"}

Check warning on line 195 in docs/source/deployment/airflow.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/deployment/airflow.md#L195

[Kedro.toowordy] 'minimise' is too wordy
Raw output
{"message": "[Kedro.toowordy] 'minimise' is too wordy", "location": {"path": "docs/source/deployment/airflow.md", "range": {"start": {"line": 195, "column": 208}}}, "severity": "WARNING"}
```shell
s3fs
```

6. Follow steps 5-6 from the [Create, prepare and package example Kedro project](#create-prepare-and-package-example-kedro-project) section to package your Kedro project and generate an Airflow DAG.
7. Update the DAG file `new_kedro_project_airflow_dag.py` located in the `dags/` folder by adding `conf_source="plugins/conf-new_kedro_project.tar.gz"` to the arguments of `KedroSession.create()` in the Kedro operator execution function. This change is necessary because your Kedro configuration archive will be stored in the `plugins/` folder, not the root directory:
```shell
def execute(self, context):
configure_project(self.package_name)
with KedroSession.create(project_path=self.project_path,
env=self.env, conf_source="plugins/conf-new_kedro_project.tar.gz") as session:
session.run(self.pipeline_name, node_names=[self.node_name])
```

### Deployment on AWAA

Check warning on line 210 in docs/source/deployment/airflow.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/deployment/airflow.md#L210

[Kedro.headings] 'Deployment on AWAA' should use sentence-style capitalization.
Raw output
{"message": "[Kedro.headings] 'Deployment on AWAA' should use sentence-style capitalization.", "location": {"path": "docs/source/deployment/airflow.md", "range": {"start": {"line": 210, "column": 5}}}, "severity": "WARNING"}
1. Archive your three files: `new_kedro_project-0.1-py3-none-any.whl` and `conf-new_kedro_project.tar.gz` located in `new-kedro-project/dist`, and `logging.yml` located in `new-kedro-project/conf/` into a file called `plugins.zip` and upload it to `s3://your_S3_bucket`.
```shell
zip -j plugins.zip dist/new_kedro_project-0.1-py3-none-any.whl dist/conf-new_kedro_project.tar.gz conf/logging.yml
```
This archive will be later unpacked to the `/plugins` folder in the working directory of the Airflow container.

2. Create a new `requirements.txt` file, add the path where your Kedro project will be unpacked in the Airflow container, and upload `requirements.txt` to `s3://your_S3_bucket`:
```shell
./plugins/new_kedro_project-0.1-py3-none-any.whl
```
Libraries from `requirements.txt` will be installed during container initialisation.

3. Upload `new_kedro_project_airflow_dag.py` from the `new-kedro-project/dags` to `s3://your_S3_bucket/dags`.
4. Create an empty `startup.sh` file for container startup commands. Set an environment variable for custom Kedro logging:
```shell
export KEDRO_LOGGING_CONFIG="plugins/logging.yml"
```
5. Set up a new [AWS MWAA environment](https://docs.aws.amazon.com/mwaa/latest/userguide/create-environment.html) using the following settings:
```shell
S3 Bucket:
s3://your_S3_bucket
DAGs folder
s3://your_S3_bucket/dags
Plugins file - optional
s3://your_S3_bucket/plugins.zip
Requirements file - optional
s3://your_S3_bucket/requrements.txt
Startup script file - optional
s3://your_S3_bucket/startup.sh
```
On the next page, set the `Public network (Internet accessible)` option in the `Web server access` section if you want to access your Airflow UI from the internet. Continue with the default options on the subsequent pages.

Check warning on line 241 in docs/source/deployment/airflow.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/deployment/airflow.md#L241

[Kedro.toowordy] 'subsequent' is too wordy
Raw output
{"message": "[Kedro.toowordy] 'subsequent' is too wordy", "location": {"path": "docs/source/deployment/airflow.md", "range": {"start": {"line": 241, "column": 206}}}, "severity": "WARNING"}

6. Once the environment is created, use the `Open Airflow UI` button to access the standard Airflow interface, where you can manage your DAG.

## How to run a Kedro pipeline on Apache Airflow using a Kubernetes cluster

The `kedro-airflow-k8s` plugin from GetInData | Part of Xebia enables you to run a Kedro pipeline on Airflow with a Kubernetes cluster. The plugin can be used together with `kedro-docker` to prepare a docker image for pipeline execution. At present, the plugin is available for versions of Kedro < 0.18 only.
Expand Down