diff --git a/RELEASE.md b/RELEASE.md index 620abccf2c..0658fec658 100644 --- a/RELEASE.md +++ b/RELEASE.md @@ -9,6 +9,7 @@ ## Documentation changes * Improved documentation for custom starters +* Added a new section on deploying Kedro project on AWS Airflow MWAA ## Community contributions Many thanks to the following Kedroids for contributing PRs to this release: diff --git a/docs/source/deployment/airflow.md b/docs/source/deployment/airflow.md index 3d1964186d..219626a3c5 100644 --- a/docs/source/deployment/airflow.md +++ b/docs/source/deployment/airflow.md @@ -2,17 +2,23 @@ Apache Airflow is a popular open-source workflow management platform. It is a suitable engine to orchestrate and execute a pipeline authored with Kedro, because workflows in Airflow are modelled and organised as [DAGs](https://en.wikipedia.org/wiki/Directed_acyclic_graph). -## How to run a Kedro pipeline on Apache Airflow with Astronomer +## Introduction and strategy -The following tutorial shows how to deploy an example [Spaceflights Kedro project](https://docs.kedro.org/en/stable/tutorial/spaceflights_tutorial.html) on [Apache Airflow](https://airflow.apache.org/) with [Astro CLI](https://docs.astronomer.io/astro/cli/overview), a command-line tool created by [Astronomer](https://www.astronomer.io/) that streamlines the creation of local Airflow projects. You will deploy it locally first, and then transition to Astro Cloud. +The general strategy to deploy a Kedro pipeline on Apache Airflow is to run every Kedro node as an [Airflow task](https://airflow.apache.org/docs/apache-airflow/stable/concepts/tasks.html) while the whole pipeline is converted to an [Airflow DAG](https://airflow.apache.org/docs/apache-airflow/stable/concepts/dags.html). This approach mirrors the principles of [running Kedro in a distributed environment](distributed.md). -[Astronomer](https://docs.astronomer.io/astro/install-cli) is a managed Airflow platform which allows users to spin up and run an Airflow cluster in production. Additionally, it also provides a set of tools to help users get started with Airflow locally in the easiest way possible. +Each node will be executed within a new Kedro session, which implies that `MemoryDataset`s cannot serve as storage for the intermediate results of nodes. Instead, all datasets must be registered in the [`DataCatalog`](https://docs.kedro.org/en/stable/data/index.html) and stored in persistent storage. This approach enables nodes to access the results from preceding nodes. -### Strategy +This guide provides instructions on running a Kedro pipeline on different Airflow platforms. You can jump to the specific sections by clicking the links below, how to run a Kedro pipeline on: -The general strategy to deploy a Kedro pipeline on Apache Airflow is to run every Kedro node as an [Airflow task](https://airflow.apache.org/docs/apache-airflow/stable/concepts/tasks.html) while the whole pipeline is converted to an [Airflow DAG](https://airflow.apache.org/docs/apache-airflow/stable/concepts/dags.html). This approach mirrors the principles of [running Kedro in a distributed environment](distributed.md). +- [Apache Airflow with Astronomer](#how-to-run-a-kedro-pipeline-on-apache-airflow-with-astronomer) +- [Amazon AWS Managed Workflows for Apache Airflow (MWAA)](#how-to-run-a-kedro-pipeline-on-amazon-aws-managed-workflows-for-apache-airflow-mwaa) +- [Apache Airflow using a Kubernetes cluster](#how-to-run-a-kedro-pipeline-on-apache-airflow-using-a-kubernetes-cluster) -Each node will be executed within a new Kedro session, which implies that `MemoryDataset`s cannot serve as storage for the intermediate results of nodes. Instead, all datasets must be registered in the [`DataCatalog`](https://docs.kedro.org/en/stable/data/index.html) and stored in persistent storage. This approach enables nodes to access the results from preceding nodes. +## How to run a Kedro pipeline on Apache Airflow with Astronomer + +The following tutorial shows how to deploy an example [Spaceflights Kedro project](https://docs.kedro.org/en/stable/tutorial/spaceflights_tutorial.html) on [Apache Airflow](https://airflow.apache.org/) with [Astro CLI](https://docs.astronomer.io/astro/cli/overview), a command-line tool created by [Astronomer](https://www.astronomer.io/) that streamlines the creation of local Airflow projects. You will deploy it locally first, and then transition to Astro Cloud. + +[Astronomer](https://docs.astronomer.io/astro/install-cli) is a managed Airflow platform which allows users to spin up and run an Airflow cluster in production. Additionally, it also provides a set of tools to help users get started with Airflow locally in the easiest way possible. ### Prerequisites @@ -44,7 +50,7 @@ In this section, you will create a new Kedro project equipped with an example pi 3. Open `conf/airflow/catalog.yml` to see the list of datasets used in the project. Note that additional intermediate datasets (`X_train`, `X_test`, `y_train`, `y_test`) are stored only in memory. You can locate these in the pipeline description under `/src/new_kedro_project/pipelines/data_science/pipeline.py`. To ensure these datasets are preserved and accessible across different tasks in Airflow, we need to include them in our `DataCatalog`. Instead of repeating similar code for each dataset, you can use [Dataset Factories](https://docs.kedro.org/en/stable/data/kedro_dataset_factories.html), a special syntax that allows defining a catch-all pattern to overwrite the default `MemoryDataset` creation. Add this code to the end of the file: ```yaml -{base_dataset}: +"{base_dataset}": type: pandas.CSVDataset filepath: data/02_intermediate/{base_dataset}.csv ``` @@ -74,7 +80,7 @@ This step should produce a wheel file called `new_kedro_project-0.1-py3-none-any kedro airflow create --target-dir=dags/ --env=airflow ``` -This step should produce a `.py` file called `new_kedro_project_dag.py` located at `dags/`. +This step should produce a `.py` file called `new_kedro_project_airflow_dag.py` located at `dags/`. ### Deployment process with Astro CLI @@ -102,15 +108,15 @@ In this section, you will start by setting up a new blank Airflow project using cp -r new-kedro-project/conf kedro-airflow-spaceflights/conf mkdir -p kedro-airflow-spaceflights/dist/ cp new-kedro-project/dist/new_kedro_project-0.1-py3-none-any.whl kedro-airflow-spaceflights/dist/ - cp new-kedro-project/dags/new_kedro_project_dag.py kedro-airflow-spaceflights/dags/ + cp new-kedro-project/dags/new_kedro_project_airflow_dag.py kedro-airflow-spaceflights/dags/ ``` Feel free to completely copy `new-kedro-project` into `kedro-airflow-spaceflights` if your project requires frequent updates, DAG recreation, and repackaging. This approach allows you to work with kedro and astro projects in a single folder, eliminating the need to copy kedro files for each development iteration. However, be aware that both projects will share common files such as `requirements.txt`, `README.md`, and `.gitignore`. -4. Add a few lines to the `Dockerfile` located in the `kedro-airflow-spaceflights` folder to set the environment variable `KEDRO_LOGGING_CONFIG` to point to `conf/logging.yml` to enable custom logging in Kedro and to install the .whl file of our prepared Kedro project into the Airflow container: +4. Add a few lines to the `Dockerfile` located in the `kedro-airflow-spaceflights` folder to set the environment variable `KEDRO_LOGGING_CONFIG` to point to `conf/logging.yml` to enable custom logging in Kedro (note that from Kedro 0.19.6 onwards, this step is unnecessary because Kedro uses the `conf/logging.yml` file by default) and to install the .whl file of our prepared Kedro project into the Airflow container: ```Dockerfile -ENV KEDRO_LOGGING_CONFIG="conf/logging.yml" +ENV KEDRO_LOGGING_CONFIG="conf/logging.yml" # This line is not needed from Kedro 0.19.6 RUN pip install --user dist/new_kedro_project-0.1-py3-none-any.whl ``` @@ -166,6 +172,76 @@ astro deploy ![](../meta/images/astronomer_cloud_deployment.png) +## How to run a Kedro pipeline on Amazon AWS Managed Workflows for Apache Airflow (MWAA) + +### Kedro project preparation +MWAA, or Managed Workflows for Apache Airflow, is an AWS service that makes it easier to set up, operate, and scale Apache Airflow in the cloud. Deploying a Kedro pipeline to MWAA is similar to Astronomer, but there are some key differences: you need to store your project data in an AWS S3 bucket and make necessary changes to your `DataCatalog`. Additionally, you must configure how you upload your Kedro configuration, install your Kedro package, and set up the necessary environment variables. +1. Complete steps 1-4 from the [Create, prepare and package example Kedro project](#create-prepare-and-package-example-kedro-project) section. +2. Your project's data should not reside in the working directory of the Airflow container. Instead, [create an S3 bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/creating-bucket.html) and [upload your data folder from the new-kedro-project folder to your S3 bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/upload-objects.html). +3. Modify the `DataCatalog` to reference data in your S3 bucket by updating the filepath and add credentials line for each dataset in `new-kedro-project/conf/airflow/catalog.yml`. Add the S3 prefix to the filepath as shown below: +```shell +companies: + type: pandas.CSVDataset + filepath: s3:///data/01_raw/companies.csv + credentials: dev_s3 +``` +4. [Set up AWS credentials](https://docs.aws.amazon.com/keyspaces/latest/devguide/access.credentials.html) to provide read and write access to your S3 bucket. Update `new-kedro-project/conf/local/credentials.yml` with your AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY and copy it to the `new-kedro-project/conf/airflow/` folder: +```shell +dev_s3: + client_kwargs: + aws_access_key_id: ********************* + aws_secret_access_key: ****************************************** +``` +5. Add `s3fs` to your project’s `requirements.txt` in `new-kedro-project` to facilitate communication with AWS S3. Some libraries could cause dependency conflicts in the Airflow environment, so make sure to minimise the requirements list and avoid using `kedro-viz` and `pytest`. +```shell +s3fs +``` + +6. Follow steps 5-6 from the [Create, prepare and package example Kedro project](#create-prepare-and-package-example-kedro-project) section to package your Kedro project and generate an Airflow DAG. +7. Update the DAG file `new_kedro_project_airflow_dag.py` located in the `dags/` folder by adding `conf_source="plugins/conf-new_kedro_project.tar.gz"` to the arguments of `KedroSession.create()` in the Kedro operator execution function. This change is necessary because your Kedro configuration archive will be stored in the `plugins/` folder, not the root directory: +```shell + def execute(self, context): + configure_project(self.package_name) + with KedroSession.create(project_path=self.project_path, + env=self.env, conf_source="plugins/conf-new_kedro_project.tar.gz") as session: + session.run(self.pipeline_name, node_names=[self.node_name]) +``` + +### Deployment on AWAA +1. Archive your three files: `new_kedro_project-0.1-py3-none-any.whl` and `conf-new_kedro_project.tar.gz` located in `new-kedro-project/dist`, and `logging.yml` located in `new-kedro-project/conf/` into a file called `plugins.zip` and upload it to `s3://your_S3_bucket`. +```shell +zip -j plugins.zip dist/new_kedro_project-0.1-py3-none-any.whl dist/conf-new_kedro_project.tar.gz conf/logging.yml +``` +This archive will be later unpacked to the `/plugins` folder in the working directory of the Airflow container. + +2. Create a new `requirements.txt` file, add the path where your Kedro project will be unpacked in the Airflow container, and upload `requirements.txt` to `s3://your_S3_bucket`: +```shell +./plugins/new_kedro_project-0.1-py3-none-any.whl +``` +Libraries from `requirements.txt` will be installed during container initialisation. + +3. Upload `new_kedro_project_airflow_dag.py` from the `new-kedro-project/dags` to `s3://your_S3_bucket/dags`. +4. Create an empty `startup.sh` file for container startup commands. Set an environment variable for custom Kedro logging: +```shell +export KEDRO_LOGGING_CONFIG="plugins/logging.yml" +``` +5. Set up a new [AWS MWAA environment](https://docs.aws.amazon.com/mwaa/latest/userguide/create-environment.html) using the following settings: +```shell +S3 Bucket: + s3://your_S3_bucket +DAGs folder + s3://your_S3_bucket/dags +Plugins file - optional + s3://your_S3_bucket/plugins.zip +Requirements file - optional + s3://your_S3_bucket/requrements.txt +Startup script file - optional + s3://your_S3_bucket/startup.sh +``` +On the next page, set the `Public network (Internet accessible)` option in the `Web server access` section if you want to access your Airflow UI from the internet. Continue with the default options on the subsequent pages. + +6. Once the environment is created, use the `Open Airflow UI` button to access the standard Airflow interface, where you can manage your DAG. + ## How to run a Kedro pipeline on Apache Airflow using a Kubernetes cluster The `kedro-airflow-k8s` plugin from GetInData | Part of Xebia enables you to run a Kedro pipeline on Airflow with a Kubernetes cluster. The plugin can be used together with `kedro-docker` to prepare a docker image for pipeline execution. At present, the plugin is available for versions of Kedro < 0.18 only.