From 272c51ea9375632f56bd456b9215049c8b8e1528 Mon Sep 17 00:00:00 2001 From: Dmitry Sorokin Date: Wed, 8 May 2024 16:55:33 +0100 Subject: [PATCH 01/11] Update airflow MWAA deployment docs Signed-off-by: Dmitry Sorokin --- docs/source/deployment/airflow.md | 55 +++++++++++++++++++++++++++++++ 1 file changed, 55 insertions(+) diff --git a/docs/source/deployment/airflow.md b/docs/source/deployment/airflow.md index 3d1964186d..4122ec1986 100644 --- a/docs/source/deployment/airflow.md +++ b/docs/source/deployment/airflow.md @@ -166,6 +166,61 @@ astro deploy ![](../meta/images/astronomer_cloud_deployment.png) +## How to run a Kedro pipeline on Amazon AWS Managed Workflows for Apache Airflow (MWAA) + +### Kedro project preparation +MWAA, or Managed Workflows for Apache Airflow, is an AWS service that makes it easier to set up, operate, and scale Apache Airflow in the cloud. Deploying a Kedro pipeline to MWAA is similar to Astronomer, but there are some key differences: you need to store your project data in an AWS S3 bucket and make necessary changes to your Data Catalog. Additionally, you must configure how you upload your Kedro configuration, install your Kedro package, and set up the necessary environment variables. +1. Complete steps 1-4 from the [Create, prepare and package example Kedro project](#create-prepare-and-package-example-kedro-project) section. +2. Your project's data should not reside in the working directory of the Airflow container. Instead, [create an S3 bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/creating-bucket.html) and [upload your data folder from the new-kedro-project folder to your S3 bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/upload-objects.html). +3. Modify the Data Catalog to reference data in your S3 bucket by updating the filepath and add credentials line for each Dataset in the `new-kedro-project/conf/airflow/catalog.yml`. Add the S3 prefix to the filepath as shown below: +```shell +companies: + type: pandas.CSVDataset + filepath: s3://your_S3_bucket/data/01_raw/companies.csv + credentials: dev_s3 +``` +4. [Set up AWS credentials](https://docs.aws.amazon.com/keyspaces/latest/devguide/access.credentials.html) to provide read and write access to your S3 bucket. Update `new-kedro-project/conf/local/credentials.yml` with your AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY: +```shell +dev_s3: + client_kwargs: + aws_access_key_id: ********************* + aws_secret_access_key: ****************************************** +``` +5. Add s3fs to your project’s `requirements.txt` in `new-kedro-project` to facilitate communication with AWS S3. +```shell +s3fs +``` + +6. Archive your `conf` folder into a `conf.zip` file and upload it to `s3://your_S3_bucket` for later use in the Airflow container. +7. Follow steps 5-6 from the [Create, prepare and package example Kedro project](#create-prepare-and-package-example-kedro-project) section + +### Deployment on AWAA +1. Upload your `new_kedro_project-0.1-py3-none-any.whl` from `new-kedro-project/dist` to a new S3 bucket and [provide public access to that file](https://repost.aws/knowledge-center/read-access-objects-s3-bucket). Use the `Copy URL` button in AWS Console to retrieve the public URL for file access, it will look like `https://your_new_public_s3_bucket.s3.eu-west-1.amazonaws.com/new_kedro_project-0.1-py3-none-any.whl`. +2. Create a new `requirements.txt` file listing dependencies required for the Airflow container, including a link to the Kedro wheel file from Step 1, and upload it to `s3://your_S3_bucket`. +```shell +new_kedro_project @ https://your_new_public_s3_bucket.s3.eu-west-1.amazonaws.com/new_kedro_project-0.1-py3-none-any.whl +``` +3. Upload `new_kedro_project_dag.py` from the `new-kedro-project/dags` to `s3://your_S3_bucket/dags`. +4. Create an empty `startup.sh` file for container startup commands. Set an environment variable for custom Kedro logging: +```shell +export KEDRO_LOGGING_CONFIG="plugins/conf/logging.yml" +``` +5. Set up a new [AWS MWAA environment](https://docs.aws.amazon.com/mwaa/latest/userguide/create-environment.html) using the following settings: +```shell +S3 Bucket: + s3://your_S3_bucket +DAGs folder + s3://your_S3_bucket/dags +Plugins file - optional + s3://your_S3_bucket/conf.zip +Requirements file - optional + s3://your_S3_bucket/requrements.txt +Startup script file - optional + s3://your_S3_bucket/startup.sh +``` +Continue with the default options on subsequent pages. +6. Once the environment is created, use the `Open Airflow UI` button to access the standard Airflow interface, where you can manage your DAG. + ## How to run a Kedro pipeline on Apache Airflow using a Kubernetes cluster The `kedro-airflow-k8s` plugin from GetInData | Part of Xebia enables you to run a Kedro pipeline on Airflow with a Kubernetes cluster. The plugin can be used together with `kedro-docker` to prepare a docker image for pipeline execution. At present, the plugin is available for versions of Kedro < 0.18 only. From 9fa15c438943b7bec2cf949595010e580c2a3678 Mon Sep 17 00:00:00 2001 From: Dmitry Sorokin Date: Wed, 8 May 2024 17:01:59 +0100 Subject: [PATCH 02/11] Linting Signed-off-by: Dmitry Sorokin --- docs/source/deployment/airflow.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/source/deployment/airflow.md b/docs/source/deployment/airflow.md index 4122ec1986..233efa4977 100644 --- a/docs/source/deployment/airflow.md +++ b/docs/source/deployment/airflow.md @@ -170,7 +170,7 @@ astro deploy ### Kedro project preparation MWAA, or Managed Workflows for Apache Airflow, is an AWS service that makes it easier to set up, operate, and scale Apache Airflow in the cloud. Deploying a Kedro pipeline to MWAA is similar to Astronomer, but there are some key differences: you need to store your project data in an AWS S3 bucket and make necessary changes to your Data Catalog. Additionally, you must configure how you upload your Kedro configuration, install your Kedro package, and set up the necessary environment variables. -1. Complete steps 1-4 from the [Create, prepare and package example Kedro project](#create-prepare-and-package-example-kedro-project) section. +1. Complete steps 1-4 from the [Create, prepare and package example Kedro project](#create-prepare-and-package-example-kedro-project) section. 2. Your project's data should not reside in the working directory of the Airflow container. Instead, [create an S3 bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/creating-bucket.html) and [upload your data folder from the new-kedro-project folder to your S3 bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/upload-objects.html). 3. Modify the Data Catalog to reference data in your S3 bucket by updating the filepath and add credentials line for each Dataset in the `new-kedro-project/conf/airflow/catalog.yml`. Add the S3 prefix to the filepath as shown below: ```shell @@ -195,7 +195,7 @@ s3fs 7. Follow steps 5-6 from the [Create, prepare and package example Kedro project](#create-prepare-and-package-example-kedro-project) section ### Deployment on AWAA -1. Upload your `new_kedro_project-0.1-py3-none-any.whl` from `new-kedro-project/dist` to a new S3 bucket and [provide public access to that file](https://repost.aws/knowledge-center/read-access-objects-s3-bucket). Use the `Copy URL` button in AWS Console to retrieve the public URL for file access, it will look like `https://your_new_public_s3_bucket.s3.eu-west-1.amazonaws.com/new_kedro_project-0.1-py3-none-any.whl`. +1. Upload your `new_kedro_project-0.1-py3-none-any.whl` from `new-kedro-project/dist` to a new S3 bucket and [provide public access to that file](https://repost.aws/knowledge-center/read-access-objects-s3-bucket). Use the `Copy URL` button in AWS Console to retrieve the public URL for file access, it will look like `https://your_new_public_s3_bucket.s3.eu-west-1.amazonaws.com/new_kedro_project-0.1-py3-none-any.whl`. 2. Create a new `requirements.txt` file listing dependencies required for the Airflow container, including a link to the Kedro wheel file from Step 1, and upload it to `s3://your_S3_bucket`. ```shell new_kedro_project @ https://your_new_public_s3_bucket.s3.eu-west-1.amazonaws.com/new_kedro_project-0.1-py3-none-any.whl @@ -204,7 +204,7 @@ new_kedro_project @ https://your_new_public_s3_bucket.s3.eu-west-1.amazonaws.com 4. Create an empty `startup.sh` file for container startup commands. Set an environment variable for custom Kedro logging: ```shell export KEDRO_LOGGING_CONFIG="plugins/conf/logging.yml" -``` +``` 5. Set up a new [AWS MWAA environment](https://docs.aws.amazon.com/mwaa/latest/userguide/create-environment.html) using the following settings: ```shell S3 Bucket: From aeb01105518bbda90032689136819ace6779721c Mon Sep 17 00:00:00 2001 From: Dmitry Sorokin Date: Wed, 8 May 2024 19:34:04 +0100 Subject: [PATCH 03/11] DAG Modification Signed-off-by: Dmitry Sorokin --- docs/source/deployment/airflow.md | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/docs/source/deployment/airflow.md b/docs/source/deployment/airflow.md index 233efa4977..5f9ae37d14 100644 --- a/docs/source/deployment/airflow.md +++ b/docs/source/deployment/airflow.md @@ -191,8 +191,16 @@ dev_s3: s3fs ``` -6. Archive your `conf` folder into a `conf.zip` file and upload it to `s3://your_S3_bucket` for later use in the Airflow container. -7. Follow steps 5-6 from the [Create, prepare and package example Kedro project](#create-prepare-and-package-example-kedro-project) section +6. Archive your `conf` folder into a `conf.zip` file and upload it to `s3://your_S3_bucket` for later use in the Airflow container. This file will be unzipped into the `plugins` folder within the MWAA Airflow container. +7. Follow steps 5-6 from the [Create, prepare and package example Kedro project](#create-prepare-and-package-example-kedro-project) section to package your Kedro project and generate an Airflow DAG. +8. Modify the DAG file `new_kedro_project_dag.py` located in the `dags/` folder by adding `, conf_source="plugins/conf"` to the Kedro session creation in the Kedro operator execution function. This change is necessary because your Kedro configuration folder will be stored in the `plugins/conf` folder, not the root directory: +```shell + def execute(self, context): + configure_project(self.package_name) + with KedroSession.create(project_path=self.project_path, + env=self.env, conf_source="plugins/conf") as session: + session.run(self.pipeline_name, node_names=[self.node_name]) +``` ### Deployment on AWAA 1. Upload your `new_kedro_project-0.1-py3-none-any.whl` from `new-kedro-project/dist` to a new S3 bucket and [provide public access to that file](https://repost.aws/knowledge-center/read-access-objects-s3-bucket). Use the `Copy URL` button in AWS Console to retrieve the public URL for file access, it will look like `https://your_new_public_s3_bucket.s3.eu-west-1.amazonaws.com/new_kedro_project-0.1-py3-none-any.whl`. From 3b3b198a8b8be9e7d60ec28042f952bfade17135 Mon Sep 17 00:00:00 2001 From: Dmitry Sorokin <40151847+DimedS@users.noreply.github.com> Date: Mon, 13 May 2024 10:51:01 +0100 Subject: [PATCH 04/11] Apply suggestions from code review Co-authored-by: Ankita Katiyar <110245118+ankatiyar@users.noreply.github.com> Signed-off-by: Dmitry Sorokin <40151847+DimedS@users.noreply.github.com> --- docs/source/deployment/airflow.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/source/deployment/airflow.md b/docs/source/deployment/airflow.md index 5f9ae37d14..f1b1f6d81d 100644 --- a/docs/source/deployment/airflow.md +++ b/docs/source/deployment/airflow.md @@ -176,7 +176,7 @@ MWAA, or Managed Workflows for Apache Airflow, is an AWS service that makes it e ```shell companies: type: pandas.CSVDataset - filepath: s3://your_S3_bucket/data/01_raw/companies.csv + filepath: s3:///data/01_raw/companies.csv credentials: dev_s3 ``` 4. [Set up AWS credentials](https://docs.aws.amazon.com/keyspaces/latest/devguide/access.credentials.html) to provide read and write access to your S3 bucket. Update `new-kedro-project/conf/local/credentials.yml` with your AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY: @@ -193,7 +193,7 @@ s3fs 6. Archive your `conf` folder into a `conf.zip` file and upload it to `s3://your_S3_bucket` for later use in the Airflow container. This file will be unzipped into the `plugins` folder within the MWAA Airflow container. 7. Follow steps 5-6 from the [Create, prepare and package example Kedro project](#create-prepare-and-package-example-kedro-project) section to package your Kedro project and generate an Airflow DAG. -8. Modify the DAG file `new_kedro_project_dag.py` located in the `dags/` folder by adding `, conf_source="plugins/conf"` to the Kedro session creation in the Kedro operator execution function. This change is necessary because your Kedro configuration folder will be stored in the `plugins/conf` folder, not the root directory: +8. Update the DAG file `new_kedro_project_dag.py` located in the `dags/` folder by adding `conf_source="plugins/conf"` to the arguments of `KedroSession.create()` in the Kedro operator execution function. This change is necessary because your Kedro configuration folder will be stored in the `plugins/conf` folder, not the root directory: ```shell def execute(self, context): configure_project(self.package_name) From 62c917fb3c2e758d075440e502719e0da0a289a5 Mon Sep 17 00:00:00 2001 From: Dmitry Sorokin Date: Wed, 15 May 2024 16:58:00 +0100 Subject: [PATCH 05/11] PR comments Signed-off-by: Dmitry Sorokin --- docs/source/deployment/airflow.md | 34 +++++++++++++++---------------- 1 file changed, 17 insertions(+), 17 deletions(-) diff --git a/docs/source/deployment/airflow.md b/docs/source/deployment/airflow.md index f1b1f6d81d..99e8a582c0 100644 --- a/docs/source/deployment/airflow.md +++ b/docs/source/deployment/airflow.md @@ -44,7 +44,7 @@ In this section, you will create a new Kedro project equipped with an example pi 3. Open `conf/airflow/catalog.yml` to see the list of datasets used in the project. Note that additional intermediate datasets (`X_train`, `X_test`, `y_train`, `y_test`) are stored only in memory. You can locate these in the pipeline description under `/src/new_kedro_project/pipelines/data_science/pipeline.py`. To ensure these datasets are preserved and accessible across different tasks in Airflow, we need to include them in our `DataCatalog`. Instead of repeating similar code for each dataset, you can use [Dataset Factories](https://docs.kedro.org/en/stable/data/kedro_dataset_factories.html), a special syntax that allows defining a catch-all pattern to overwrite the default `MemoryDataset` creation. Add this code to the end of the file: ```yaml -{base_dataset}: +"{base_dataset}": type: pandas.CSVDataset filepath: data/02_intermediate/{base_dataset}.csv ``` @@ -74,7 +74,7 @@ This step should produce a wheel file called `new_kedro_project-0.1-py3-none-any kedro airflow create --target-dir=dags/ --env=airflow ``` -This step should produce a `.py` file called `new_kedro_project_dag.py` located at `dags/`. +This step should produce a `.py` file called `new_kedro_project_airflow_dag.py` located at `dags/`. ### Deployment process with Astro CLI @@ -102,15 +102,15 @@ In this section, you will start by setting up a new blank Airflow project using cp -r new-kedro-project/conf kedro-airflow-spaceflights/conf mkdir -p kedro-airflow-spaceflights/dist/ cp new-kedro-project/dist/new_kedro_project-0.1-py3-none-any.whl kedro-airflow-spaceflights/dist/ - cp new-kedro-project/dags/new_kedro_project_dag.py kedro-airflow-spaceflights/dags/ + cp new-kedro-project/dags/new_kedro_project_airflow_dag.py kedro-airflow-spaceflights/dags/ ``` Feel free to completely copy `new-kedro-project` into `kedro-airflow-spaceflights` if your project requires frequent updates, DAG recreation, and repackaging. This approach allows you to work with kedro and astro projects in a single folder, eliminating the need to copy kedro files for each development iteration. However, be aware that both projects will share common files such as `requirements.txt`, `README.md`, and `.gitignore`. -4. Add a few lines to the `Dockerfile` located in the `kedro-airflow-spaceflights` folder to set the environment variable `KEDRO_LOGGING_CONFIG` to point to `conf/logging.yml` to enable custom logging in Kedro and to install the .whl file of our prepared Kedro project into the Airflow container: +4. Add a few lines to the `Dockerfile` located in the `kedro-airflow-spaceflights` folder to set the environment variable `KEDRO_LOGGING_CONFIG` to point to `conf/logging.yml` to enable custom logging in Kedro (note that from Kedro 0.19.6 onwards, this step is unnecessary because Kedro uses the `conf/logging.yml` file by default) and to install the .whl file of our prepared Kedro project into the Airflow container: ```Dockerfile -ENV KEDRO_LOGGING_CONFIG="conf/logging.yml" +ENV KEDRO_LOGGING_CONFIG="conf/logging.yml" # This line is not needed from Kedro 0.19.6 RUN pip install --user dist/new_kedro_project-0.1-py3-none-any.whl ``` @@ -186,32 +186,31 @@ dev_s3: aws_access_key_id: ********************* aws_secret_access_key: ****************************************** ``` -5. Add s3fs to your project’s `requirements.txt` in `new-kedro-project` to facilitate communication with AWS S3. +5. Add `s3fs` to your project’s `requirements.txt` in `new-kedro-project` to facilitate communication with AWS S3. Some libraries could cause dependency conflicts in the Airflow environment, so make sure to minimise the list and avoid using `kedro-viz` and `pytest`. ```shell s3fs ``` -6. Archive your `conf` folder into a `conf.zip` file and upload it to `s3://your_S3_bucket` for later use in the Airflow container. This file will be unzipped into the `plugins` folder within the MWAA Airflow container. -7. Follow steps 5-6 from the [Create, prepare and package example Kedro project](#create-prepare-and-package-example-kedro-project) section to package your Kedro project and generate an Airflow DAG. -8. Update the DAG file `new_kedro_project_dag.py` located in the `dags/` folder by adding `conf_source="plugins/conf"` to the arguments of `KedroSession.create()` in the Kedro operator execution function. This change is necessary because your Kedro configuration folder will be stored in the `plugins/conf` folder, not the root directory: +6. Follow steps 5-6 from the [Create, prepare and package example Kedro project](#create-prepare-and-package-example-kedro-project) section to package your Kedro project and generate an Airflow DAG. +7. Update the DAG file `new_kedro_project_dag.py` located in the `dags/` folder by adding `conf_source="plugins/conf-new_kedro_project.tar.gz"` to the arguments of `KedroSession.create()` in the Kedro operator execution function. This change is necessary because your Kedro configuration archive will be stored in the `plugins/` folder, not the root directory: ```shell def execute(self, context): configure_project(self.package_name) with KedroSession.create(project_path=self.project_path, - env=self.env, conf_source="plugins/conf") as session: + env=self.env, conf_source="plugins/conf-new_kedro_project.tar.gz") as session: session.run(self.pipeline_name, node_names=[self.node_name]) ``` ### Deployment on AWAA -1. Upload your `new_kedro_project-0.1-py3-none-any.whl` from `new-kedro-project/dist` to a new S3 bucket and [provide public access to that file](https://repost.aws/knowledge-center/read-access-objects-s3-bucket). Use the `Copy URL` button in AWS Console to retrieve the public URL for file access, it will look like `https://your_new_public_s3_bucket.s3.eu-west-1.amazonaws.com/new_kedro_project-0.1-py3-none-any.whl`. -2. Create a new `requirements.txt` file listing dependencies required for the Airflow container, including a link to the Kedro wheel file from Step 1, and upload it to `s3://your_S3_bucket`. +1. Archive your three files: `new_kedro_project-0.1-py3-none-any.wh`l and `conf-new_kedro_project.tar.gz` located in `new-kedro-project/dist`, and `logging.yml` located in `new-kedro-project/conf/` into a file called `plugins.zip` and upload it to `s3://your_S3_bucket`. This archive will be later unpacked to the `/plugins` folder in the working directory of the Airflow container. +2. Create a new `requirements.txt` file, add the command to install your Kedro project archived in the previous step, and upload it to `s3://your_S3_bucket`: ```shell -new_kedro_project @ https://your_new_public_s3_bucket.s3.eu-west-1.amazonaws.com/new_kedro_project-0.1-py3-none-any.whl +./plugins/new_kedro_project-0.1-py3-none-any.whl ``` -3. Upload `new_kedro_project_dag.py` from the `new-kedro-project/dags` to `s3://your_S3_bucket/dags`. +3. Upload `new_kedro_project_airflow_dag.py` from the `new-kedro-project/dags` to `s3://your_S3_bucket/dags`. 4. Create an empty `startup.sh` file for container startup commands. Set an environment variable for custom Kedro logging: ```shell -export KEDRO_LOGGING_CONFIG="plugins/conf/logging.yml" +export KEDRO_LOGGING_CONFIG="plugins/logging.yml" ``` 5. Set up a new [AWS MWAA environment](https://docs.aws.amazon.com/mwaa/latest/userguide/create-environment.html) using the following settings: ```shell @@ -220,13 +219,14 @@ S3 Bucket: DAGs folder s3://your_S3_bucket/dags Plugins file - optional - s3://your_S3_bucket/conf.zip + s3://your_S3_bucket/plugins.zip Requirements file - optional s3://your_S3_bucket/requrements.txt Startup script file - optional s3://your_S3_bucket/startup.sh ``` -Continue with the default options on subsequent pages. +On the next page, set the `Public network (Internet accessible)` option in the `Web server access` section if you want to access your Airflow UI from the internet. Continue with the default options on the subsequent pages. + 6. Once the environment is created, use the `Open Airflow UI` button to access the standard Airflow interface, where you can manage your DAG. ## How to run a Kedro pipeline on Apache Airflow using a Kubernetes cluster From 36a64a6b6fbc9b28be1b5554a0284136d122d12c Mon Sep 17 00:00:00 2001 From: Dmitry Sorokin Date: Thu, 16 May 2024 10:10:58 +0100 Subject: [PATCH 06/11] Fix credentials Signed-off-by: Dmitry Sorokin --- docs/source/deployment/airflow.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/source/deployment/airflow.md b/docs/source/deployment/airflow.md index 99e8a582c0..10134a7e8a 100644 --- a/docs/source/deployment/airflow.md +++ b/docs/source/deployment/airflow.md @@ -179,7 +179,7 @@ companies: filepath: s3:///data/01_raw/companies.csv credentials: dev_s3 ``` -4. [Set up AWS credentials](https://docs.aws.amazon.com/keyspaces/latest/devguide/access.credentials.html) to provide read and write access to your S3 bucket. Update `new-kedro-project/conf/local/credentials.yml` with your AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY: +4. [Set up AWS credentials](https://docs.aws.amazon.com/keyspaces/latest/devguide/access.credentials.html) to provide read and write access to your S3 bucket. Update `new-kedro-project/conf/local/credentials.yml` with your AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY and copy it to the `new-kedro-project/conf/airflow/` folder: ```shell dev_s3: client_kwargs: @@ -192,7 +192,7 @@ s3fs ``` 6. Follow steps 5-6 from the [Create, prepare and package example Kedro project](#create-prepare-and-package-example-kedro-project) section to package your Kedro project and generate an Airflow DAG. -7. Update the DAG file `new_kedro_project_dag.py` located in the `dags/` folder by adding `conf_source="plugins/conf-new_kedro_project.tar.gz"` to the arguments of `KedroSession.create()` in the Kedro operator execution function. This change is necessary because your Kedro configuration archive will be stored in the `plugins/` folder, not the root directory: +7. Update the DAG file `new_kedro_project_airflow_dag.py` located in the `dags/` folder by adding `conf_source="plugins/conf-new_kedro_project.tar.gz"` to the arguments of `KedroSession.create()` in the Kedro operator execution function. This change is necessary because your Kedro configuration archive will be stored in the `plugins/` folder, not the root directory: ```shell def execute(self, context): configure_project(self.package_name) From 65c27710d9beb769f4351d076f8cfae9096d9996 Mon Sep 17 00:00:00 2001 From: Dmitry Sorokin <40151847+DimedS@users.noreply.github.com> Date: Fri, 17 May 2024 10:54:01 +0100 Subject: [PATCH 07/11] Apply suggestions from code review Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com> Signed-off-by: Dmitry Sorokin <40151847+DimedS@users.noreply.github.com> --- docs/source/deployment/airflow.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/source/deployment/airflow.md b/docs/source/deployment/airflow.md index 10134a7e8a..24eb681275 100644 --- a/docs/source/deployment/airflow.md +++ b/docs/source/deployment/airflow.md @@ -169,10 +169,10 @@ astro deploy ## How to run a Kedro pipeline on Amazon AWS Managed Workflows for Apache Airflow (MWAA) ### Kedro project preparation -MWAA, or Managed Workflows for Apache Airflow, is an AWS service that makes it easier to set up, operate, and scale Apache Airflow in the cloud. Deploying a Kedro pipeline to MWAA is similar to Astronomer, but there are some key differences: you need to store your project data in an AWS S3 bucket and make necessary changes to your Data Catalog. Additionally, you must configure how you upload your Kedro configuration, install your Kedro package, and set up the necessary environment variables. +MWAA, or Managed Workflows for Apache Airflow, is an AWS service that makes it easier to set up, operate, and scale Apache Airflow in the cloud. Deploying a Kedro pipeline to MWAA is similar to Astronomer, but there are some key differences: you need to store your project data in an AWS S3 bucket and make necessary changes to your `DataCatalog`. Additionally, you must configure how you upload your Kedro configuration, install your Kedro package, and set up the necessary environment variables. 1. Complete steps 1-4 from the [Create, prepare and package example Kedro project](#create-prepare-and-package-example-kedro-project) section. 2. Your project's data should not reside in the working directory of the Airflow container. Instead, [create an S3 bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/creating-bucket.html) and [upload your data folder from the new-kedro-project folder to your S3 bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/upload-objects.html). -3. Modify the Data Catalog to reference data in your S3 bucket by updating the filepath and add credentials line for each Dataset in the `new-kedro-project/conf/airflow/catalog.yml`. Add the S3 prefix to the filepath as shown below: +3. Modify the `DataCatalog` to reference data in your S3 bucket by updating the filepath and add credentials line for each dataset in `new-kedro-project/conf/airflow/catalog.yml`. Add the S3 prefix to the filepath as shown below: ```shell companies: type: pandas.CSVDataset @@ -186,7 +186,7 @@ dev_s3: aws_access_key_id: ********************* aws_secret_access_key: ****************************************** ``` -5. Add `s3fs` to your project’s `requirements.txt` in `new-kedro-project` to facilitate communication with AWS S3. Some libraries could cause dependency conflicts in the Airflow environment, so make sure to minimise the list and avoid using `kedro-viz` and `pytest`. +5. Add `s3fs` to your project’s `requirements.txt` in `new-kedro-project` to facilitate communication with AWS S3. Some libraries could cause dependency conflicts in the Airflow environment, so make sure to minimise the requirements list and avoid using `kedro-viz` and `pytest`. ```shell s3fs ``` @@ -202,7 +202,7 @@ s3fs ``` ### Deployment on AWAA -1. Archive your three files: `new_kedro_project-0.1-py3-none-any.wh`l and `conf-new_kedro_project.tar.gz` located in `new-kedro-project/dist`, and `logging.yml` located in `new-kedro-project/conf/` into a file called `plugins.zip` and upload it to `s3://your_S3_bucket`. This archive will be later unpacked to the `/plugins` folder in the working directory of the Airflow container. +1. Archive your three files: `new_kedro_project-0.1-py3-none-any.whl` and `conf-new_kedro_project.tar.gz` located in `new-kedro-project/dist`, and `logging.yml` located in `new-kedro-project/conf/` into a file called `plugins.zip` and upload it to `s3://your_S3_bucket`. This archive will be later unpacked to the `/plugins` folder in the working directory of the Airflow container. 2. Create a new `requirements.txt` file, add the command to install your Kedro project archived in the previous step, and upload it to `s3://your_S3_bucket`: ```shell ./plugins/new_kedro_project-0.1-py3-none-any.whl From 4593bdde90fa69adc73d3a3a5f0baf80e6df4f83 Mon Sep 17 00:00:00 2001 From: Dmitry Sorokin Date: Fri, 17 May 2024 11:15:55 +0100 Subject: [PATCH 08/11] Fix comments Signed-off-by: Dmitry Sorokin --- docs/source/deployment/airflow.md | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/docs/source/deployment/airflow.md b/docs/source/deployment/airflow.md index 24eb681275..7eb764bbcb 100644 --- a/docs/source/deployment/airflow.md +++ b/docs/source/deployment/airflow.md @@ -202,11 +202,18 @@ s3fs ``` ### Deployment on AWAA -1. Archive your three files: `new_kedro_project-0.1-py3-none-any.whl` and `conf-new_kedro_project.tar.gz` located in `new-kedro-project/dist`, and `logging.yml` located in `new-kedro-project/conf/` into a file called `plugins.zip` and upload it to `s3://your_S3_bucket`. This archive will be later unpacked to the `/plugins` folder in the working directory of the Airflow container. -2. Create a new `requirements.txt` file, add the command to install your Kedro project archived in the previous step, and upload it to `s3://your_S3_bucket`: +1. Archive your three files: `new_kedro_project-0.1-py3-none-any.whl` and `conf-new_kedro_project.tar.gz` located in `new-kedro-project/dist`, and `logging.yml` located in `new-kedro-project/conf/` into a file called `plugins.zip` and upload it to `s3://your_S3_bucket`. +```shell +zip -j plugins.zip dist/new_kedro_project-0.1-py3-none-any.whl dist/conf-new_kedro_project.tar.gz conf/logging.yml +``` +This archive will be later unpacked to the `/plugins` folder in the working directory of the Airflow container. + +2. Create a new `requirements.txt` file, add the path where your Kedro project will be unpacked in the Airflow container, and upload `requirements.txt` to `s3://your_S3_bucket`: ```shell ./plugins/new_kedro_project-0.1-py3-none-any.whl ``` +Libraries from `requirements.txt` will be installed during container initialization. + 3. Upload `new_kedro_project_airflow_dag.py` from the `new-kedro-project/dags` to `s3://your_S3_bucket/dags`. 4. Create an empty `startup.sh` file for container startup commands. Set an environment variable for custom Kedro logging: ```shell From a3e4f5c875d1990565d8d1739a93fe40fba37b9e Mon Sep 17 00:00:00 2001 From: Dmitry Sorokin Date: Fri, 17 May 2024 11:38:38 +0100 Subject: [PATCH 09/11] Add introduction Signed-off-by: Dmitry Sorokin --- docs/source/deployment/airflow.md | 18 ++++++++++++------ 1 file changed, 12 insertions(+), 6 deletions(-) diff --git a/docs/source/deployment/airflow.md b/docs/source/deployment/airflow.md index 7eb764bbcb..d4c9f8c3a1 100644 --- a/docs/source/deployment/airflow.md +++ b/docs/source/deployment/airflow.md @@ -2,17 +2,23 @@ Apache Airflow is a popular open-source workflow management platform. It is a suitable engine to orchestrate and execute a pipeline authored with Kedro, because workflows in Airflow are modelled and organised as [DAGs](https://en.wikipedia.org/wiki/Directed_acyclic_graph). -## How to run a Kedro pipeline on Apache Airflow with Astronomer +## Introduction and strategy -The following tutorial shows how to deploy an example [Spaceflights Kedro project](https://docs.kedro.org/en/stable/tutorial/spaceflights_tutorial.html) on [Apache Airflow](https://airflow.apache.org/) with [Astro CLI](https://docs.astronomer.io/astro/cli/overview), a command-line tool created by [Astronomer](https://www.astronomer.io/) that streamlines the creation of local Airflow projects. You will deploy it locally first, and then transition to Astro Cloud. +The general strategy to deploy a Kedro pipeline on Apache Airflow is to run every Kedro node as an [Airflow task](https://airflow.apache.org/docs/apache-airflow/stable/concepts/tasks.html) while the whole pipeline is converted to an [Airflow DAG](https://airflow.apache.org/docs/apache-airflow/stable/concepts/dags.html). This approach mirrors the principles of [running Kedro in a distributed environment](distributed.md). -[Astronomer](https://docs.astronomer.io/astro/install-cli) is a managed Airflow platform which allows users to spin up and run an Airflow cluster in production. Additionally, it also provides a set of tools to help users get started with Airflow locally in the easiest way possible. +Each node will be executed within a new Kedro session, which implies that `MemoryDataset`s cannot serve as storage for the intermediate results of nodes. Instead, all datasets must be registered in the [`DataCatalog`](https://docs.kedro.org/en/stable/data/index.html) and stored in persistent storage. This approach enables nodes to access the results from preceding nodes. -### Strategy +This guide provides instructions on running a Kedro pipeline on different Airflow platforms. You can jump to the specific sections by clicking the links below, how to run a Kedro pipeline on: -The general strategy to deploy a Kedro pipeline on Apache Airflow is to run every Kedro node as an [Airflow task](https://airflow.apache.org/docs/apache-airflow/stable/concepts/tasks.html) while the whole pipeline is converted to an [Airflow DAG](https://airflow.apache.org/docs/apache-airflow/stable/concepts/dags.html). This approach mirrors the principles of [running Kedro in a distributed environment](distributed.md). +- [Apache Airflow with Astronomer](#how-to-run-a-kedro-pipeline-on-apache-airflow-with-astronomer) +- [Amazon AWS Managed Workflows for Apache Airflow (MWAA)](#how-to-run-a-kedro-pipeline-on-amazon-aws-managed-workflows-for-apache-airflow-mwaa) +- [Apache Airflow using a Kubernetes cluster](#how-to-run-a-kedro-pipeline-on-apache-airflow-using-a-kubernetes-cluster) -Each node will be executed within a new Kedro session, which implies that `MemoryDataset`s cannot serve as storage for the intermediate results of nodes. Instead, all datasets must be registered in the [`DataCatalog`](https://docs.kedro.org/en/stable/data/index.html) and stored in persistent storage. This approach enables nodes to access the results from preceding nodes. +## How to run a Kedro pipeline on Apache Airflow with Astronomer + +The following tutorial shows how to deploy an example [Spaceflights Kedro project](https://docs.kedro.org/en/stable/tutorial/spaceflights_tutorial.html) on [Apache Airflow](https://airflow.apache.org/) with [Astro CLI](https://docs.astronomer.io/astro/cli/overview), a command-line tool created by [Astronomer](https://www.astronomer.io/) that streamlines the creation of local Airflow projects. You will deploy it locally first, and then transition to Astro Cloud. + +[Astronomer](https://docs.astronomer.io/astro/install-cli) is a managed Airflow platform which allows users to spin up and run an Airflow cluster in production. Additionally, it also provides a set of tools to help users get started with Airflow locally in the easiest way possible. ### Prerequisites From bc2d40ed4003dc4486563d5b7b7672c37ddda636 Mon Sep 17 00:00:00 2001 From: Dmitry Sorokin <40151847+DimedS@users.noreply.github.com> Date: Fri, 17 May 2024 14:06:16 +0100 Subject: [PATCH 10/11] Update docs/source/deployment/airflow.md Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com> Signed-off-by: Dmitry Sorokin <40151847+DimedS@users.noreply.github.com> --- docs/source/deployment/airflow.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/deployment/airflow.md b/docs/source/deployment/airflow.md index d4c9f8c3a1..219626a3c5 100644 --- a/docs/source/deployment/airflow.md +++ b/docs/source/deployment/airflow.md @@ -218,7 +218,7 @@ This archive will be later unpacked to the `/plugins` folder in the working dire ```shell ./plugins/new_kedro_project-0.1-py3-none-any.whl ``` -Libraries from `requirements.txt` will be installed during container initialization. +Libraries from `requirements.txt` will be installed during container initialisation. 3. Upload `new_kedro_project_airflow_dag.py` from the `new-kedro-project/dags` to `s3://your_S3_bucket/dags`. 4. Create an empty `startup.sh` file for container startup commands. Set an environment variable for custom Kedro logging: From e3d38f1fbe69e9a85f5101d0eb4e29902582e7ca Mon Sep 17 00:00:00 2001 From: Dmitry Sorokin Date: Mon, 20 May 2024 16:56:58 +0100 Subject: [PATCH 11/11] Update README.md Signed-off-by: Dmitry Sorokin --- RELEASE.md | 1 + 1 file changed, 1 insertion(+) diff --git a/RELEASE.md b/RELEASE.md index 620abccf2c..0658fec658 100644 --- a/RELEASE.md +++ b/RELEASE.md @@ -9,6 +9,7 @@ ## Documentation changes * Improved documentation for custom starters +* Added a new section on deploying Kedro project on AWS Airflow MWAA ## Community contributions Many thanks to the following Kedroids for contributing PRs to this release: