Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Airflow Astro deployment docs #3792

Merged
merged 29 commits into from
Apr 19, 2024

Conversation

DimedS
Copy link
Contributor

@DimedS DimedS commented Apr 8, 2024

Description

This PR addresses issues 1, 2, 3, 5, and 6 from 605:

  1. The architecture of the Astro deployment manual has been restructured into two stages: preparation of the Kedro project and Astro deployment. Redundancies have been eliminated, and the manual is now updated and fully operational.
  2. The kedro-airflow-k8s plugin has been relocated to the end of the document because it is compatible only with Kedro versions earlier than 0.17.
  3. The astro-airflow-iris starter has been replaced with the standard spaceflights-pandas example pipeline.
  4. Customisation of logging has been introduced.
  5. Instructions for transferring results back from the container have been added.

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

  • Read the contributing guidelines
  • Signed off each commit with a Developer Certificate of Origin (DCO)
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the RELEASE.md file
  • Added tests to cover my changes
  • Checked if this change will affect Kedro-Viz, and if so, communicated that with the Viz team

Signed-off-by: Dmitry Sorokin <dmd40in@gmail.com>
@DimedS DimedS linked an issue Apr 8, 2024 that may be closed by this pull request
@DimedS DimedS requested a review from ankatiyar April 8, 2024 15:50
DimedS and others added 2 commits April 8, 2024 18:54
Signed-off-by: Dmitry Sorokin <dmd40in@gmail.com>
Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a fantastic start @DimedS ⭐ I have left some comments inline, but in general I think it would be useful to add some more context and explanations about why certain steps are needed. Also a minor note on referencing Kedro class names, when talking about a class in Kedro e.g. DataCatalog the best practice would be to use the exact class name or the regular english name for it e.g. data catalog, but not a hybrid like Data Catalog.

docs/source/deployment/airflow_astronomer.md Outdated Show resolved Hide resolved

2. Create a new Kedro project using the `astro-airflow-iris` starter. You can use the default value in the project creation process:
1. To create a new Kedro project, select the `example=yes` option to include example code. Additionally, to implement custom logging, select `tools=log`. Proceed with the default project name, but feel free to add any other tools as desired:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this requires a bit more of an intro to explain why the user needs the select the example project and the custom logging. Perhaps instead of immediately starting with the steps, give a bit of an intro explaining what will happen in the subsequent steps and what the expected end result is: e.g. a Kedro spaceflights project with logging setup that can be run on airflow.

docs/source/deployment/airflow_astronomer.md Outdated Show resolved Hide resolved
docs/source/deployment/airflow_astronomer.md Outdated Show resolved Hide resolved
docs/source/deployment/airflow_astronomer.md Outdated Show resolved Hide resolved
docs/source/deployment/airflow_astronomer.md Outdated Show resolved Hide resolved
docs/source/deployment/airflow_astronomer.md Outdated Show resolved Hide resolved
* **Step 2.3**: Modify the `Dockerfile` to have the following content:
This step should produce a .py file called `new_kedro_project_dag.py` located at `dags/`.

### Deployment process with Astro Airflow
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would again add a bit of an introduction paragraph before diving into the steps to explain what we'll be doing in this part of the tutorial.

astro dev init
```

2. The folder `kedro-airflow-spaceflights` will be executed within the Airflow container. To run our Kedro project there, we need to copy several items from the previous sections into it: the `/data` folder from Step 1, the `/conf` folder from Steps 2-4, the `.whl` file from Step 5, and the Airflow DAG from Step 6:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be worth explaining why these items need to be copied over, so that users will have a better understanding of how things work when they're running a different project than spaceflights on airflow.

docs/source/deployment/airflow_astronomer.md Outdated Show resolved Hide resolved
DimedS and others added 9 commits April 9, 2024 11:34
Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com>
Signed-off-by: Dmitry Sorokin <40151847+DimedS@users.noreply.github.com>
Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com>
Signed-off-by: Dmitry Sorokin <40151847+DimedS@users.noreply.github.com>
Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com>
Signed-off-by: Dmitry Sorokin <40151847+DimedS@users.noreply.github.com>
Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com>
Signed-off-by: Dmitry Sorokin <40151847+DimedS@users.noreply.github.com>
Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com>
Signed-off-by: Dmitry Sorokin <40151847+DimedS@users.noreply.github.com>
Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com>
Signed-off-by: Dmitry Sorokin <40151847+DimedS@users.noreply.github.com>
Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com>
Signed-off-by: Dmitry Sorokin <40151847+DimedS@users.noreply.github.com>
Signed-off-by: Dmitry Sorokin <dmd40in@gmail.com>
Signed-off-by: Dmitry Sorokin <dmd40in@gmail.com>
@DimedS
Copy link
Contributor Author

DimedS commented Apr 9, 2024

This is a fantastic start @DimedS ⭐ I have left some comments inline, but in general I think it would be useful to add some more context and explanations about why certain steps are needed. Also a minor note on referencing Kedro class names, when talking about a class in Kedro e.g. DataCatalog the best practice would be to use the exact class name or the regular english name for it e.g. data catalog, but not a hybrid like Data Catalog.

Thank you, @merelcht ! I agree and hope I've addressed your comments.

@DimedS DimedS requested a review from merelcht April 9, 2024 15:28
docs/source/deployment/airflow_astronomer.md Outdated Show resolved Hide resolved
docs/source/deployment/airflow_astronomer.md Outdated Show resolved Hide resolved
docs/source/deployment/airflow_astronomer.md Outdated Show resolved Hide resolved
docs/source/deployment/airflow_astronomer.md Outdated Show resolved Hide resolved
docs/source/deployment/airflow_astronomer.md Outdated Show resolved Hide resolved

In this section, we'll start by setting up a new blank Airflow project using Astro. We'll then copy the files prepared in the previous section from our Kedro project. Next, we'll customize the Dockerfile to enhance logging capabilities and manage the installation of our Kedro package. Finally, we will run and explore the Airflow cluster.

1. [Initialise an Airflow project with Astro](https://docs.astronomer.io/astro/cli/develop-project) in a new folder outside of your Kedro project. Let's call it `kedro-airflow-spaceflights`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I followed through the tutorial and got stuck at this step because I didn't have the astro CLI installed. It would be good to mention it needs to be installed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@merelcht , we have a section Prerequisites, where that is mentioned, do you think that it's not enough? Also when you will click the link and try to install Astro CLI it will be mentioned there that you need Docker for that:

Prerequisites
To follow this tutorial, ensure you have the following:

The Astro CLI installed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes I read over the Astro CLI part. I think it won't hurt to mention again at this point in the tutorial you need the Astro CLI and Docker. I would personally also mention Docker explicitly in the prereqs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I've included an additional note about Docker and the Astro CLI at the start of section 2 and also mentioned Docker in the prerequisites.

cp new-kedro-project/dags/new_kedro_project_dag.py kedro-airflow-spaceflights/dags/
```

3. Add a few lines to the `Dockerfile` to set the environment variable `KEDRO_LOGGING_CONFIG` to point to `conf/logging.yml` to enable custom logging in Kedro and to install the .whl file of our prepared Kedro project into the Airflow container:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should specify which Dockerfile you mean, because at this point we've created new-kedro-project and kedro-airflow-spaceflights

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, fixed


RUN pip install --user dist/new_kedro_project-0.1-py3-none-any.whl
```

#### Step 3. Convert the Kedro pipeline into an Airflow DAG with `kedro airflow`
4. Navigate to `kedro-airflow-spaceflights` folder and launch the local Airflow cluster with Astronomer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got stuck here as well, because I didn't have docker installed. So again it would be good to give some guidance or mention beforehand that users need to have astro and docker installed.


RUN pip install --user dist/new_kedro_project-0.1-py3-none-any.whl
```

#### Step 3. Convert the Kedro pipeline into an Airflow DAG with `kedro airflow`
4. Navigate to `kedro-airflow-spaceflights` folder and launch the local Airflow cluster with Astronomer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got stuck here as well, because I didn't have docker installed. So again it would be good to give some guidance or mention beforehand that users need to have astro and docker installed.

DimedS and others added 7 commits April 10, 2024 11:29
Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com>
Signed-off-by: Dmitry Sorokin <40151847+DimedS@users.noreply.github.com>
Co-authored-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>
Signed-off-by: Dmitry Sorokin <40151847+DimedS@users.noreply.github.com>
Signed-off-by: Dmitry Sorokin <dmd40in@gmail.com>
Signed-off-by: Dmitry Sorokin <dmd40in@gmail.com>
Signed-off-by: Dmitry Sorokin <dmd40in@gmail.com>
Signed-off-by: Dmitry Sorokin <dmd40in@gmail.com>
@DimedS DimedS marked this pull request as ready for review April 11, 2024 11:52
@DimedS DimedS requested a review from yetudada as a code owner April 11, 2024 11:52
Copy link
Member

@astrojuanlu astrojuanlu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a more thorough review of the prose and the workflow. Fantastic job @DimedS ! Left a few comments

@@ -0,0 +1,181 @@
# Apache Airflow

Apache Airflow is a popular open-source workflow management platform. It is a suitable engine to orchestrate and execute a pipeline authored with Kedro because workflows in Airflow are modelled and organised as [DAGs](https://en.wikipedia.org/wiki/Directed_acyclic_graph).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for the future: we should have an entry for DAGs in our Glossary! https://docs.kedro.org/en/stable/resources/glossary.html

docs/source/deployment/airflow.md Outdated Show resolved Hide resolved
docs/source/deployment/airflow.md Outdated Show resolved Hide resolved
docs/source/deployment/airflow.md Outdated Show resolved Hide resolved
docs/source/deployment/airflow.md Show resolved Hide resolved
docs/source/deployment/airflow.md Outdated Show resolved Hide resolved
astro dev init
```

2. The folder `kedro-airflow-spaceflights` will be executed within the Airflow container. To run our Kedro project there, we need to copy several items from the previous section into it:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
2. The folder `kedro-airflow-spaceflights` will be executed within the Airflow container. To run our Kedro project there, we need to copy several items from the previous section into it:
2. The folder `kedro-airflow-spaceflights` will be executed within the Airflow container. To run the Kedro project there, you need to copy several items from the previous section into it:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if there's a simpler workflow that doesn't involve copying files around. Asking because, during development, I understand that the user would want to update the configuration several times, and this involves repeating the generating of the wheel and the copying process every time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's a really good question. Previously, the manual described fully moving the Kedro project into the Astro project, causing them to share the same folder. This seems a bit risky to me because the two projects begin to share the same files, such as the README, requirements, and .gitignore. And it wasn't clear exactly what needed to be transferred from the Kedro project to enable it to run in Astro, so I clarified that. However, it makes sense to keep them together for the reasons you mentioned—it will be easier to work with, update Kedro, repackage, and rerun Astro from the same folder. But in this case, we should be cautious about managing these common files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it's best to keep the manual as it is to clearly illustrate how Kedro and Astro are integrated, enhancing a user's understanding. However, we could add a recommendation at the end: "Feel free to combine them in the same folder if your project requires frequent updates, DAG recreation, and repackaging." What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added


## How to run a Kedro pipeline on Apache Airflow using a Kubernetes cluster

The `kedro-airflow-k8s` plugin from GetInData | Part of Xebia enables you to run a Kedro pipeline on Airflow with a Kubernetes cluster. The plugin can be used together with `kedro-docker` to prepare a docker image for pipeline execution. At present, the plugin is available for versions of Kedro < 0.18 only.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think that GiD was going to deprecate this plugin as the same could be already achieved (not as easily unfortunately) with the official plugin. @Lasica

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's continue the conversation in kedro-org/kedro-plugins#652 then, to see to what extent the official plugin can bridge the gap.

DimedS and others added 6 commits April 12, 2024 10:14
Co-authored-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>
Signed-off-by: Dmitry Sorokin <40151847+DimedS@users.noreply.github.com>
Signed-off-by: Dmitry Sorokin <dmd40in@gmail.com>
Signed-off-by: Dmitry Sorokin <dmd40in@gmail.com>
Signed-off-by: Dmitry Sorokin <dmd40in@gmail.com>
Signed-off-by: Dmitry Sorokin <dmd40in@gmail.com>
@DimedS
Copy link
Contributor Author

DimedS commented Apr 15, 2024

I believe I've addressed most of the comments. Could you please do a final check, @ankatiyar , @merelcht , @astrojuanlu ?

@astrojuanlu
Copy link
Member

@sbrugman By any chance do you have a moment to give this a look?

Copy link
Member

@astrojuanlu astrojuanlu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a couple of nitpicks but other than that this looks good to me!

Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a small nitpicks as well, but other than that very happy with these docs! Great work @DimedS 👍

Considering this is quite a substantial re-write, I'd also add it to the release notes.

docs/source/deployment/airflow.md Outdated Show resolved Hide resolved
docs/source/deployment/airflow.md Outdated Show resolved Hide resolved
docs/source/deployment/airflow.md Outdated Show resolved Hide resolved
Copy link
Contributor

@ankatiyar ankatiyar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, the default dataset factory is the only thing that needs to be updated!

Apache Airflow is a popular open-source workflow management platform. It is a suitable engine to orchestrate and execute a pipeline authored with Kedro because workflows in Airflow are modelled and organised as [DAGs](https://en.wikipedia.org/wiki/Directed_acyclic_graph).

## How to run a Kedro pipeline on Apache Airflow with Astronomer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For users who wish to deploy on using Atronomer this tutorial is clear. For users that have a different Airflow provider, it should be mentioned which steps are not relevant and that Kedro Airflow is not specific to astronomer.

In the logical flow (reading top to bottom), this information should be placed above Astronomer specific instructions

Copy link
Contributor

@ankatiyar ankatiyar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was following along the tutorial and came across v minor things

docs/source/deployment/airflow.md Outdated Show resolved Hide resolved
docs/source/deployment/airflow.md Show resolved Hide resolved
ankatiyar and others added 4 commits April 19, 2024 12:12
Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com>
Co-authored-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>
Signed-off-by: Ankita Katiyar <110245118+ankatiyar@users.noreply.github.com>
Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>
@ankatiyar ankatiyar enabled auto-merge (squash) April 19, 2024 11:49
@ankatiyar ankatiyar merged commit 144323a into main Apr 19, 2024
10 checks passed
@ankatiyar ankatiyar deleted the 605-kedro-airflow-revise-airflow-deployment-manual branch April 19, 2024 11:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kedro-airflow: revise Airflow deployment manual
6 participants