Initialize environment as described in repo root README
List container images built on environment initialization to confirm availability:
# Using local setup
docker images list | grep grep extract_load_trips_from_tlc_to_gs
# Using cloud setup
gcloud artifacts docker images list $GCP_CONTAINER_REGISTRY_URL | grep extract_load_trips_from_tlc_to_gs
Set container image on orchestrator instance for "extract_load_trips_from_tlc_to_gs" dags:
# Using local setup
IMAGE=extract_load_trips_from_tlc_to_gs:latest
airflow variables set docker_image_extract_load_trips_from_tlc_to_gs $IMAGE
# Using cloud setup
IMAGE=us-central1-docker.pkg.dev/dtc-de-project-383119/airflow-docker-operators/extract_load_trips_from_tlc_to_gs:latest
gcloud composer environments update "$COMPOSER_ENV_NAME" \
--location "$COMPOSER_ENV_LOCATION" \
--update-env-variables=docker_image_extract_load_trips_from_tlc_to_gs=$IMAGE
Example DAG configuration for single lightweighest task:
{
"cloud_run_jobs_parent": null,
"data_bucket_name": null,
"vehicle_types": ["green"],
"years": [2023]
}
Example DAG configuration for eight dynamic tasks:
{
"cloud_run_jobs_parent": null,
"data_bucket_name": null,
"vehicle_types": ["green", "yellow"],
"years": [2019, 2020, 2021, 2022]
}
Notes on environment variables:
cloud_run_jobs_parent
anddata_bucket_name
are parameteres defaulting to Airflow variables set as environment variables in orchestration instance, in both local and cloud setup. You can checkAIRFLOW_VAR_CLOUD_RUN_JOBS_PARENT
andAIRFLOW_VAR_DATA_BUCKET_NAME
values in the environment file generated by initialization script. Read repo root README- The values of these Airflow variables set through environment variables may be overriden in ad-hoc/manual DagRuns as DAG parameters but cannot be modified from environment variables UI view, as Airflow variables defined as environment variables are not visible from Airflow UI. Read https://airflow.apache.org/docs/apache-airflow/stable/howto/variable.html#storing-variables-in-environment-variables
DAGs Execution isolation is implemented using Cloud Run Job deployed through a virtual environment Airflow operator requiring google-cloud-run
package.
TODO: validate why splitting steps into three different tasks raises error. Use single task DAG as a temporary solution.
2023-05-18: added support for Cloud Batch, similar to Cloud Run Job