Clash is a simple Python library for running jobs on the Google Compute Engine. Typical use cases are batch jobs which require very specific hardware configurations at runtime (e.g. multiple GPUs for model training). The library offers the following features:
- Automatic management of compute resources (allocation and deallocation)
- Definition of jobs via custom docker images
- Fine-grained cost-control by optionally using preemptible VMs
- An easy-to-use Python API including operators for Apache Airflow
There are several ways for running jobs on the Google Cloud Platform (GCP) where each one comes with its pros and cons. For example, Google's ML engine is now able to run dockerized jobs as well and should definitely be considered before using Clash, because of its excellent integration in the GCP ecosystem. In fact, the development of Clash started at a point in time when the options for running jobs on GCP were very limited.
On the other hand, Clash might still help to drastically reduce costs by offering the option to use preemptible VMs. In practice, one can usually make jobs robust towards sudden preemptions, for example, by continuously storing checkpoints. Clash automatically attempts to restart preempted machines and, thus, jobs can load their latest checkpoint and just continue where they have been interrupted. This way, we are successfully saving up to 80% of our compute costs.
- Python >= 3.7
Because Clash uses the Google Cloud SDK, you first have to set up your local environment to access GCP. Please visit the gcloud docs for that matter. In addition, Clash requires the following IAM roles to run correctly:
- roles/pubsub.editor # Clash uses PubSub to communicate with its VMs
- roles/compute.instanceAdmin.* # Clash needs to be able to create custom VMs on the project
If Clash should create VMs that are configured to run as a service account, one must also grant the roles/iam.serviceAccountUser role.
Note that Clash VMs also need to be able to delete themselves and delete / publish to PubSub topics in order to work correctly.
$ pip install pyclash
from pyclash.clash import JobConfigBuilder, Job
JOB_CONFIG = (
JobConfigBuilder()
.project_id("my-gcp-project")
.image("google/cloud-sdk:latest")
.machine_type("n1-standard-1")
.subnetwork("default")
.preemptible(True)
.build()
)
result = Job(job_config=JOB_CONFIG, name_prefix="myjob").run(
["echo", "hello world"], wait_for_result=True
)
if result["status"] != 0:
raise ValueError(f"The command failed with status code {result['status']}")
By default, Clash runs VMs with the Compute Engine default service account. One can also use Clash in the Cloud Composer. To deploy the operators, run
COMPOSER_ENVIRONMENT="mycomposer-env" \
COMPOSER_LOCATION="europe-west1" ./run.sh deploy-airflow-plugin
Note that the pyclash package must be available to the Composer in order to use the Clash operators (the official documentation describes how to install custom packages from PyPi). The following example shows how to run jobs using Clash's ComputeEngineJobOperator:
from airflow.operators import ComputeEngineJobOperator
from airflow import DAG
JOB_CONFIG = (
JobConfigBuilder()
.project_id("my-gcp-project")
.image("google/cloud-sdk:latest")
.machine_type("n1-standard-32")
.subnetwork("default")
.build()
)
with DAG(
"dag_id",
start_date=datetime(2018, 10, 1),
catchup=False,
) as dag:
task_run_script = ComputeEngineJobOperator(
cmd="echo hello",
job_config=JOB_CONFIG,
name_prefix="myjob",
task_id="run_script_task"
)
The best way to start working on Clash is to first install all the dependencies and run the tests:
pipenv install --dev
./run.sh unit-test
GCP_PROJECT_ID='your-project-id' ./run.sh integration-test