# README (Ignore if you are running on Mac/Linux)

If you are running on Windows, make sure you have started the Jupyter Notebook in a Bash shell.
Moreover, all the requirements below must be installed in this Bash (compatible) shell.

This can be achieved as follows:

1. Enable and install WSL(2) for Windows 10/11 [official documentation](https://docs.microsoft.com/en-us/windows/wsl/install)
    * On newer builds of W10/11 you can install WSL by running the following command in an *administrator* PowerShell terminal. Which will install by default an Ubuntu instance of WSL.
    ```bash
   wsl --install
    ```
2. Start the Ubuntu Bash shell by searching for `Bash` under Start, or by running `bash` in a (normal) PowerShell terminal.

Using a Bash terminal as started under step 2 above, you can install the Requirements as described below as if you are running it under Linux or Ubuntu/Debian.

## Requirements
These requirements may also be installed on Windows, however, development has only been tested on Linux/macOS.

Before we get started, first make sure to install all the required tools. We provide two lists below, one needed for setting up the testbed. And one for developing code to use with the testbed. Feel free to skip the installation of the second list, and return at a later point in time.


### Deployment

 > ⚠️ All dependencies must be installed in a Bash-compatible shell. For Windows users also see [above](#read-me)
Make sure to install a recent version of each of the dependencies.


 * (Windows only) Install every dependency in a Windows Subsystem for the Linux, Bash shell (see also README above).
 * GCloud SDK
    - Follow the installation instructions [here](https://cloud.google.com/sdk/docs/install), follow either the Linux installation instruction, or your OS/Distribution specific instructions.
    - Initialize the SDK with `gcloud init`, if prompted you may ignore to set/create a default/first project.
    - ⚠️ Run the command `gcloud auth application-default login`
        - ℹ️ We need to run this command to utilize your login credentials programmatically with terraform. This is needed as we will use these to impersonate a service account during the creation and setup of the Kubernetes cluster.
    - ⚠️ Run the command `gcloud components install beta`
        - ℹ️ We need to run this command to list the billing account IDs and enable billing. Currently, these features fall under beta access.
    - ⚠️ Run the command `gcloud components install gke-gcloud-auth-plugin`
        - ℹ️ We need to run this command to retrieve cluster configurations (to be used by `kubectl` and `helm`)
    - ⚠️ Run the command `gcloud auth configure-docker`
        - ℹ️ We need to run this command to push container images with docker to your project's container registry
 * Kubectl (>= 1.22.0)
 * Helm (>= 3.9.4)
 * Terraform (>= 1.2.8)
 * Python3.9/10
   * jupyter, ipython, bash_kernel
```bash
pip3 install -r requirements-jupyter.txt
python3 -m bash_kernel.install
```

### Development
For development, the following tools are needed/recommended:

 * Docker (>= 18.09).
    - If you don't have experience with using Docker, we recommend following [this](https://docs.docker.com/get-started/) tutorial.
 * Python3.9
 * pip3
 * JetBrains PyCharm

# Preparation

To make sure we can request resources on Google Cloud Platform (GCP), perform the following;

1. Create a GCP account on [https://cloud.google.com](https://cloud.google.com), using a Google account
2. Redeem your academic coupon on GCP, see Brightspace for information on obtaining the \\$50 academic coupon, or use the free \\$300 credits for new users provided by Google.


3. Make sure to use the `Bash` kernel, not a Python or other kernel. For those on windows machines, make sure to launch the `jupyter notebook` server from a bash-compliant command line, we recommend Windows Subsystem for Linux.

⚠️ Make sure to run this Notebook within a cloned repository, not standalone/downloaded from GitHub.


# Deployment

⚠️ This notebook assumes that commands are executed in order. Executing the provided commands multiple times should not result in issues. However, re-running cells with `cd` commands, or altering cells (other than variables as instructed) may result in unexpected behaviour.

## Getting started

First, we will set a few variables used **throughout** the project. We set them in this notebook for convenience, but they are also set to some example default values in configuration files for the project. If you change any of these, make sure to change the corresponding variables as well in;

* [`../terraform/terraform-gke/variables.tf`](../terraform/terraform-gke/variables.tf)
* [`../terraform/terraform-dependencies/variables.tf`](../terraform/terraform-dependencies/variables.tf)


> ⚠️ As you have changed the `PROJECT_ID` parameter to a unique project name, also change the `project_id` variable in the following files. This allows you to run `terraform apply` without having to override the default value for the project.

> ℹ️ Any variable changed here can also be provided to `terraform` using the `-var` flag, i.e.  `-var terraform_variable=$BASH_VARIABLE`. An example for setting the `project_id` variable is also provided later.

In [1]:
# VARIABLES THAT NEEDS TO BE SET

##################
### CHANGE ME! ###
##################
PROJECT_ID="fltk-group-11"

# DEFAULT VARIABLES
ACCOUNT_ID="terraform-iam-service-account"
PRIVILEGED_ACCOUNT_ID="${ACCOUNT_ID}@${PROJECT_ID}.iam.gserviceaccount.com"
CLUSTER_NAME="fltk-testbed-cluster"
DEFAULT_POOL="default-node-pool"
EXPERIMENT_POOL="medium-fltk-pool-1"
REGION="us-central1-c"

TERRAFORM_GKE_DIR="../terraform/terraform-gke"
TERRAFORM_DEPENDENCIES_DIR="../terraform/terraform-dependencies"

## Project creation

Next, we create a project using the `PROJECT_ID` variable and get all the billing account information.

⁉️ (Ignore if using a pre-existing GCP Project) If the command below does not complete successfully, make sure to change the `PROJECT_ID` variable in the previous cell and re-run it.

In [None]:
gcloud projects create $PROJECT_ID --set-as-default
gcloud beta billing accounts list # Copy the Account ID of the account

Copy the billing account identifier, e.g. `015594-41687F-092941`, and assign to the variable in the cell below

In [2]:
##################
### CHANGE ME! ###
##################
BILLING_ACCOUNT="0197FE-691924-DEDC82"

Setup billing and enable services, this will allow us to create a GKE cluster (Google managed Kubernetes cluster), and push and pull containers to our private container repo.

In [4]:
# Setup billing to project
gcloud beta billing projects link $PROJECT_ID --billing-account $BILLING_ACCOUNT
# Enable services now billing is enabled
gcloud services enable compute container --project $PROJECT_ID

billingAccountName: billingAccounts/0197FE-691924-DEDC82
billingEnabled: true
name: projects/fltk-group-11/billingInfo
projectId: fltk-group-11
Operation "operations/acat.p2-528711584803-5af05b0e-ffd4-4bc1-b362-83b11ae75d42" finished successfully.


## Creating a service-account

Create service account that has the minimum set of permissions for creating and managing a cluster. This service account
will be used to create the cluster, and deploy the dependencies that we use.

During the deployment we will make use of impersonation, to let *your* account utilize the service-account. For more information about this practise, see also [this](https://cloud.google.com/blog/topics/developers-practitioners/using-google-cloud-service-account-impersonation-your-terraform-code) blog by Google.

In [None]:
# Helper function to quickly enable gcp roles, assumes $PRIVILEGED_ACCOUNT_ID and $PROJECT_ID to be set.
function enable_gcp_role () {
  ROLE=$1
  gcloud projects add-iam-policy-binding \
    $PROJECT_ID \
    --member="serviceAccount:$PRIVILEGED_ACCOUNT_ID" \
    --role="roles/$ROLE"
}

# Create service-account
gcloud iam service-accounts create $ACCOUNT_ID --display-name="Terraform service account" --project ${PROJECT_ID}

# Allow the service account to use the the set of roles below.
enable_gcp_role "compute.viewer"                # Allow the service account to see active resources
enable_gcp_role "storage.objectViewer"          # Allow the service account/managed resources to pull from gcr.io (your code)
enable_gcp_role "compute.networkAdmin"          # Needed for setting up private network
enable_gcp_role "compute.securityAdmin"         # Needed for GKE
enable_gcp_role "container.clusterViewer"       # Needed for GKE
enable_gcp_role "container.clusterAdmin"        # Needed for GKE
enable_gcp_role "container.developer"           # Needed for GKE
enable_gcp_role "iam.serviceAccountAdmin"       # Needed for GKE
enable_gcp_role "iam.serviceAccountUser"        # Needed for GKE


## Enable impersonation
With the service account created, we must enable impersonation, to allow the main account of the project to make use of the service account. For more information see also the [`add-iam-policy-binding`](https://cloud.google.com/sdk/gcloud/reference/iam/service-accounts/add-iam-policy-binding) reference.

Assign your `google_account` mail to the `OWNER_MAIL` variable, and run the command box below.

In [5]:
##################
### CHANGE ME! ###
##################
OWNER_MAIL="hugojpvandijk@gmail.com"

gcloud iam service-accounts add-iam-policy-binding $PRIVILEGED_ACCOUNT_ID \
 --member="user:$OWNER_MAIL" \
 --role=roles/iam.serviceAccountTokenCreator \
 --project $PROJECT_ID

Updated IAM policy for serviceAccount [terraform-iam-service-account@fltk-group-11.iam.gserviceaccount.com].
bindings:
- members:
  - user:HugoJPvanDijk@gmail.com
  role: roles/iam.serviceAccountTokenCreator
etag: BwXqISrITBQ=
version: 1


To enable using your account's credentials, run the command below. This will open in a new tab/open the link that is displayed. Afterwards you can use your own credentials to impersonate the service account. 

You can, for example, also allow other google users (such as project members) to work with your cluster in this way.

In [1]:
gcloud auth application-default login

Your browser has been opened to visit:

    https://accounts.google.com/o/oauth2/auth?response_type=code&client_id=764086051850-6qr4p6gpi6hn506pt8ejuq83di341hur.apps.googleusercontent.com&redirect_uri=http%3A%2F%2Flocalhost%3A8085%2F&scope=openid+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fuserinfo.email+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcloud-platform+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fsqlservice.login+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Faccounts.reauth&state=42wr4ORz9xfUXWO8pYgKOfuErxhPpR&access_type=offline&code_challenge=Sf8A4u6ntKKgaKOWYWLjU6OhkOv4-o4YqmMIaOauKdM&code_challenge_method=S256


Credentials saved to file: [/home/hugo/.config/gcloud/application_default_credentials.json]

These credentials will be used by any library that requests Application Default Credentials (ADC).

Quota project "fltk-group-11" was added to ADC which can be used by Google client libraries for billing and quota. Note that some services may still bill the project owning the resourc

## Creating a Google managed cluster (GKE)
To create the cluster, first change the active directory to the `terraform-gke` directory.

⚠️ Creating a cluster will incur billing cost on your project, by default the cluster will be small to minimize costs during this tutorial. Forgetting to `destroy` or scale down the cluster may result in quickly spending your academic coupon.

Init the directory, to initialize the Terraform module.

In [7]:
terraform -chdir=$TERRAFORM_GKE_DIR init 

[0m[1mInitializing modules...[0m

[0m[1mInitializing the backend...[0m

[0m[1mInitializing provider plugins...[0m
- Reusing previous version of hashicorp/random from the dependency lock file
- Reusing previous version of hashicorp/google from the dependency lock file
- Reusing previous version of hashicorp/kubernetes from the dependency lock file
- Reusing previous version of hashicorp/google-beta from the dependency lock file
- Using previously-installed hashicorp/random v3.4.3
- Using previously-installed hashicorp/google v4.35.0
- Using previously-installed hashicorp/kubernetes v2.13.1
- Using previously-installed hashicorp/google-beta v4.35.0

[0m[1m[32mTerraform has been successfully initialized![0m[32m[0m
[0m[32m
You may now begin working with Terraform. Try running "terraform plan" to see
any changes that are required for your infrastructure. All Terraform commands
should now work.

If you ever set or change modules or backend configuration for Terraform,
rerun 

Next, we can check whether we can create a cluster. No warnings or errors should occur during this process. It may take a while to complete.

> ⚠️ We provide the project_id variable from `terraform/terraform-gke` manually, and also change the default value.

⁉️ If the command below does not complete successfully, e.g. after raising a `403` error, make sure that you have successfully created the project with `gcloud` earlier.


In [8]:
terraform -chdir=$TERRAFORM_GKE_DIR plan -var project_id=$PROJECT_ID

[0m[1mmodule.gke.random_string.cluster_service_account_suffix: Refreshing state... [id=oaql][0m
[0m[1mdata.google_service_account_access_token.default: Reading...[0m[0m
[0m[1mdata.google_service_account_access_token.default: Read complete after 0s [id=projects/-/serviceAccounts/terraform-iam-service-account@fltk-group-11.iam.gserviceaccount.com][0m
[0m[1mdata.google_client_config.default: Reading...[0m[0m
[0m[1mmodule.gke.data.google_compute_zones.available: Reading...[0m[0m
[0m[1mmodule.gke.data.google_container_engine_versions.region: Reading...[0m[0m
[0m[1mdata.google_client_config.default: Read complete after 0s [id=projects/fltk-group-11/regions//zones/][0m
[0m[1mmodule.gcp-network.module.vpc.google_compute_network.network: Refreshing state... [id=projects/fltk-group-11/global/networks/gcp-private-network][0m
[0m[1mmodule.gke.data.google_container_engine_versions.region: Read complete after 0s [id=2022-10-03 13:18:22.905344604 +0000 UTC][0m
[0m[1mm

            ]
          [32m+[0m [0m[1m[0mmax_pods_per_node[0m[0m           = 110
          [32m+[0m [0m[1m[0mname[0m[0m                        = "medium-fltk-pool-1"
          [32m+[0m [0m[1m[0mnode_count[0m[0m                  = 0
          [32m+[0m [0m[1m[0mnode_locations[0m[0m              = [
              [32m+[0m [0m"us-central1-c",
            ]
          [32m+[0m [0m[1m[0mversion[0m[0m                     = "1.21.14-gke.5300"

          [32m+[0m [0mautoscaling {
              [32m+[0m [0m[1m[0mmax_node_count[0m[0m = 100
              [32m+[0m [0m[1m[0mmin_node_count[0m[0m = 0
            }

          [32m+[0m [0mmanagement {
              [32m+[0m [0m[1m[0mauto_repair[0m[0m  = true
              [32m+[0m [0m[1m[0mauto_upgrade[0m[0m = true
            }

          [32m+[0m [0mnode_config {
              [32m+[0m [0m[1m[0mdisk_size_gb[0m[0m      = 64
              [32m+[0m [0m[1m[0mdisk_type[

      [32m+[0m [0m[1m[0moperation[0m[0m                   = (known after apply)
      [33m~[0m [0m[1m[0mversion[0m[0m                     = "1.21.14-gke.5300" [33m->[0m [0m(known after apply)
        [90m# (4 unchanged attributes hidden)[0m[0m

      [33m~[0m [0mnode_config {
          [32m+[0m [0m[1m[0mmin_cpu_platform[0m[0m  = (known after apply)
          [33m~[0m [0m[1m[0moauth_scopes[0m[0m      = [ [31m# forces replacement[0m[0m
              [31m-[0m [0m"https://www.googleapis.com/auth/logging.write",
              [31m-[0m [0m"https://www.googleapis.com/auth/monitoring",
                [90m# (1 unchanged element hidden)[0m[0m
            ]
            [1m[0mtags[0m[0m              = [
                "gke-fltk-testbed-cluster",
                "gke-fltk-testbed-cluster-default-node-pool",
                "default-node-pool",
            ]
          [33m~[0m [0m[1m[0mtaint[0m[0m             = [] [33m->[0m [0m(known a

When the previous command completes successfully, we can start the deployment. Depending on any changes you may have done, this might take a while.

By default, this will create a private zonal cluster consisting of two node pools.

> ⚠️ A regional cluster (multi-zonal) will incur an additional fee of \\$ 0.10 /hour per managed (GKE) cluster. The **first** zonal cluster is free of this charge.

> ⚠️ By default spot/preemptive nodes are disabled. You can experiment by setting `spot` to true in the `tf` files. Note, however, that the default implementations provided in the testbed do not allow for recovery from getting spun down and rescheduled. Moreover, this may result in poor availability during busy hours in the region in which you deploy your cluster.


In [3]:
terraform -chdir=$TERRAFORM_GKE_DIR apply -auto-approve -var project_id=$PROJECT_ID

[0m[1mmodule.gke.random_string.cluster_service_account_suffix: Refreshing state... [id=oaql][0m
[0m[1mdata.google_service_account_access_token.default: Reading...[0m[0m
[0m[1mdata.google_service_account_access_token.default: Read complete after 1s [id=projects/-/serviceAccounts/terraform-iam-service-account@fltk-group-11.iam.gserviceaccount.com][0m
[0m[1mmodule.gke.data.google_compute_zones.available: Reading...[0m[0m
[0m[1mmodule.gke.data.google_container_engine_versions.region: Reading...[0m[0m
[0m[1mdata.google_client_config.default: Reading...[0m[0m
[0m[1mmodule.gcp-network.module.vpc.google_compute_network.network: Refreshing state... [id=projects/fltk-group-11/global/networks/gcp-private-network][0m
[0m[1mdata.google_client_config.default: Read complete after 0s [id=projects/fltk-group-11/regions//zones/][0m
[0m[1mmodule.gcp-network.module.subnets.google_compute_subnetwork.subnetwork["us-central1/gcp-private-subnetwork"]: Refreshing state... [id=proj

Next, we add cluster credentials (so you can interact with the cluster through `kubectl` an `helm`).

In [None]:
# Add credentials for interacting with cluster via kubectl
gcloud container clusters get-credentials $CLUSTER_NAME --region $REGION --project $PROJECT_ID

⚠️ The cluster by default does not contain any nodes in the node pools, the `initial_node_count` is set to 0.

Lastly, we need to scale up the cluster, as by default we create a cluster with nodepools of size 0.

In [None]:
###
### ! CHANGE ME
###
MAX_NUM_NODES=2

gcloud container clusters update $CLUSTER_NAME --node-pool $DEFAULT_POOL \
    --no-enable-autoscaling --region $REGION --quiet
    
# The high performance node will scale up automatically whenever the workloads are deployed
gcloud container clusters update $CLUSTER_NAME --node-pool $EXPERIMENT_POOL \
    --enable-autoscaling --min-nodes=0 --max-nodes=$MAX_NUM_NODES --region $REGION --quiet

gcloud container clusters resize $CLUSTER_NAME --node-pool $DEFAULT_POOL \
    --num-nodes 1 --region $REGION --quiet


### Changing deployment

To save cost, or run different experiments, you might want to change the configuration of your cluster. This can be achieved by modifying the cluster configuration in the [`terraform-gke/main.tf`](../terraform/terraform-gke/main.tf) configuration file. You can change the default node-pools, create additional node pools with taints (to allow for scheduling on specific nodes/pools) and much more.

After finishing your changes, simply run the following commands

```bash
# Use `plan` to check your configuration
terraform plan
# Check to see if your changes are as expected, terraform will show what will be created/removed.

# If the changes are as you expect, apply the changes.
terraform apply #-auto-approve
```

Depending on the number of changes, this may take some time.

## Installing dependencies
Lastly, we need to install the dependencies on our cluster. First change the directories, and then run the `init`, `plan` and `apply` commands as we did for creating the GKE cluster.

Init the directory, to initialize the Terraform module.

In [1]:
terraform -chdir=$TERRAFORM_DEPENDENCIES_DIR init -reconfigure

Invalid -chdir option: must include an equals sign followed by a directory path, like -chdir=example


: 1

Check to see if we can plan the deployment. This will setup the following:

* Kubeflow training operator (used to deploy and manage PyTorchTrainJobs programmatically)
* NFS-provisioner (used to enable logging on a persistent `ReadWriteMany` PVC in the cluster)


In [12]:
terraform -chdir=$TERRAFORM_DEPENDENCIES_DIR plan -var project_id=$PROJECT_ID

[0m[1mhelm_release.nfs_client_provisioner: Refreshing state... [id=nfs-server][0m
[0m[1mdata.kustomization_build.training_operator: Reading...[0m[0m
[0m[1mdata.google_service_account_access_token.default: Reading...[0m[0m
[0m[1mdata.google_service_account_access_token.default: Read complete after 1s [id=projects/-/serviceAccounts/terraform-iam-service-account@fltk-group-11.iam.gserviceaccount.com][0m
[0m[1mdata.kustomization_build.training_operator: Read complete after 3s [id=a294ea9a3d4f626ec1ec55aac66b4a486f682fe5dbec2eadf58d30baee14a8f66a3ec2674c0e2d9a40e7bc191f878f010393849f4581c6a4189bc41761abab89][0m
[0m[1mkustomization_resource.training_operator["_/Namespace/_/kubeflow"]: Refreshing state... [id=a0784de8-1cfc-4afc-81db-7b270cc0eabc][0m
[0m[1mkustomization_resource.training_operator["rbac.authorization.k8s.io/ClusterRoleBinding/_/training-operator"]: Refreshing state... [id=4ed4df0f-39dd-4765-994d-84689e6f5199][0m
[0m[1mkustomization_resource.training_ope

When the previous command completes successfully, we can start the deployment. This will install the NFS provisioner and Kubeflow Training Operator dependencies


In [5]:
terraform -chdir=$TERRAFORM_DEPENDENCIES_DIR apply -auto-approve -var project_id=$PROJECT_ID

[0m[1mhelm_release.nfs_client_provisioner: Refreshing state... [id=nfs-server][0m
[0m[1mdata.kustomization_build.training_operator: Reading...[0m[0m
[0m[1mdata.google_service_account_access_token.default: Reading...[0m[0m
[0m[1mdata.google_service_account_access_token.default: Read complete after 0s [id=projects/-/serviceAccounts/terraform-iam-service-account@fltk-group-11.iam.gserviceaccount.com][0m
[0m[1mdata.google_client_config.default: Reading...[0m[0m
[0m[1mdata.google_client_config.default: Read complete after 0s [id=projects/fltk-group-11/regions//zones/][0m
[0m[1mdata.google_container_cluster.testbed_cluster: Reading...[0m[0m
[0m[1mdata.kustomization_build.training_operator: Read complete after 3s [id=a294ea9a3d4f626ec1ec55aac66b4a486f682fe5dbec2eadf58d30baee14a8f66a3ec2674c0e2d9a40e7bc191f878f010393849f4581c6a4189bc41761abab89][0m
[0m[1mkustomization_resource.training_operator["_/Namespace/_/kubeflow"]: Refreshing state... [id=a0784de8-1cfc-4afc-

: 1

## Deploying extractor

Lastly, we deploy the extractor pod, which also provides PVCs which can be used for artifact retrieval.

Retrieval can be done by running

```bash
EXTRACTOR_POD_NAME=$(kubectl get pods -n test -l "app.kubernetes.io/name=fltk.extractor" -o jsonpath="{.items[0].metadata.name}")
kubectl cp -n test $EXTRACTOR_POD_NAME:/opt/federation-lab/logging ./logging
```

For copying from the extractor path `/opt/federation-lab/logging` to a directory locally named `logging`.

First build the docker container, following the instructions of the [readme](https://github.com/JMGaljaard/fltk-testbed#creating-and-uploading-docker-container).


N.B. Make sure to have setup a working authentication provider for docker, such that you can push to your repository.

Run this in a terminal in the content-root directory (so `fltk-testbed` if the project name was not altered).
```bash
python3 -m venv venv
source venv
pip3 install -r requirements-cpu.txt
python3 -m fltk extractor configs/example_cloud_experiment.json
```

Make sure to have run `gcloud auth configure-docker` in an external terminal.

Make sure to allow docker to build/push/run without `sudo` [link](https://cloud.google.com/artifact-registry/docs/docker/authentication).


In [None]:
# Build the docker container with buildkit. Make sure you have Docker Desktop running on Windows/MacOS
DOCKER_BUILDKIT=1 docker build --platform linux/amd64 ../ --tag gcr.io/$PROJECT_ID/fltk
docker push gcr.io/<project-id>/fltk

In [None]:
# Install the extractor, and set the projectName to $PROJECT_ID.
# In case you get a warning regarding the namespace test, this means that the dependencies have not been properly installed.
# Make sure to check whether you have enough resources available, and re-run the installation of dependencies. (see above).

# Deploy extractor, in test namespace with updated image reference (--set overwrites values from `fltk-values.yaml`).
helm install extractor ../charts/extractor -f ../charts/fltk-values.yaml --namespace test --set provider.projectName=$PROJECT_ID

## Testing the deployment

To make sure that the deployment went OK, we can run the following command to test whether we can use Pytorch-Training operators.

This will create a simple deployment using a Kubeflow pytorch example job.

This will create a small (1 master, 1 client) training job on mnist on your cluster. You can follow the deployment by navigating to your cluster on [cloud.google.com](cloud.google.com)

In [None]:
# This cell is optional, but the next shell should show that a pytorch train job is created.
kubectl create -f https://raw.githubusercontent.com/kubeflow/training-operator/master/examples/pytorch/simple.yaml

In [None]:
# Retrieve all CRD Pytorchjob from Kubeflow.
kubectl get pytorchjobs.kubeflow.org --all-namespaces

# Alternatively, we can remove all jobs, this will remove all information and logs as well.
kubectl delete pytorchjobs.kubeflow.org --all-namespaces --all

# Cleaning up

## Scaling down the cluster

This is the preferred way to scale down.
Scale node pools down to prevent idle resource utilization.

In [None]:
gcloud container clusters resize $CLUSTER_NAME --node-pool $DEFAULT_POOL \
     --num-nodes 0 --region $REGION --quiet

gcloud container clusters resize $CLUSTER_NAME --node-pool $EXPERIMENT_POOL \
    --num-nodes 0 --region $REGION --quiet

## Destroying the cluster

> ⚠️ THIS WILL REMOVE YOUR CLUSTER AND DATA STORED ON IT. For this tutorial's purpose destroying your cluster is not an issue. For testing/developing, we recommend manually scaling your cluster up and down instead.

To clean up/remove the cluster, we will use the `terraform destroy` command.

 * Running it in `terraform-dependencies` WILL REMOVE the Kubeflow Training-Operator from your cluster.
 * Running it in `terraform-gke` WILL REMOVE YOU ENTIRE CLUSTER.

You can uncomment the commands below to remove the cluster, or run the command in a terminal in the [`../terraform/terraform-gke`](../terraform/terraform-gke) directory.

> ⚠️ It is recommended to scale down the cluster/nodepools rather then destroying, refer to the last code block.

In [None]:
# THIS WILL REMOVE/TEARDOWN YOUR CLUSTER, ONLY RECOMMENDED FOR TESTING THE DEPLOYMENT

terraform -chdir=$TERRAFORM_DEPENDENCIES_DIR destroy -auto-approve -var project_id=$PROJECT_ID

terraform -chdir=$TERRAFORM_GKE_DIR destroy -auto-approve -var project_id=$PROJECT_ID