Skip to content

google/cloud-tpu-monitoring-debugging

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

47 Commits
 
 
 
 
 
 
 
 
 
 

Cloud TPU Monitoring Debugging

Overview

Cloud TPU Monitoring Debugging repository contains all the infrastructure and logic required to monitor and debug jobs running on Cloud TPU.

Terraform is used to deploy resources in google cloud project. Terraform is an open-source tool to set up and manage google cloud infrastructure based on configuration files. This repository will help the customers to deploy various google cloud resources via script, without any manual effort.

cloud-tpu-diagnostics PyPI package contains all the logic to monitor, debug and profile the jobs running on Cloud TPU.

Getting Started with Terraform

  • Follow this link to install Terraform on desktop.
  • Run terraform init to initialize google cloud Terraform provider version. This command will add the necessary plugins and build the .terraform directory.
  • If there is an update to terraform google cloud provider version, run terraform init --upgrade for the update to take place.
  • You can also run terraform plan to validate resource declarations, identify any syntax errors, version mismatch before deploying the resources.

Configure Terraform to store state in Cloud Storage

By default, Terraform stores state locally in a file named terraform.tfstate. This default configuration can make Terraform usage difficult for teams, especially when many users run Terraform at the same time and each machine has its own understanding of the current infrastructure. To help avoid such issues, this section configures a remote state that points to Google Cloud Storage (GCS) bucket.

  1. In Cloud Shell, create the GCS bucket:

     gsutil mb gs://${GCS_BUCKET_NAME}
    
  2. Enable Object Versioning to keep the history of your deployments. Enabling Object Versioning increases storage costs, which you can mitigate by configuring Object Lifecycle Management to delete old state versions.

     gsutil versioning set on gs://${GCS_BUCKET_NAME}
    
  3. Enter the name of GCS bucket created above when you run terraform init to initialize Terraform.

     Initializing the backend...
     bucket
       The name of the Google Cloud Storage bucket
    
       Enter a value: <GCS_BUCKET_NAME>
    

Deploy GCP Resources

There are following resources managed in this directory:

  1. Monitoring Dashboard: This is an outlier dashboard that displays statistics and outlier mode for TPU metrics.
  2. Debugging Dashboard: This dashboard displays the stack traces collected in Cloud Logging for the process running on TPU VMs.
  3. Logging Storage: This is an user-defined log bucket to store stack traces. Creating a new log storage is completely optional. If you choose not to create a separate log bucket, the stack traces will be collected in _Default log bucket.

Deploy Resources for Workloads on GCE

Run terraform init && terraform apply inside gcp_resources/gce directory to deploy all the resources mentioned above for TPU workloads running on GCE. You will be prompted to provide values for some input variables. After confirming the action, all the resources will get automatically deployed in your gcp project.

Deploy Resources for Workloads on GKE

Run terraform init && terraform apply inside gcp_resources/gke directory to deploy all the resources mentioned above for TPU workloads running on GKE. You will be prompted to provide values for some input variables. After confirming the action, all the resources will get automatically deployed in your gcp project.

NOTE: Please check the below guide for more details about GCE/GKE specific resources and prerequisites.

Follow the below guide to deploy the resources individually:

Monitoring Dashboard

GCE

Run terraform init && terraform apply inside gcp_resources/gce/resources/dashboard/monitoring_dashboard/ to deploy only monitoring dashboard for GCE in your gcp project.

If the node_prefix parameter is not specified in the input variable var.monitoring_dashboard_config or is set to an empty string, the metrics on the dashboard will plot the data points for all TPU VMs in your GCP project.

For instance, if you provide {"node_prefix": "test"} as the input value for the input variable var.monitoring_dashboard_config, then the metrics on the monitoring dashboard will only show the data points for the TPU VMs with node names that start with test. Refer to this doc for more information on node prefix for TPUs in multislice.

GKE

Run terraform init && terraform apply inside gcp_resources/gke/resources/dashboard/monitoring_dashboard/ to deploy only monitoring dashboard for GKE in your gcp project.

Debugging Dashboard

GCE

Run terraform init && terraform apply inside gcp_resources/gce/resources/dashboard/logging_dashboard/ to deploy only debugging dashboard for GCE in your gcp project.

GKE

Run terraform init && terraform apply inside gcp_resources/gke/resources/dashboard/logging_dashboard/ to deploy only debugging dashboard for GKE in your gcp project.

Users need to add a sidecar container to their TPU workload running on GKE to view traces in the debugging dashboard. The sidecar container must be named in a specific way, matching the regex [a-z-0-9]*stacktrace[a-z-0-9]*. Here is an example of the sidecar container that should be added:

containers:
- name: stacktrace-log-collector
  image: busybox:1.28
  resources:
    limits:
      cpu: 100m
      memory: 200Mi
  args: [/bin/sh, -c, "while [ ! -d /tmp/debugging ]; do sleep 60; done; while [ ! -e /tmp/debugging/* ]; do sleep 60; done; tail -n+1 -f /tmp/debugging/*"]
  volumeMounts:
  - name: tpu-debug-logs
    readOnly: true
    mountPath: /tmp/debugging
- name: <main_container>
.....
.....
volumes:
- name: tpu-debug-logs

Log Storage

GCE

Run terraform init && terraform apply inside gcp_resources/gce/resources/log_storage/ to deploy a separate log bucket to store stack traces for GCE. You will be prompted to provide name of your gcp project and also the bucket configuration. You can also set the retention period for the bucket.

GKE

Run terraform init && terraform apply inside gcp_resources/gke/resources/log_storage/ to deploy a separate log bucket to store stack traces for GKE. You will be prompted to provide name of your gcp project and also the bucket configuration. You can also set the retention period for the bucket. Make sure that you have the sidecar container running in your GKE cluster as mentioned in Debugging Dashboard section for GKE.