# Vertex AI Troubleshooting


| | |
|-|-|
| Author(s) | [Keeyana Jones](https://github.com/keeyanajones/) |

## **A Notebook for MLOps Vertex AI Troubleshooting**

## I. Logs: The "What Happened"

Logs provide detailed records of events that occurred during the execution of your ML workloads. In GCP, these are primarily managed by Cloud Logging.

### Where to Find Logs in Vertex AI:

- Vertex AI Training Jobs (Custom Jobs, Hyperparameter Tuning Jobs):
    - Navigate to Vertex AI -> Training.
    - Select your job. In the "Job details" page, you'll see a "Logs" tab. This provides the stdout and stderr output from your training container.
    - **Direct Cloud Logging Link:** For more advanced filtering and context, click "VIEW LOGS" which takes you directly to Cloud Logging (Logs Explorer) with the relevant filters pre-applied (e.g., resource.type="ml_job", resource.labels.job_id="your-job-id").
    
    - Common issues:
       - **Permission Denied:** Check if your training job's service account has necessary permissions (e.g., to read data from GCS, write to BigQuery, log to Cloud Logging).
       - **Module Not Found:** Ensure all dependencies are correctly installed in your Docker container or included in your Python package.
       - **Resource Exhaustion:** Look for OOM (Out of Memory) errors or messages indicating GPU/CPU limits were hit.
       - **Code Errors:** Python tracebacks or specific error messages from your training script will appear here.

### Vertex AI Prediction Endpoints (Online Prediction):
   - **Container Logs:** These are the stdout and stderr from your deployed model's container. Crucial for debugging issues within your prediction code (e.g., model loading errors, inference logic bugs). Enabled by default for v1 endpoints.
   - **Access Logs:** Provide information about each prediction request, including timestamp, latency, HTTP status codes, and source IP. Essential for understanding traffic patterns and errors. (Disabled by default, must be enabled when deploying/mutating the model).
   - **Request-Response Logging:** Logs a sample of the actual prediction requests and responses to a BigQuery table. Invaluable for debugging specific inference failures or understanding typical input/output. (Disabled by default, must be enabled).
    
   - Where to Find:
        - Vertex AI -> Endpoints. Select your endpoint.
        - For basic container logs, there's a "Logs" section on the endpoint details page.
        - For detailed access and request-response logs, you'll generally go to Cloud Logging (Logs Explorer) and filter by resource.type="aiplatform.googleapis.com/Endpoint" or resource.labels.endpoint_id="your-endpoint-id".
    
   - Common issues:
       - **HTTP 5xx Errors:** Indicate server-side issues (e.g., your model code crashed, OOM, timeouts).
       - **HTTP 4xx Errors:** Client-side issues (e.g., malformed request, authentication failure).
       - **Latency Spikes:** Check access logs for increased total_latency_duration and container logs for bottlenecks in your model.

- Vertex AI Pipelines:
    - Navigate to Vertex AI -> Pipelines. Select a pipeline run.
    - The visual graph of the pipeline run shows each step. Click on a specific step.
    - In the "Pipeline run analysis" pane, you'll see details for that step, including a "View logs" button. This takes you to the Cloud Logging entries for that specific pipeline component's execution.
    
    - Common issues:
        - **Component Failures:** Logs show why a specific component (e.g., data preprocessing, training, evaluation) failed.
        - **Input/Output Issues:** Errors related to reading/writing artifacts from GCS or passing parameters between components.
        - **Permissions:** Again, check the service account permissions for the pipeline itself and individual components.

- Vertex AI Workbench (Managed Notebooks):
    - **System Logs:** For the underlying VM instance (e.g., startup errors, disk issues). Found in Cloud Logging under resource.type="gce_instance".
    - **JupyterLab Logs:** Logs from JupyterLab server processes.
    - **User Notebook Output:** stdout and stderr from your executed notebook cells directly in the notebook interface, or in Cloud Logging if you configure custom logging from your notebook.
    
    - How to access:
        - Vertex AI -> Workbench. Select your instance.
        - On the instance details page, look for "Logs" or "Monitoring" tabs which often link to relevant Cloud Logging views.
        - Enable "Install Cloud Monitoring agent" and "Report custom metrics to Cloud Monitoring" when creating the instance for more detailed system and JupyterLab metrics.

- Cloud Logging (Logs Explorer) Features for Troubleshooting:

    - **Filters:** Filter by resource type, log severity (ERROR, WARNING, INFO, DEBUG), time range, text search, and labels.
    - **Log Analytics:** Use BigQuery-like SQL queries on your logs for advanced analysis.
    - **Log Sinks:** Export logs to BigQuery, Cloud Storage, or Pub/Sub for long-term archival or further analysis with external tools.
    - **Alerting:** Create log-based alerts to notify you when specific error patterns or high-severity events occur.

## II. Metrics: The "How Well is it Doing?"

Metrics provide quantitative data about the performance, resource utilization, and health of your ML systems. In GCP, these are primarily handled by Cloud Monitoring.

Where to Find Metrics in Vertex AI:

- Vertex AI Training Jobs:
    - On the job details page in Vertex AI, there's often a "Resource usage" tab or section showing CPU, GPU, and memory utilization over time.
    - **Cloud Monitoring:** For more detailed and customizable metrics, go to Cloud Monitoring -> Metrics Explorer.
        - **Metric Filter:** Search for aiplatform or ml.googleapis.com/training. You'll find metrics like:
            - training/cpu_utilization, training/gpu_utilization, training/memory_utilization
            - training/disk_utilization
            - training/network_bytes_sent, training/network_bytes_received
        - Filter by resource.labels.job_id to focus on a specific training run.
    - **Vertex AI Experiments:** If you're using Vertex AI Experiments (often with aiplatform.start_run()), you can log custom metrics (e.g., loss, accuracy, precision, recall) directly from your training script. These metrics are visualized in the Vertex AI Experiments UI and can also be viewed in TensorBoard (if integrated).

- Vertex AI Prediction Endpoints:
    - On the endpoint details page in Vertex AI, there are built-in charts for key metrics:
        - Predictions per second (QPS)
        - Prediction error percentage
        - Latency (Model latency, Overhead latency, Total latency duration)
        - CPU/GPU Utilization
        - Memory Utilization
    - Cloud Monitoring (Metrics Explorer):
        - **Metric Filter:** Search for aiplatform.googleapis.com/endpoint.
        - **Key metrics:** prediction_count, online_prediction_request_count, online_prediction_error_count, online_prediction_latency, deployed_model_cpu_utilization, deployed_model_memory_utilization.
        - You can filter by resource.labels.endpoint_id or resource.labels.deployed_model_id.

- Vertex AI Pipelines:
    - **Vertex AI Pipelines UI:** Provides a visual overview of run status (Success/Failure/Running) and duration for each step. You can compare runs.
    - Cloud Monitoring (Metrics Explorer):
        - **Metric Filter:** Search for aiplatform.googleapis.com/pipelinejob.
        - Metrics like pipeline_job_duration, completed_pipeline_tasks, executing_pipeline_jobs, executing_pipeline_tasks.
        - Filter by pipeline_job_id or run_state.

- Vertex AI Model Monitoring:
    - This is a specialized service for detecting data drift and concept drift in deployed tabular models.
    - **Dashboard:** Vertex AI Model Monitoring provides dedicated dashboards to visualize feature distributions, distance scores (L-infinity, Jensen-Shannon divergence), and compare baseline data with production data.
    - **Alerting:** Configure alerts when drift thresholds are exceeded, notifying you via Pub/Sub or email.

- Generative AI on Vertex AI (Gemini, etc.):
    - **Model Observability Dashboard:** For Google-managed foundation models (MaaS models), Vertex AI provides a prebuilt observability dashboard in Cloud Monitoring.
    - **Metrics:** QPS, token throughput, first token latencies, API error rates.
    - **Cloud Monitoring:** You can also track token consumption for Gemini API calls using CountTokens API and aiplatform.googleapis.com/token_count metric in Cloud Monitoring.

- Cloud Monitoring Features for Troubleshooting:

    - **Metrics Explorer:** Query and visualize any metric from your GCP services.
    - **Custom Dashboards:** Create custom dashboards that combine metrics from various Vertex AI components, Cloud Storage, BigQuery, etc., providing a holistic view of your MLOps system.
    - **Alerting:** Set up alerts based on metric thresholds (e.g., alert if CPU utilization exceeds 80% for 5 minutes, or if prediction error rate goes above 1%).
    - **Uptime Checks:** Monitor the availability of your prediction endpoints.

## III. Monitoring Dashboards: The "At-a-Glance" View

Dashboards consolidate logs and metrics into actionable visualizations, providing a quick overview of system health and performance.

#### Types of Dashboards:

Built-in Vertex AI Dashboards:
   - **Training:** Overview of job status, resource usage.
   - **Endpoints:** Real-time metrics for QPS, latency, error rates, resource utilization.
   - **Pipelines:** Visual graph of pipeline runs, step status.
   - **Model Monitoring:** Specialized dashboards for drift detection.
   - **Generative AI Model Observability:** For Google-managed LLMs.

Custom Cloud Monitoring Dashboards:
   - Recommended for comprehensive MLOps monitoring.
      - Allows you to combine metrics and logs from all relevant GCP services that your MLOps pipeline uses:
         - Vertex AI (training, prediction, pipelines, feature store)
         - Cloud Storage (data transfer, bucket sizes)
         - BigQuery (query performance, data processing)
         - Dataflow/Dataproc (data engineering job health)
         - Cloud Functions/Cloud Run (serverless components)
         - Network (egress costs, latency)

   - **Templates:** Start with Google-provided dashboard templates or create your own from scratch.
    
   - Key Monitoring Areas for MLOps Dashboards:
      - **Pipeline Health:** Success/failure rates of pipeline runs, average run duration.
      - **Model Performance:** Production accuracy, precision, recall (if ground truth available), drift scores.
      - **Serving Performance:** QPS, latency, error rates, resource utilization of prediction endpoints.
      - **Data Quality:** Metrics from data validation steps in pipelines or feature stores.
      - **Resource Utilization:** Overall project-level CPU, GPU, memory, disk usage to monitor costs and capacity.

#### Best Practices for Troubleshooting with Dashboards:

   - **Start Broad, Then Drill Down:** Begin with high-level dashboards to identify anomalies. If an issue is detected (e.g., high error rate), drill down into specific logs and detailed metrics for the affected component.
   - **Set Baselines:** Understand normal operational patterns for your metrics so you can quickly identify deviations.
   - **Correlate Events:** Look for correlations between different metrics and log entries. A spike in latency might correlate with an increase in QPS or a specific error message in logs.
   - **Regular Review:** Regularly review your dashboards, even when systems are healthy, to build familiarity and detect subtle changes over time.
   - **Automate Alerts:** Configure alerts for critical thresholds or patterns on your dashboards so you're proactively notified of potential issues.

By effectively utilizing Cloud Logging, Cloud Monitoring, and well-designed dashboards, you can gain deep insights into your Vertex AI workloads, quickly pinpoint problems, and ensure the reliability and performance of your MLOps systems.