# AIOS Block Health Check Tutorial

This notebook provides a comprehensive guide to checking the health of your AIOS blocks and creating custom health check policies.

## Overview

The AIOS platform includes a robust health checking system that allows you to monitor the status of your blocks and their instances in real-time. This is essential for ensuring the reliability and availability of your AI services.

In this tutorial, we will cover:

1.  **Basic Health Checks**: How to use the AIOS API to perform basic health checks on your blocks.
2.  **Interpreting Health Status**: Understanding the JSON response from the health check API.
3.  **Custom Health Check Policies**: A step-by-step guide to creating your own custom health check policies for more advanced monitoring scenarios.

## 1. Basic Health Checks

You can easily check the health of any block by sending a GET request to the AIOS metrics service. The endpoint is structured as follows:

`http://<metrics-service-ip>:<port>/block/health/<block-id>`

Let's try this with a live example for the `gemma3-27b-block`.

In [1]:
import requests
import json

# Define the health check URL for your block
health_check_url = "http://MANAGEMENTMASTER:30201/block/health/gemma3-27b-block"

try:
    # Send the GET request
    response = requests.get(health_check_url)
    response.raise_for_status()  # Raise an exception for bad status codes
    
    # Parse and print the JSON response
    health_data = response.json()
    print(json.dumps(health_data, indent=2))
    
except requests.exceptions.RequestException as e:
    print(f"Error fetching health status: {e}")

{
  "block_id": "gemma3-27b-block",
  "healthy": true,
  "instances": [
    {
      "healthy": true,
      "instanceId": "executor",
      "reason": "executor instance"
    },
    {
      "healthy": true,
      "instanceId": "in-vkkv",
      "lastMetrics": "12.959992408752441s ago"
    }
  ],
  "success": true
}


## 2. Interpreting the Health Status

The JSON response from the health check API provides a detailed breakdown of the block's health. Let's look at the key fields:

-   **`block_id`**: The ID of the block being checked.
-   **`healthy`**: A boolean value indicating the overall health of the block. This is `true` only if all instances are healthy.
-   **`instances`**: An array of objects, each representing an instance of the block.
    -   **`healthy`**: A boolean value indicating the health of the specific instance.
    -   **`instanceId`**: The ID of the instance (e.g., `executor`, `in-0kcz`).
    -   **`reason`** or **`lastMetrics`**: Provides additional context about the instance's status.

This detailed breakdown is useful for quickly identifying which specific instance might be causing issues within a block.

## 3. Custom Health Check Policies

For more advanced scenarios, you can create custom health check policies. This allows you to define your own logic for determining whether a block is healthy. For example, you could create a policy that checks for:

-   GPU temperature
-   Specific log messages
-   The age of the last processed request
-   Any other custom metric you expose

### Creating a Custom Health Check Policy

Here is a step-by-step guide to creating a custom health check policy, based on the [official documentation](https://github.com/OpenCyberspace/OpenOS.AI-Documentation/blob/main/block/block.md#block-health-checker).
