## 🧠 Cluster,Node,Block Onboarding - Policies of control - PART-1

- Author: Shridhar Kini (Profile)
- To Securely Run: `jupyter notebook password` to generate onetime password for secure access
- To Run: `jupyter notebook --allow-root  --port 9999 --ip=0.0.0.0`
- To Clear Outputs: Use `jupyter nbconvert --clear-output --inplace cluster_node_block.ipynb`

This notebook provides an overview of the **Cluster Controller Gateway,** **Cluster Controller,** **Cluster,** **Nodes** and  **Blocks** features in the AIOSv1 platform. These features are designed to helps users to onboard, manage and monitor their clusters, nodes, and blocks effectively and to define policies of control on each of these units.

### ⛩️**I. Cluster Controller Gateway** ([Doc](https://github.com/OpenCyberspace/OpenOS.AI-Documentation/blob/main/cluster-controller-gateway/cluster-cotroller-gateway.md))
- multiple clusters can exist in the network
- gateway/interface between the clusters present in the network and the users & parsers([Parser](https://github.com/OpenCyberspace/OpenOS.AI-Documentation/blob/main/parser/parser.md))
- execute multiple cluster level functionalities through this gateway
- Cluster controller gateway runs as a part of AIOS management services by default, so the management of the cluster controller gateway falls under the network owner
- User-level functionalities:
    - Query Cluster Metrics(Cluster ID)/Node metrics(Cluster ID, Node ID)
    - Update/Delete the details of the cluster.
    - Allocate/De-allocate the cluster infrastructure components by providing the cluster Kubernetes config file ID.
    - Add/Remove nodes to the cluster.
    - Create/Delete vDAG controller infrastructure components.
    - Execute block level management commands (like parameter updates, updating policy parameters or SDK level parameters) (block ID needed).
    - Execute cluster controller level management commands (cluster ID needed).
    - Manually scale the block’s instances (block ID and cluster ID needed).

#### **Cluster controller gateway components:**
- Internal(for parser, search server) and external(users) APIs server
- Pre-check policy rule executor:
    - pre-check is a policy-rule that will determine whether an action can be allowed or not
    - network admin who manages the management cluster can implement these pre-check rules
        - Actions like `add_node`, `remove_node`, `add_cluster`, `remove_cluster`,  `add_block`, `remove_block`, `update_cluster`, `scale_block`, `block_mgmt`, `cluster_mgmt` etc.
- Block allocator, dry run modules:
    - used internally from the parser
    - needs block’s resource allocation policy rule and cluster DB API client for filtering the cluster.
    - needs cluster controller client APIs for allocating blocks.
    - asynchronous operations: taskID will be returned to the user. (Query task DB API to get the status of the task)
- Cluster infra management module:
    - creating / removing the cluster controller infrastructure
        - metrics DBs
        - Creates the metrics reader and writer services
        - metrics collector daemon-set in all nodes(for HW Metrics of the Node)
        - Creates the cluster controller
        - Configures the services, ingresses and the ports mapping
        - Updates the cluster entry in the DB with the URL details of the cluster controller and the metrics services, so it can be used by other services
- Cluster nodes add/remove module:
    - add/remove nodes to the existing cluster - node onboarding script
    - node onboarding script will be executed in the node to be added
- vDAG controller module:
    - creation, removal and scaling of the vDAG controller on cluster
    - simply forwards the request to the cluster controller on the cluster
- Cluster DB, Cluster metrics DB, Cluster controller client modules:
    - internal client modules that are used by other modules for interacting with cluster controller service, cluster metrics service of the target cluster and the Global cluster DB

#### **Sample Json**
- **[Cluster Data](sample_jsons/cluster_data.json)**
- **[Cluster Metrics](sample_jsons/cluster_metrics.json)**
- **[Node Data](sample_jsons/node_data.json)**
- **[Block Data](sample_jsons/block_data.json)**
- **[Scale Data](sample_jsons/scale_data.json)**

#### Cluster Controller Gateway APIs
- Cluster entry read, update and delete APIs
    - Cluster read API
    - Update cluster API
    - Delete cluster entry API
- Cluster metrics query APIs
    - Get metrics of cluster
    - Get metrics of a node
- Scale/Downscale instances of a running block 
    - Upscale
    - Downscale
- Cluster Infra APIs:
    - Create Your cluster with Master nodes and Worker nodes
        - Follow **[cluster_setup_guide](cluster_setup/cluster_setup_guide.ipynb)**
    - Cluster Registration in AIOS:
        - Once Cluster is created, register the cluster in AIOS DB by using `Parser API`
    - Upon Registraion, Create a cluster infra by providing the kube config file for below operations
        - Create cluster infrastructure API
        - Remove cluster infrastructure API
- vDAG controller actions:
    - gateway acts like proxy for vDAG controller management actions like create/remove/scaling/vdag controllers management api
    - sample API
    - `curl -X POST "http://$GATEWAY_URL/vdag-controller/$CLUSTER_ID" -H "Content-Type: application/json" -d '{ "action": "<action_value>", "payload": <payload_data>}'`


### **🎮II. Cluster Controller**([Doc](https://github.com/OpenCyberspace/OpenOS.AI-Documentation/blob/main/cluster-controller/cluster-controller.md#management-command-executor-apis))
- is a suite of services deployed by default on each cluster
- It is responsible for orchestrating and managing the lifecycle of blocks/vdags running within the cluster
- handles the onboarding and deboarding of nodes, 
- monitors the health of both nodes and pods, and ensures high availability through automated failure recovery mechanisms.

**To  GET PORT MAPPING wrt to Service**([Doc](https://docs.aigr.id/installation/installation/#deploying-registry-services))

In [None]:
# 🔧 Configuration Setup - Run this cell first to set up shared variables
import os

# Set configuration variables that will be available across all cells
GATEWAY_URL = "MANAGEMENTMASTER:30600"
CLUSTER_ID = "gcp-cluster-2"
GLOBAL_CLUSTER_METRICS_DB = "MANAGEMENTMASTER:30202"
GLOBAL_BLOCK_METRICS_DB = "MANAGEMENTMASTER:30201"
PARSER_URL = "MANAGEMENTMASTER:30501"
GLOBAL_CLUSTER_DB = "MANAGEMENTMASTER:30101"
GLOBAL_TASK_DB_SERVICE = "MANAGEMENTMASTER:30108"
COMPONENT_REGISTRY_SERVICE = "MANAGEMENTMASTER:30112"
GLOBAL_BLOCKDB_SERVICE = "MANAGEMENTMASTER:30100"
#SERVER_URL = "10.10.10.10:5000"  # For other API calls

# Set environment variables for bash cells
os.environ['GATEWAY_URL'] = GATEWAY_URL
os.environ['CLUSTER_ID'] = CLUSTER_ID
os.environ['GLOBAL_CLUSTER_METRICS_DB'] = GLOBAL_CLUSTER_METRICS_DB
os.environ['GLOBAL_BLOCK_METRICS_DB'] = GLOBAL_BLOCK_METRICS_DB
os.environ['PARSER_URL'] = PARSER_URL
os.environ['GLOBAL_CLUSTER_DB'] = GLOBAL_CLUSTER_DB
os.environ['GLOBAL_TASK_DB_SERVICE'] = GLOBAL_TASK_DB_SERVICE
os.environ['COMPONENT_REGISTRY_SERVICE'] = COMPONENT_REGISTRY_SERVICE
os.environ['GLOBAL_BLOCKDB_SERVICE'] = GLOBAL_BLOCKDB_SERVICE
#os.environ['SERVER_URL'] = SERVER_URL

print("✅ Configuration variables set:")
print(f"   • GATEWAY_URL: {GATEWAY_URL}")
print(f"   • CLUSTER_ID: {CLUSTER_ID}")
print(f"   • GLOBAL_CLUSTER_METRICS_DB: {GLOBAL_CLUSTER_METRICS_DB}")
print("\n📝 These variables are now available in both Python and bash cells!")
print("   - In Python: use GATEWAY_URL, CLUSTER_ID, GLOBAL_CLUSTER_METRICS_DB PARSER_URL GLOBAL_CLUSTER_DB GLOBAL_TASK_DB_SERVICE")
print("   - In bash: use $GATEWAY_URL, $CLUSTER_ID, $GLOBAL_CLUSTER_METRICS_DB $PARSER_URL $GLOBAL_CLUSTER_DB $GLOBAL_TASK_DB_SERVICE")
os.system('echo $GATEWAY_URL')
os.system('echo $CLUSTER_ID')
os.system('echo $GLOBAL_CLUSTER_METRICS_DB')
os.system('echo $PARSER_URL')
os.system('echo $GLOBAL_CLUSTER_DB')
os.system('echo $GLOBAL_TASK_DB_SERVICE')
os.system('echo $COMPONENT_REGISTRY_SERVICE')
os.system('echo $GLOBAL_BLOCKDB_SERVICE')
os.system('echo $GLOBAL_BLOCK_METRICS_DB')

**Cluster read API**

In [None]:
%%bash 
echo $GATEWAY_URL
echo $CLUSTER_ID
curl -X GET http://$GATEWAY_URL/clusters/read/$CLUSTER_ID | json_pp

**Update cluster API**

In [None]:
%%bash
echo $GATEWAY_URL
echo $CLUSTER_ID
curl -X PATCH http://$GATEWAY_URL/clusters/update/$CLUSTER_ID -H "Content-Type: application/json" -d '{"tags": ["gpu","production","ml","vision","us-central-1","dummy"] }' | json_pp


**Delete cluster entry API**

In [None]:
%%bash
#curl -X DELETE http://$GATEWAY_URL/clusters/delete/<cluster-id> | json_pp

**Get metrics of cluster**

In [None]:
%%bash 
curl -X GET http://$GATEWAY_URL/cluster-metrics/$CLUSTER_ID -H "Content-Type: application/json" | json_pp

In [None]:
%%bash 
curl -X GET http://$GLOBAL_CLUSTER_METRICS_DB/cluster/gcp-cluster-2

**Get metrics of a node**

In [None]:
%%bash
curl -X GET http://$GATEWAY_URL/cluster-metrics/node/$CLUSTER_ID/wc-gpu-node3 -H "Content-Type: application/json" | json_pp


**Upscale**

In [None]:
%%bash
#check details of a block
curl -X GET http://$GLOBAL_BLOCKDB_SERVICE/blocks/magistral-small-2506-llama-cpp-block \
    -H "Content-Type: application/json" | json_pp

In [None]:
%%bash
curl -X GET http://$GLOBAL_BLOCK_METRICS_DB/block/demo-magistral-block-llama-cpp | json_pp

In [None]:
%%bash
#create 2 more instance of it
curl -X POST "http://$GATEWAY_URL/controller/block-scaling/gcp-cluster-2" \
     -H "Content-Type: application/json" \
     -d '{
           "operation": "scale",
           "block_id": "magistral-small-2506-llama-cpp-block",
           "instances_count": 2,
           "allocation_data": [
               {
                   "node_id": "wc-gpu-node2",
                   "gpu_ids": [0, 1]
               },
               {
                   "node_id": "wc-gpu-node1",
                   "gpu_ids": [0]
               }
           ]
         }' | json_pp

**DownScale**

In [None]:
%%bash
curl -X POST "http://$GATEWAY_URL/controller/block-scaling/gcp-cluster-2" \
     -H "Content-Type: application/json" \
     -d '{
           "operation": "downscale",
           "block_id": "magistral-small-2506-llama-cpp-block",
           "instances_list": ["in-zr2p","in-1cbr"]
         }' | json_pp

### **Cluster Controller Creation**

**Register Cluster With Parser API**

In [None]:
%%bash 
curl -X POST http://$PARSER_URL/api/createCluster \
  -H "Content-Type: application/json" \
  -d @cluster_onboarding_to_aios/cluster-1.json

**To Query Cluster**

In [None]:
%%bash
curl http://$GLOBAL_CLUSTER_DB/clusters/gcp-demo-cluster | json_pp

**To Update Any Cluster Info**

In [None]:
%%bash
curl -X PUT http://$GLOBAL_CLUSTER_DB/clusters/gcp-demo-cluster -H "Content-Type: application/json" -d '{
    "config.publicHostname":  "DEMOMASTERNODE",
    "config.urlMap" : {         
         "blocksQuery" : "http://DEMOMASTERNODE:32302",
         "controllerService" : "http://DEMOMASTERNODE:32300",
         "membershipServer" : "http://DEMOMASTERNODE:30501",
         "metricsService" : "http://DEMOMASTERNODE:32301",
         "parameterUpdater" : "http://DEMOMASTERNODE:32303",
         "publicGateway" : [
            "DEMOMASTERNODE:32000"
         ]
      }
}'

### Get Kube Config  DATA for Cluster Controller Gateway APIs and Create Cluster Infra


In [None]:
import os
import yaml
import argparse
import requests

DEFAULT_KUBE_CONFIG_PATH = "cluster_setup/master_node/config.yaml"
API_URL = "http://MANAGEMENTMASTER:30600/create-cluster-infra"


def create_cluster_infra(cluster_id, kube_config_data):
  payload = {
    "cluster_id": cluster_id,
    "kube_config_data": yaml.safe_load(open(DEFAULT_KUBE_CONFIG_PATH, "r"))
  }
  response = requests.post(API_URL, json=payload)
  print(response.text, response.status_code)
  if response.status_code == 200:
    print("Cluster creation scheduled successfully:", response.json())
  else:
    print("Error creating cluster:", response.json())

create_cluster_infra("gcp-demo-cluster", kube_config_data=None)

**In Master Node:Check Pods**

- `kubectl get pods -n controllers`
- `kubectl get pods -n metrics`

**Create cluster infrastructure API**

In [None]:
%%bash
curl -X POST "http://$GATEWAY_URL/create-cluster-infra" \
    -H "Content-Type: application/json" \
    -d '{
    "cluster_id": "gcp-demo-cluster",
    "kube_config_data": "c3BlY2lhbC1lbmNvZGVkLWt1YmUtY29uZmlnLWRhdGE="
    }' | json_pp

**Remove cluster infrastructure API**

In [None]:
%%bash
curl -X POST "http://$GATEWAY_URL/remove-cluster-infra" \
-H "Content-Type: application/json" \
-d '{
        "cluster_id": "gcp-demo-cluster",
        "kube_config_data": {}
    }' | json_pp

In [None]:
import os
import yaml
import argparse
import requests

DEFAULT_KUBE_CONFIG_PATH = "cluster_setup/master_node/config.yaml"
API_URL = "http://MANAGEMENTMASTER:30600/remove-cluster-infra"


def remove_cluster_infra(cluster_id, kube_config_data):
  payload = {
    "cluster_id": cluster_id,
    "kube_config_data": yaml.safe_load(open(DEFAULT_KUBE_CONFIG_PATH, "r"))
  }
  response = requests.post(API_URL, json=payload)
  print(response.text, response.status_code)
  if response.status_code == 200:
    print("Cluster creation scheduled successfully:", response.json())
  else:
    print("Error creating cluster:", response.json())

remove_cluster_infra("gcp-demo-cluster", kube_config_data=None)

**Query With TaskID**

In [None]:
%%bash
curl -X GET http://$GLOBAL_TASK_DB_SERVICE/task/72656caa-6d84-4a79-a6ee-4804406e1d2f

**Remove Cluster Entry in DB**

In [None]:
%%bash
#curl -X DELETE http://$GLOBAL_CLUSTER_DB/clusters/gcp-demo-cluster
 

## **Create Blocks in Cluster(With Basic Policies)**

#### **Register Components**

In [None]:
%%bash 
curl -X POST http://$COMPONENT_REGISTRY_SERVICE/api/registerComponent \
  -H "Content-Type: application/json" \
  -d @./demo_block/component_demo_magistral.json | json_pp

#### **Unregister Components**

In [None]:
%%bash
curl -X POST http://$COMPONENT_REGISTRY_SERVICE/api/unregisterComponent \
  -H "Content-Type: application/json" \
  -d '{"uri":"model.demo-magistral-llama_cpp:1.0.0-stable"}' | json_pp

#### **Deploy Block**

In [None]:
%%bash
curl -X POST http://$PARSER_URL/api/createBlock \
 -H "Content-Type: application/json" \
 -d @./demo_block/allocation-demo-mistral.json | json_pp

#### **Get Block Details**

In [None]:
%%bash
curl -X GET http://$GLOBAL_BLOCKDB_SERVICE/blocks/demo-magistral-block-llama-cpp \
    -H "Content-Type: application/json" | json_pp

#### **Create Inference Server For The Cluster**
- Run this commands in your cluster node(like master node)

- `kubectl create namespace inference-server`
- `kubectl create -f inference_server/inference_server.yaml`

#### **Do the Inference**

In [None]:
%%bash
curl -X POST  http://DEMOMASTERNODE:31504/v1/infer \
  -H "Content-Type: application/json" \
  -d '{
  "model": "demo-magistral-block-llama-cpp",
  "session_id": "session-2",
  "seq_no": 16,
  "data": {
    "mode": "chat",
    "gen_params": {
      "temperature": 0.1,
      "top_p": 0.95,
      "max_tokens": 4096
    },
    "message": "Give me code for adding two integers list element wise in python",
    "system_message": "You are a helpful assistant that provides code examples."
  },
  "graph": {},
  "selection_query": {
    
  }
}'

#### **Reassigning a Deployed instance**

In [None]:
%%bash
curl -X GET http://$GLOBAL_BLOCK_METRICS_DB/block/demo-magistral-block-llama-cpp | json_pp

In [None]:
%%bash 

curl -X POST "http://$GATEWAY_URL/controller/reassign-instance/gcp-demo-cluster" \
     -H "Content-Type: application/json" \
     -d '{
           "blockId": "demo-magistral-block-llama-cpp",
           "instanceId": "in-zavw",
           "extra_data": {
            "allocation_data":{
                "node_id": "demo-worker-node2",
                "gpu_ids": [0,1]
               }
           }
         }'

#### **Remove Block**

In [None]:
%%bash
curl -X POST http://$GATEWAY_URL/controller/removeBlock/gcp-demo-cluster \
    -H "Content-Type: application/json" \
    -d '{"block_id": "demo-magistral-block-llama-cpp"}'