## 🧠 Model Splitting Across Nodes in AIOS Using Pytorch & Transformers


- Author: Shridhar Kini (Profile)
- To Securely Run: `jupyter notebook password` to generate onetime password for secure access
- To Run: `jupyter notebook --allow-root  --port 9999 --ip=0.0.0.0 part-1_model_splitting_using_pytorch.ipynb`
- To Clear Outputs: Use `jupyter nbconvert --clear-output --inplace part-1_model_splitting_using_pytorch.ipynb`

This notebook provides an overview of Model Splitting across Nodes in AIOS using Library support from Pytorch and Transformers. This Demo is for the users who has bigger models which cannot be fit in single GPU or Node. Splitting Models with the helper code of **split-sdk**

### ⛩️**I. Model Splitting (Parallelism) Strategies**

#### Tensor Parallelism(TP)
- Split a single operation(like matmul) across multiple devices
    - Large Matrix multiplication op is split across GPUs
        - Like Feed Forward Network and Attention Blocks
        - Example: weight matrix W can be split by its columns [W_1,W_2,W_3,W_4] across 4 GPUs
            - Each GPU multiplies input X with W_i to Give Y_i
            - Use high-speed communication collective (like all-gather) to collect Y_i
            - Form the final Output
            - All GPUs are active on the same layer(W) at same time
    - Pros ✅:
        - High GPU Utilization: Keeps all participating GPUs busy, avoiding the idle "bubbles" seen in pipeline parallelism.
        - Low Latency: Excellent for inference, as it speeds up the forward pass of individual large layers.
    - Cons ❌:
        - High Communication Overhead: Requires very frequent, high-bandwidth communication     between GPUs. It is almost exclusively used within a single node connected by ultra-fast interconnects like NVIDIA's NVLink.
        - Limited Scalability: Does not scale well beyond a small number of GPUs (typically 8 or 16) on a single server.

#### Pipeline Parallelism (PP)
- Splits the entire model into sequential chunks of layers (stages) and places each chunk on a different GPU.
    - Similar to Assembly Line
        - Each GPU processes a different micro-batch of data at the same time
        - Output of one stage (x number of layers) from one GPU is the input to the next stage
        - When one GPU finishes processing its micro-batch, it sends the output to the next GPU and starts processing the next micro-batch
    - Pros ✅:
        - Lower Communication Volume: Only needs to pass activations between adjacent GPUs, which is less data-intensive than TP's constant weight syncing.
        - Scales Across Nodes: Can work effectively across multiple servers connected by slower networking like Ethernet, making it ideal for scaling to a huge number of GPUs.
    - Cons ❌:
        - The "Pipeline Bubble": It's hard to keep the pipeline perfectly full. The first GPU will be idle at the end of a batch, and the last GPU is idle at the beginning, leading to wasted compute cycles.
        - Load Balancing is Hard: The layers must be carefully divided so that each GPU has roughly the same amount of work to do.

#### Fully Sharded Data Parallelism (FSDP)
- Mainly used for Training as it can shard the model weight, gradients and optimizers state across GPUs

### Concept of RANK, LOCAL_RANK, WORLD_SIZE

Think of your multi-node setup as a team of workers assigned to a large project. Each worker (GPU) has a specific role (rank) and works on a portion of the task (data parallelism).
- `World Size`: This is the total number of workers (processes/GPUs) in your team. 
    - For example, if you have 2 nodes with 8 GPUs each, your World Size is 2 times 8 = 16.
- `Rank`: This is the unique identifier for each worker in the team. 
    - It helps in distinguishing between different workers. 
    - For instance, in a setup with 16 GPUs, their ranks would be from 0 to 15.
- `Local Rank`: Rank of a worker within its local node.
    - If each node has 8 GPUs, then
        - the local ranks for the GPUs in a single node would be from 0 to 7.

**To  GET PORT MAPPING wrt to Service**([Doc](https://docs.aigr.id/installation/installation/#deploying-registry-services))

In [None]:
# 🔧 Configuration Setup - Run this cell first to set up shared variables
import os

# Set configuration variables that will be available across all cells
GATEWAY_URL = "MANAGEMENTMASTER:30600"
CLUSTER_ID = "gcp-cluster-2"
GLOBAL_CLUSTER_METRICS_DB = "MANAGEMENTMASTER:30202"
GLOBAL_BLOCK_METRICS_DB = "MANAGEMENTMASTER:30201"
PARSER_URL = "MANAGEMENTMASTER:30501"
GLOBAL_CLUSTER_DB = "MANAGEMENTMASTER:30101"
GLOBAL_TASK_DB_SERVICE = "MANAGEMENTMASTER:30108"
COMPONENT_REGISTRY_SERVICE = "MANAGEMENTMASTER:30112"
GLOBAL_BLOCKDB_SERVICE = "MANAGEMENTMASTER:30100"
GCP_CLUSTER_2_INFERENCE_SERVER = "CLUSTER2_MASTER_IP:31504"
#SERVER_URL = "10.10.10.10:5000"  # For other API calls

# Set environment variables for bash cells
os.environ['GATEWAY_URL'] = GATEWAY_URL
os.environ['CLUSTER_ID'] = CLUSTER_ID
os.environ['GLOBAL_CLUSTER_METRICS_DB'] = GLOBAL_CLUSTER_METRICS_DB
os.environ['GLOBAL_BLOCK_METRICS_DB'] = GLOBAL_BLOCK_METRICS_DB
os.environ['PARSER_URL'] = PARSER_URL
os.environ['GLOBAL_CLUSTER_DB'] = GLOBAL_CLUSTER_DB
os.environ['GLOBAL_TASK_DB_SERVICE'] = GLOBAL_TASK_DB_SERVICE
os.environ['COMPONENT_REGISTRY_SERVICE'] = COMPONENT_REGISTRY_SERVICE
os.environ['GLOBAL_BLOCKDB_SERVICE'] = GLOBAL_BLOCKDB_SERVICE
os.environ['GCP_CLUSTER_2_INFERENCE_SERVER'] = GCP_CLUSTER_2_INFERENCE_SERVER
#os.environ['SERVER_URL'] = SERVER_URL

print("✅ Configuration variables set:")
print(f"   • GATEWAY_URL: {GATEWAY_URL}")
print(f"   • CLUSTER_ID: {CLUSTER_ID}")
print(f"   • GLOBAL_CLUSTER_METRICS_DB: {GLOBAL_CLUSTER_METRICS_DB}")
print("\n📝 These variables are now available in both Python and bash cells!")
print("   - In Python: use GATEWAY_URL, CLUSTER_ID, GLOBAL_CLUSTER_METRICS_DB PARSER_URL GLOBAL_CLUSTER_DB GLOBAL_TASK_DB_SERVICE")
print("   - In bash: use $GATEWAY_URL, $CLUSTER_ID, $GLOBAL_CLUSTER_METRICS_DB $PARSER_URL $GLOBAL_CLUSTER_DB $GLOBAL_TASK_DB_SERVICE")
os.system('echo $GATEWAY_URL')
os.system('echo $CLUSTER_ID')
os.system('echo $GLOBAL_CLUSTER_METRICS_DB')
os.system('echo $PARSER_URL')
os.system('echo $GLOBAL_CLUSTER_DB')
os.system('echo $GLOBAL_TASK_DB_SERVICE')
os.system('echo $COMPONENT_REGISTRY_SERVICE')
os.system('echo $GLOBAL_BLOCKDB_SERVICE')
os.system('echo $GLOBAL_BLOCK_METRICS_DB')

### Code Sample for Splitting and Generating Response ([Code](model_splitting/split-sdk))

##### Main Codes
- aios_transformers/apis.py - Exposes the API for model inferencing
- aios_transformers/sdk.py - Contains the core logic for model splitting and distributed inference
- aios_transformers/metrics.py - Helper for metrics

##### To Build the Model Splitter Docker Image

In [None]:
%%bash
bash Part-1/split-sdk/build_docker.bash

##### Push the built docker to registry

In [None]:
%%bash
docker push MANAGEMENTMASTER:31280/example/split-runner:demo

##### **Deploy the Model Splitter for inferencing**
**To Know more about these API's Please check the Link**([Link](https://github.com/opencyber-space/AIGr.id/blob/main/services/applications/model_splits/core/apis.py))

In [None]:
%%bash
curl -X POST http://MANAGEMENTMASTER:30160/splits/create \
-H "Content-Type: application/json" \
-d '{
  "rank_0_cluster_id": "gcp-cluster-2",
  "cluster_id": [
    "gcp-cluster-2"
  ],
  "deployment_name": "phi-128k-2",
  "nnodes": 4,
  "common_params": {
    "model_name": "microsoft/Phi-3-mini-128k-instruct",
    "image": "MANAGEMENTMASTER:31280/example/split-runner:demo",
    "master_port": 3000
  },
  "per_rank_params": [
    {
      "rank": 0,
      "node_id": "wc-gpu-node2",
      "nccl_socket_ifname": "eth0",
      "nvidia_visible_devices": "0",
      "cluster_id": "gcp-cluster-2",
      "cuda_visible_devices": "0"
    },
    {
      "rank": 1,
      "node_id": "wc-gpu-node2",
      "nccl_socket_ifname": "eth0",
      "nvidia_visible_devices": "0,1",
      "cuda_visible_devices": "1",
      "cluster_id": "gcp-cluster-2"
    },
    {
      "rank": 2,
      "node_id": "wc-gpu-node3",
      "nccl_socket_ifname": "eth0",
      "nvidia_visible_devices": "0",
      "cuda_visible_devices": "0",
      "cluster_id": "gcp-cluster-2"
    },
    {
      "rank": 3,
      "node_id": "wc-gpu-node3",
      "nccl_socket_ifname": "eth0",
      "nvidia_visible_devices": "1",
      "cuda_visible_devices": "0,1",
      "cluster_id": "gcp-cluster-2"
    }
  ],
  "multi_cluster": false,
  "platform": "torch"
}'

##### To check the k8s pod(to be done in target cluster Controle Plane)

In [None]:
kubectl get pods -n splits

#### **Create Proxy-AIOS-Block**

- Proxy Block can do inference with Splitted Model Rank-0
- Proxy Block can hold chat history kind of logic
- Proxy Block uses internal namespace based URL for inference. If not we need to create ClusterIP to NodePort for Model Split Pod Service
    - f"{deploymentNameOfSplittingService}-rank-master.svc.cluster.local:8080/generate"

##### Build Docker Image

In [None]:
%%bash 
bash Part-1/block/block-client/build_docker.bash

##### Push the Docker Image to registry

In [None]:
%%bash
docker push MANAGEMENTMASTER:31280/pytorch-split-client:latest

##### Register the component

In [None]:
%%bash
curl -X POST http://$COMPONENT_REGISTRY_SERVICE/api/registerComponent \
 -d @Part-1/block/component.json \
 -H "Content-Type: application/json" | json_pp
  

##### Unregister Component

In [None]:
%%bash
curl -X POST http://$COMPONENT_REGISTRY_SERVICE/api/unregisterComponent \
  -H "Content-Type: application/json" \
  -d '{"uri":"model.pytorch-runner:1.0.0-stable"}' | json_pp

##### Deploy the Block

In [None]:
%%bash
curl -X POST -d @Part-1/block/block.json \
 -H "Content-Type: application/json" \
  http://$PARSER_URL/api/createBlock | json_pp

##### To get the log of Block

In [None]:
%%bash
kubectl logs -f pytorch-block-3-in-v66d-d57fc6d5c-v9x8j -n blocks instance

##### **Create Inference Server For The Cluster**
- Run this commands in your cluster node(like master node)
    - `kubectl create namespace inference-server`
    - `kubectl create -f inference_server/inference_server.yaml`

##### **Do the Inference**

In [None]:
%%bash
curl -X POST  http://CLUSTER2_MASTER_IP:31504/v1/infer \
  -H "Content-Type: application/json" \
  -d '{
  "model": "pytorch-block-3",
  "session_id": "session-3",
  "seq_no": 20,
  "data": {
    "mode": "completions",
    "generation_config": {
                "max_new_tokens": 1024,
                "do_sample": false,
                "top_k": 50,
                "top_p": 0.95,
                "temperature": 1.0
            },
    "message": "Give me code for adding two integers list element wise in python",
    "system_message": "You are a helpful assistant that provides code examples."
  },
  "graph": {},
  "selection_query": {
    
  }
}' | json_pp

#### Clean Up

##### Delete Model Split

In [None]:
%%bash
curl -X DELETE http://MANAGEMENTMASTER:30160/splits/delete/phi-128k-2

##### Delete the Block

In [None]:
%%bash
curl -X POST http://$GATEWAY_URL/controller/removeBlock/gcp-cluster-2 \
    -H "Content-Type: application/json" \
    -d '{"block_id": "pytorch-block-3"}'