# A Developer's Guide to Model Onboarding in AIOS

Welcome to this guide on onboarding a model to the OpenOS/AIGr.id platform. This notebook serves as a standalone tutorial that walks you through every step of the process, from registering your model as a digital asset to running inference and managing its lifecycle. We will use the `deepseek_r1_distill_70b` model as our primary example.

## The AIOS Ecosystem: A Brief Overview

Before we dive in, it's helpful to understand the key components of the AIOS ecosystem. AIOS is designed as a decentralized, modular, and extensible platform for AI development and deployment. Its architecture consists of several core services that work together to manage the lifecycle of AI models and applications. These include:

- **Asset DB Registry**: A central catalog for discovering and managing versioned, runnable software components (Assets).
- **Resource Allocator**: Responsible for scheduling and allocating resources for running Assets on the network of clusters.
- **Block Controller**: Manages the lifecycle of a Block, which is a running instance of an Asset.
- **Metrics System**: A comprehensive system for monitoring the health, status, and performance of Blocks.
- [AIOS Ecosystem Architecture](https://docs.aigr.id/assets/aios-all-arch.drawio.png)

This guide will touch on each of these components as we walk through the model onboarding process.

## Onboarding Process Overview

In this guide, we will cover the following steps in detail:

1. **Registering an Asset**: We will create a `component.json` file that describes our model and register it with the AIOS Component Registry.
2. **Allocating a Block**: We will define the block's configuration in an `allocation.json` file and allocate a block using the AIOS API.
3. **Checking Block Status & Metrics**: We will use `curl` commands to check the block's status, health, and metrics.
4. **Performing Inference**: We will write a Python script to send an inference request to the block via gRPC.
5. **Chat UI with streaming**: We will launch an interactive chat interface to demonstrate streaming capabilities.
6. **Cleaning Up**: We will deallocate the block and deregister the asset when done.

The following animation provides a high-level overview of the entire model onboarding process.

![Model Onboarding Process](block_onboarding.gif)

Let's get started!

## 1. Registering an Asset

First, we need to define our model as an **Asset**. An asset is a static, versioned, and runnable software component that is registered in the AIOS **Asset DB Registry**. The registry acts as a central catalog, allowing developers to discover, share, and reuse assets across the ecosystem. By registering an asset, you are making it available for deployment on the AIOS network.

For a deeper dive into how the asset registry works, you can refer to the official documentation:
- [Asset DB Registry Concepts](https://github.com/OpenCyberspace/OpenOS.AI-Documentation/blob/main/assets-db-registry/assets-db-registry.md)

### A note on using prebuilt docker images. 
Below docker images can be a quick start for testing the features of AIOS. For custom model onboarding refer 
    - [Custom Model Onboarding](https://github.com/OpenCyberspace/OpenOS.AI-Documentation/blob/main/llm-docs/llm-method-1.md)
- `MANAGEMENTMASTER:31280/llama4-scout-17b:v1`
- `MANAGEMENTMASTER:31280/gemma3-27b:v1`
- `MANAGEMENTMASTER:31280/deepseek-r1-distill-70b:v1`
- `MANAGEMENTMASTER:31280/magistral-small-2506-llama-cpp:v1`
- `MANAGEMENTMASTER:31280/qwen-3-32b-llama-cpp:v1`

Let's look at the `component.json` file for `deepseek_r1_distill_70b`. This file is the asset's manifest, containing all the essential information AIOS needs. It defines:
- **`componentId`**: The unique name, version, and release tag for the asset.
- **`componentType`**: Specifies that this is a `model` asset.
- **`containerRegistryInfo`**: Points to the container image (`deepseek-r1-distill-70b:v1`) and includes metadata like the author and a description.
- **`componentMetadata`**: A rich set of details including the model's use case (`chat-completion`), hardware requirements (`gpu`), and performance benchmarks.
- **`componentInitData`**: Specifies the default model files to be loaded.

This file effectively serves as a comprehensive "passport" for the model within the AIOS ecosystem.

In [None]:
!cat component.json

Now, let's register this asset with the AIOS Component Registry using a `curl` command.

In [None]:
!curl -X POST http://MANAGEMENTMASTER:30112/api/registerComponent -H "Content-Type: application/json" -d @./component.json | json_pp

## 2. Allocating a Block

Now that the asset is registered, we can create a running instance of it. In AIOS, a running instance of an asset is called a **Block**. A Block is the fundamental unit of execution in AIOS, responsible for serving inference requests or running any computational workload defined by an Asset. When you allocate a Block, the AIOS **Resource Allocator** finds a suitable cluster and schedules the Block for execution.

For more details, you can refer to the official documentation:
- [Block Concepts](https://github.com/OpenCyberspace/OpenOS.AI-Documentation/blob/main/block/block.md)

To allocate a block, we must provide an `allocation.json` file. This file is a request to the Resource Allocator, specifying the desired configuration for the block. Let's examine the key fields in our `allocation.json` for the `deepseek_r1_distill_70b` model. This file tells AIOS:

- **`blockId`**: We are requesting a block named `deepseek-r1-distill-70b-block`.
- **`blockComponentURI`**: The block should be an instance of the `model.deepseek-r1-distill-70b:1.0.0-stable` asset we registered earlier.
- **`blockInitData`**: This section provides initial data to the block, such as the `model_name`.
- **`initSettings`**: These are settings for the block's runtime, including `tensor_parallel`, `device` (`cuda`), and `quantization_type`.
- **`policyRulesSpec`**: This is a crucial section that defines the chain of policies governing the block's behavior:
    - **`clusterAllocator`**: Selects a specific cluster, in this case, `gcp-cluster-2`.
    - **`resourceAllocator`**: Assigns the block to a specific node (`wc-gpu-node1`) and GPU (`0`).
    - **`loadBalancer`**: Manages how requests are distributed, configured here to cache sessions.
    - **`stabilityChecker`**: Defines a health check mechanism to ensure the block is running correctly.
    - **`autoscaler`**: Enables autoscaling for the block.

In [None]:
!cat allocation.json

Now, let's allocate the block using `curl`.

In [None]:
!curl -X POST -d @./allocation.json -H "Content-Type: application/json" http://MANAGEMENTMASTER:30501/api/createBlock

## 3. Checking Block Status and Metrics

After allocating the block, it's important to check its status to ensure it's running correctly. AIOS provides a comprehensive **Metrics System** that exposes endpoints for health, status, and resource consumption, giving you a complete picture of your block's operational state.

For more information on the metrics system, see the documentation:
- [Metrics System Overview](https://github.com/OpenCyberspace/OpenOS.AI-Documentation/blob/main/metrics-system/metrics-system.md)

Let's check the status, health, and metrics of our block.

### Check Block Health

In [None]:
# Check Block Health - This command verifies the health of the running service within the block.
!curl -X GET http://MANAGEMENTMASTER:30201/block/health/deepseek-r1-distill-70b-block

### View Block Metrics
We can also check the block's metrics, such as CPU and memory usage, by querying the metrics endpoint.

- [Block Metrics API](https://github.com/OpenCyberspace/OpenOS.AI-Documentation/blob/main/block/block.md#block-services)


In [None]:
### Check Block Metrics (Before Inference)
!curl -X GET http://MANAGEMENTMASTER:30201/block/deepseek-r1-distill-70b-block | json_pp

#### Core Metric Categories

##### 🔧 **Runtime Metrics**
- **CPU Usage**: Real-time CPU utilization percentage for the block container
- **Memory Usage**: Current memory consumption and peak memory usage
- **GPU Utilization**: GPU usage percentage and VRAM consumption (for GPU-enabled blocks)

##### 📊 **Performance Metrics**
- **Request Latency**: Average,  response times for inference requests
- **Throughput**: Requests per second (RPS) and tokens per second (TPS)
- **Queue Length**: Number of pending requests in the processing queue

##### 🔄 **Operational Metrics**
- **Request Count**: Total number of requests processed over time
- **Error Rate**: Percentage of failed requests and error types

##### 🏥 **Health Metrics**
- **Uptime**: Total running time since last restart

## 4. Performing Inference

Now that the block is running, let's perform an inference task. We'll use a Python script to send a request to the block via gRPC.

In [None]:
# First, let's install the necessary libraries for our gRPC client and import them.
# We also need to add the `inference_client` directory to our Python path to import the generated gRPC files.
!pip install grpcio grpcio-tools protobuf

import sys
sys.path.append('../utils/inference_client')

import grpc
import json
import time

import service_pb2
import service_pb2_grpc

Now, let's define a function to send an inference request to our block. This function will connect to the gRPC server, construct a request packet, and print the response.

In [9]:
def run_inference(block_id, session_id, seq_no, message, generation_config, image_url=None):
    SERVER_ADDRESS = "CLUSTER1MASTER:31500"
    
    # Connect to the gRPC server
    channel = grpc.insecure_channel(SERVER_ADDRESS)
    stub = service_pb2_grpc.BlockInferenceServiceStub(channel)

    if image_url:
        data = {
            "mode": "chat",
            "gen_params": generation_config,
            "messages": [{
                "content": [
                    {"type": "text", "text": message},
                    {"type": "image_url", "image_url": {"url": image_url}}
                ]
            }]
        }
    else:
        data = {
            "mode": "chat",
            "message": message,
            "gen_params": generation_config
        }

    # Create the BlockInferencePacket request
    request = service_pb2.BlockInferencePacket(
        block_id=block_id,
        session_id=session_id,
        seq_no=seq_no,
        data=json.dumps(data),
        ts=time.time()
    )

    try:
        st = time.time()
        # Make the gRPC call
        response = stub.infer(request)
        et = time.time()

        print(f"Latency: {et - st}s")
        print(f"Session ID: {response.session_id}")
        print(f"Sequence No: {response.seq_no}")
        
        # Parse JSON response data
        try:
            response_data = json.loads(response.data)
            print("Data:")
            print(json.dumps(response_data, indent=2))
        except (json.JSONDecodeError, TypeError):
             print(f"Data: {response.data}")

        print(f"Timestamp: {response.ts}")

    except grpc.RpcError as e:
        print(f"gRPC Error: {e.code()} - {e.details()}")

Finally, let's call our function to get a response from the `deepseek-r1-distill-70b-block`.

In [None]:
generation_config = {
    "temperature": 0.1,
    "top_p": 0.95,
    "max_tokens": 512
}

run_inference(
    block_id="deepseek-r1-distill-70b-block",
    session_id="session_notebook_chat-1",
    seq_no=1,
    message="Explain about the architecturual difference between chat and reasoning models?",
    generation_config=generation_config
)

## 4.a Block Metrics After Inference

In [None]:
# Check Block Metrics after running inference
!curl -X GET http://MANAGEMENTMASTER:30201/block/deepseek-r1-distill-70b-block | json_pp

## 5. Interactive Chat with Streamlit

To provide an interactive way to test the model, we will launch a Streamlit application. The following cell will import the necessary function and launch the app, providing a public URL for you to access.

In [None]:
!pip install streamlit pyngrok nest_asyncio websockets > /dev/null

import sys
import os

# Add the parent directory of 'utils' to the path to find the inference_client
sys.path.append(os.path.abspath('../..'))
sys.path.append(os.path.abspath('../utils/streamlit_app'))

from utils import run_streamlit_direct

# Define the block_id and grpc_server_address
BLOCK_ID = "deepseek-r1-distill-70b-block"
GRPC_SERVER_ADDRESS = "CLUSTER1MASTER:31500"
streamlit_url = run_streamlit_direct(BLOCK_ID, GRPC_SERVER_ADDRESS, port=8501)
print(f"Streamlit App URL: {streamlit_url}")

## 6. Cleaning Up

After you are finished with the block, it is important to deallocate it to free up resources. You can also deregister the asset if you no longer need it.

### Deallocate the Block
This command will stop the running block and release all associated resources.

`K8s dashboard` available at https://CLUSTER1MASTER:32319/#/login

In [None]:
!curl -X POST http://MANAGEMENTMASTER:30600/controller/removeBlock/gcp-cluster-2 -H "Content-Type: application/json" -d '{"block_id": "deepseek-r1-distill-70b-block"}'

### Deregister the Asset
If you no longer need the asset in the registry, you can remove it.

In [None]:
!curl -X POST http://MANAGEMENTMASTER:30112/api/unregisterComponent -H "Content-Type: application/json" -d '{"uri": "model.deepseek-r1-distill-70b:1.0.0-stable"}' | json_pp