# A Developer's Guide to Model Onboarding in AIOS - Qwen3 32B

Welcome to this guide on onboarding the **Qwen3 32B** model to the OpenOS/AIGr.id platform. This notebook serves as a standalone tutorial that walks you through every step of the process, from registering your model as a digital asset to running inference and managing its lifecycle. We will use the `qwen3_32b_llama_cpp` model as our primary example.

## The AIOS Ecosystem: A Brief Overview
- [AIOS Ecosystem Architecture](https://docs.aigr.id/assets/aios-all-arch.drawio.png)
Before we dive in, it's helpful to understand the key components of the AIOS ecosystem. AIOS is designed as a decentralized, modular, and extensible platform for AI development and deployment. Its architecture consists of several core services that work together to manage the lifecycle of AI models and applications. These include:

- **Asset DB Registry**: A central catalog for discovering and managing versioned, runnable software components (Assets).
- **Resource Allocator**: Responsible for scheduling and allocating resources for running Assets on the network of clusters.
- **Cluster Controller**: Manages the lifecycle of a Block, which is a running instance of an Asset.
- **Metrics System**: A comprehensive system for monitoring the health, status, and performance of Blocks.

This guide will touch on each of these components as we walk through the model onboarding process.

## Onboarding Process Overview

In this guide, we will cover the following steps in detail:

1. **Registering an Asset**: We will create a `component.json` file that describes our model and register it with the AIOS Component Registry.
2. **Allocating a Block**: We will define the block's configuration in an `allocation.json` file and allocate a block using the AIOS API.
3. **Checking Block Status & Metrics**: We will use `curl` commands to check the block's status, health, and metrics.
4. **Performing Inference**: We will write a Python script to send an inference request to the block via gRPC.
5. **Chat UI with streaming**: We will launch an interactive chat interface to demonstrate streaming capabilities.
6. **Cleaning Up**: We will deallocate the block and deregister the asset when done.

Let's get started!

The following image provides a high-level overview of the entire model onboarding process, from defining the asset to monitoring the running service.

<!-- ![Model Onboarding Process](block_onboarding.gif) -->
<img src="../02_Part1_onboard_gemma3_llama_cpp/onboarding.png" alt="Model Onboarding Process" width="1000" height="1000">

### A note on using prebuilt docker images
Below docker images can be a quick start for testing the features of AIOS. For custom model onboarding refer:
- [Custom Model Onboarding](https://github.com/OpenCyberspace/OpenOS.AI-Documentation/blob/main/llm-docs/llm-method-1.md)
- `MANAGEMENTMASTER:31280/llama4-scout-17b:v1`
- `MANAGEMENTMASTER:31280/gemma3-27b:v1`
- `MANAGEMENTMASTER:31280/deepseek-r1-distill-70b:v1`
- `MANAGEMENTMASTER:31280/magistral-small-2506-llama-cpp:v1`
- `MANAGEMENTMASTER:31280/qwen-3-32b-llama-cpp:v1`

## 1. Registering an Asset

First, we need to define our model as an **Asset**. An asset is a static, versioned, and runnable software component that is registered in the AIOS **Asset DB Registry**. The registry acts as a central catalog, allowing developers to discover, share, and reuse assets across the ecosystem. By registering an asset, you are making it available for deployment on the AIOS network.

For a deeper dive into how the asset registry works, you can refer to the official documentation:
- [Asset DB Registry Concepts](https://github.com/OpenCyberspace/OpenOS.AI-Documentation/blob/main/assets-db-registry/assets-db-registry.md)

Let's look at the `component.json` file for `qwen3_32b_llama_cpp`. This file is the asset's manifest, containing all the essential information AIOS needs. It defines:
- **`componentId`**: The unique name, version, and release tag for the asset (`qwen3-32b-llama_cpp2`).
- **`componentType`**: Specifies that this is a `model` asset.
- **`containerRegistryInfo`**: Points to the container image and includes metadata like the author and a description.
- **`componentMetadata`**: A rich set of details including the model's use case (`chat-completion`), hardware requirements (`gpu`), large context length (128K tokens), performance benchmarks, and known limitations.
- **`componentInitData`**: Specifies the default model files to be loaded, such as the GGUF file for the model (`Qwen3-32B-Q8_0.gguf`).

This file effectively serves as a comprehensive "passport" for the model within the AIOS ecosystem.

In [None]:
!cat component.json

In [1]:
# Register the asset with the AIOS Component Registry
!curl -X POST http://MANAGEMENTMASTER:30112/api/registerComponent -H "Content-Type: application/json" -d @./component.json | json_pp

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 21935  100  7666  100 14269   585k  1089k --:--:-- --:--:-- --:--:-- 1785k
{
   "error" : false,
   "payload" : {
      "__v" : 0,
      "_id" : "689230146e3a468c79698c8e",
      "componentId" : {
         "name" : "qwen3-32b-llama_cpp2",
         "releaseTag" : "stable",
         "version" : "1.0.0"
      },
      "componentInitData" : {
         "model_name" : "unsloth/Qwen3-32B-GGUF/Qwen3-32B-Q8_0.gguf",
         "system_message" : "You are a helpfull assistant"
      },
      "componentInitParametersProtocol" : {
         "max_tokens" : {
            "description" : "Maximum token to generate (Should be less than n_ctx)",
            "max" : 128000,
            "min" : 1,
            "type" : "number"
         },
         "min_p" : {
            "description" : "Min Probability",
            "max" : 1,
            "min" : 

## 2. Allocating a Block

Now that the asset is registered, we can create a running instance of it. In AIOS, a running instance of an asset is called a **Block**. A Block is the fundamental unit of execution in AIOS, responsible for serving inference requests or running any computational workload defined by an Asset. When you allocate a Block, the AIOS **Resource Allocator** finds a suitable cluster and schedules the Block for execution.

For more details, you can refer to the official documentation:
- [Block Concepts](https://github.com/OpenCyberspace/OpenOS.AI-Documentation/blob/main/block/block.md)
- [Block APIs/Services](https://github.com/OpenCyberspace/OpenOS.AI-Documentation/blob/main/block/block.md#block-services)

To allocate a block, we must provide an `allocation.json` file. This file is a request to the Resource Allocator, specifying the desired configuration for the block. Let's examine the key fields in our `allocation.json` for the `qwen3_32b_llama_cpp` model. This file tells AIOS:

- **`blockId`**: We are requesting a block named `qwen3-32b-2`.
- **`blockComponentURI`**: The block should be an instance of the `model.qwen3-32b-llama_cpp2:1.0.0-stable` asset we registered earlier.
- **`blockInitData`**: This section provides initial data to the block. Here we specify the `model_name` (`Qwen3-32B-Q8_0.gguf`) and the `system_message`.
- **`initSettings`**: These are settings for the block's runtime. We have settings for GPU usage, device configuration, large context window (8192 tokens), and model parameters optimized for the Qwen3 architecture.
- **`policyRulesSpec`**: This is a crucial section that defines the chain of policies governing the block's behavior:
    - **`clusterAllocator`**: Selects a specific cluster, in this case, `gcp-cluster-2`.
    - **`resourceAllocator`**: Assigns the block to a specific node (`wc-gpu-node1`) and GPU (`0`).
    - **`loadBalancer`**: Manages how requests are distributed, configured here to cache sessions.
    - **`stabilityChecker`**: Defines a health check mechanism to ensure the block is running correctly.
    - **`autoscaler`**: Enables autoscaling for the block based on GPU utilization.

In [None]:
!cat allocation.json

In [2]:
# Now, let's allocate the block using `curl`.
!curl -X POST -d @./allocation.json -H "Content-Type: application/json" http://MANAGEMENTMASTER:30501/api/createBlock

{
  "result": {
    "data": {
      "message": "task scheduled in background",
      "task_id": "c46aa9d1-2146-419f-93f4-7f394a455248"
    },
    "success": true
  },
  "success": true,
  "task_id": ""
}


## 3. Checking Block Status and Metrics

After allocating the block, it's important to check its status to ensure it's running correctly. AIOS provides a comprehensive **Metrics System** that exposes endpoints for health status, and resource consumption, giving you a complete picture of your block's operational state. The Metrics System is designed to be extensible, allowing you to define and expose custom metrics for your applications.

For more information on the metrics system, see the documentation:
- [Metrics System Overview-covered in later tutorial](https://github.com/OpenCyberspace/OpenOS.AI-Documentation/blob/main/metrics-system/metrics-system.md#metrics-system)
- [Custom Metrics-covered in later tutorial](https://github.com/OpenCyberspace/OpenOS.AI-Documentation/blob/main/metrics-system/metrics-system.md#custom-metrics)

Let's check the health status and metrics of our block.

### Check Block Health

In [6]:
# Check Block Health - This command verifies the health of the running service within the block.
!curl -X GET http://MANAGEMENTMASTER:30201/block/health/qwen3-32b-2

{
  "block_id": "qwen3-32b-2",
  "healthy": true,
  "instances": [
    {
      "healthy": true,
      "instanceId": "executor",
      "reason": "executor instance"
    },
    {
      "healthy": true,
      "instanceId": "in-bwep",
      "lastMetrics": "10.151591777801514s ago"
    }
  ],
  "success": true
}


### Check Block Metrics (Before Inference)

#### Core Metric Categories

##### 🔧 **Runtime Metrics**
- **CPU Usage**: Real-time CPU utilization percentage for the block container
- **Memory Usage**: Current memory consumption and peak memory usage
- **GPU Utilization**: GPU usage percentage and VRAM consumption (for GPU-enabled blocks)

##### 📊 **Performance Metrics**
- **Request Latency**: Average response times for inference requests
- **Throughput**: Requests per second (RPS) and tokens per second (TPS)
- **Queue Length**: Number of pending requests in the processing queue

##### 🔄 **Operational Metrics**
- **Request Count**: Total number of requests processed over time
- **Error Rate**: Percentage of failed requests and error types

##### 🏥 **Health Metrics**
- **Uptime**: Total running time since last restart

In [7]:
# Check Block Metrics before launching the app
!curl -X GET http://MANAGEMENTMASTER:30201/block/qwen3-32b-2 | json_pp

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  2589  100  2589    0     0   111k      0 --:--:-- --:--:-- --:--:--  114k
{
   "data" : [
      {
         "blockId" : "qwen3-32b-2",
         "instances" : [
            {
               "blockId" : "qwen3-32b-2",
               "instanceId" : "executor",
               "latency" : {
                  "latency" : 0
               },
               "nodeKey" : "executor__qwen3-32b-2",
               "tasks_processed" : {
                  "tasks_processed_created" : 1754411057.62712,
                  "tasks_processed_total" : 0
               },
               "type" : "app",
               "uptime" : 70.0237765312195,
               "uptime_hours" : 0.0194510490364499,
               "uptime_minutes" : 1.16706294218699
            },
            {
               "blockId" : "qwen3-32b-2",
               "hardware" : {
     

## 4. Performing Inference

Now that the block is running, let's perform an inference task. We'll use a Python script to send a request to the block via gRPC.

In [None]:
# First, let's install the necessary libraries for our gRPC client and import them.
# We also need to add the `inference_client` directory to our Python path to import the generated gRPC files.
!pip install grpcio grpcio-tools protobuf

import sys
sys.path.append('../utils/inference_client')

import grpc
import json
import time

import service_pb2
import service_pb2_grpc

Now, let's define a function to send an inference request to our block. This function will connect to the gRPC server, construct a request packet, and print the response.

In [9]:
def run_inference(block_id, session_id, seq_no, message, generation_config):
    SERVER_ADDRESS = "CLUSTER1MASTER:31500"
    
    # Connect to the gRPC server
    channel = grpc.insecure_channel(SERVER_ADDRESS)
    stub = service_pb2_grpc.BlockInferenceServiceStub(channel)

    data = {
        "mode": "chat",
        "message": message,
        "gen_params": generation_config
    }

    # Create the BlockInferencePacket request
    request = service_pb2.BlockInferencePacket(
        block_id=block_id,
        session_id=session_id,
        seq_no=seq_no,
        data=json.dumps(data),
        ts=time.time()
    )

    try:
        st = time.time()
        # Make the gRPC call
        response = stub.infer(request)
        et = time.time()

        print(f"Latency: {et - st}s")
        print(f"Session ID: {response.session_id}")
        print(f"Sequence No: {response.seq_no}")
        
        # Parse JSON response data
        try:
            response_data = json.loads(response.data)
            print("Data:")
            print(json.dumps(response_data, indent=2))
        except (json.JSONDecodeError, TypeError):
             print(f"Data: {response.data}")

        print(f"Timestamp: {response.ts}")

    except grpc.RpcError as e:
        print(f"gRPC Error: {e.code()} - {e.details()}")

In [10]:
generation_config = {
    "temperature": 0.1,
    "top_p": 0.95,
    "max_tokens": 512
}

run_inference(
    block_id="qwen3-32b-2",
    session_id="session_notebook_qwen3-1",
    seq_no=1,
    message="What are the key features and capabilities of the Qwen3 32B model? How does its 128K context window benefit users?",
    generation_config=generation_config
)

Latency: 74.24909567832947s
Session ID: session_notebook_qwen3-1
Sequence No: 1
Data:
{
  "reply": "<think>\nOkay, the user is asking about the key features and capabilities of the Qwen3 32B model and how its 128K context window benefits users. Let me start by recalling what I know about Qwen3. It's a large language model developed by Alibaba, and the 32B refers to the number of parameters, which is 32 billion. That's a significant size, so I should mention that it's a large-scale model with strong language understanding and generation capabilities.\n\nFirst, the key features. I need to list them out. The user probably wants to know what makes Qwen3 stand out. Let me think: parameter count, training data, multilingual support, reasoning abilities, dialogue understanding, code generation, and maybe some specific optimizations. Also, the 128K context window is a big part of the question, so I should highlight that as a key feature.\n\nFor the 128K context window, the benefits would inclu

Let's test the large context capability with a longer prompt:

In [11]:
# Bit longer message
long_context_message = """
Given the following research paper abstract and methodology section, please analyze and summarize the key findings:

Abstract: Large language models have revolutionized natural language processing, but their computational requirements remain a significant barrier to widespread deployment. Recent advances in model compression, including quantization, pruning, and knowledge distillation, have shown promise in reducing model size while maintaining performance. However, the trade-offs between compression ratio, inference speed, and task performance are not well understood across different model architectures and sizes.

Methodology: We conducted a comprehensive evaluation of compression techniques on models ranging from 7B to 70B parameters across multiple benchmark tasks including reading comprehension, mathematical reasoning, code generation, and multilingual understanding. Our experiments included 8-bit and 4-bit quantization, structured and unstructured pruning at various sparsity levels, and teacher-student distillation with different temperature settings.

Please provide a detailed analysis of what this research likely discovered and its implications for practical AI deployment.
"""

run_inference(
    block_id="qwen3-32b-2",
    session_id="session_notebook_qwen3-2",
    seq_no=1,
    message=long_context_message,
    generation_config=generation_config
)

Latency: 17.778135776519775s
Session ID: session_notebook_qwen3-2
Sequence No: 1
Data:
{
  "reply": "<think>\nOkay, let's see. The user wants me to analyze and summarize the key findings from the given research paper abstract and methodology. The abstract talks about large language models and the challenges with their computational requirements. They mention model compression techniques like quantization, pruning, and knowledge distillation. The methodology section says they evaluated these techniques on models from 7B to 70B parameters across various tasks.\n\nFirst, I need to figure out what the research likely discovered. The abstract mentions that the trade-offs between compression ratio, inference speed, and task performance aren't well understood. So the study probably looked into how different compression methods affect these factors across different model sizes and architectures.\n\nThe methodology includes 8-bit and 4-bit quantization. I know that lower bit quantization reduce

## 4.a Block Metrics After Inference

In [12]:
# Check Block Metrics after inference
!curl -X GET http://MANAGEMENTMASTER:30201/block/qwen3-32b-2 | json_pp

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  5897  100  5897    0     0   159k      0 --:--:-- --:--:-- --:--:--  164k
{
   "data" : [
      {
         "blockId" : "qwen3-32b-2",
         "instances" : [
            {
               "blockId" : "qwen3-32b-2",
               "instanceId" : "executor",
               "latency" : {
                  "latency" : 0.000131607055664062
               },
               "nodeKey" : "executor__qwen3-32b-2",
               "tasks_processed" : {
                  "tasks_processed_created" : 1754411057.62712,
                  "tasks_processed_total" : 2
               },
               "type" : "app",
               "uptime" : 260.059189558029,
               "uptime_hours" : 0.0722386637661192,
               "uptime_minutes" : 4.33431982596715
            },
            {
               "blockId" : "qwen3-32b-2",
               "

## 5. Interactive Chat with Streamlit

To provide an interactive way to test the model, we will launch a Streamlit application. The following cell will import the necessary function and launch the app, providing a public URL for you to access.

In [None]:
!pip install streamlit pyngrok nest_asyncio websockets > /dev/null

import sys
import os

# Add the parent directory of 'utils' to the path to find the inference_client
sys.path.append(os.path.abspath('../..'))
sys.path.append(os.path.abspath('../utils/streamlit_app'))

from utils import run_streamlit_direct

# Define the block_id and grpc_server_address
BLOCK_ID = "qwen3-32b-2"
GRPC_SERVER_ADDRESS = "CLUSTER1MASTER:31500"
streamlit_url = run_streamlit_direct(BLOCK_ID, GRPC_SERVER_ADDRESS, port=8501)
print(f"Streamlit App URL: {streamlit_url}")

You can now open the URL above in your browser to interact with the model. Once you are done, you can proceed to the cleanup steps.

## 6. Cleaning Up

After you are finished with the block, it is important to deallocate it to free up resources. You can also deregister the asset if you no longer need it.

### Deallocate the Block
This command will stop the running block and release all associated resources.

`K8s dashboard` available at https://CLUSTER1MASTER:32319/#/login

In [14]:
# Show the Block in K8s first and then show after deletion!
!curl -X POST http://MANAGEMENTMASTER:30600/controller/removeBlock/gcp-cluster-2 -H "Content-Type: application/json" -d '{"block_id": "qwen3-32b-2"}' | json_pp

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    94  100    67  100    27     12      5  0:00:05  0:00:05 --:--:--    16
{
   "data" : {
      "data" : "Action performed",
      "success" : true
   },
   "success" : true
}


### Deregister the Asset
If you no longer need the asset in the registry, you can remove it.

In [None]:
!curl -X POST http://MANAGEMENTMASTER:30112/api/unregisterComponent -H "Content-Type: application/json" -d '{"uri": "model.qwen3-32b-llama_cpp2:1.0.0-stable"}' | json_pp

## More models sample notebooks can be found [here](https://github.com/OpenCyberspace/AIOS_AI_Blueprints/tree/main/video_tutorial_series/02_more_models_llama_cpp)