# A Developer's Guide to Block Load Balancing in AIOS

Welcome to this guide on the **Block Load Balancer** feature in AIOS. The load balancer is a critical policy that distributes incoming inference requests across multiple block replicas. This optimizes performance, ensures high availability, and makes efficient use of your computational resources.

Below is a visual overview of the load balancing process, where incoming requests are intelligently distributed across available model replicas.

<!--- ![Load Balancing Process](loadbalancer.gif) -->
<!-- <img src="loadbalancer.gif" width="400" height="150"> -->
<img src="loadbalancer.png" alt="Custom LoadBalancer" width="800" height="800"> 

For a deep dive into the concepts, you can refer to the official documentation:
### where this comes in the overall scheme of things with Architecture diagram
- [Block Concepts](https://github.com/OpenCyberspace/OpenOS.AI-Documentation/blob/main/block/block.md)
- [Block Load Balancer Concepts](https://github.com/OpenCyberspace/OpenOS.AI-Documentation/blob/main/block/block.md#block-load-balancer-executor)

The process involves:
1.  **Building a Load Balancer Policy**: Defining the logic for request distribution in a Python class.
2.  **Testing the Policy**: Running unit tests to ensure the distribution logic is correct.
3.  **Deploying the Policy**: Uploading and registering the policy with AIOS.
4.  **Triggering and Observing Distribution**: Simulating a workload and monitoring the system to see the load balancer in action.

## The AIOS Policy System

Before we build our load balancer, it's important to understand the **AIOS Policy System**. Policies are pluggable, versioned, and reusable modules that control the runtime behavior of Blocks. They allow you to customize how AIOS manages resources, routes traffic, and ensures stability without modifying the core application code.

AIOS uses a chain of policies to manage a Block's lifecycle. Common policy types include:
- **`clusterAllocator`**: Selects a cluster to run the block.
- **`resourceAllocator`**: Assigns specific nodes and hardware.
- **`loadBalancer`**: Distributes incoming requests among replicas.
- **`stabilityChecker`**: Monitors the health of a block.
- **`autoscaler`**: Automatically adjusts the number of block replicas.

In this tutorial, we will focus on creating a custom `loadBalancer` policy. For more information on the policy system, see the documentation:
- [Policies System Overview](https://github.com/OpenCyberspace/OpenOS.AI-Documentation/blob/main/policies-system/policies-system.md)

## 1. Building a Token-Based Load Balancer Policy

Our load balancer will use a **token-based distribution** strategy. This is particularly effective for LLMs, where the number of tokens in a request is a good proxy for the computational cost. The policy will score each available replica based on its recent token workload and route new requests to the replica with the lowest score (i.e., the least busy).

### The Logic of Token-Based Distribution
The core idea is to:
- **Calculate a "load score"** for each replica based on a weighted average of its input and output tokens per minute.
- **Assign a higher weight to output tokens**, as generation is typically more computationally intensive than processing the input prompt.
- **Select the replica with the lowest score** for the next incoming request, as it is presumed to be the least loaded.
- **Implement session stickiness**, ensuring that all requests within the same session are routed to the same replica to maintain context.

This approach intelligently distributes the load based on the actual work each replica is doing, leading to better overall throughput and latency.

The policy is defined within a class named `AIOSv1PolicyRule`. We will build this class step-by-step, explaining each method.

In [1]:
import random
import logging
from typing import Dict, Any
import inspect
import textwrap

The history saving thread hit an unexpected error (OperationalError('attempt to write a readonly database')).History will not be written to the database.


### The `AIOSv1PolicyRule` Class and `__init__` Method

This is the main class for our policy. The `__init__` method initializes the policy's state. It sets up logging and retrieves parameters like token weights and the averaging period from the policy's configuration. It also gets a reference to the `get_metrics` function, which is crucial for fetching the data needed to make decisions.

In [2]:
class AIOSv1PolicyRule:
    def __init__(self, rule_id, settings, parameters):
        self.rule_id = rule_id
        self.settings = settings
        self.parameters = parameters
        self.logger = logging.getLogger(f"TokenBasedDistributionPolicy-{self.rule_id}")

        self.output_token_weight = self.parameters.get("output_token_weight", 0.9)
        self.input_token_weight = self.parameters.get("input_token_weight", 0.1)
        self.averaging_period = self.parameters.get("averaging_period", "average_1m")
        self.allow_random_fallback = self.parameters.get("allow_random_fallback", True)

        self.metrics_function = self.settings.get("get_metrics")
        self.block_data = self.settings.get("block_data")
        self.cluster_data = self.settings.get("cluster_data")
        
        self.session_ids_cache = {}
        self.current_instances = []

### The `_calculate_weighted_tokens` Method

This helper method calculates a score for a given block instance based on its recent token usage. It combines the input and output tokens, applying the weights defined in our parameters, to produce a single score representing the instance's current load.

In [9]:
class AIOSv1PolicyRule(AIOSv1PolicyRule): 
    def _calculate_weighted_tokens(self, instance_metrics: Dict[str, Any]) -> float:
        input_tokens_rolling = instance_metrics.get("llm_input_tokens_per_minute_rolling", {})
        output_tokens_rolling = instance_metrics.get("llm_output_tokens_per_minute_rolling", {})
    
        input_tokens = input_tokens_rolling.get(self.averaging_period, 0)
        output_tokens = output_tokens_rolling.get(self.averaging_period, 0)
    
        return (self.input_token_weight * input_tokens) + (self.output_token_weight * output_tokens)

### The `_select_instance` Method

This method is responsible for choosing the best instance to handle a new request. It fetches the latest metrics for all available instances, uses `_calculate_weighted_tokens` to score each one, and then selects the instance with the *lowest* score. If no metrics are available, it can fall back to selecting an instance at random.

In [10]:
class AIOSv1PolicyRule(AIOSv1PolicyRule): 
    def _select_instance(self) -> str:
        if not self.current_instances:
            self.logger.warning("No instances available for routing.")
            return None
    
        current_metrics = self.metrics_function()
        block_metrics = current_metrics.get("block_metrics", [])
        
        instance_scores = {}
        if block_metrics:
            self.logger.info(f"Calculating scores for instances with metrics: {self.current_instances}")
            for instance_metric in block_metrics:
                instance_id = instance_metric.get("instanceId")
                if instance_id and instance_id in self.current_instances:
                    score = self._calculate_weighted_tokens(instance_metric)
                    instance_scores[instance_id] = score
                    self.logger.info(f"  - Instance '{instance_id}': score = {score:.2f}")
    
        if not instance_scores:
            self.logger.warning("No instance scores calculated from metrics.")
            if self.allow_random_fallback and self.current_instances:
                chosen_instance = random.choice(self.current_instances)
                self.logger.info(f"Randomly selected instance: '{chosen_instance}' as fallback is enabled.")
                return chosen_instance
            else:
                self.logger.warning("Cannot select an instance: No scores and random fallback is disabled or no instances available.")
                return None
        else:
            chosen_instance = min(instance_scores, key=instance_scores.get)
            self.logger.info(f"Final scores: {instance_scores}. Chosen instance with lowest score: '{chosen_instance}'")
            return chosen_instance

### The `eval` Method

This is the main entry point for the policy's logic. It's called by the AIOS Load Balancer for each incoming request. It first checks for "session stickiness"—if the request is part of an existing session, it routes it to the same instance as before. If it's a new request, it calls `_select_instance` to find the best destination. Finally, it caches the decision for new sessions and returns the chosen instance ID.

In [11]:
class AIOSv1PolicyRule(AIOSv1PolicyRule):
    def eval(self, parameters: Dict[str, Any], input_data: Dict[str, Any], context: Dict[str, Any]) -> Dict[str, Any]:
        self.logger.info(f"Starting token-based distribution evaluation. Input data keys: {list(input_data.keys())}")
        try:
            if not callable(self.metrics_function):
                self.logger.error("'get_metrics' function not found. Cannot determine instances or metrics.")
                return {"instance_id": None, "reason": "Metrics function not configured."}
            
            latest_instances = input_data.get("instances", [])
    
            if set(latest_instances) != set(self.current_instances):
                self.logger.info(f"Instance list changed from {self.current_instances} to {latest_instances}. Updating session cache.")
                
                stale_sessions = [
                    session_id for session_id, instance_id in self.session_ids_cache.items()
                    if instance_id not in latest_instances
                ]
                
                if stale_sessions:
                    self.logger.info(f"Removing stale sessions from cache: {stale_sessions}")
                    for session_id in stale_sessions:
                        del self.session_ids_cache[session_id]
    
                self.current_instances = latest_instances
    
            if not self.current_instances:
                self.logger.warning("No instances available for routing.")
                return {"instance_id": None, "reason": "No available instances."}
    
            packet = input_data.get("packet")
            if packet:
                session_id = getattr(packet, "session_id", None)
                if session_id and session_id in self.session_ids_cache:
                    cached_instance = self.session_ids_cache[session_id]
                    if cached_instance in self.current_instances:
                        self.logger.info(f"Session '{session_id}' found in cache. Routing to instance '{cached_instance}'.")
                        return {"instance_id": cached_instance}
                    else:
                        self.logger.info(f"Cached instance '{cached_instance}' for session '{session_id}' is no longer active. Removing from cache.")
                        del self.session_ids_cache[session_id]
    
            chosen_instance = self._select_instance()
    
            if not chosen_instance:
                self.logger.error("Failed to select an instance.")
                return {"instance_id": None, "reason": "Instance selection failed."}
    
            if packet and getattr(packet, "session_id", None):
                session_id = packet.session_id
                self.logger.info(f"Caching session '{session_id}' to instance '{chosen_instance}'.")
                self.session_ids_cache[session_id] = chosen_instance
    
            return {"instance_id": chosen_instance}
    
        except Exception as e:
            self.logger.exception(f"An unexpected error occurred during evaluation: {e}")
            if self.current_instances and self.allow_random_fallback:
                chosen_instance = random.choice(self.current_instances)
                self.logger.warning(f"Error occurred. Falling back to random instance: {chosen_instance}")
                return {"instance_id": chosen_instance}
            
            return {"instance_id": None, "reason": "An error occurred and random fallback is disabled."}

### The `management` Method

This method handles administrative tasks and special commands that aren't part of the regular request evaluation flow. For example, it can respond to health checks or pre-assign an instance for a long-lived streaming session before the first request even arrives.

In [12]:
class AIOSv1PolicyRule(AIOSv1PolicyRule):
    def management(self, action: str, data: dict) -> dict:
        self.logger.info(f"Management action received: {action} with data: {data}")
        if action == "health_check":
            return {"instances": self.current_instances, "status": "healthy"}
        elif action == "get_current_mapping":
            return {"mapping": self.session_ids_cache}
        elif action == "assign_streaming":
            session_id = data.get("session_id")
            latest_instances = data.get("instances", [])
            if set(latest_instances) != set(self.current_instances):
                self.logger.info(f"Instance list changed from {self.current_instances} to {latest_instances}. Updating session cache.")
                
                stale_sessions = [
                    session_id for session_id, instance_id in self.session_ids_cache.items()
                    if instance_id not in latest_instances
                ]
                
                if stale_sessions:
                    self.logger.info(f"Removing stale sessions from cache: {stale_sessions}")
                    for session_id in stale_sessions:
                        del self.session_ids_cache[session_id]
    
                self.current_instances = latest_instances
    
            if not session_id:
                self.logger.error("'session_id' not provided for 'assign_streaming' action.")
                return {"status": "error", "reason": "session_id is required."}
    
            if session_id in self.session_ids_cache:
                cached_instance = self.session_ids_cache[session_id]
                if cached_instance in self.current_instances:
                    self.logger.info(f"Session '{session_id}' already assigned to '{cached_instance}'.")
                    return {"instance_id": cached_instance, "status": "ok"}
                else:
                    self.logger.info(f"Cached instance '{cached_instance}' for session '{session_id}' is no longer active. Re-assigning.")
                    del self.session_ids_cache[session_id]
    
            chosen_instance = self._select_instance()
    
            if chosen_instance:
                self.logger.info(f"Pre-allocating instance '{chosen_instance}' for streaming session '{session_id}'.")
                self.session_ids_cache[session_id] = chosen_instance
                return {"instance_id": chosen_instance, "status": "ok"}
            else:
                self.logger.error(f"Failed to select an instance for session '{session_id}'.")
                return {"instance_id": None, "status": "error", "reason": "Instance selection failed."}
    
        self.logger.warning(f"Unknown management action received: {action}")
        return {"status": "unknown_action", "reason": f"Action '{action}' is not supported."}

## 2. Testing the Policy

We'll write a unit test to ensure our policy correctly scores blocks based on their token load.

In [22]:
import unittest,sys
from unittest.mock import MagicMock, Mock

logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    stream=sys.stdout  # Ensure logs go to the notebook output
)

class TestTokenBasedLoadBalancer(unittest.TestCase):

    def setUp(self):
        self.rule_id = "test-lb-rule"
        self.settings = {}
        self.parameters = {
            "output_token_weight": 0.9,
            "input_token_weight": 0.1,
            "averaging_period": "average_1m",
            "allow_random_fallback": True
        }

    def test_select_instance_with_lowest_score(self):
        # Mock metrics collector
        mock_metrics_collector = MagicMock(return_value={
            "block_metrics": [
                {"instanceId": "instance-1", "llm_input_tokens_per_minute_rolling": {"average_1m": 100}, "llm_output_tokens_per_minute_rolling": {"average_1m": 1000}}, # Score: 910
                {"instanceId": "instance-2", "llm_input_tokens_per_minute_rolling": {"average_1m": 50}, "llm_output_tokens_per_minute_rolling": {"average_1m": 500}},   # Score: 455
                {"instanceId": "instance-3", "llm_input_tokens_per_minute_rolling": {"average_1m": 200}, "llm_output_tokens_per_minute_rolling": {"average_1m": 1500}}  # Score: 1370
            ]
        })
        self.settings["get_metrics"] = mock_metrics_collector
        
        policy = AIOSv1PolicyRule(self.rule_id, self.settings, self.parameters)
        input_data = {"instances": ["instance-1", "instance-2", "instance-3"], "packet": None}
        
        result = policy.eval({}, input_data, {})
        self.assertEqual(result["instance_id"], "instance-2")

    def test_session_stickiness(self):
        # Mock metrics collector
        mock_metrics_collector = MagicMock(return_value={
            "block_metrics": [
                {"instanceId": "instance-1", "llm_input_tokens_per_minute_rolling": {"average_1m": 100}, "llm_output_tokens_per_minute_rolling": {"average_1m": 1000}},
                {"instanceId": "instance-2", "llm_input_tokens_per_minute_rolling": {"average_1m": 50}, "llm_output_tokens_per_minute_rolling": {"average_1m": 500}}
            ]
        })
        self.settings["get_metrics"] = mock_metrics_collector
        
        policy = AIOSv1PolicyRule(self.rule_id, self.settings, self.parameters)
        
        # First call, should select instance-2 (lowest score) and cache it
        packet1 = Mock()
        packet1.session_id = "session-123"
        input_data1 = {"instances": ["instance-1", "instance-2"], "packet": packet1}
        result1 = policy.eval({}, input_data1, {})
        self.assertEqual(result1["instance_id"], "instance-2")
        
        # Second call with the same session_id, should return the cached instance-2
        # even if metrics change to make instance-1 look better
        mock_metrics_collector.return_value["block_metrics"][1]["llm_output_tokens_per_minute_rolling"]["average_1m"] = 2000
        
        packet2 = Mock()
        packet2.session_id = "session-123"
        input_data2 = {"instances": ["instance-1", "instance-2"], "packet": packet2}
        result2 = policy.eval({}, input_data2, {})
        self.assertEqual(result2["instance_id"], "instance-2")

# Run the tests
suite = unittest.TestSuite()
suite.addTest(unittest.makeSuite(TestTokenBasedLoadBalancer))
runner = unittest.TextTestRunner(stream=sys.stdout, verbosity=2)
runner.run(suite)

test_select_instance_with_lowest_score (__main__.TestTokenBasedLoadBalancer.test_select_instance_with_lowest_score) ... 2025-07-30 09:33:51,097 - INFO - Starting token-based distribution evaluation. Input data keys: ['instances', 'packet']
2025-07-30 09:33:51,097 - INFO - Instance list changed from [] to ['instance-1', 'instance-2', 'instance-3']. Updating session cache.
2025-07-30 09:33:51,098 - INFO - Calculating scores for instances with metrics: ['instance-1', 'instance-2', 'instance-3']
2025-07-30 09:33:51,098 - INFO -   - Instance 'instance-1': score = 910.00
2025-07-30 09:33:51,098 - INFO -   - Instance 'instance-2': score = 455.00
2025-07-30 09:33:51,099 - INFO -   - Instance 'instance-3': score = 1370.00
2025-07-30 09:33:51,099 - INFO - Final scores: {'instance-1': 910.0, 'instance-2': 455.0, 'instance-3': 1370.0}. Chosen instance with lowest score: 'instance-2'
ok
test_session_stickiness (__main__.TestTokenBasedLoadBalancer.test_session_stickiness) ... 2025-07-30 09:33:51,101

  suite.addTest(unittest.makeSuite(TestTokenBasedLoadBalancer))


<unittest.runner.TextTestResult run=2 errors=0 failures=0>

## 3. Deploying the Policy

With the policy tested, we can now deploy it to AIOS. This involves three steps: Assembling the policy, uploading the policy package (`tokenloadbalancer.zip`) and then registering it with the AIOS Policy System.

### a. Assembling the Policy for Deployment

Now that we have defined all the methods of our `AIOSv1PolicyRule` class, we need to assemble them into a single script and package it correctly for deployment. The policy must be contained within a `code` directory, which includes the `function.py` file and a `requirements.txt` file.

In [17]:
# Create a requirements.txt file
with open("code/requirements.txt", "w") as f:
    f.write("requests") # No external dependencies for this policy

print("code/function.py and code/requirements.txt created successfully.")
!cat code/function.py

code/function.py and code/requirements.txt created successfully.
import random
import logging
from typing import Dict, Any

class AIOSv1PolicyRule:
    def __init__(self, rule_id, settings, parameters):
        """
        Initializes the Token-Based Distribution Policy.

        Args:
            rule_id (str): Unique identifier for the rule.
            settings (dict): Configuration settings for the rule.
            parameters (dict): Parameters defining the rule's behavior.
        """
        self.rule_id = rule_id
        self.settings = settings
        self.parameters = parameters
        self.logger = logging.getLogger(f"TokenBasedDistributionPolicy-{self.rule_id}")

        # Weights for token calculation
        self.output_token_weight = self.parameters.get("output_token_weight", 0.9)
        self.input_token_weight = self.parameters.get("input_token_weight", 0.1)
        self.averaging_period = self.parameters.get("averaging_period", "average_1m")
        self.allow_rando

In [18]:
!zip -r tokenloadbalancer2.zip code

  adding: code/ (stored 0%)
  adding: code/function.py (deflated 78%)
  adding: code/requirements.txt (stored 0%)
  adding: code/.ipynb_checkpoints/ (stored 0%)
  adding: code/.ipynb_checkpoints/function-checkpoint.py (deflated 78%)
  adding: code/.ipynb_checkpoints/requirements-checkpoint.txt (stored 0%)


### b. Upload the Policy Package

First, we upload the `.zip` file containing our policy code to a location accessible by AIOS. The following `curl` command sends the file to an upload server, which makes it available via a URL.

In [None]:
!curl -X POST http://POLICYSTORESERVER:30186/upload -F "file=@./tokenloadbalancer.zip" -F "path=."

### c. Register the Policy

Next, we register the policy with AIOS. This `curl` command sends a JSON payload to the policy registry endpoint. The payload contains metadata about our policy, including its name, version, and the URL where the code can be found (`"code": "http://MANAGEMENTMASTER:32555/tokenloadbalancer.zip"`). This tells AIOS how to find and execute our policy when it's attached to a block.

In [None]:
!curl -X POST http://MANAGEMENTMASTER:30102/policy \\
     -H "Content-Type: application/json" \\
     -d '{ \\
           "name": "load_balancer", \\
           "version": "1.0", \\
           "release_tag": "stable", \\
           "metadata": {"author": "admin", "category": "load_balancing"}, \\
           "tags": "load_balancing,ai", \\
           "code": "http://MANAGEMENTMASTER:32555/tokenloadbalancer.zip", \\
           "code_type": "tar.xz", \\
           "type": "policy", \\
           "policy_input_schema": {"type": "object", "properties": {"blocks": {"type": "array"}}}, \\
           "policy_output_schema": {"type": "object", "properties": {"scored_blocks": {"type": "array"}}}, \\
           "policy_settings_schema": {}, \\
           "policy_parameters_schema": {}, \\
           "policy_settings": {}, \\
           "policy_parameters": {}, \\
           "description": "A policy for token-based load balancing.", \\
           "functionality_data": {"strategy": "scoring-based"}, \\
           "resource_estimates": {} \\
         }'

Now, let's package our policy into a `tokenloadbalancer.zip` file for deployment. The zip file must contain the `code` directory.

In [19]:
!cat allocation.json

{
    "head": {
        "templateUri": "Parser/V1",
        "parameters": {}
    },
    "body": {
        "spec": {
            "values": {
                "mode": "allocate",
                "blockId": "gemma3-27b-block",
                "blockComponentURI": "model.gemma3-27b:1.0.0-stable",
                "minInstances": 1,
                "maxInstances": 3,
                "blockInitData": {
                    "model_name": "gemma-3-27b-it-UD-Q8_K_XL/gemma-3-27b-it-UD-Q8_K_XL.gguf",
                    "clip_model_name": "gemma-3-27b-it-UD-Q8_K_XL/mmproj-F16.gguf"
                },
                "initSettings": {
                    "tensor_parallel": true,
                    "device": "cuda",
                    "quantization_type": "fp16",
                    "cleanup_enabled":  true,
                    "cleanup_check_interval": 60,
                    "cleanup_session_timeout": 1800,
                    "gen_params": {
                        "max_new_tokens": 2048,
       

## 4. Triggering and Observing Distribution

To test the load balancer, we need multiple replicas of a block and a client that sends requests. The load balancer, guided by our policy, will distribute these requests.

In [31]:
#first create the Block and show in K8s Dashboard
!curl -X POST -d @./allocation.json -H "Content-Type: application/json" http://MANAGEMENTMASTER:30501/api/createBlock

{
  "error": "block with same ID already exists"
}


In [39]:
!python3 client_test_token_loadbalancer.py  --input-tokens=1600  --max-output-tokens=50 --num-requests=2

Starting test with 2 concurrent requests.
Each request will have 1600 input tokens and ask for max 50 output tokens.
Session session-6f5162af-2650-4f1f-8911-1cb3b85a947e: Latency: 4.61s
Session session-8672805d-bb75-4b7f-8d16-41eb041a8ec7: Latency: 8.41s
Test finished.


In [None]:
!python3 client_test_token_loadbalancer.py  --max-output-tokens=100 --num-requests=15

Starting test with 15 concurrent requests.
Each request will have 100 input tokens and ask for max 100 output tokens.


### Observing the Results

You can monitor the distribution of requests in the logs.


**In K8s Logs:**
- `K8s dashboard` available at https://CLUSTER1MASTER:32319/#/login
- The load balancer service will log its decisions, including the scores it calculated for each block and which block it chose for each request. 
- You can also inspect the logs of the individual block replicas to see which requests they received.