# A Developer's Guide to Block Autoscaling in AIOS

Welcome to this guide on the **Block Auto-Scaler** feature in AIOS. The auto-scaler automatically adjusts the number of block replicas based on real-time demand, ensuring optimal resource utilization and performance. This is a critical feature for building robust, cost-effective, and scalable AI applications.

Below is a visual overview of the autoscaling process, where the number of replicas dynamically adjusts based on the incoming workload.

<!--- ![Autoscaling Process](autoscaler.gif) -->
<!-- <img src="autoscaler.gif" width="400" height="150"> -->
<!-- ![Custom Autoscaler]("tokenautoscaler.png") -->
<img src="tokenautoscaler.png" alt="Custom Autoscaler" width="600" height="400">

For a deep dive into the concepts, you can refer to the official documentation:
- [Block Auto-Scaler Concepts](https://github.com/OpenCyberspace/OpenOS.AI-Documentation/blob/main/block/block.md#block-auto-scaler)

The process involves:
1.  **Building an Autoscaler Policy**: Defining the logic for scaling in a Python class.
2.  **Testing the Policy**: Running unit tests to ensure the logic is correct.
3.  **Deploying the Policy**: Uploading and registering the policy with AIOS.
4.  **Triggering and Observing Scaling**: Simulating a workload and monitoring the system to see the autoscaler in action.

## The AIOS Policy System

Before we build our autoscaler, it's important to understand the **AIOS Policy System**. Policies are pluggable, versioned, and reusable modules that control the runtime behavior of Blocks. They allow you to customize how AIOS manages resources, routes traffic, and ensures stability without modifying the core application code.

AIOS uses a chain of policies to manage a Block's lifecycle. Common policy types include:
- **`clusterAllocator`**: Selects a cluster to run the block.
- **`resourceAllocator`**: Assigns specific nodes and hardware.
- **`loadBalancer`**: Distributes incoming requests among replicas.
- **`stabilityChecker`**: Monitors the health of a block.
- **`autoscaler`**: Automatically adjusts the number of block replicas.

In this tutorial, we will focus on creating a custom `autoscaler` policy. For more information on the policy system, see the documentation:
- [Policies System Overview](https://github.com/OpenCyberspace/OpenOS.AI-Documentation/blob/main/policies-system/policies-system.md)

## 1. Building a Token-Based Autoscaler Policy

Our autoscaler will use a **token-based scaling** strategy. This is particularly effective for Large Language Models (LLMs), where the number of input and output tokens is a direct measure of the workload. The policy will monitor the rate of tokens processed by a block and make scaling decisions based on predefined thresholds.

### The Logic of Token-Based Autoscaling
The core idea is simple:
- **If the rate of tokens per minute exceeds a certain "high" threshold**, it indicates that the current number of replicas is struggling to keep up with demand. To prevent performance degradation, the policy will trigger a **scale-up** event, adding more replicas to handle the load.
- **If the token rate drops below a "low" threshold**, it means we have more resources than necessary. To save costs, the policy will trigger a **scale-down** event, removing idle replicas.
- **A cooldown period** is used to prevent "flapping"—scaling up and down too frequently in response to short-lived spikes in traffic.

This approach ensures that the number of running instances is always aligned with the actual workload, providing a balance between performance and cost.

The policy is defined within a class named `AIOSv1PolicyRule`. We will build this class step-by-step, explaining each method. The class contains three main methods:
- **`__init__(self, rule_id, settings, parameters)`**: Initializes the policy, setting up thresholds, cooldown periods, and other parameters from the policy registration.
- **`eval(self, parameters, input_data, context)`**: The core evaluation logic. It retrieves metrics, calculates average token rates, and decides whether to scale up, scale down, or do nothing.
- **`management(self, action, data)`**: Handles administrative commands, such as manual overrides or resets.

In [5]:
import logging
import time
from typing import Dict, Any
import inspect
import textwrap

logging.basicConfig(level=logging.INFO)

### The `AIOSv1PolicyRule` Class and `__init__` Method

This is the main class for our policy. The `__init__` method initializes the policy's state. It sets up logging and retrieves parameters like scaling thresholds and cooldown periods from the policy's configuration.

In [6]:
class AIOSv1PolicyRule:
    def __init__(self, rule_id: str, settings: Dict[str, Any], parameters: Dict[str, Any]):
        """
        Initializes the Tokens Autoscaler Policy.
        """
        self.rule_id = rule_id
        self.settings = settings
        self.parameters = parameters
        self.logger = logging.getLogger(f"TokensAutoscalerPolicy-{self.rule_id}")
        self.logger.info("Initializing Tokens Autoscaler Policy")
        
        # Scaling thresholds based on input and output tokens
        self.input_up_threshold = self.parameters.get("input_tokens_up_threshold", 500)
        self.output_up_threshold = self.parameters.get("output_tokens_up_threshold", 300)
        self.input_down_threshold = self.parameters.get("input_tokens_down_threshold", 100)
        self.output_down_threshold = self.parameters.get("output_tokens_down_threshold", 50)
        
        # Minimum number of replicas to maintain
        self.min_replicas = self.parameters.get("min_replicas", 1)
        # The time window for averaging token metrics (e.g., 'average_1m', 'average_5m')
        self.averaging_period = self.parameters.get("averaging_period", "average_1m")
        # Cooldown period in seconds to avoid flapping (scaling up and down too quickly)
        self.cooldown_seconds = self.parameters.get("cooldown_seconds", 120)
        self.last_action_ts = None

### The `_cooldown_ok` Method

This is a simple helper method to prevent "flapping"—scaling up and down too rapidly. It checks if enough time has passed since the last scaling action before allowing another one.

In [7]:
class AIOSv1PolicyRule(AIOSv1PolicyRule): 
    def _cooldown_ok(self, now: float) -> bool:
        """
        Checks if the cooldown period has elapsed since the last action.
        """
        if self.last_action_ts is None:
            return True
        return (now - self.last_action_ts) > self.cooldown_seconds

### The `eval` Method

This is the core logic of our autoscaler. It's called periodically by AIOS. The method fetches the latest token metrics, calculates the average input and output token rates across all block replicas, and compares them against the thresholds. Based on this comparison, it decides whether to scale up, scale down, or do nothing.

In [10]:
class AIOSv1PolicyRule(AIOSv1PolicyRule):
    def eval(self, parameters: Dict[str, Any], input_data: Dict[str, Any], context: Dict[str, Any]) -> Dict[str, Any]:
        """
        Evaluates the policy to determine if a scaling action is needed.
        """
        try:
            self.logger.info(f"Evaluating policy with parameters: {self.parameters}")
    
            metrics_collector = self.settings.get('get_metrics')
            if not callable(metrics_collector):
                self.logger.error("get_metrics function not found in settings")
                return {"skip": True, "reason": "Metrics collector not configured."}
    
            metrics = metrics_collector()
            self.logger.info(f"Metrics received: {metrics}")
    
            current_instances = input_data.get("current_instances")
            if not current_instances:
                self.logger.warning("No current instances provided. Skipping evaluation.")
                return {"skip": True, "reason": "No current instances provided."}
    
            self.logger.info(f"Processing metrics for instances: {current_instances}")
            
            block_metrics = metrics.get("block_metrics", [])
            now = time.time()
    
            instances_to_process = [
                inst for inst in block_metrics
                if inst.get('instanceId') in current_instances
            ]
    
            if not instances_to_process:
                self.logger.warning("No metrics found for the specified current_instances.")
                return {"skip": True, "reason": "No metrics found for the specified instances."}
    
            total_input_tokens = 0
            total_output_tokens = 0
            
            for instance in instances_to_process:
                input_tokens_metrics = instance.get('llm_input_tokens_per_minute_rolling', {})
                output_tokens_metrics = instance.get('llm_output_tokens_per_minute_rolling', {})
                
                total_input_tokens += input_tokens_metrics.get(self.averaging_period, 0)
                total_output_tokens += output_tokens_metrics.get(self.averaging_period, 0)
    
            instance_count = len(instances_to_process)
            avg_input_tokens = total_input_tokens / instance_count
            avg_output_tokens = total_output_tokens / instance_count
            
            self.logger.info(f"Average Input Tokens: {avg_input_tokens}, Average Output Tokens: {avg_output_tokens}")
    
            if not self._cooldown_ok(now):
                self.logger.info(f"Cooldown active. Skipping evaluation. Last action at {self.last_action_ts}")
                return {"skip": True, "reason": "Cooldown active."}
    
            # Upscale logic
            if avg_input_tokens > self.input_up_threshold or avg_output_tokens > self.output_up_threshold:
                self.last_action_ts = now
                reason = f"Token rate exceeded threshold (Input: {avg_input_tokens:.2f}, Output: {avg_output_tokens:.2f})"
                self.logger.info(f"{reason}. Scaling up.")
                return {
                    "skip": False,
                    "operation": "upscale",
                    "instances_count": 1,
                    "reason": reason
                }
            
            # Downscale logic
            if avg_input_tokens < self.input_down_threshold and avg_output_tokens < self.output_down_threshold:
                if instance_count > self.min_replicas:
                    self.last_action_ts = now
                    
                    instances_to_process.sort(key=lambda x: 
                        x.get('llm_input_tokens_per_minute_rolling', {}).get(self.averaging_period, 0) +
                        x.get('llm_output_tokens_per_minute_rolling', {}).get(self.averaging_period, 0)
                    )
                    
                    instance_to_remove = instances_to_process[0].get('instanceId') if instances_to_process else None
                    
                    if instance_to_remove:
                        reason = f"Token rate below threshold (Input: {avg_input_tokens:.2f}, Output: {avg_output_tokens:.2f})"
                        self.logger.info(f"{reason}. Scaling down. Removing instance {instance_to_remove}")
                        return {
                            "skip": False,
                            "operation": "downscale",
                            "instances_list": [instance_to_remove],
                            "reason": reason
                        }
    
            self.logger.info("No scaling action required at this time.")
            return {"skip": True, "reason": "No scaling action required."}
    
        except Exception as e:
            self.logger.exception(f"An unexpected error occurred during evaluation: {e}")
            return {"skip": True, "reason": f"An error occurred: {e}"}

### The `management` Method

This method is for handling administrative commands. In our current policy, it's just a placeholder, but it could be extended to support actions like manually triggering a scaling event or resetting the cooldown timer.

In [11]:
class AIOSv1PolicyRule(AIOSv1PolicyRule):
    def management(self, action: str, data: dict) -> dict:
        """
        Executes a custom management command.
        """
        try:
            if action == "reset_cooldown":
                self.last_action_ts = None
                self.logger.info("Cooldown timer reset")
                return {"status": "ok", "message": "Cooldown timer reset successfully"}
        except Exception as e:
            return {"status": "error", "reason": f"Management action failed:{str(e)} "}

## 2. Testing the Policy

Before deploying, it's crucial to test the policy logic. We can write a simple unit test to verify that the policy makes the correct scaling decisions under different conditions.

In [14]:
import unittest
import time,sys
from unittest.mock import MagicMock

logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    stream=sys.stdout  # Ensure logs go to the notebook output
)


class TestTokenBasedAutoscaler(unittest.TestCase):

    def setUp(self):
        self.rule_id = "test-rule"
        self.settings = {}
        self.parameters = {
            "input_tokens_up_threshold": 500,
            "output_tokens_up_threshold": 300,
            "input_tokens_down_threshold": 100,
            "output_tokens_down_threshold": 50,
            "min_replicas": 1,
            "averaging_period": "average_1m",
            "cooldown_seconds": 120
        }

    def test_scale_up(self):
        # Mock metrics collector to return high token rates
        mock_metrics_collector = MagicMock(return_value={
            "block_metrics": [
                {"instanceId": "instance-1", 
                 "llm_input_tokens_per_minute_rolling": {"average_1m": 600},
                 "llm_output_tokens_per_minute_rolling": {"average_1m": 200}
                }
            ]
        })
        self.settings["get_metrics"] = mock_metrics_collector
        
        policy = AIOSv1PolicyRule(self.rule_id, self.settings, self.parameters)
        input_data = {"current_instances": ["instance-1"]}
        
        result = policy.eval({}, input_data, {})
        self.assertFalse(result.get("skip"))
        self.assertEqual(result["operation"], "upscale")
        self.assertEqual(result["instances_count"], 1)

    def test_scale_down(self):
        # Mock metrics collector for low token rates
        mock_metrics_collector = MagicMock(return_value={
            "block_metrics": [
                {"instanceId": "instance-1", 
                 "llm_input_tokens_per_minute_rolling": {"average_1m": 50},
                 "llm_output_tokens_per_minute_rolling": {"average_1m": 20}
                },
                {"instanceId": "instance-2", 
                 "llm_input_tokens_per_minute_rolling": {"average_1m": 40},
                 "llm_output_tokens_per_minute_rolling": {"average_1m": 10}
                }
            ]
        })
        self.settings["get_metrics"] = mock_metrics_collector
        
        policy = AIOSv1PolicyRule(self.rule_id, self.settings, self.parameters)
        policy.min_replicas = 1 # ensure we can scale down
        input_data = {"current_instances": ["instance-1", "instance-2"]}
        
        result = policy.eval({}, input_data, {})
        self.assertFalse(result.get("skip"))
        self.assertEqual(result["operation"], "downscale")
        self.assertEqual(result["instances_list"], ["instance-2"])

    def test_no_action(self):
        # Mock metrics for moderate token rates
        mock_metrics_collector = MagicMock(return_value={
            "block_metrics": [
                {"instanceId": "instance-1", 
                 "llm_input_tokens_per_minute_rolling": {"average_1m": 200},
                 "llm_output_tokens_per_minute_rolling": {"average_1m": 100}
                }
            ]
        })
        self.settings["get_metrics"] = mock_metrics_collector
        
        policy = AIOSv1PolicyRule(self.rule_id, self.settings, self.parameters)
        input_data = {"current_instances": ["instance-1"]}
        
        result = policy.eval({}, input_data, {})
        self.assertTrue(result["skip"])
        self.assertEqual(result["reason"], "No scaling action required.")

# Run the tests
suite = unittest.TestSuite()
suite.addTest(unittest.makeSuite(TestTokenBasedAutoscaler))
runner = unittest.TextTestRunner(stream=sys.stdout, verbosity=2)
runner.run(suite)

test_no_action (__main__.TestTokenBasedAutoscaler.test_no_action) ... 

  suite.addTest(unittest.makeSuite(TestTokenBasedAutoscaler))
INFO:TokensAutoscalerPolicy-test-rule:Initializing Tokens Autoscaler Policy
INFO:TokensAutoscalerPolicy-test-rule:Evaluating policy with parameters: {'input_tokens_up_threshold': 500, 'output_tokens_up_threshold': 300, 'input_tokens_down_threshold': 100, 'output_tokens_down_threshold': 50, 'min_replicas': 1, 'averaging_period': 'average_1m', 'cooldown_seconds': 120}
INFO:TokensAutoscalerPolicy-test-rule:Metrics received: {'block_metrics': [{'instanceId': 'instance-1', 'llm_input_tokens_per_minute_rolling': {'average_1m': 200}, 'llm_output_tokens_per_minute_rolling': {'average_1m': 100}}]}
INFO:TokensAutoscalerPolicy-test-rule:Processing metrics for instances: ['instance-1']
INFO:TokensAutoscalerPolicy-test-rule:Average Input Tokens: 200.0, Average Output Tokens: 100.0
INFO:TokensAutoscalerPolicy-test-rule:No scaling action required at this time.


ok
test_scale_down (__main__.TestTokenBasedAutoscaler.test_scale_down) ... 

INFO:TokensAutoscalerPolicy-test-rule:Initializing Tokens Autoscaler Policy
INFO:TokensAutoscalerPolicy-test-rule:Evaluating policy with parameters: {'input_tokens_up_threshold': 500, 'output_tokens_up_threshold': 300, 'input_tokens_down_threshold': 100, 'output_tokens_down_threshold': 50, 'min_replicas': 1, 'averaging_period': 'average_1m', 'cooldown_seconds': 120}
INFO:TokensAutoscalerPolicy-test-rule:Metrics received: {'block_metrics': [{'instanceId': 'instance-1', 'llm_input_tokens_per_minute_rolling': {'average_1m': 50}, 'llm_output_tokens_per_minute_rolling': {'average_1m': 20}}, {'instanceId': 'instance-2', 'llm_input_tokens_per_minute_rolling': {'average_1m': 40}, 'llm_output_tokens_per_minute_rolling': {'average_1m': 10}}]}
INFO:TokensAutoscalerPolicy-test-rule:Processing metrics for instances: ['instance-1', 'instance-2']
INFO:TokensAutoscalerPolicy-test-rule:Average Input Tokens: 45.0, Average Output Tokens: 15.0
INFO:TokensAutoscalerPolicy-test-rule:Token rate below thresho

ok
test_scale_up (__main__.TestTokenBasedAutoscaler.test_scale_up) ... 

INFO:TokensAutoscalerPolicy-test-rule:Initializing Tokens Autoscaler Policy
INFO:TokensAutoscalerPolicy-test-rule:Evaluating policy with parameters: {'input_tokens_up_threshold': 500, 'output_tokens_up_threshold': 300, 'input_tokens_down_threshold': 100, 'output_tokens_down_threshold': 50, 'min_replicas': 1, 'averaging_period': 'average_1m', 'cooldown_seconds': 120}
INFO:TokensAutoscalerPolicy-test-rule:Metrics received: {'block_metrics': [{'instanceId': 'instance-1', 'llm_input_tokens_per_minute_rolling': {'average_1m': 600}, 'llm_output_tokens_per_minute_rolling': {'average_1m': 200}}]}
INFO:TokensAutoscalerPolicy-test-rule:Processing metrics for instances: ['instance-1']
INFO:TokensAutoscalerPolicy-test-rule:Average Input Tokens: 600.0, Average Output Tokens: 200.0
INFO:TokensAutoscalerPolicy-test-rule:Token rate exceeded threshold (Input: 600.00, Output: 200.00). Scaling up.


ok

----------------------------------------------------------------------
Ran 3 tests in 0.011s

OK


<unittest.runner.TextTestResult run=3 errors=0 failures=0>

## 3. Deploying the Policy

With the policy tested, we can now deploy it to AIOS. This involves three steps: Assembling the policy,uploading the policy package (`tokenautoscaler.zip`) and then registering it with the AIOS Policy System.

### a. Assembling the Policy for Deployment

Now that we have defined all the methods of our `AIOSv1PolicyRule` class, we need to assemble them into a single script and package it correctly for deployment. The policy must be contained within a `code` directory, which includes the `function.py` file and a `requirements.txt` file.

In [12]:
# Create a requirements.txt file
with open("code/requirements.txt", "w") as f:
    f.write("requests") # No external dependencies for this policy

print("code/function.py and code/requirements.txt created successfully.")
!cat code/function.py

code/function.py and code/requirements.txt created successfully.
#!/usr/bin/env python3
"""
Input/Output Tokens Autoscaler Helper Policy

Scales up/down based on rolling average input/output tokens per minute.
"""
import logging
import time
from typing import Dict, Any

# Configure logger
logging.basicConfig(level=logging.INFO)

class AIOSv1PolicyRule:
    def __init__(self, rule_id, settings, parameters):
        self.rule_id = rule_id
        self.settings = settings
        self.parameters = parameters
        self.logger = logging.getLogger(f"TokensAutoscalerPolicy-{self.rule_id}")
        self.logger.info("Initializing Tokens Autoscaler Policy")
        
        self.input_up_threshold = self.parameters.get("input_tokens_up_threshold", 500)
        self.output_up_threshold = self.parameters.get("output_tokens_up_threshold", 300)
        self.input_down_threshold = self.parameters.get("input_tokens_down_threshold", 100)
        self.output_down_threshold = self.parameters.get("outp

Now, let's package our policy into a `tokenautoscaler.zip` file for deployment. The zip file must contain the `code` directory.

In [None]:
!zip -r tokenautoscaler.zip code

### b. Upload the Policy Package

First, we upload the `.zip` file containing our `function.py` to a location accessible by AIOS. The following `curl` command sends the file to an upload server, which makes it available via a URL.

In [None]:
!curl -X POST http://POLICYSTORESERVER:30186/upload -F "file=@./tokenautoscaler.zip" -F "path=."

### c. Register the Policy

Next, we register the policy with AIOS. This `curl` command sends a JSON payload to the policy registry endpoint. The payload contains metadata about our policy, including its name, version, and the URL where the code can be found (`"code": "http://MANAGEMENTMASTER:32555/autoscaler.zip"`). This tells AIOS how to find and execute our policy.

In [None]:
!curl -X POST http://MANAGEMENTMASTER:30102/policy \\
     -H "Content-Type: application/json" \\
     -d '{ \\
           "name": "autoscaler", \\
           "version": "2.0", \\
           "release_tag": "stable", \\
           "metadata": {"author": "admin", "category": "analytics"}, \\
           "tags": "analytics,ai", \\
           "code": "http://MANAGEMENTMASTER:32555/autoscaler.zip", \\
           "code_type": "tar.xz", \\
           "type": "policy", \\
           "policy_input_schema": {"type": "object", "properties": {"input": {"type": "string"}}}, \\
           "policy_output_schema": {"type": "object", "properties": {"output": {"type": "string"}}}, \\
           "policy_settings_schema": {}, \\ add some paramaters tat looks real as well,working,explain a bit as well
           "policy_parameters_schema": {}, \\
           "policy_settings": {}, \\
           "policy_parameters": {}, \\
           "description": "A policy for token based autoscaling.", \\
           "functionality_data": {"strategy": "ML-based"}, \\
           "resource_estimates": {} \\
         }'

## 4. Triggering and Observing the Autoscaler in Action

Now that our policy is deployed and registered, we can see it in action by generating a load on a block that is configured to use it. We will simulate both a high load to trigger a **scale-up** and a low load to trigger a **scale-down**.

In [None]:
#first create the Block and show in K8s Dashboard
!curl -X POST -d @./allocation.json -H "Content-Type: application/json" http://MANAGEMENTMASTER:30501/api/createBlock

### a. Triggering a Scale-Up Event

The following command executes a Python script (`client_test_token_autoscaler.py`) that sends a high volume of inference requests. We'll simulate 20 concurrent requests, each with a high token count, to push the average token rate above our defined `input_tokens_up_threshold`.

### how to show the tokens per second increase in throughput

In [15]:
!python3 client_test_token_autoscaler.py  --input-tokens=1500  --max-output-tokens=50 --num-requests=2 

Starting test with 1 concurrent requests.
Each request will have 100 input tokens and ask for max 50 output tokens.
Session session-66a0c5b6-380c-4c39-ad64-faeb6c3fac3c: Latency: 1.33s
Test finished.


In [None]:
!python3 client_test_token_loadbalancer.py  --max-output-tokens=100 --num-requests=5

### b. Triggering a Scale-Down Event 
Once the above request is completed, scale down event will be triggerd after the cool down period

### c. Observing the Results

After triggering the policy, you can observe the scaling events in the Kubernetes dashboard and Grafana.

**In Kubernetes:**
- `K8s dashboard` available at https://CLUSTER1MASTER:32319/#/login to watch for new pods being created (scale-up) or terminated (scale-down).