# A Developer's Guide to Ad-hoc Inference in AIOS

Welcome to this guide on the **Ad-hoc Inference Server**, a powerful feature that combines the dynamic block selection of the Router with direct inference execution. This allows a client to send a single, unified request that both finds the best block for a task and immediately runs inference on it.

Below is a visual overview of the ad-hoc inference process, where a single request triggers both routing and execution.

<img src="adhoc_inference_serving.png" alt="Adhoc inference serving" width="800" height="800">

For more details, refer to the official documentation:
- [Ad-hoc Inference Server Concepts](https://docs.aigr.id/adhoc-inference-server/adhoc-inference-server/)

## The Ad-hoc Server in the AIOS Ecosystem

The Ad-hoc Inference Server streamlines the process of dynamic inference. While the **Router** is excellent for discovering and selecting blocks, it only returns the *ID* of the best block. The client would then need to make a second call to that specific block to run the actual inference.

The Ad-hoc Inference Server eliminates this two-step process. It acts as a higher-level service that:
1.  Receives a single request containing both a `selection_query` (for the Router) and an inference `data` payload.
2.  Internally calls the Router to find the best block based on the `selection_query`.
3.  Automatically forwards the `data` payload to the selected block.
4.  Returns the final inference result to the client.

This provides a seamless experience for applications that need to dynamically select and execute models in a single, atomic operation.

## 1. Deploying the Policy
With the policy tested, we can now deploy it to AIOS. This involves three steps: Assembling the policy,uploading the policy package (tokenautoscaler.zip) and then registering it with the AIOS Policy System.

### a. Assembling the Policy for Deployment

Now that we have defined all the methods of our `AIOSv1PolicyRule` class, we need to assemble them into a single script and package it correctly for deployment. The policy must be contained within a `code` directory, which includes the `function.py` file and a `requirements.txt` file.

In [None]:
# Create a requirements.txt file
with open("code/requirements.txt", "w") as f:
    f.write("requests") # No external dependencies for this policy

print("code/function.py and code/requirements.txt created successfully.")
!cat code/function.py

Now, let's package our policy into a `attributebasedrouter.zip` file for deployment. The zip file must contain the `code` directory.

In [None]:
!zip -r attributebasedrouter.zip code

### b. Upload the Policy Package

First, we upload the `.zip` file containing our `function.py` to a location accessible by AIOS. The following `curl` command sends the file to an upload server, which makes it available via a URL.

In [None]:
!curl -X POST http://POLICYSTORESERVER:30186/upload -F "file=@./attributebasedrouter.zip" -F "path=."

### c. Register the Policy

Next, we register the policy with AIOS. This `curl` command sends a JSON payload to the policy registry endpoint. The payload contains metadata about our policy, including its name, version, and the URL where the code can be found (`"code": "http://MANAGEMENTMASTER:32555/attributebasedrouter.zip"`). This tells AIOS how to find and execute our policy.

In [None]:
import requests

policy_registration_payload = {
    "name": "attributebasedrouter",
    "version": "2.0",
    "release_tag": "stable",
    "metadata": {"author": "admin", "category": "analytics"},
    "tags": "analytics,ai",
    "code": "http://MANAGEMENTMASTER:32555/attributebasedrouter.zip",
    "code_type": "tar.xz",
    "type": "policy",
    "policy_input_schema": {
        "type": "object",
        "properties": {
            "input": {"type": "string"}
        }
    },
    "policy_output_schema": {
        "type": "object",
        "properties": {
            "output": {"type": "string"}
        }
    },
    "policy_settings_schema": {
        "METRICS_BASE_URL": {
            "type": "string",
            "description": "Base URL for fetching block metrics",
            "pattern": "^https?://.*"
        },
        "APP_MAP": {
            "type": "object",
            "description": "Mapping of application types to use cases",
            "additionalProperties": {"type": "string"}
        }
    },
    "policy_parameters_schema": {
        "METRICS_BASE_URL": {
            "type": "string",
            "description": "Override for metrics base URL",
            "pattern": "^https?://.*"
        },
        "selection_query": {
            "type": "object",
            "description": "Query for selecting and optimizing block candidates",
            "properties": {
                "application_type": {"type": "string"},
                "optimization_goals": {
                    "type": "array",
                    "items": {
                        "type": "object",
                        "properties": {
                            "name": {"type": "string"},
                            "weight": {"type": "number"}
                        },
                        "required": ["name"]
                    }
                }
            }
        }
    },
    "policy_settings": {
        "METRICS_BASE_URL": "http://metrics-service:30201",
        "APP_MAP": {
            "RAG": "chat-completion",
            "Code": "code-generation",
            "Vision": "multi-modal",
            "Chat": "chat-completion"
        }
    },
    "policy_parameters": {
        "METRICS_BASE_URL": "http://metrics-service:30201"
    },
    "description": "A policy for dynamic block selection based on metrics and application type.",
    "functionality_data": {"strategy": "ML-based"},
    "resource_estimates": {}
}



In [None]:
response = requests.post(
    "http://MANAGEMENTMASTER:30102/policy",
    json=policy_registration_payload,
    headers={"Content-Type": "application/json"}
)

print("Status Code:", response.status_code)
print("Response:", response.text)

## 2. The Ad-hoc Inference Request Structure

The `curl` command below sends a request to the `/v1/infer` endpoint. This request is a hybrid; it contains both an inference payload (the `data` section) and a block selection query (the `selection_query` section).

In [None]:
payload = {
    "model": "",
    "session_id": "session-1257",
    "seq_no": 11,
    "data": {
        "mode": "chat",
        "gen_params": {
            "temperature": 0.1,
            "top_p": 0.95,
            "max_tokens": 256
        },
        "message": "Explain about the Deepseek LLM Model"
    },
    "graph": {},
    "selection_query": {
        "header": {
            "templateUri": "Parser/V1"
        },
        "body": {
            "values": {
                "matchType": "block",
                "rankingPolicyRule": {
                    "values": {
                        "policyRuleURI": "attributebasedrouter:2.0-stable",
                        "parameters": {
                            "filterRule": {
                                "matchType": "block",
                                "filter": {
                                    "blockQuery": {
                                        "logicalOperator": "AND",
                                        "conditions": [
                                            {
                                                "variable": "component.componentMetadata.usecase",
                                                "operator": "LIKE",
                                                "value": "*chat-completion*"
                                            },
                                            {
                                                "variable": "component.componentMetadata.capabilities.supportsStreaming",
                                                "operator": "==",
                                                "value": True
                                            },
                                            {
                                                "logicalOperator": "OR",
                                                "conditions": [
                                                    {
                                                        "variable": "component.componentMetadata.evaluation.benchmarks.MMLU.value",
                                                        "operator": ">=",
                                                        "value": 80
                                                    },
                                                    {
                                                        "variable": "component.componentMetadata.evaluation.benchmarks.ARC-Challenge.value",
                                                        "operator": ">=",
                                                        "value": 90
                                                    }
                                                ]
                                            }
                                        ]
                                    },
                                    "blockMetricsQuery": {
                                        "logicalOperator": "OR",
                                        "conditions": [
                                            {
                                                "aggOperator": "avg",
                                                "target": "instances.uptime_hours",
                                                "operator": ">",
                                                "value": 10
                                            },
                                            {
                                                "aggOperator": "avg",
                                                "target": "instances.llm_tokens_per_second",
                                                "operator": ">",
                                                "value": 10
                                            }
                                        ]
                                    }
                                }
                            },
                            "user_query": "Find a reliable and high-performing chat model",
                            "METRICS_BASE_URL": "http://MANAGEMENTMASTER:30201/block/",
                            "selection_query": {
                                "optimization_goals": [
                                    {
                                        "name": "Fast_Generator",
                                        "weight": 0.9
                                    },
                                    {
                                        "name": "Free_Block",
                                        "weight": 0.1
                                    }
                                ]
                            }
                        }
                    }
                }
            }
        }
    }
}


## 3. Understanding the Flow

Here’s what happens when AIOS receives this request:

1.  **Selection First**: The Ad-hoc Inference Server first takes the `selection_query` part and sends it to the **Router**. The Router uses the specified policy (`non-llm-matcher`) to find the best block. This involves:
    - **Filtering**: Applying the `blockQuery` (for static metadata) and `blockMetricsQuery` (for real-time metrics) to find all suitable candidates.
    - **Scoring & Ranking**: The `non-llm-matcher` policy then scores the candidates based on the `optimization_goals` to create a ranked list.
    - **Selection**: The Router selects the top-ranked block.

2.  **Inference Next**: Once the Router returns the ID of the selected block, the Ad-hoc Inference Server takes the `data` part of the original request and forwards it to that specific block for processing.

### The Filters in This Request

-   **`blockQuery` (Metadata)**: This time, we're looking for a multi-modal block that uses the `llama-cpp-python` framework and supports quantization.
    -   `framework` is `llama-cpp-python`.
    -   `supports_quantization` is `true`.
    -   `tags` contains `multi-modal`.
-   **`blockMetricsQuery` (Metrics)**: We want a block that is performing well.
    -   Average `on_preprocess_fps` (frames per second) is greater than `8000`.
    -   Maximum `llm_active_sessions` is at least `5`.

## 4. Running the Inference

Executing the `requests.post` command will trigger the entire flow. The system will find the best multi-modal block and ask it to analyze the provided image, all in a single call.

In [None]:
import requests
import json

url = "http://CLUSTER1MASTER:31504/v1/infer"
headers = {"Content-Type": "application/json"}
try:
    response = requests.post(url, headers=headers, json=payload, timeout=30)
    print(f"Status Code: {response.status_code}")
    print(f"Response Headers: {dict(response.headers)}")
    
    if response.headers.get('content-type', '').startswith('application/json'):
        print(f"Response JSON: {json.dumps(response.json(), indent=2)}")
    else:
        print(f"Response Text: {response.text}")
        
except requests.exceptions.RequestException as e:
    print(f"Request failed: {e}")

## 5. Observing in Logs

To see the whole process, you can check the logs of two services:

1.  **Router (Parser) Logs**: These logs will show the details of the selection process, including the list of candidate blocks, the results of the filters, the final ranked list, and the block that was ultimately selected. This can be see here in `Management Server` http://MANAGEMENTMASTER:32199/explore , under label conatiner with value executor-executor-001.
2.   **Ad-hoc Inference Server Logs**: logs of the block in `K8s dashboard`https://CLUSTER1MASTER:32319/

## 6. LLM Based Matchers

In [2]:
from typing import Dict, List, Any, Optional
def eval(self, parameters: Dict, input_data: Dict, context: Dict) -> Dict:
        """
        Main evaluation function that processes input data and returns the selected block.
        """
        try:
            merged_params = {**self.parameters, **parameters}
            
            candidates = input_data.get('candidates', [])
            if not candidates:
                return self._create_error_response("No candidates provided")
            
            query_string = merged_params.get('query_string', '').strip()
            if not query_string:
                return self._create_error_response("Query string is required")

            llm_block_id = merged_params.get('llm_block_id')
            llm_server_address = merged_params.get('llm_server_address')

            if not llm_block_id or not llm_server_address:
                return self._create_error_response("LLM block ID and server address are required")

            prompt = f"""
            Based on the following user query, select the best block from the candidates provided.
            User Query: "{query_string}"
            
            Candidates:
            {json.dumps(candidates, indent=2)}
            
            Respond with only the blockId of the best candidate.
            """
            
            # Extract relevant fields for the prompt
            summarized_candidates = [self._summarize_candidate(c) for c in candidates]

            prompt = f"""
            Based on the following user query, select the best block from the candidates provided.
            User Query: "{query_string}"
            
            Candidates:
            {json.dumps(summarized_candidates, indent=2)}
            
            Respond with only the containerImage of the best candidate in a JSON object with the key "containerImage".
            """
        
            llm_response = self._call_llm_inference(prompt, llm_block_id, llm_server_address)
            
            if not llm_response or 'reply' not in llm_response:
                return self._create_error_response("Invalid or empty response from LLM")
            
            reply_text = llm_response['reply'].strip()
            
            selected_container_image = None
            
            # 1. Look for a JSON code block
            json_match = re.search(r"```json\n(.*?)\n```", reply_text, re.DOTALL)
            if json_match:
                try:
                    json_data = json.loads(json_match.group(1))
                    selected_container_image = json_data.get("containerImage")
                except json.JSONDecodeError:
                    logger.warning(f"Found JSON block but failed to decode: {json_match.group(1)}")

            # 2. If no JSON block found, try to parse the whole reply
            if not selected_container_image:
                try:
                    json_reply = json.loads(reply_text)
                    selected_container_image = json_reply.get("containerImage")
                except json.JSONDecodeError:
                    pass # Not a JSON object

            # 3. If still nothing, assume the last line is the container image
            if not selected_container_image:
                selected_container_image = reply_text.splitlines()[-1].strip()


            if not selected_container_image:
                 return self._create_error_response(f"Could not extract containerImage from LLM response: {reply_text}")

            # Find the selected block in the candidates
            selected_block = next((c for c in candidates if c.get("containerRegistryInfo", {}).get("containerImage") == selected_container_image), None)

            if not selected_block:
                return self._create_error_response(f"LLM selected a containerImage not in the candidate list: {selected_container_image}")

            return {
                "allowed": True,
                "input_data": {
                    "selected_block": selected_block,
                    "timestamp": self.current_time
                }
            }
            
        except Exception as e:
            logger.error(f"Error during evaluation: {e}", exc_info=True)
            return self._create_error_response(f"Error during evaluation: {str(e)}")
