# <font color="#418FDE" size="6.5" uppercase>**Serving Models**</font>

>Last update: 20260130.
    
By the end of this Lecture, you will be able to:
- Wrap a PyTorch model in a simple inference function that handles preprocessing and postprocessing. 
- Integrate the inference function into a lightweight REST API or batch inference script. 
- Evaluate latency and throughput of the serving setup and identify simple optimizations. 


## **1. Designing Inference Functions**

### **1.1. Robust Input Checks**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_09/Lecture_B/image_01_01.jpg?v=1769778782" width="250">



>* Inference functions must guard against bad inputs
>* Validate types, structure, and fields; give clear errors

>* Check input shapes, sizes, and channels carefully
>* Align all validation rules with model training data

>* Catch errors, give clear, contextual messages
>* Log invalid inputs to monitor and improve



In [None]:
#@title Python Code - Robust Input Checks

# This script shows robust input checks.
# We wrap a tiny model with validation.
# Focus on safe preprocessing before inference.

# !pip install tensorflow==2.20.0.

# Import standard libraries for typing.
from typing import List, Tuple, Union

# Import numpy for simple array handling.
import numpy as np

# Import tensorflow and set random seeds.
import tensorflow as tf

# Set deterministic seeds for reproducibility.
np.random.seed(42)

# Print TensorFlow version in one short line.
print("TensorFlow version:", tf.__version__)

# Define expected image height and width.
EXPECTED_HEIGHT: int = 28

# Define expected image width and channels.
EXPECTED_WIDTH: int = 28

# Define expected number of channels.
EXPECTED_CHANNELS: int = 1

# Create a tiny dummy model for demonstration.
model = tf.keras.Sequential(
    [
        tf.keras.layers.InputLayer(
            input_shape=(EXPECTED_HEIGHT, EXPECTED_WIDTH, EXPECTED_CHANNELS)
        ),
        tf.keras.layers.Flatten(),
        tf.keras.layers.Dense(8, activation="relu"),
        tf.keras.layers.Dense(3, activation="softmax"),
    ]
)

# Compile the model with simple settings.
model.compile(optimizer="adam", loss="sparse_categorical_crossentropy")

# Create a small dummy training batch.
dummy_images = np.random.rand(4, EXPECTED_HEIGHT, EXPECTED_WIDTH, EXPECTED_CHANNELS)

# Create small dummy labels for training.
dummy_labels = np.random.randint(0, 3, size=(4,))

# Train briefly with silent verbose setting.
model.fit(dummy_images, dummy_labels, epochs=1, verbose=0)

# Define a custom exception for input errors.
class InputValidationError(Exception):
    pass

# Validate a single image like object carefully.
def validate_single_image(image: np.ndarray) -> np.ndarray:
    # Check that input is a numpy array.
    if not isinstance(image, np.ndarray):
        raise InputValidationError("Input must be a numpy array.")

    # Check that array has four dimensions.
    if image.ndim != 3:
        raise InputValidationError("Image must have three dimensions.")

    # Unpack height width and channels.
    height, width, channels = image.shape

    # Check that channels match expectation.
    if channels != EXPECTED_CHANNELS:
        raise InputValidationError("Image must have one channel only.")

    # Check that height and width are positive.
    if height <= 0 or width <= 0:
        raise InputValidationError("Image height and width must be positive.")

    # Resize if shape differs from expected.
    if height != EXPECTED_HEIGHT or width != EXPECTED_WIDTH:
        image = tf.image.resize(
            image, size=(EXPECTED_HEIGHT, EXPECTED_WIDTH)
        ).numpy()

    # Normalize pixel values to zero one.
    image = image.astype("float32") / 255.0

    # Return validated and normalized image.
    return image

# Inference function with robust input checks.
def predict_single_image(image: Union[np.ndarray, None]) -> Tuple[int, float]:
    # Check that something was actually provided.
    if image is None:
        raise InputValidationError("No image was provided to function.")

    # Validate and preprocess the image.
    processed = validate_single_image(image)

    # Add batch dimension for model call.
    batch = np.expand_dims(processed, axis=0)

    # Run model prediction inside try block.
    try:
        probs = model.predict(batch, verbose=0)
    except Exception as exc:
        raise RuntimeError("Model inference failed.") from exc

    # Extract predicted class and confidence.
    class_index = int(np.argmax(probs[0]))

    # Extract confidence as float value.
    confidence = float(np.max(probs[0]))

    # Return prediction and confidence tuple.
    return class_index, confidence

# Helper to run a scenario and print result.
def run_scenario(name: str, image_candidate) -> None:
    # Print scenario name for clarity.
    print("Scenario:", name)

    # Try running prediction and catch errors.
    try:
        pred_class, conf = predict_single_image(image_candidate)
        print(" Prediction:", pred_class, "Confidence:", round(conf, 3))
    except InputValidationError as ive:
        print(" InputValidationError:", str(ive))
    except Exception as exc:
        print(" Other error:", type(exc).__name__)

# Create a valid dummy grayscale image.
valid_image = np.random.randint(
    0, 255, size=(28, 28, 1), dtype="uint8"
)

# Create an image with wrong number of channels.
wrong_channels_image = np.random.randint(
    0, 255, size=(28, 28, 3), dtype="uint8"
)

# Create an image with wrong spatial size.
wrong_size_image = np.random.randint(
    0, 255, size=(40, 20, 1), dtype="uint8"
)

# Run scenario with valid image input.
run_scenario("valid image", valid_image)

# Run scenario with wrong channels image.
run_scenario("wrong channels", wrong_channels_image)

# Run scenario with None as invalid input.
run_scenario("none input", None)




### **1.2. Input to Tensor Conversion**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_09/Lecture_B/image_01_02.jpg?v=1769778858" width="250">



>* Convert varied inputs into standardized model tensors
>* Centralize conversion to avoid shape and type bugs

>* Handle single items and batches consistently
>* Standardize shapes, precision, and device for performance

>* Reuse training-time preprocessing during tensor conversion
>* Centralized transforms keep serving consistent and reliable



In [None]:
#@title Python Code - Input to Tensor Conversion

# This script shows input tensor conversion.
# We focus on simple image like arrays.
# All steps are small and beginner friendly.

# Required library is tensorflow for tensor handling.
# !pip install tensorflow==2.20.0.

# Import standard libraries for arrays and typing.
import os
import random
import numpy as np

# Import tensorflow for tensor operations and models.
import tensorflow as tf

# Set deterministic seeds for reproducible behavior.
random.seed(0)
np.random.seed(0)
tf.random.set_seed(0)

# Print tensorflow version in one short line.
print("TensorFlow version:", tf.__version__)

# Define expected image height and width values.
IMG_HEIGHT = 28
IMG_WIDTH = 28
IMG_CHANNELS = 1

# Define expected tensor dtype and device placement.
EXPECTED_DTYPE = tf.float32
DEVICE = "/CPU:0"

# Create a tiny dummy model for demonstration.
with tf.device(DEVICE):
    model = tf.keras.Sequential(
        [
            tf.keras.layers.InputLayer(
                input_shape=(IMG_HEIGHT, IMG_WIDTH, IMG_CHANNELS)
            ),
            tf.keras.layers.Flatten(),
            tf.keras.layers.Dense(10, activation="softmax"),
        ]
    )

# Compile model with simple configuration.
model.compile(
    optimizer="adam",
    loss="sparse_categorical_crossentropy",
    metrics=["accuracy"],
)

# Create a small random batch for quick training.
train_images = np.random.randint(
    0,
    256,
    size=(16, IMG_HEIGHT, IMG_WIDTH, IMG_CHANNELS),
    dtype=np.uint8,
)

train_labels = np.random.randint(0, 10, size=(16,), dtype=np.int32)

# Train briefly to make model usable.
model.fit(train_images, train_labels, epochs=1, verbose=0)

# Define a function to normalize uint8 image arrays.
def normalize_image_array(image_array: np.ndarray) -> np.ndarray:
    # Ensure array has correct number of dimensions.
    if image_array.ndim not in (2, 3):
        raise ValueError("Image array must be 2D or 3D.")

    # If grayscale without channel, add channel dimension.
    if image_array.ndim == 2:
        image_array = image_array[..., np.newaxis]

    # Validate spatial dimensions before resizing.
    h, w, c = image_array.shape
    if c != IMG_CHANNELS:
        raise ValueError("Unexpected channel count in image array.")

    # Convert to float32 and scale to zero one range.
    image_array = image_array.astype("float32") / 255.0
    return image_array

# Define function to convert various inputs into tensor.
def to_model_tensor(input_data) -> tf.Tensor:
    # Handle single numpy array input case.
    if isinstance(input_data, np.ndarray):
        image_array = normalize_image_array(input_data)
        batch_array = np.expand_dims(image_array, axis=0)

    # Handle list of numpy arrays as batch input.
    elif isinstance(input_data, list):
        if not input_data:
            raise ValueError("Input list must not be empty.")
        normalized_list = []
        for idx, item in enumerate(input_data):
            if not isinstance(item, np.ndarray):
                raise TypeError("All list items must be numpy arrays.")
            norm_item = normalize_image_array(item)
            normalized_list.append(norm_item)
        batch_array = np.stack(normalized_list, axis=0)

    # Raise error for unsupported input types.
    else:
        raise TypeError("Input must be numpy array or list of arrays.")

    # Convert numpy batch into tensorflow tensor.
    tensor = tf.convert_to_tensor(batch_array, dtype=EXPECTED_DTYPE)

    # Validate final tensor shape before returning.
    if tensor.shape.rank != 4:
        raise ValueError("Tensor must be rank four for model.")
    if tensor.shape[1:] != (IMG_HEIGHT, IMG_WIDTH, IMG_CHANNELS):
        raise ValueError("Tensor spatial shape does not match.")

    # Move tensor to desired device context.
    with tf.device(DEVICE):
        tensor_on_device = tf.identity(tensor)
    return tensor_on_device

# Define simple inference function using conversion helper.
def run_inference(input_data):
    # Convert raw input into clean model tensor.
    input_tensor = to_model_tensor(input_data)

    # Run model prediction inside device context.
    with tf.device(DEVICE):
        probs = model(input_tensor, training=False)

    # Convert probabilities to predicted class indices.
    predicted_classes = tf.argmax(probs, axis=1)
    return probs.numpy(), predicted_classes.numpy()

# Create a fake single image as height width array.
single_image = np.random.randint(
    0,
    256,
    size=(IMG_HEIGHT, IMG_WIDTH),
    dtype=np.uint8,
)

# Create a small batch list of fake images.
batch_images = [
    np.random.randint(
        0,
        256,
        size=(IMG_HEIGHT, IMG_WIDTH),
        dtype=np.uint8,
    )
    for _ in range(3)
]

# Run inference on single image input.
probs_single, preds_single = run_inference(single_image)

# Run inference on batch list input.
probs_batch, preds_batch = run_inference(batch_images)

# Print shapes to show consistent tensor conversion.
print("Single input tensor shape:", probs_single.shape)
print("Single input predicted class:", preds_single)
print("Batch input tensor shape:", probs_batch.shape)
print("Batch input predicted classes:", preds_batch)




### **1.3. Interpreting Model Outputs**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_09/Lecture_B/image_01_03.jpg?v=1769778930" width="250">



>* Inference function turns raw tensors into results
>* Centralized logic ensures consistent, structured model outputs

>* Convert numeric model scores into domain meanings
>* Bridge tensors to labels, decisions, and UI outputs

>* Return predictions plus helpful contextual details
>* Use stable structured outputs for easy integration



In [None]:
#@title Python Code - Interpreting Model Outputs

# This script explains interpreting model outputs.
# We simulate a tiny classifier prediction pipeline.
# Focus is on postprocessing logits into labels.

# !pip install tensorflow==2.20.0.

# Import required standard libraries.
import random
import math
import os

# Set deterministic random seeds for reproducibility.
random.seed(42)

# Define a tiny label vocabulary for our classifier.
LABELS = ["negative", "neutral", "positive"]

# Create a fake model function returning raw logits.
def fake_model_logits(text_batch):
    # Validate input type and basic structure.
    if not isinstance(text_batch, list):
        raise TypeError("text_batch must be list")

    # Build deterministic logits using text length features.
    logits_batch = []
    for text in text_batch:
        # Ensure each element is a string input.
        if not isinstance(text, str):
            raise TypeError("each item must string")

        # Use simple handcrafted scoring heuristics.
        length_score = len(text) / 20.0
        exclam_score = text.count("!") * 0.5
        # Compose three class logits from features.
        neg_logit = 0.5 - length_score
        neu_logit = 0.1
        pos_logit = length_score + exclam_score

        # Collect logits for this single example.
        logits_batch.append([neg_logit, neu_logit, pos_logit])

    # Return list of lists representing logits.
    return logits_batch

# Convert logits into probabilities using softmax.
def softmax(logits):
    # Subtract max for numerical stability.
    max_logit = max(logits)
    shifted = [x - max_logit for x in logits]

    # Exponentiate shifted logits safely.
    exps = [math.exp(x) for x in shifted]
    total = sum(exps)

    # Avoid division by zero with fallback.
    if total == 0.0:
        return [1.0 / len(logits)] * len(logits)

    # Normalize to obtain probabilities.
    return [x / total for x in exps]

# Interpret logits into structured prediction dictionaries.
def interpret_logits(logits_batch, top_k=2):
    # Validate batch shape and label alignment.
    if len(LABELS) == 0:
        raise ValueError("LABELS list empty")

    # Ensure each logits vector matches labels length.
    for logits in logits_batch:
        if len(logits) != len(LABELS):
            raise ValueError("logits length mismatch")

    # Build structured outputs for each example.
    results = []
    for logits in logits_batch:
        # Convert logits to probabilities via softmax.
        probs = softmax(logits)

        # Pair labels with probabilities together.
        label_probs = list(zip(LABELS, probs))
        # Sort by probability descending order.
        label_probs.sort(key=lambda x: x[1], reverse=True)

        # Select top_k predictions for transparency.
        top = label_probs[:top_k]
        primary_label, primary_prob = top[0]

        # Package prediction and auxiliary information.
        result = {
            "label": primary_label,
            "confidence": round(primary_prob, 3),
            "top_k": [
                {"label": lab, "prob": round(p, 3)}
                for lab, p in top
            ],
            "raw_logits": [round(x, 3) for x in logits],
        }

        # Append structured result to batch list.
        results.append(result)

    # Return list of structured prediction dictionaries.
    return results

# High level inference function wrapping full pipeline.
def run_inference(text_batch):
    # Get raw logits from our fake model.
    logits_batch = fake_model_logits(text_batch)

    # Interpret logits into semantic outputs.
    interpreted = interpret_logits(logits_batch, top_k=3)

    # Return final structured predictions to caller.
    return interpreted

# Prepare a tiny batch of example input texts.
example_texts = [
    "I love this course!",
    "The video was okay.",
    "This is really disappointing",
]

# Run inference and obtain structured outputs.
predictions = run_inference(example_texts)

# Print a concise summary for each example.
for text, pred in zip(example_texts, predictions):
    print("Text:", text)
    print("Primary:", pred["label"], pred["confidence"])
    print("Top k:", pred["top_k"])
    print("Raw logits:", pred["raw_logits"])
    print("---")




## **2. APIs and Batch Serving**

### **2.1. Building a REST API**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_09/Lecture_B/image_02_01.jpg?v=1769778995" width="250">



>* Expose notebook models through a simple web service
>* REST API wraps inference, handling requests and responses

>* Define endpoint method, request body, and response
>* Connect inference function and return predictions with metadata

>* Optimize model loading, concurrency, and request validation
>* Design API as reliable, production-ready prediction bridge



In [None]:
#@title Python Code - Building a REST API

# This script shows a tiny REST style API example.
# We simulate serving a model with a simple function.
# Focus is on clear structure and beginner friendly code.

# Required external libraries would be installed like this.
# !pip install flask.

# Import standard libraries for typing and randomness.
import json
import random
import time

# Set a deterministic seed for reproducible behavior.
random.seed(42)

# Define a tiny pretend model using a simple function.
def tiny_model_predict(x_values):
    # Validate that input is a list of numbers.
    if not isinstance(x_values, list):
        raise TypeError("input must be list of numbers")

    # Check each element type for safety.
    for value in x_values:
        if not isinstance(value, (int, float)):
            raise TypeError("all items must be numeric")

    # Apply a simple linear transformation as prediction.
    predictions = [0.3 * v + 0.7 for v in x_values]
    return predictions

# Wrap preprocessing, model call, and postprocessing.
def inference_pipeline(payload):
    # Ensure payload is a dictionary like a JSON body.
    if not isinstance(payload, dict):
        raise TypeError("payload must be a dictionary")

    # Extract feature list with a safe default.
    features = payload.get("features", [])

    # Validate that features list is not empty.
    if len(features) == 0:
        raise ValueError("features list must not be empty")

    # Record start time for simple latency measurement.
    start_time = time.time()

    # Call the tiny model prediction function.
    raw_predictions = tiny_model_predict(features)

    # Convert raw predictions into rounded scores.
    scores = [round(p, 3) for p in raw_predictions]

    # Build a simple label based on a threshold.
    labels = ["high" if s > 1.0 else "low" for s in scores]

    # Compute elapsed time in milliseconds.
    latency_ms = (time.time() - start_time) * 1000.0

    # Build a structured response dictionary.
    response = {
        "scores": scores,
        "labels": labels,
        "latency_ms": round(latency_ms, 3),
    }
    return response

# Simulate a minimal REST style handler function.
def handle_predict_request(json_body):
    # Parse JSON string into a Python dictionary.
    try:
        payload = json.loads(json_body)
    except json.JSONDecodeError as exc:
        return {"error": f"invalid json: {exc}"}

    # Call inference pipeline and catch possible errors.
    try:
        result = inference_pipeline(payload)
    except Exception as exc:
        return {"error": str(exc)}

    # Return successful result as a dictionary.
    return {"result": result}

# Build a small helper to measure throughput.
def measure_throughput(example_body, num_requests):
    # Ensure request count is a positive integer.
    if not isinstance(num_requests, int) or num_requests <= 0:
        raise ValueError("num_requests must be positive integer")

    # Record start time before the loop.
    start = time.time()

    # Send repeated requests through the handler.
    for _ in range(num_requests):
        _ = handle_predict_request(example_body)

    # Compute total elapsed time in seconds.
    total_time = time.time() - start

    # Derive requests per second as throughput.
    rps = num_requests / total_time if total_time > 0 else 0.0

    # Return both total time and throughput.
    return total_time, rps

# Create a small example request body for testing.
example_request = {"features": [0.0, 1.0, 2.0, 3.5]}

# Convert example request to a JSON string.
example_json = json.dumps(example_request)

# Call the handler once to see a single response.
single_response = handle_predict_request(example_json)

# Measure throughput for a small batch of requests.
num_test_requests = 50

# Use the helper to compute timing statistics.
elapsed, throughput = measure_throughput(example_json, num_test_requests)

# Print a short summary of the simulated API behavior.
print("Tiny REST style API simulation summary:")
print("Single response keys:", list(single_response.keys()))
print("Single result sample:", single_response.get("result"))
print("Requests tested:", num_test_requests)
print("Total time seconds:", round(elapsed, 4))
print("Requests per second:", round(throughput, 2))
print("Average latency ms:", round(1000.0 / throughput, 3))
print("This demonstrates wrapping inference in an API handler.")




### **2.2. Batch Inference Workflows**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_09/Lecture_B/image_02_02.jpg?v=1769779060" width="250">



>* Run models offline on large data batches
>* Scripts handle I/O, preprocessing, and storage integration

>* Load data, batch it, run model inference
>* Store predictions, log progress, handle partial failures

>* Schedule and scale batch jobs using parallel workers
>* Ensure workflows are fault tolerant, traceable, auditable



In [None]:
#@title Python Code - Batch Inference Workflows

# This script shows simple batch inference workflows.
# We simulate a PyTorch like model using TensorFlow.
# Focus is on offline batch scoring from a file.

# !pip install tensorflow==2.20.0.

# Import required standard libraries.
import os
import csv
import random

# Import TensorFlow as lightweight model backend.
import tensorflow as tf

# Set deterministic seeds for reproducibility.
random.seed(42)
tf.random.set_seed(42)

# Print TensorFlow version for environment clarity.
print("TensorFlow version:", tf.__version__)

# Define a tiny dense model for numeric features.
model = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(3,)),
    tf.keras.layers.Dense(4, activation="relu"),
    tf.keras.layers.Dense(1, activation="sigmoid"),
])

# Compile model with simple optimizer and loss.
model.compile(optimizer="adam", loss="binary_crossentropy")

# Create a tiny synthetic training dataset.
train_features = tf.constant([
    [0.1, 0.2, 0.3],
    [0.9, 0.8, 0.7],
    [0.2, 0.1, 0.4],
    [0.8, 0.9, 0.6],
], dtype=tf.float32)

# Create matching binary labels for training.
train_labels = tf.constant([[0.0], [1.0], [0.0], [1.0]], dtype=tf.float32)

# Train briefly with silent output for speed.
model.fit(train_features, train_labels, epochs=20, verbose=0)

# Define a simple preprocessing function for rows.
def preprocess_row(row):
    # Convert numeric strings to float features.
    features = [float(row["f1"]), float(row["f2"]), float(row["f3"])]
    return tf.constant(features, dtype=tf.float32)


# Define a postprocessing function for model outputs.
def postprocess_scores(scores):
    # Convert probabilities to labels with threshold.
    labels = [1 if float(s) >= 0.5 else 0 for s in scores]
    return labels


# Define a batch inference function for many rows.
def run_batch_inference(rows, batch_size=2):
    # Collect preprocessed feature vectors.
    features = [preprocess_row(r) for r in rows]
    feature_tensor = tf.stack(features, axis=0)

    # Validate feature tensor shape before prediction.
    if feature_tensor.shape[1] != 3:
        raise ValueError("Expected three features per row")

    # Run predictions in small batches for efficiency.
    predictions = []
    for start in range(0, len(rows), batch_size):
        end = start + batch_size
        batch = feature_tensor[start:end]
        batch_scores = model.predict(batch, verbose=0)
        predictions.extend(batch_scores[:, 0].tolist())

    # Postprocess raw scores into integer labels.
    labels = postprocess_scores(predictions)
    return predictions, labels


# Create a small CSV file representing offline data.
input_path = "batch_inputs.csv"
fieldnames = ["id", "f1", "f2", "f3"]

# Write a few synthetic rows to the CSV file.
with open(input_path, "w", newline="") as f:
    writer = csv.DictWriter(f, fieldnames=fieldnames)
    writer.writeheader()
    writer.writerow({"id": "u1", "f1": 0.1, "f2": 0.2, "f3": 0.3})
    writer.writerow({"id": "u2", "f1": 0.9, "f2": 0.8, "f3": 0.7})
    writer.writerow({"id": "u3", "f1": 0.4, "f2": 0.3, "f3": 0.2})
    writer.writerow({"id": "u4", "f1": 0.7, "f2": 0.6, "f3": 0.9})


# Read rows from the CSV file into memory.
rows = []
with open(input_path, newline="") as f:
    reader = csv.DictReader(f)
    for row in reader:
        rows.append(row)


# Run batch inference on all loaded rows.
raw_scores, labels = run_batch_inference(rows, batch_size=2)

# Pair each input id with its prediction results.
results = []
for row, score, label in zip(rows, raw_scores, labels):
    results.append({
        "id": row["id"],
        "score": round(float(score), 4),
        "label": int(label),
    })


# Print a short summary of batch inference results.
print("Total rows processed:", len(results))
for r in results:
    print("ID:", r["id"], "Score:", r["score"], "Label:", r["label"])




### **2.3. Robust Error Handling**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_09/Lecture_B/image_02_03.jpg?v=1769779145" width="250">



>* APIs need strong validation to prevent failures
>* Return clear error messages instead of wrong predictions

>* Separate client and server errors, respond appropriately
>* Log server issues, isolate bad records in batches

>* Log failures and recurring issues for observability
>* Use clear errors, logging, and fallbacks for resilience



In [None]:
#@title Python Code - Robust Error Handling

# This script shows robust error handling concepts.
# We simulate a tiny model serving API example.
# Focus is on validating inputs and handling failures.

# Example uses only standard library modules.
# No external installation is required here.
# Uncomment installs if you extend this script.
# pip install fastapi uvicorn.

# Import required standard library modules.
import json
import time
import random

# Set deterministic seed for reproducibility.
random.seed(42)

# Define a tiny dummy model inference function.
def dummy_model_predict(features):
    # Simulate simple numeric prediction logic.
    score = sum(features) / max(len(features), 1)
    return float(round(score, 3))

# Validate that payload is a dictionary.
def validate_is_dict(payload):
    # Raise error if payload is not a dictionary.
    if not isinstance(payload, dict):
        raise TypeError("Payload must be a dictionary.")

# Validate required keys and types.
def validate_required_fields(payload):
    # Check that 'features' key exists and is a list.
    if "features" not in payload:
        raise KeyError("Missing required 'features' field.")

    # Ensure features is a list of numbers.
    features = payload["features"]
    if not isinstance(features, list):
        raise TypeError("'features' must be a list.")

    # Limit list length to avoid huge payloads.
    if len(features) == 0 or len(features) > 1000:
        raise ValueError("'features' length must be between one and 1000.")

    # Ensure each element is numeric.
    for value in features:
        if not isinstance(value, (int, float)):
            raise TypeError("All feature values must be numeric.")

    return features

# Simulate latency measurement for model call.
def timed_model_call(features):
    # Record start time before prediction.
    start = time.perf_counter()
    prediction = dummy_model_predict(features)
    duration_ms = (time.perf_counter() - start) * 1000.0
    return prediction, duration_ms

# Build a robust inference handler function.
def handle_inference_request(raw_body):
    # Prepare base response structure.
    response = {"ok": False, "prediction": None, "error": None}

    try:
        # Try to parse JSON body safely.
        payload = json.loads(raw_body)

        # Validate payload structure and content.
        validate_is_dict(payload)
        features = validate_required_fields(payload)

        # Simulate occasional server side timeout.
        if len(features) > 10 and random.random() < 0.1:
            raise TimeoutError("Model backend timeout occurred.")

        # Run model prediction with timing.
        prediction, latency_ms = timed_model_call(features)

        # Build successful response body.
        response["ok"] = True
        response["prediction"] = prediction
        response["latency_ms"] = round(latency_ms, 3)

    except json.JSONDecodeError as exc:
        # Handle invalid JSON from client.
        response["error"] = {
            "type": "client_error",
            "message": "Invalid JSON payload.",
        }

    except (TypeError, KeyError, ValueError) as exc:
        # Handle validation related client errors.
        response["error"] = {
            "type": "client_error",
            "message": str(exc),
        }

    except TimeoutError as exc:
        # Handle simulated server side timeout.
        response["error"] = {
            "type": "server_error",
            "message": "Temporary model timeout, retry later.",
        }

    except Exception as exc:
        # Catch unexpected server side failures.
        response["error"] = {
            "type": "server_error",
            "message": "Unexpected internal error occurred.",
        }

    # Return structured response dictionary.
    return response

# Prepare several example request bodies.
example_requests = [
    # Valid small request should succeed.
    json.dumps({"features": [1.0, 2.0, 3.0]}),
    # Invalid JSON string should trigger parse error.
    "{features: [1, 2, 3]}",
    # Missing features field should trigger key error.
    json.dumps({"values": [1, 2, 3]}),
    # Non numeric feature should trigger type error.
    json.dumps({"features": [1, "bad", 3]}),
]

# Process each example request and print results.
for idx, body in enumerate(example_requests, start=1):
    # Call handler and capture structured response.
    result = handle_inference_request(body)

    # Print concise summary for each example.
    print(f"Example {idx} -> ok={result['ok']}, error={result['error']}")




## **3. Serving Performance Metrics**

### **3.1. Latency measurement**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_09/Lecture_B/image_03_01.jpg?v=1769779224" width="250">



>* Latency is total request round-trip time
>* Covers all pipeline stages where delays appear

>* Define and log client and server latency
>* Analyze averages and percentiles to spot slow requests

>* Analyze latency patterns, variability, and external factors
>* Use insights to apply simple performance optimizations



In [None]:
#@title Python Code - Latency measurement

# This script demonstrates simple latency measurement.
# We simulate a tiny model and measure request times.
# Focus on end to end timing for beginners.

# Required install for TensorFlow in some environments.
# !pip install tensorflow==2.20.0.

# Import standard timing and math utilities.
import time
import statistics
import random

# Import TensorFlow for a tiny example model.
import tensorflow as tf

# Set deterministic random seeds for reproducibility.
random.seed(0)

# Print TensorFlow version in one concise line.
print("TensorFlow version:", tf.__version__)

# Select CPU device for predictable performance here.
tf.config.set_visible_devices([], "GPU")

# Define a tiny dense model for demonstration.
model = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(4,)),
    tf.keras.layers.Dense(4, activation="relu"),
    tf.keras.layers.Dense(1, activation="sigmoid"),
])

# Compile model with simple optimizer and loss.
model.compile(optimizer="adam", loss="mse")

# Create a tiny deterministic training dataset.
train_x = tf.constant([[0.1, 0.2, 0.3, 0.4]], dtype=tf.float32)
train_y = tf.constant([[0.0]], dtype=tf.float32)

# Train briefly with silent output to initialize weights.
model.fit(train_x, train_y, epochs=3, verbose=0)

# Define a simple preprocessing function for inputs.
def preprocess(raw_features):
    # Convert list to tensor and normalize values.
    tensor = tf.convert_to_tensor([raw_features], dtype=tf.float32)
    return tensor / 1.0


# Define a simple postprocessing function for outputs.
def postprocess(raw_output):
    # Convert tensor to float and round value.
    value = float(raw_output.numpy()[0][0])
    return round(value, 4)


# Define an end to end inference function wrapper.
def run_inference(raw_features):
    # Validate input length for safety here.
    if len(raw_features) != 4:
        raise ValueError("Expected exactly four features.")
    # Apply preprocessing then model prediction.
    inputs = preprocess(raw_features)
    outputs = model(inputs, training=False)
    # Apply postprocessing to raw model outputs.
    return postprocess(outputs)


# Measure latency for a single synthetic request.
def measure_single_latency(features):
    # Record start time just before sending request.
    start = time.perf_counter()
    _ = run_inference(features)
    # Record end time right after receiving response.
    end = time.perf_counter()
    return (end - start) * 1000.0


# Measure latency for many requests and collect stats.
def measure_many_latencies(num_requests):
    # Store individual latencies in a simple list.
    latencies = []
    for _ in range(num_requests):
        # Use small random features for each request.
        features = [random.random() for _ in range(4)]
        latency_ms = measure_single_latency(features)
        latencies.append(latency_ms)
    # Validate we collected the expected number.
    if len(latencies) != num_requests:
        raise RuntimeError("Latency collection size mismatch.")
    return latencies


# Compute basic latency statistics including percentiles.
def summarize_latencies(latencies):
    # Sort a copy to compute percentile values.
    sorted_lat = sorted(latencies)
    count = len(sorted_lat)
    # Guard against empty input list here.
    if count == 0:
        raise ValueError("Latency list must not be empty.")
    # Helper to compute simple percentile index.
    def percentile(p):
        rank = int(round((p / 100.0) * (count - 1)))
        return sorted_lat[rank]
    # Build a dictionary of summary statistics.
    summary = {
        "min_ms": min(sorted_lat),
        "max_ms": max(sorted_lat),
        "mean_ms": statistics.mean(sorted_lat),
        "p50_ms": percentile(50),
        "p95_ms": percentile(95),
        "p99_ms": percentile(99),
    }
    return summary


# Run latency experiment with a small number of requests.
num_requests = 30
latencies = measure_many_latencies(num_requests)

# Compute throughput as requests per second value.
total_time_sec = sum(latencies) / 1000.0
throughput_rps = num_requests / total_time_sec if total_time_sec > 0 else 0.0

# Summarize latency distribution using helper function.
summary = summarize_latencies(latencies)

# Print concise latency statistics for interpretation.
print("Requests:", num_requests)
print("Min latency ms:", round(summary["min_ms"], 4))
print("P50 latency ms:", round(summary["p50_ms"], 4))
print("P95 latency ms:", round(summary["p95_ms"], 4))
print("P99 latency ms:", round(summary["p99_ms"], 4))
print("Max latency ms:", round(summary["max_ms"], 4))
print("Mean latency ms:", round(summary["mean_ms"], 4))
print("Throughput req_per_sec:", round(throughput_rps, 2))



### **3.2. Throughput estimation basics**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_09/Lecture_B/image_03_02.jpg?v=1769779304" width="250">



>* Throughput counts how many requests are handled
>* Must consider full pipeline under concurrent load

>* Throughput rises with concurrency by using hardware parallelism
>* Beyond saturation, queues grow and latency explodes

>* Run realistic load tests, measure requests per second
>* Tune batch size, workers, resources for balanced throughput



In [None]:
#@title Python Code - Throughput estimation basics

# This script explores basic throughput estimation concepts.
# It simulates a tiny inference server handling fake requests.
# We measure latency and throughput under different batch sizes.

# Required install for TensorFlow if missing in environment.
# !pip install tensorflow==2.20.0 --quiet.

# Import standard libraries for timing and math.
import time
import math
import random

# Import TensorFlow to simulate a lightweight model.
import tensorflow as tf

# Set deterministic seeds for reproducible behavior.
random.seed(0)

# Print TensorFlow version in a single concise line.
print("TensorFlow version:", tf.__version__)

# Choose CPU or GPU device based on availability.
physical_gpus = tf.config.list_physical_devices("GPU")

# Select device string for placing the fake model.
device_name = "/GPU:0" if physical_gpus else "/CPU:0"

# Create a tiny dense model to mimic inference cost.
with tf.device(device_name):
    model = tf.keras.Sequential([
        tf.keras.layers.Input(shape=(16,)),
        tf.keras.layers.Dense(32, activation="relu"),
        tf.keras.layers.Dense(16, activation="relu"),
        tf.keras.layers.Dense(4, activation="softmax"),
    ])

# Build model by running one dummy forward pass.
dummy_input = tf.zeros((1, 16), dtype=tf.float32)

# Run once to ensure any lazy initialization completes.
_ = model(dummy_input)

# Define a simple inference function with preprocessing.
def run_inference(batch_inputs: tf.Tensor) -> tf.Tensor:
    # Validate input rank and feature dimension.
    if batch_inputs.ndim != 2 or batch_inputs.shape[1] != 16:
        raise ValueError("Input must be [batch,16] tensor.")

    # Normalize inputs to mimic preprocessing step.
    normalized = tf.clip_by_value(batch_inputs, 0.0, 1.0)

    # Run model forward pass to obtain predictions.
    outputs = model(normalized, training=False)

    # Return probabilities as final postprocessed outputs.
    return outputs

# Helper function to measure latency and throughput.
def benchmark_batch(batch_size: int, num_batches: int) -> dict:
    # Create random inputs for the entire benchmark.
    total_samples = batch_size * num_batches

    # Build tensor of random floats in a small range.
    inputs = tf.random.uniform((total_samples, 16), 0.0, 1.0)

    # Warm up once to avoid cold start effects.
    _ = run_inference(inputs[:batch_size])

    # Start timing the full benchmark loop.
    start = time.perf_counter()

    # Process inputs in fixed size batches.
    for i in range(num_batches):
        batch = inputs[i * batch_size : (i + 1) * batch_size]
        _ = run_inference(batch)

    # Stop timing and compute elapsed seconds.
    elapsed = time.perf_counter() - start

    # Avoid division by zero in degenerate cases.
    if elapsed <= 0.0:
        elapsed = 1e-6

    # Compute average latency per batch in milliseconds.
    batch_latency_ms = (elapsed / num_batches) * 1000.0

    # Compute average latency per sample in milliseconds.
    sample_latency_ms = (elapsed / total_samples) * 1000.0

    # Compute throughput as samples processed per second.
    throughput_sps = total_samples / elapsed

    # Return metrics in a small dictionary structure.
    return {
        "batch_size": batch_size,
        "num_batches": num_batches,
        "batch_latency_ms": batch_latency_ms,
        "sample_latency_ms": sample_latency_ms,
        "throughput_sps": throughput_sps,
    }

# Define different batch sizes to compare throughput behavior.
batch_sizes = [1, 4, 16, 64]

# Use a small number of batches to keep runtime short.
num_batches = 40

# Run benchmarks for each batch size and collect metrics.
results = []
for b in batch_sizes:
    metrics = benchmark_batch(b, num_batches)
    results.append(metrics)

# Print a compact header explaining the upcoming metrics.
print("Batch,AvgBatchMs,AvgSampleMs,ThroughputSamplesPerSec")

# Loop through results and print rounded metric values.
for m in results:
    line = (
        f"{m['batch_size']},"\
        f"{m['batch_latency_ms']:.2f},"\
        f"{m['sample_latency_ms']:.4f},"\
        f"{m['throughput_sps']:.1f}"
    )
    print(line)




### **3.3. Caching for Faster Serving**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_09/Lecture_B/image_03_03.jpg?v=1769779380" width="250">



>* Cache saves previous results to answer instantly
>* Cold and warm cache responses affect latency

>* Caching happens at application, infrastructure, server levels
>* Stacked caches affect latency; measure best and worst

>* Balance cache speed with memory, freshness, fairness
>* Profile hit rates, latency, and tune policies



In [None]:
#@title Python Code - Caching for Faster Serving

# This script demonstrates simple caching concepts.
# We simulate model serving with and without caching.
# Focus on latency and throughput style measurements.

# Required external installs would be placed here.
# !pip install tensorflow.

# Import standard timing and random modules.
import time
import random
import math

# Set deterministic random seed for reproducibility.
random.seed(42)

# Define a fake expensive model inference function.
def fake_model_inference(x_value: float) -> float:
    # Simulate heavy computation using sleep.
    time.sleep(0.01)
    # Return a deterministic nonlinear transformation.
    return math.tanh(x_value) * math.sin(x_value)

# Define a simple cache dictionary for results.
cache_store: dict[float, float] = {}

# Define an inference function that uses caching.
def cached_inference(x_value: float) -> float:
    # Check if value already exists in cache.
    if x_value in cache_store:
        return cache_store[x_value]
    # Compute result when cache miss occurs.
    result_value = fake_model_inference(x_value)
    # Store result in cache for future reuse.
    cache_store[x_value] = result_value
    return result_value

# Generate a small workload with repeated values.
workload_values: list[float] = []

# Fill workload with many repeated and few unique values.
for index_value in range(50):
    # Use few popular values and some random ones.
    if index_value % 5 == 0:
        workload_values.append(0.5)
    elif index_value % 5 == 1:
        workload_values.append(1.0)
    else:
        workload_values.append(round(random.uniform(0.0, 2.0), 2))

# Validate workload size before running experiments.
if len(workload_values) != 50:
    raise ValueError("Workload size must equal fifty entries")

# Measure latency without any caching enabled.
start_no_cache = time.perf_counter()

# Run fake model directly for each workload value.
results_no_cache: list[float] = []
for value_item in workload_values:
    results_no_cache.append(fake_model_inference(value_item))

# Compute total time for no cache scenario.
end_no_cache = time.perf_counter()
no_cache_seconds = end_no_cache - start_no_cache

# Measure latency with caching enabled for same workload.
start_with_cache = time.perf_counter()

# Run cached inference for each workload value.
results_with_cache: list[float] = []
for value_item in workload_values:
    results_with_cache.append(cached_inference(value_item))

# Compute total time for cached scenario.
end_with_cache = time.perf_counter()
with_cache_seconds = end_with_cache - start_with_cache

# Verify both result lists have identical values.
if len(results_no_cache) != len(results_with_cache):
    raise ValueError("Result lengths must match for comparison")

# Compute maximum absolute difference between results.
max_difference = max(
    abs(a_value - b_value)
    for a_value, b_value in zip(results_no_cache, results_with_cache)
)

# Estimate simple throughput as requests per second.
throughput_no_cache = len(workload_values) / no_cache_seconds
throughput_with_cache = len(workload_values) / with_cache_seconds

# Count how many cache entries were actually stored.
cache_entries = len(cache_store)

# Print concise summary of performance measurements.
print("Requests processed:", len(workload_values))
print("Unique cached entries:", cache_entries)
print("Time without cache seconds:", round(no_cache_seconds, 4))
print("Time with cache seconds:", round(with_cache_seconds, 4))
print("Throughput without cache rps:", round(throughput_no_cache, 1))
print("Throughput with cache rps:", round(throughput_with_cache, 1))
print("Maximum result difference:", max_difference)
print("Cache speedup factor:", round(no_cache_seconds / with_cache_seconds, 2))




# <font color="#418FDE" size="6.5" uppercase>**Serving Models**</font>


In this lecture, you learned to:
- Wrap a PyTorch model in a simple inference function that handles preprocessing and postprocessing. 
- Integrate the inference function into a lightweight REST API or batch inference script. 
- Evaluate latency and throughput of the serving setup and identify simple optimizations. 

In the next Module (Module 10), we will go over 'Capstone and Best Practices'