# <font color="#418FDE" size="6.5" uppercase>**Serving and APIs**</font>

>Last update: 20260128.
    
By the end of this Lecture, you will be able to:
- Deploy a SavedModel using TensorFlow Serving or a similar serving stack. 
- Expose a prediction API endpoint that accepts JSON inputs and returns model outputs. 
- Evaluate basic performance characteristics of a served model, including latency and throughput. 


## **1. TensorFlow Serving Basics**

### **1.1. Dockerized TF Serving**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_09/Lecture_B/image_01_01.jpg?v=1769620985" width="250">



>* Run TensorFlow Serving inside a prebuilt container
>* Get portable, reproducible deployments across all environments

>* Mount SavedModel, container exposes prediction endpoint
>* Separates ML work from ops, enables safe updates

>* Orchestration tools scale and manage serving containers
>* Containerization plus orchestration ensures robust, maintainable serving



### **1.2. Managing Model Versions**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_09/Lecture_B/image_01_02.jpg?v=1769620999" width="250">



>* Models stored as numbered versioned directories
>* Server switches versions while keeping client interface stable

>* Configure which model versions load and default
>* Use policies for canaries, A/B tests, rollbacks

>* Plan storage, memory, and version retention carefully
>* Monitor, automate promotion, and retire unused versions



### **1.3. Serving API Protocols**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_09/Lecture_B/image_01_03.jpg?v=1769621021" width="250">



>* Serving protocols define request, response, error formats
>* They connect clients to models via REST or gRPC

>* Protocols support many input types and shapes
>* Strict schemas ensure correct mapping and interoperability

>* Protocols define batching, metadata, and shared tooling
>* Design stable APIs that evolve without breaking clients



In [None]:
#@title Python Code - Serving API Protocols

# This script demonstrates serving API protocols basics.
# It simulates REST and gRPC style prediction requests.
# Focus is on request shapes not real networking.

# !pip install tensorflow-serving-api.

# Import required standard libraries.
import json
import random
import numpy as np

# Set deterministic random seeds.
random.seed(7)
np.random.seed(7)

# Define a tiny dummy prediction function.
def dummy_model_predict(features_batch):
    features_array = np.asarray(features_batch, dtype=float)
    if features_array.ndim != 2:
        raise ValueError("features_batch must be 2D array")
    weights = np.array([[0.3], [0.7]], dtype=float)
    scores = features_array @ weights
    probs = 1.0 / (1.0 + np.exp(-scores))
    return probs.squeeze(axis=-1)

# Build a fake REST JSON request body.
def build_rest_request(user_ids, item_ids, scores):
    if not (len(user_ids) == len(item_ids) == len(scores)):
        raise ValueError("All feature lists must share length")
    instances = []
    for u, i, s in zip(user_ids, item_ids, scores):
        instance = {"user_id": int(u), "item_id": int(i), "score": float(s)}
        instances.append(instance)
    request_body = {"instances": instances}
    return json.dumps(request_body)

# Parse REST JSON and map to model inputs.
def parse_rest_request(json_body):
    parsed = json.loads(json_body)
    if "instances" not in parsed:
        raise KeyError("Request must contain 'instances' field")
    instances = parsed["instances"]
    features_batch = []
    for inst in instances:
        for key in ("user_id", "item_id", "score"):
            if key not in inst:
                raise KeyError(f"Missing required field: {key}")
        features = [inst["user_id"], inst["item_id"]]
        features_batch.append(features)
    return np.array(features, dtype=float), np.array(features_batch, dtype=float)

# Build a fake gRPC style request dictionary.
def build_grpc_request(user_ids, item_ids, scores):
    if not (len(user_ids) == len(item_ids) == len(scores)):
        raise ValueError("All feature lists must share length")
    tensor = {
        "user_id": {"dtype": "INT32", "values": list(map(int, user_ids))},
        "item_id": {"dtype": "INT32", "values": list(map(int, item_ids))},
        "score": {"dtype": "FLOAT", "values": list(map(float, scores))},
    }
    request = {"model_name": "demo_recommender", "inputs": tensor}
    return request

# Parse gRPC style request into model inputs.
def parse_grpc_request(request):
    if "inputs" not in request:
        raise KeyError("Request must contain 'inputs' field")
    inputs = request["inputs"]
    for key in ("user_id", "item_id", "score"):
        if key not in inputs:
            raise KeyError(f"Missing required tensor: {key}")
    user_ids = np.array(inputs["user_id"]["values"], dtype=float)
    item_ids = np.array(inputs["item_id"]["values"], dtype=float)
    if user_ids.shape != item_ids.shape:
        raise ValueError("user_id and item_id shapes must match")
    features_batch = np.stack([user_ids, item_ids], axis=1)
    return features_batch

# Create a tiny batch of example features.
user_ids = [1, 2, 3]
item_ids = [10, 20, 30]
scores = [0.1, 0.5, 0.9]

# Build and parse REST style request.
rest_body = build_rest_request(user_ids, item_ids, scores)
last_instance, rest_features = parse_rest_request(rest_body)

# Build and parse gRPC style request.
grpc_request = build_grpc_request(user_ids, item_ids, scores)
grpc_features = parse_grpc_request(grpc_request)

# Run dummy predictions for both protocols.
rest_predictions = dummy_model_predict(rest_features)
grpc_predictions = dummy_model_predict(grpc_features)

# Print concise comparison of protocol behaviors.
print("REST request JSON:")
print(rest_body)
print("\nREST predictions:", rest_predictions.tolist())
print("\ngRPC style request keys:", list(grpc_request.keys()))
print("gRPC predictions:", grpc_predictions.tolist())
print("\nBoth protocols map to same model inputs.")




## **2. Designing Prediction APIs**

### **2.1. JSON Request Design**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_09/Lecture_B/image_02_01.jpg?v=1769621105" width="250">



>* Define a clear, stable contract for requests
>* Use descriptive names and unambiguous data types

>* Map client-friendly instances to model features
>* Service layer converts JSON objects into tensors

>* Include optional parameters, versions, and configuration cleanly
>* Separate options from features for stability, reliability



In [None]:
#@title Python Code - JSON Request Design

# This script shows JSON request design basics.
# We simulate a tiny prediction API service.
# Focus is on clear JSON request structures.

# Required library is tensorflow already available.
# No extra installations are needed here.

# Import standard libraries for JSON handling.
import json
import random
import tensorflow as tf

# Set deterministic seeds for reproducibility.
random.seed(7)
tf.random.set_seed(7)

# Print TensorFlow version in one short line.
print("TensorFlow version:", tf.__version__)

# Define a simple toy model for demonstration.
model = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(3,)),
    tf.keras.layers.Dense(4, activation="relu"),
    tf.keras.layers.Dense(1, activation="sigmoid"),
])

# Compile the model with basic configuration.
model.compile(optimizer="adam", loss="binary_crossentropy")

# Create tiny dummy training data for the model.
train_x = tf.constant([[0.1, 0.2, 0.3], [0.9, 0.8, 0.7]], dtype=tf.float32)
train_y = tf.constant([[0.0], [1.0]], dtype=tf.float32)

# Train briefly with silent output for speed.
model.fit(train_x, train_y, epochs=5, verbose=0)

# Define a function to build a JSON request body.
def build_request(instances, options=None):
    # Create base request with instances list.
    request = {"instances": instances}

    # Attach optional config section if provided.
    if options is not None:
        request["config"] = options

    # Return JSON string with sorted keys.
    return json.dumps(request, sort_keys=True)

# Define a function to parse and validate JSON.
def parse_request(json_body):
    # Parse JSON string into Python dictionary.
    data = json.loads(json_body)

    # Ensure instances key exists and is a list.
    if "instances" not in data or not isinstance(
        data["instances"], list
    ):
        raise ValueError("Request must contain list field 'instances'.")

    # Extract instances and optional config safely.
    instances = data["instances"]
    config = data.get("config", {})

    # Validate each instance has required fields.
    features = []
    for item in instances:
        if not isinstance(item, dict):
            raise ValueError("Each instance must be a JSON object.")
        if not {"f1", "f2", "f3"}.issubset(item.keys()):
            raise ValueError("Each instance needs f1, f2, f3 fields.")
        features.append([item["f1"], item["f2"], item["f3"]])

    # Convert features list into float32 tensor.
    features_tensor = tf.convert_to_tensor(features, dtype=tf.float32)

    # Return tensor and config dictionary.
    return features_tensor, config

# Define a function that simulates prediction endpoint.
def predict_endpoint(json_body):
    # Parse and validate incoming JSON request.
    features_tensor, config = parse_request(json_body)

    # Optionally limit batch size from config.
    max_batch = int(config.get("max_batch", 16))
    if features_tensor.shape[0] > max_batch:
        features_tensor = features_tensor[:max_batch]

    # Run model prediction on validated tensor.
    preds = model.predict(features_tensor, verbose=0)

    # Build response with predictions as simple list.
    response = {"predictions": preds[:, 0].tolist()}

    # Return JSON string response to the caller.
    return json.dumps(response, sort_keys=True)

# Build a client friendly JSON request example.
client_instances = [
    {"f1": 0.2, "f2": 0.1, "f3": 0.4, "user_id": "u1"},
    {"f1": 0.9, "f2": 0.7, "f3": 0.3, "user_id": "u2"},
]

# Define optional configuration for the request.
client_config = {"max_batch": 4, "explain": False}

# Create JSON body using the helper function.
json_request = build_request(client_instances, client_config)

# Send JSON body into simulated prediction endpoint.
json_response = predict_endpoint(json_request)

# Print request and response to show clear contract.
print("Request JSON:", json_request)
print("Response JSON:", json_response)



### **2.2. Service Layer Preprocessing**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_09/Lecture_B/image_02_02.jpg?v=1769621220" width="250">



>* Service layer cleans and transforms raw client requests
>* It validates, handles errors, and stabilizes model inputs

>* Service layer repeats training-time feature transformations consistently
>* Builds final feature vectors using shared routines

>* Service layer enforces limits, privacy, and defaults
>* Adds logging, validation, and keeps responsibilities separated



In [None]:
#@title Python Code - Service Layer Preprocessing

# This script shows service layer preprocessing.
# We simulate a simple JSON prediction request.
# Focus on cleaning inputs before model prediction.

# !pip install tensorflow==2.20.0.

# Import required standard libraries.
import json
import random
import numpy as np

# Import TensorFlow for a tiny demo model.
import tensorflow as tf

# Set deterministic seeds for reproducibility.
random.seed(7)
np.random.seed(7)
tf.random.set_seed(7)

# Print TensorFlow version in one short line.
print("TensorFlow version:", tf.__version__)

# Define a tiny numeric regression model.
model = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(3,)),
    tf.keras.layers.Dense(4, activation="relu"),
    tf.keras.layers.Dense(1, activation="linear"),
])

# Compile model with simple optimizer and loss.
model.compile(optimizer="adam", loss="mse")

# Create tiny synthetic training data.
x_train = np.array([
    [0.0, 0.0, 0.0],
    [1.0, 0.0, 0.0],
    [0.0, 1.0, 0.0],
    [0.0, 0.0, 1.0],
    [1.0, 1.0, 1.0],
], dtype="float32")

# Create simple target values for training.
y_train = np.array([[0.0], [1.0], [1.0], [1.0], [3.0]], dtype="float32")

# Train briefly with silent verbose setting.
model.fit(x_train, y_train, epochs=50, verbose=0)

# Define expected feature configuration for service.
EXPECTED_FEATURES = {
    "age": {"min": 0, "max": 120},
    "income": {"min": 0, "max": 1_000_000},
    "is_premium": {"min": 0, "max": 1},
}

# Define simple normalization helper function.
def normalize_feature(value, min_value, max_value):
    # Clip value into allowed range.
    clipped = max(min_value, min(value, max_value))
    # Avoid division by zero safely.
    if max_value == min_value:
        return 0.0
    # Scale value into zero one range.
    return (clipped - min_value) / (max_value - min_value)

# Define service layer preprocessing function.
def preprocess_request(json_payload):
    # Parse JSON string into Python dictionary.
    try:
        data = json.loads(json_payload)
    except json.JSONDecodeError:
        raise ValueError("Invalid JSON payload format.")

    # Validate that payload is a dictionary.
    if not isinstance(data, dict):
        raise ValueError("Top level JSON must be object.")

    # Prepare list for normalized feature values.
    features = []

    # Iterate over expected features configuration.
    for name, cfg in EXPECTED_FEATURES.items():
        # Get raw value or default fallback.
        raw_value = data.get(name, None)
        if raw_value is None:
            # Use safe default when missing.
            raw_value = cfg["min"]
        # Ensure numeric type for model.
        try:
            numeric = float(raw_value)
        except (TypeError, ValueError):
            raise ValueError(f"Feature {name} must be numeric.")
        # Normalize using helper function.
        norm = normalize_feature(numeric, cfg["min"], cfg["max"])
        features.append(norm)

    # Convert list into proper tensor shape.
    features_array = np.array(features, dtype="float32").reshape(1, -1)

    # Validate final shape before prediction.
    if features_array.shape[1] != 3:
        raise ValueError("Preprocessed features must have length three.")

    # Return model ready tensor for prediction.
    return features_array

# Define service layer prediction wrapper.
def predict_from_service(json_payload):
    # Preprocess raw JSON into numeric tensor.
    inputs = preprocess_request(json_payload)
    # Run model prediction without extra logs.
    outputs = model.predict(inputs, verbose=0)
    # Convert prediction to plain Python float.
    score = float(outputs[0, 0])
    # Build clean response dictionary.
    response = {"prediction": round(score, 3)}
    return response

# Create example client JSON with mixed issues.
example_request = json.dumps({
    "age": 35,
    "income": 50_000,
    "is_premium": "1",
})

# Run service layer prediction on example.
response = predict_from_service(example_request)

# Print original JSON and cleaned prediction.
print("Raw request JSON:", example_request)
print("Service response:", response)




### **2.3. Interpreting Model Responses**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_09/Lecture_B/image_02_03.jpg?v=1769621323" width="250">



>* Convert raw model tensors into readable fields
>* Structure responses consistently for easy client integration

>* Choose what predictions, alternatives, and probabilities to expose
>* Design responses to show uncertainty and support oversight

>* Clearly encode technical and model-level issues
>* Include context flags to guide safe decisions



## **3. Serving Performance Basics**

### **3.1. Efficient Request Batching**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_09/Lecture_B/image_03_01.jpg?v=1769621341" width="250">



>* Batching groups many requests into one pass
>* Better uses accelerators, boosting throughput without hardware

>* Batch size trades off throughput against user latency
>* Use max batch size and wait time limits

>* Tune batch size using real traffic patterns
>* Use adaptive batching and monitor latency, throughput



In [None]:
#@title Python Code - Efficient Request Batching

# This script demonstrates efficient request batching concepts.
# It simulates batched versus unbatched prediction latency simply.
# Use it to understand latency throughput tradeoffs clearly.

# !pip install tensorflow==2.20.0.

# Import required standard libraries.
import time
import random
import statistics

# Import tensorflow and check version.
import tensorflow as tf

# Set deterministic random seeds for reproducibility.
random.seed(42)

# Print tensorflow version in one short line.
print("TensorFlow version:", tf.__version__)

# Define a simple dense model for demonstration.
model = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(16,)),
    tf.keras.layers.Dense(32, activation="relu"),
    tf.keras.layers.Dense(8, activation="relu"),
    tf.keras.layers.Dense(1, activation="sigmoid"),
])

# Compile model with minimal configuration.
model.compile(optimizer="adam", loss="binary_crossentropy")

# Create small dummy input batch for warmup.
warmup_inputs = tf.random.uniform(shape=(8, 16))

# Run one warmup prediction to initialize model.
_ = model.predict(warmup_inputs, verbose=0)

# Define a helper to simulate single request prediction.
def predict_single(sample_tensor: tf.Tensor) -> tf.Tensor:
    # Ensure input shape is correct for single sample.
    assert sample_tensor.shape == (1, 16)
    return model.predict(sample_tensor, verbose=0)


# Define a helper to simulate batched prediction.
def predict_batch(batch_tensor: tf.Tensor) -> tf.Tensor:
    # Ensure batch has correct feature dimension.
    assert batch_tensor.shape[1] == 16
    return model.predict(batch_tensor, verbose=0)


# Define a function to measure latency for many calls.
def measure_latency(num_requests: int, batch_size: int) -> float:
    # Collect random inputs for all synthetic requests.
    inputs = [
        tf.random.uniform(shape=(1, 16)) for _ in range(num_requests)
    ]

    # Start timing for the whole scenario.
    start = time.perf_counter()

    # If batch size is one, call predict_single repeatedly.
    if batch_size == 1:
        for sample in inputs:
            _ = predict_single(sample)

    # Otherwise group requests into batches for prediction.
    else:
        current_batch = []
        for sample in inputs:
            current_batch.append(sample)
            if len(current_batch) == batch_size:
                batch_tensor = tf.concat(current_batch, axis=0)
                _ = predict_batch(batch_tensor)
                current_batch = []

        # Handle any remaining partial batch at the end.
        if current_batch:
            batch_tensor = tf.concat(current_batch, axis=0)
            _ = predict_batch(batch_tensor)

    # Compute average latency per request in milliseconds.
    total_time = time.perf_counter() - start
    avg_latency_ms = (total_time / num_requests) * 1000.0
    return avg_latency_ms


# Configure experiment parameters for synthetic traffic.
num_requests = 64
batch_sizes = [1, 4, 16, 32]

# Measure latency for each batch size configuration.
results = []
for bsize in batch_sizes:
    avg_ms = measure_latency(num_requests=num_requests, batch_size=bsize)
    throughput = num_requests / (avg_ms / 1000.0)
    results.append((bsize, avg_ms, throughput))

# Print concise summary of latency and throughput tradeoffs.
print("BatchSize  AvgLatency(ms)  RequestsPerSecond")
for bsize, avg_ms, rps in results:
    print(f"{bsize:8d}  {avg_ms:14.3f}  {rps:17.1f}")




### **3.2. Latency and Throughput Metrics**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_09/Lecture_B/image_03_02.jpg?v=1769621440" width="250">



>* Latency measures one request’s end-to-end time
>* Throughput measures total requests handled under load

>* Check full latency distribution, especially tail percentiles
>* Measure latency under realistic, end-to-end production conditions

>* Test throughput as concurrent traffic steadily increases
>* Use results to set limits and scaling strategies



In [None]:
#@title Python Code - Latency and Throughput Metrics

# This script explores serving latency and throughput.
# It uses a tiny TensorFlow model and dummy requests.
# Focus is on simple timing not advanced deployment.

# !pip install tensorflow==2.20.0.

# Import required standard libraries.
import os
import time
import statistics

# Import TensorFlow and print version.
import tensorflow as tf
print("TensorFlow version:", tf.__version__)

# Set deterministic random seeds.
os.environ["PYTHONHASHSEED"] = "0"
tf.random.set_seed(0)

# Create a tiny Sequential model.
model = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(4,)),
    tf.keras.layers.Dense(4, activation="relu"),
    tf.keras.layers.Dense(1, activation="linear"),
])

# Compile the model with simple settings.
model.compile(optimizer="adam", loss="mse")

# Create a small dummy dataset.
inputs = tf.random.uniform(shape=(32, 4))
labels = tf.reduce_sum(inputs, axis=1, keepdims=True)

# Train briefly with silent output.
model.fit(inputs, labels, epochs=3, verbose=0)

# Prepare a single dummy request tensor.
request_example = tf.random.uniform(shape=(1, 4))

# Warm up the model before timing.
_ = model(request_example, training=False)

# Define a helper to measure latency.
def measure_latency(model, sample, runs):
    latencies = []
    for _ in range(runs):
        start = time.perf_counter()
        _ = model(sample, training=False)
        end = time.perf_counter()
        latencies.append((end - start) * 1000.0)
    return latencies

# Measure latency over multiple runs.
latency_ms = measure_latency(model, request_example, runs=50)

# Compute basic latency statistics.
median_latency = statistics.median(latency_ms)
mean_latency = statistics.mean(latency_ms)

# Compute simple percentile estimates.
sorted_lat = sorted(latency_ms)
index_95 = int(0.95 * len(sorted_lat)) - 1
index_99 = int(0.99 * len(sorted_lat)) - 1
p95_latency = sorted_lat[max(index_95, 0)]
p99_latency = sorted_lat[max(index_99, 0)]

# Estimate throughput from median latency.
throughput_rps = 1000.0 / median_latency if median_latency > 0 else 0.0

# Print latency metrics in milliseconds.
print("Median latency ms:", round(median_latency, 4))
print("Mean latency ms:", round(mean_latency, 4))
print("P95 latency ms:", round(p95_latency, 4))
print("P99 latency ms:", round(p99_latency, 4))

# Print simple throughput estimate.
print("Estimated single worker throughput rps:", round(throughput_rps, 2))

# Simulate batched requests for throughput.
batch_size = 16
batched_example = tf.random.uniform(shape=(batch_size, 4))

# Time one batched inference call.
start_batch = time.perf_counter()
_ = model(batched_example, training=False)
end_batch = time.perf_counter()

# Compute batched latency and throughput.
batch_latency_ms = (end_batch - start_batch) * 1000.0
per_request_ms = batch_latency_ms / float(batch_size)
throughput_batched = 1000.0 * float(batch_size) / batch_latency_ms

# Print batched performance metrics.
print("Batch latency ms:", round(batch_latency_ms, 4))
print("Per request latency ms:", round(per_request_ms, 4))
print("Estimated batched throughput rps:", round(throughput_batched, 2))



### **3.3. Scaling Model Replicas**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_09/Lecture_B/image_03_03.jpg?v=1769621533" width="250">



>* Run many identical servers to share load
>* More replicas improve real-world latency and throughput

>* More replicas increase throughput and reduce queuing
>* They don’t speed single requests; monitor per-replica latency

>* Continuously measure latency and throughput while scaling
>* Find bottlenecks and balance the entire pipeline



# <font color="#418FDE" size="6.5" uppercase>**Serving and APIs**</font>


In this lecture, you learned to:
- Deploy a SavedModel using TensorFlow Serving or a similar serving stack. 
- Expose a prediction API endpoint that accepts JSON inputs and returns model outputs. 
- Evaluate basic performance characteristics of a served model, including latency and throughput. 

In the next Module (Module 10), we will go over 'Advanced Topics'