# <font color="#418FDE" size="6.5" uppercase>**Serving and APIs**</font>

>Last update: 20260121.
    
By the end of this Lecture, you will be able to:
- Deploy a SavedModel using TensorFlow Serving or a similar serving stack. 
- Expose a prediction API endpoint that accepts JSON inputs and returns model outputs. 
- Evaluate basic performance characteristics of a served model, including latency and throughput. 


## **1. TensorFlow Serving Essentials**

### **1.1. Dockerized TF Serving**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_09/Lecture_B/image_01_01.jpg?v=1768987301" width="250">



>* Run TensorFlow Serving inside reusable Docker containers
>* Ensures consistent, conflict-free environments across all machines

>* Container mounts model directory and serves requests
>* Isolation controls resources and prevents team interference

>* Containers integrate with Kubernetes for scaling, reliability
>* Enable easy rolling updates and high-volume production serving



### **1.2. Model Versioning Basics**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_09/Lecture_B/image_01_02.jpg?v=1768987348" width="250">



>* Production needs multiple clearly labeled model versions
>* TensorFlow Serving organizes SavedModel versions in subdirectories

>* Versioning enables safe rollouts and instant rollbacks
>* Supports A/B tests, canaries, and reliable baselines

>* Versioning links models to data, code, config
>* Enables traceability, compliance, and continuous production improvement



### **1.3. Configuring REST and gRPC**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_09/Lecture_B/image_01_03.jpg?v=1768987371" width="250">



>* Choose REST or gRPC for client communication
>* Both expose same predictions; differ in usability, performance

>* Match ports, protocols, and routing to clients
>* Document model I/O clearly to prevent integration errors

>* Use gateways, security, health checks, and monitoring
>* Build scalable, reliable endpoints for many clients



In [None]:
#@title Python Code - Configuring REST and gRPC

# This script compares REST and gRPC style configuration concepts practically.
# It simulates REST JSON requests and gRPC style binary payloads locally.
# It helps beginners understand endpoint configuration and payload structure.

# !pip install tensorflow

# Import required standard library modules for networking demonstration.
import http.server as http_server
import socketserver as socket_server
import threading as threading_module

# Import json module for encoding and decoding REST style payloads.
import json as json_module

# Import time module for simple latency measurement calculations.
import time as time_module

# Define a simple prediction function simulating TensorFlow Serving behavior.
def simple_predict_function(input_value):
    squared_value = input_value * input_value
    return squared_value

# Define a custom HTTP handler class for REST style JSON prediction requests.
class RestPredictionHandler(http_server.BaseHTTPRequestHandler):

    # Handle POST requests representing REST prediction calls.
    def do_POST(self):
        content_length = int(self.headers.get("Content-Length", 0))
        request_body = self.rfile.read(content_length)
        parsed_body = json_module.loads(request_body.decode("utf-8"))

        input_value = float(parsed_body.get("input_value", 0.0))
        prediction_value = simple_predict_function(input_value)
        response_body = {"prediction_value": prediction_value}

        response_bytes = json_module.dumps(response_body).encode("utf-8")
        self.send_response(200)
        self.send_header("Content-Type", "application/json")
        self.send_header("Content-Length", str(len(response_bytes)))
        self.end_headers()
        self.wfile.write(response_bytes)

    # Suppress default logging for cleaner beginner friendly output.
    def log_message(self, format, *args):
        return

# Define a function that starts a REST server on a dedicated port.
def start_rest_server(server_port, server_ready_event):
    handler_class = RestPredictionHandler
    httpd_server = socket_server.TCPServer(("localhost", server_port), handler_class)
    server_ready_event.set()
    httpd_server.serve_forever()

# Define a function that simulates a REST client sending JSON payloads.
def rest_client_request(server_port, input_value):
    import urllib.request as urllib_request
    url_string = f"http://localhost:{server_port}/predict"
    request_body = {"input_value": float(input_value)}

    encoded_body = json_module.dumps(request_body).encode("utf-8")
    request_object = urllib_request.Request(url_string, data=encoded_body)
    request_object.add_header("Content-Type", "application/json")
    start_time = time_module.time()

    with urllib_request.urlopen(request_object) as response_object:
        response_data = response_object.read().decode("utf-8")
        elapsed_time = time_module.time() - start_time
        parsed_response = json_module.loads(response_data)

    return parsed_response["prediction_value"], elapsed_time

# Define a function that simulates gRPC style binary payload handling.
def grpc_style_call_simulation(input_value):
    import struct as struct_module
    packed_bytes = struct_module.pack("!f", float(input_value))
    unpacked_value = struct_module.unpack("!f", packed_bytes)[0]

    start_time = time_module.time()
    prediction_value = simple_predict_function(unpacked_value)
    elapsed_time = time_module.time() - start_time
    return prediction_value, elapsed_time

# Define the main demonstration function orchestrating both communication styles.
def main_demonstration_function():
    rest_port = 8501
    server_ready_event = threading_module.Event()
    server_thread = threading_module.Thread(target=start_rest_server, args=(rest_port, server_ready_event))

    server_thread.daemon = True
    server_thread.start()
    server_ready_event.wait(timeout=5.0)

    test_input_value = 3.0
    rest_prediction, rest_latency = rest_client_request(rest_port, test_input_value)
    grpc_prediction, grpc_latency = grpc_style_call_simulation(test_input_value)

    print("REST endpoint configured on port", rest_port)
    print("REST request input value", test_input_value)
    print("REST response prediction value", rest_prediction)
    print("REST measured latency seconds", round(rest_latency, 6))

    print("gRPC style simulated call used binary payloads")
    print("gRPC style input value", test_input_value)
    print("gRPC style prediction value", grpc_prediction)
    print("gRPC style latency seconds", round(grpc_latency, 6))

    print("Both interfaces share prediction logic and configuration concepts")

# Execute the main demonstration function when script is run directly.
main_demonstration_function()



## **2. Designing Prediction APIs**

### **2.1. JSON Request Design**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_09/Lecture_B/image_02_01.jpg?v=1768987430" width="250">



>* Define clear, stable JSON fields and structure
>* Separate metadata from features to reduce integration errors

>* Clearly represent single or batched JSON inputs
>* Consistently encode types, missing data, and nesting

>* Design JSON for future changes and versions
>* Use optional fields, versioning, and clear documentation



### **2.2. Service Layer Preprocessing**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_09/Lecture_B/image_02_02.jpg?v=1768987452" width="250">



>* Service layer cleans and validates incoming JSON
>* Centralized preprocessing shields model and standardizes inputs

>* Service layer converts client JSON into model features
>* Mirrors training preprocessing to prevent distribution shifts

>* Service layer enforces rules, privacy, and enrichment
>* Creates canonical, auditable interface while isolating model



In [None]:
#@title Python Code - Service Layer Preprocessing

# This script demonstrates simple service layer preprocessing for JSON prediction requests.
# It shows validation, default handling, and conversion into model ready tensors.
# It mimics a small API layer sitting between clients and a TensorFlow model.

# !pip install tensorflow

# Import required standard libraries and TensorFlow framework for tensor handling.
import json, os, random, numpy as np, tensorflow as tf

# Set deterministic random seeds for reproducible behavior across different runtime sessions.
random.seed(42); np.random.seed(42); tf.random.set_seed(42)

# Print TensorFlow version information for clarity about the used deep learning framework.
print("TensorFlow version:", tf.__version__)

# Define a simple fake model function that expects numeric tensors as model inputs.
def fake_model_predict(feature_tensor: tf.Tensor) -> tf.Tensor:
    weights = tf.constant([[0.5], [1.0], [-0.25], [0.1]], dtype=tf.float32)
    bias = tf.constant([0.2], dtype=tf.float32)
    logits = tf.matmul(feature_tensor, weights) + bias
    return tf.sigmoid(logits)

# Define a preprocessing function that validates and transforms incoming JSON like payloads.
def preprocess_request(json_payload: dict) -> tf.Tensor:
    required_fields = ["age_years", "miles_driven_weekly", "vehicle_type", "temperature_fahrenheit"]
    for field in required_fields:
        if field not in json_payload:
            raise ValueError(f"Missing required field: {field}")

    age = float(json_payload.get("age_years", 0.0))
    miles = float(json_payload.get("miles_driven_weekly", 0.0))
    temp_f = float(json_payload.get("temperature_fahrenheit", 70.0))

    vehicle_raw = str(json_payload.get("vehicle_type", "car")).lower()
    vehicle_map = {"car": 0, "truck": 1, "motorcycle": 2}
    vehicle_index = vehicle_map.get(vehicle_raw, 0)

    if age < 16 or age > 100:
        raise ValueError("Age outside allowed driving range bounds")

    miles_clipped = max(0.0, min(miles, 3000.0))
    temp_celsius = (temp_f - 32.0) * (5.0 / 9.0)

    age_scaled = age / 100.0
    miles_scaled = miles_clipped / 3000.0
    temp_scaled = (temp_celsius + 40.0) / 80.0

    feature_list = [age_scaled, miles_scaled, float(vehicle_index), temp_scaled]
    feature_array = np.array([feature_list], dtype=np.float32)

    if feature_array.shape != (1, 4):
        raise ValueError(f"Unexpected feature shape: {feature_array.shape}")

    return tf.convert_to_tensor(feature_array, dtype=tf.float32)

# Define a helper function that simulates a prediction API endpoint handler behavior.
def handle_prediction_request(raw_json_string: str) -> float:
    parsed = json.loads(raw_json_string)
    features_tensor = preprocess_request(parsed)
    prediction_tensor = fake_model_predict(features_tensor)
    prediction_value = float(prediction_tensor.numpy()[0][0])
    return prediction_value

# Create two example JSON payloads representing realistic but simple driving related requests.
example_request_valid = json.dumps({"age_years": 30, "miles_driven_weekly": 250, "vehicle_type": "truck", "temperature_fahrenheit": 86})

# Create another payload with slightly different values to show preprocessing transformations.
example_request_second = json.dumps({"age_years": 45, "miles_driven_weekly": 1200, "vehicle_type": "car", "temperature_fahrenheit": 50})

# Run the handler on the first request and print the preprocessed features and prediction.
features_one = preprocess_request(json.loads(example_request_valid))

# Run the handler on the second request and print the preprocessed features and prediction.
features_two = preprocess_request(json.loads(example_request_second))

# Compute predictions using the fake model for both preprocessed feature tensors.
prediction_one = fake_model_predict(features_one).numpy()[0][0]

# Compute the second prediction value and convert it into a regular Python float.
prediction_two = float(fake_model_predict(features_two).numpy()[0][0])

# Print concise information about preprocessing results and final prediction outputs.
print("First request features:", features_one.numpy())

# Print the second feature tensor and both prediction values for comparison demonstration.
print("Second request features:", features_two.numpy())

# Print final prediction probabilities which represent model outputs after preprocessing.
print("Prediction one probability:", round(float(prediction_one), 4))




### **2.3. Postprocessing Model Outputs**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_09/Lecture_B/image_02_03.jpg?v=1768987501" width="250">



>* Convert raw model outputs into human-friendly results
>* Service layer maps outputs to client-ready responses

>* Postprocessing applies business, policy, and privacy rules
>* Centralized rules simplify clients and improve API safety

>* Design postprocessing for speed, clarity, and scale
>* Make steps observable, testable, and pipeline-first



In [None]:
#@title Python Code - Postprocessing Model Outputs

# This script demonstrates simple model output postprocessing for a prediction API response.
# It converts raw logits into probabilities and human readable labels for clients.
# It shows how a service layer shapes safe structured JSON style outputs.

# !pip install tensorflow==2.20.0

# Import required standard libraries and TensorFlow framework.
import os
import json
import random
import numpy as np
import tensorflow as tf

# Set deterministic seeds for reproducible random behavior in this example.
random.seed(42)
np.random.seed(42)
tf.random.set_seed(42)

# Print TensorFlow version information for environment transparency and debugging.
print("TensorFlow version:", tf.__version__)

# Define fixed label names that a classification service might expose to clients.
label_names = ["cat", "dog", "car", "tree"]

# Validate that label names length matches expected number of classes here.
num_classes = 4
assert len(label_names) == num_classes

# Simulate raw model logits output for a single prediction request.
logits = tf.constant([[1.2, 0.3, -0.5, 2.0]], dtype=tf.float32)

# Convert logits into probabilities using softmax for interpretable confidences.
probabilities = tf.nn.softmax(logits, axis=-1)

# Ensure probabilities shape matches batch size and class count expectations.
assert probabilities.shape == (1, num_classes)

# Convert probabilities tensor into a flat NumPy array for easier processing.
prob_array = probabilities.numpy()[0]

# Select top K predictions indices sorted by probability descending order.
top_k = 2
sorted_indices = np.argsort(prob_array)[::-1][:top_k]

# Build structured prediction entries with labels and confidence scores.
predictions_list = []
for index in sorted_indices:
    label = label_names[index]
    confidence = float(prob_array[index])
    predictions_list.append({"label": label, "confidence": round(confidence, 4)})

# Create a mock postprocessed response payload similar to JSON API output.
response_payload = {
    "model_name": "toy_classifier_v1",
    "model_version": "1.0.0",
    "predictions": predictions_list,
}

# Serialize payload to JSON string for logging or HTTP response body usage.
response_json = json.dumps(response_payload, indent=2)

# Print a short explanation header and the final postprocessed JSON payload.
print("\nPostprocessed prediction response payload:")
print(response_json)



## **3. Serving Performance Essentials**

### **3.1. Efficient Request Batching**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_09/Lecture_B/image_03_01.jpg?v=1768987541" width="250">



>* Batching groups many requests for one inference
>* Larger batches boost hardware use and throughput

>* Batching trades higher throughput for added waiting time
>* Tune batch size and wait by use case

>* Use dynamic batching with size and time limits
>* Monitor metrics and tune queues for traffic



In [None]:
#@title Python Code - Efficient Request Batching

# This script compares batched and unbatched inference latency clearly.
# It simulates a simple model server using TensorFlow dense layers.
# It measures throughput and average latency for different batch sizes.

# !pip install tensorflow

# Import required modules including TensorFlow and time measurement.
import tensorflow as tf
import numpy as np
import time
import os

# Set deterministic random seeds for reproducible behavior always.
np.random.seed(42)
tf.random.set_seed(42)

# Print TensorFlow version information for educational clarity here.
print("Using TensorFlow version:", tf.__version__)

# Define simple dense model representing a served prediction model.
model = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(32,)),
    tf.keras.layers.Dense(16, activation="relu"),
    tf.keras.layers.Dense(8, activation="relu"),
    tf.keras.layers.Dense(1, activation="linear"),
])

# Compile model with mean squared error loss and adam optimizer.
model.compile(optimizer="adam", loss="mse")

# Create tiny random training data for quick dummy training.
train_features = np.random.rand(256, 32).astype("float32")
train_targets = np.random.rand(256, 1).astype("float32")

# Train model briefly to initialize weights realistically and deterministically.
model.fit(train_features, train_targets, epochs=2, batch_size=32, verbose=0)

# Define function measuring latency and throughput for given batch size.
def measure_performance(batch_size, total_requests, warmup_batches):
    input_shape = (batch_size, 32)
    dummy_input = np.random.rand(*input_shape).astype("float32")
    assert dummy_input.shape[1] == 32

    for _ in range(warmup_batches):
        _ = model.predict(dummy_input, verbose=0)

    batches_needed = total_requests // batch_size
    remaining_requests = total_requests % batch_size

    latencies = []
    processed_requests = 0

    for _ in range(batches_needed):
        start_time = time.time()
        _ = model.predict(dummy_input, verbose=0)
        end_time = time.time()
        latencies.append(end_time - start_time)
        processed_requests += batch_size

    if remaining_requests > 0:
        small_input = dummy_input[:remaining_requests]
        start_time = time.time()
        _ = model.predict(small_input, verbose=0)
        end_time = time.time()
        latencies.append(end_time - start_time)
        processed_requests += remaining_requests

    total_time = sum(latencies)
    average_latency = total_time / len(latencies)
    throughput = processed_requests / total_time

    return average_latency, throughput

# Configure experiment parameters for total requests and warmup batches.
total_requests = 1024
warmup_batches = 3

# Define batch sizes list including unbatched and larger batched cases.
batch_sizes = [1, 8, 32, 128]

# Print header describing columns for latency and throughput results.
print("BatchSize  AvgLatencySeconds  ThroughputRequestsPerSecond")

# Loop over batch sizes and measure performance metrics for each.
for batch_size in batch_sizes:
    avg_latency, throughput = measure_performance(batch_size, total_requests, warmup_batches)
    print(batch_size, round(avg_latency, 5), round(throughput, 2))




### **3.2. Latency and Throughput Metrics**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_09/Lecture_B/image_03_02.jpg?v=1768987616" width="250">



>* Latency and throughput define speed and capacity
>* These metrics guide UX, hardware, and scaling decisions

>* Latency is a distribution; percentiles reveal tails
>* Different applications care about different latency percentiles

>* Throughput is sustained request rate under latency limits
>* Load testing reveals safe capacity and scaling needs



In [None]:
#@title Python Code - Latency and Throughput Metrics

import time
import statistics
import random

# Define a function to simulate request latency measurements.
# This function returns a list of simulated latencies.
# Latencies are measured in milliseconds units.
# We use random variations for realism.
# Seeded randomness ensures reproducible behavior.

def simulate_latencies(request_count, base_latency_ms, jitter_ms):
    latencies = []
    for _ in range(request_count):
        jitter = random.uniform(-jitter_ms, jitter_ms)
        latency = max(1.0, base_latency_ms + jitter)
        latencies.append(latency)
    return latencies

# Define a function to compute latency distribution metrics.
# We compute average and median latencies.
# We also compute p95 and p99 tail latencies.
# Percentiles help understand slowest requests.
# We return metrics inside a dictionary.

def compute_latency_metrics(latencies):
    sorted_latencies = sorted(latencies)
    count = len(sorted_latencies)
    average_latency = statistics.mean(sorted_latencies)
    median_latency = statistics.median(sorted_latencies)
    index_p95 = int(0.95 * (count - 1))
    index_p99 = int(0.99 * (count - 1))
    p95_latency = sorted_latencies[index_p95]
    p99_latency = sorted_latencies[index_p99]
    metrics = {
        "average_ms": average_latency,
        "median_ms": median_latency,
        "p95_ms": p95_latency,
        "p99_ms": p99_latency,
    }
    return metrics

# Define a function to simulate throughput under increasing load.
# We increase request counts to mimic higher traffic.
# We measure total time for processing requests.
# Throughput is requests per second metric.
# We also track latency metrics for each load.

def evaluate_throughput_scenarios():
    random.seed(42)
    base_latency_ms = 80.0
    jitter_ms = 40.0
    request_scenarios = [50, 200, 800, 2000]
    results = []
    for request_count in request_scenarios:
        start_time = time.time()
        latencies = simulate_latencies(request_count, base_latency_ms, jitter_ms)
        end_time = time.time()
        elapsed_seconds = max(1e-6, end_time - start_time)
        throughput_rps = request_count / elapsed_seconds
        metrics = compute_latency_metrics(latencies)
        scenario_result = {
            "requests": request_count,
            "throughput_rps": throughput_rps,
            "average_ms": metrics["average_ms"],
            "median_ms": metrics["median_ms"],
            "p95_ms": metrics["p95_ms"],
            "p99_ms": metrics["p99_ms"],
        }
        results.append(scenario_result)
    return results

# Run the throughput evaluation and print concise results.
# This demonstrates latency distribution behavior.
# It also shows throughput changes with load.
# We keep printed lines under fifteen.
# Values are rounded for readability.

if __name__ == "__main__":
    scenarios = evaluate_throughput_scenarios()
    print("Requests  Throughput_rps  Avg_ms  Median_ms  P95_ms  P99_ms")
    for scenario in scenarios:
        print(
            f"{scenario['requests']:7d}  "
            f"{scenario['throughput_rps']:13.1f}  "
            f"{scenario['average_ms']:6.1f}  "
            f"{scenario['median_ms']:9.1f}  "
            f"{scenario['p95_ms']:6.1f}  "
            f"{scenario['p99_ms']:6.1f}"
        )



### **3.3. Scaling Model Replicas**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_09/Lecture_B/image_03_03.jpg?v=1768987653" width="250">



>* Run multiple replicas behind a load balancer
>* More replicas increase capacity and keep latency low

>* More replicas hit hardware and data bottlenecks
>* Continuously measure latency and throughput when scaling

>* Autoscaling adjusts replicas using real-time performance signals
>* Tune replica counts to balance latency and cost



# <font color="#418FDE" size="6.5" uppercase>**Serving and APIs**</font>


In this lecture, you learned to:
- Deploy a SavedModel using TensorFlow Serving or a similar serving stack. 
- Expose a prediction API endpoint that accepts JSON inputs and returns model outputs. 
- Evaluate basic performance characteristics of a served model, including latency and throughput. 

In the next Module (Module 10), we will go over 'Advanced Topics'