# <font color="#418FDE" size="6.5" uppercase>**Serving and APIs**</font>

>Last update: 20260128.
    
By the end of this Lecture, you will be able to:
- Deploy a SavedModel using TensorFlow Serving or a similar serving stack. 
- Expose a prediction API endpoint that accepts JSON inputs and returns model outputs. 
- Evaluate basic performance characteristics of a served model, including latency and throughput. 


## **1. TensorFlow Serving Basics**

### **1.1. Dockerized TF Serving**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_09/Lecture_B/image_01_01.jpg?v=1769605095" width="250">



>* Run TensorFlow Serving inside a Docker container
>* Get simple, consistent deployments across all environments

>* Containers isolate TensorFlow Serving from host changes
>* Multiple models run with separate versions and resources

>* Container images integrate with registries and orchestrators
>* Enable scalable, reliable model serving across environments



### **1.2. Managing Model Versions**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_09/Lecture_B/image_01_02.jpg?v=1769605115" width="250">



>* Treat models as numbered versioned snapshots over time
>* Directory structure enables updates, rollback, parallel versions

>* New model versions load safely alongside old
>* Blue–green style updates reduce risk and downtime

>* Versioning enables A/B tests and canary rollouts
>* Supports auditability, rollback, and reproducible predictions



### **1.3. REST and gRPC Setup**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_09/Lecture_B/image_01_03.jpg?v=1769605131" width="250">



>* REST exposes models via simple HTTP JSON requests
>* Easy integration, debugging, and testing across many apps

>* gRPC uses binary, schema-based messages over HTTP/2
>* Contracts enable fast, consistent, multi-language service communication

>* Use REST or gRPC based on clients
>* Plan endpoints, reliability, monitoring, and security carefully



In [None]:
#@title Python Code - REST and gRPC Setup

# This script shows simple REST and gRPC setup.
# It simulates TensorFlow Serving style prediction calls.
# Focus is on beginner friendly client side examples.

# Required external libraries would be installed like this.
# !pip install tensorflow-serving-api.

# Import standard libraries for HTTP and environment handling.
import os
import json
import random

# Import TensorFlow to show version and basic tensor usage.
import tensorflow as tf

# Set deterministic seeds for reproducible random behavior.
random.seed(7)
tf.random.set_seed(7)

# Print TensorFlow version in one concise line.
print("TensorFlow version:", tf.__version__)

# Define a tiny helper to pretty print JSON safely.
def pretty_json(data_dict):
    return json.dumps(data_dict, indent=2, sort_keys=True)

# Simulate a SavedModel signature for numeric inputs.
MODEL_NAME = "demo_model"
MODEL_VERSION = "1"

# Create a small example input vector for predictions.
example_features = [0.1, 0.5, 0.9]

# Validate the example input length before using it.
if len(example_features) != 3:
    raise ValueError("example_features must have length three")

# Build a fake REST request body similar to TensorFlow Serving.
rest_request_body = {
    "signature_name": "serving_default",
    "instances": [example_features],
}

# Show the REST URL pattern for a prediction endpoint.
rest_url = (
    "http://localhost:8501/v1/models/" + MODEL_NAME + ":predict"
)

# Print a short explanation and the REST request example.
print("\nREST endpoint URL example:")
print(rest_url)

# Print the JSON body that a REST client would send.
print("\nREST JSON request body example:")
print(pretty_json(rest_request_body))

# Simulate a simple model prediction using TensorFlow operations.
weights = tf.constant([[0.2], [0.4], [0.6]], dtype=tf.float32)

# Convert example features into a TensorFlow tensor.
features_tensor = tf.constant([example_features], dtype=tf.float32)

# Validate tensor shape before matrix multiplication.
if features_tensor.shape[1] != weights.shape[0]:
    raise ValueError("Feature size and weight size must match")

# Compute a fake prediction using matrix multiplication.
prediction_tensor = tf.matmul(features_tensor, weights)

# Convert prediction tensor to a plain Python value.
prediction_value = float(prediction_tensor.numpy()[0][0])

# Build a fake REST style prediction response body.
rest_response_body = {"predictions": [prediction_value]}

# Print the simulated REST response JSON body.
print("\nSimulated REST JSON response body:")
print(pretty_json(rest_response_body))

# Define a minimal gRPC style request dictionary for teaching.
grpc_request = {
    "model_spec": {"name": MODEL_NAME, "version": MODEL_VERSION},
    "inputs": {"features": example_features},
}

# Print a short explanation of the gRPC style request.
print("\nSimplified gRPC style request example:")
print(pretty_json(grpc_request))

# Build a matching gRPC style response dictionary.
grpc_response = {
    "model_spec": {"name": MODEL_NAME, "version": MODEL_VERSION},
    "outputs": {"score": prediction_value},
}

# Print the simulated gRPC style response body.
print("\nSimulated gRPC style response example:")
print(pretty_json(grpc_response))

# Final confirmation line summarizing what was demonstrated.
print("\nDemo complete: compared REST and gRPC style payloads.")



## **2. Designing Prediction APIs**

### **2.1. JSON Request Design**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_09/Lecture_B/image_02_01.jpg?v=1769605178" width="250">



>* Prediction request defines client–server data contract
>* Use clear fields to avoid missing, misaligned data

>* Support both single and batched prediction requests
>* Handle text, numeric, and binary data safely

>* Separate required and optional fields with defaults
>* Design schema for versioning and future extensions



### **2.2. Service Layer Preprocessing**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_09/Lecture_B/image_02_02.jpg?v=1769605196" width="250">



>* Service layer cleans and converts JSON inputs
>* Ensures consistent tensors so models focus on inference

>* Service preprocessing adds flexibility across model versions
>* Must exactly match training transforms to avoid drift

>* Preprocessing defends against bad or unusual inputs
>* It enriches data and boosts API robustness



In [None]:
#@title Python Code - Service Layer Preprocessing

# This script shows service layer preprocessing.
# We simulate a tiny JSON prediction request.
# Then we convert it into model ready tensors.

# Required TensorFlow is usually preinstalled in Colab.
# Uncomment next line if TensorFlow is unexpectedly missing.
# !pip install tensorflow==2.20.0.

# Import standard libraries for typing and randomness.
import json, random, os, math

# Import TensorFlow for tensor handling and models.
import tensorflow as tf

# Set deterministic seeds for reproducible behavior.
random.seed(7)

# Print TensorFlow version in a compact single line.
print("TensorFlow version:", tf.__version__)

# Define a tiny feature specification for our service.
FEATURE_KEYS = ["age", "income", "employment_status"]

# Define allowed employment categories for simple encoding.
EMPLOYMENT_CATEGORIES = ["unemployed", "part_time", "full_time"]

# Define numeric scaling constants for preprocessing logic.
AGE_MAX = 100.0

# Define income scaling constant to keep values small.
INCOME_SCALE = 100000.0

# Build a simple Keras model that expects three features.
model = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(3,)),
    tf.keras.layers.Dense(4, activation="relu"),
    tf.keras.layers.Dense(1, activation="sigmoid"),
])

# Compile the model with minimal configuration options.
model.compile(optimizer="adam", loss="binary_crossentropy")

# Create a tiny dummy batch for quick warmup training.
dummy_x = tf.constant([[0.2, 0.3, 1.0]], dtype=tf.float32)

# Create dummy labels so the model can be trained.
dummy_y = tf.constant([[1.0]], dtype=tf.float32)

# Train briefly with verbose zero to avoid log spam.
model.fit(dummy_x, dummy_y, epochs=3, verbose=0)

# Define a function that validates incoming JSON payloads.
def validate_request(payload: dict) -> None:
    # Ensure all required feature keys are present.
    missing = [k for k in FEATURE_KEYS if k not in payload]
    # Raise a clear error if any keys are missing.
    if missing:
        raise ValueError(f"Missing keys: {missing}")
    # Validate age is within a safe numeric range.
    age = payload["age"]
    # Check age type and reasonable numeric bounds.
    if not isinstance(age, (int, float)) or not (0 <= age <= 120):
        raise ValueError("Invalid age value in request")
    # Validate income is non negative and numeric.
    income = payload["income"]
    # Check income type and simple upper bound.
    if not isinstance(income, (int, float)) or income < 0:
        raise ValueError("Invalid income value in request")
    # Validate employment status is a known category.
    status = payload["employment_status"]
    # Ensure status is a string and in allowed list.
    if status not in EMPLOYMENT_CATEGORIES:
        raise ValueError("Unknown employment_status category")


# Define a function that converts JSON into model tensors.
def preprocess_request(payload: dict) -> tf.Tensor:
    # First validate the payload using helper function.
    validate_request(payload)
    # Normalize age into zero one range using constant.
    age_raw = float(payload["age"])
    # Clip age to avoid extreme unexpected values.
    age_clipped = max(0.0, min(age_raw, AGE_MAX))
    # Scale age by maximum to obtain normalized feature.
    age_scaled = age_clipped / AGE_MAX
    # Scale income into a smaller numeric magnitude.
    income_raw = float(payload["income"])
    # Use simple division to keep values numerically stable.
    income_scaled = income_raw / INCOME_SCALE
    # Convert employment status into one hot like value.
    status = payload["employment_status"]
    # Map status string to index using predefined list.
    status_index = EMPLOYMENT_CATEGORIES.index(status)
    # Represent status as a single numeric index feature.
    status_feature = float(status_index)
    # Assemble final feature vector in correct order.
    features = [age_scaled, income_scaled, status_feature]
    # Convert list into a batch tensor for the model.
    tensor = tf.convert_to_tensor([features], dtype=tf.float32)
    # Validate tensor shape before returning to caller.
    if tensor.shape != (1, 3):
        raise ValueError(f"Unexpected tensor shape {tensor.shape}")
    # Return the prepared tensor ready for inference.
    return tensor


# Define a simple function that simulates an API call.
def predict_from_json(json_string: str) -> float:
    # Parse the incoming JSON string into a dictionary.
    payload = json.loads(json_string)
    # Preprocess payload into a numeric tensor batch.
    inputs = preprocess_request(payload)
    # Run the model prediction on prepared inputs.
    preds = model.predict(inputs, verbose=0)
    # Extract scalar prediction value from returned array.
    score = float(preds[0, 0])
    # Return the prediction score as a Python float.
    return score


# Create a sample valid JSON request for demonstration.
sample_request = json.dumps({
    "age": 35,
    "income": 55000,
    "employment_status": "full_time",
})

# Run the simulated prediction and capture the result.
result = predict_from_json(sample_request)

# Print the original JSON and the numeric prediction.
print("Request JSON:", sample_request)

# Print the final prediction score with limited decimals.
print("Predicted approval score:", round(result, 4))




### **2.3. Interpreting Model Responses**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_09/Lecture_B/image_02_03.jpg?v=1769605313" width="250">



>* Convert raw model numbers into meaningful outputs
>* Return clear, consistent, self-describing prediction objects

>* Choose which probabilities and signals to expose
>* Balance detail, clarity, and safety in responses

>* Clearly separate successes, low-confidence results, and errors
>* Log response types to refine models and clients



## **3. Serving Performance Basics**

### **3.1. Batching For Higher Throughput**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_09/Lecture_B/image_03_01.jpg?v=1769605333" width="250">



>* Batching groups many requests into one pass
>* This boosts throughput with little extra latency

>* Batching balances higher throughput against added waiting time
>* Traffic patterns and latency needs guide batch settings

>* Tune batch size and wait time carefully
>* Monitor latency and throughput, adjust or use dynamic batching



In [None]:
#@title Python Code - Batching For Higher Throughput

# This script demonstrates batching for throughput.
# We compare single and batched inference latency.
# Focus is on simple TensorFlow serving concepts.

# !pip install tensorflow==2.20.0.

# Import required standard libraries.
import os
import time
import random

# Import TensorFlow and numpy.
import tensorflow as tf
import numpy as np

# Set deterministic seeds for reproducibility.
random.seed(42)
np.random.seed(42)
tf.random.set_seed(42)

# Print TensorFlow version briefly.
print("TensorFlow version:", tf.__version__)

# Define a small dense model for demonstration.
model = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(32,)),
    tf.keras.layers.Dense(32, activation="relu"),
    tf.keras.layers.Dense(1, activation="linear"),
])

# Compile the model with simple configuration.
model.compile(optimizer="adam", loss="mse")

# Create tiny random training data.
train_x = np.random.randn(256, 32).astype("float32")
train_y = np.random.randn(256, 1).astype("float32")

# Train briefly with silent output.
model.fit(train_x, train_y, epochs=2, batch_size=32, verbose=0)

# Create a small test set for timing.
test_x = np.random.randn(128, 32).astype("float32")

# Validate test shape before inference.
assert test_x.shape == (128, 32)

# Define helper to time repeated predictions.
def time_predictions(inputs, batch_size, repeats):
    # Warm up once to avoid cold start.
    _ = model.predict(inputs, batch_size=batch_size, verbose=0)

    # Measure total wall time for repeats.
    start = time.time()
    for _ in range(repeats):
        _ = model.predict(inputs, batch_size=batch_size, verbose=0)
    end = time.time()

    # Compute average latency per repeat.
    total_time = end - start
    avg_latency = total_time / repeats
    return total_time, avg_latency

# Configure experiment parameters.
repeats = 20
single_batch_size = 1
large_batch_size = 32

# Time single example style inference.
single_total, single_avg = time_predictions(
    test_x, batch_size=single_batch_size, repeats=repeats
)

# Time batched inference with larger batch.
large_total, large_avg = time_predictions(
    test_x, batch_size=large_batch_size, repeats=repeats
)

# Compute approximate throughput values.
num_examples = test_x.shape[0]
single_throughput = num_examples / single_avg
large_throughput = num_examples / large_avg

# Print concise comparison results.
print("Single batch size latency (s):", round(single_avg, 5))
print("Large batch size latency (s):", round(large_avg, 5))
print("Single batch size throughput:", int(single_throughput), "examples/s")
print("Large batch size throughput:", int(large_throughput), "examples/s")
print("Total time single batch mode:", round(single_total, 3), "s")
print("Total time large batch mode:", round(large_total, 3), "s")
print("Note how batching improves overall throughput.")



### **3.2. Measuring Latency and Throughput**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_09/Lecture_B/image_03_02.jpg?v=1769605383" width="250">



>* Latency is how fast one prediction returns
>* Throughput is how many predictions per time

>* Simulate realistic traffic and record request latencies
>* Summarize latency percentiles, track throughput versus load

>* Relate latency and throughput to application needs
>* Use patterns to guide tuning and capacity decisions



In [None]:
#@title Python Code - Measuring Latency and Throughput

# This script measures simple serving latency basics.
# It simulates a tiny model prediction service.
# We focus on latency and throughput concepts.

# !pip install tensorflow==2.20.0.

# Import required standard libraries.
import time
import statistics
import random

# Import tensorflow for a tiny model.
import tensorflow as tf

# Set deterministic random seeds.
random.seed(42)

# Print TensorFlow version once.
print("TensorFlow version:", tf.__version__)

# Build a tiny dense model.
model = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(4,)),
    tf.keras.layers.Dense(8, activation="relu"),
    tf.keras.layers.Dense(1, activation="linear"),
])

# Compile model with simple settings.
model.compile(optimizer="adam", loss="mse")

# Create small synthetic training data.
x_train = tf.random.uniform((64, 4), minval=0.0, maxval=1.0)

# Create matching target values.
y_train = tf.reduce_sum(x_train, axis=1, keepdims=True)

# Train briefly with silent output.
model.fit(x_train, y_train, epochs=3, verbose=0)

# Define a simple predict function.
def predict_once(batch_size: int) -> tf.Tensor:
    # Create random input batch.
    inputs = tf.random.uniform((batch_size, 4))
    # Run model prediction.
    outputs = model(inputs, training=False)
    return outputs

# Measure latency for single requests.
latencies_single = []

# Run several single predictions.
for _ in range(20):
    # Record start time.
    start = time.perf_counter()
    # Call prediction with batch size one.
    _ = predict_once(batch_size=1)
    # Record end time.
    end = time.perf_counter()
    # Store latency in milliseconds.
    latencies_single.append((end - start) * 1000.0)

# Measure latency for batched requests.
latencies_batch = []

# Run several batched predictions.
for _ in range(20):
    # Record start time.
    start = time.perf_counter()
    # Call prediction with batch size sixteen.
    _ = predict_once(batch_size=16)
    # Record end time.
    end = time.perf_counter()
    # Store latency in milliseconds.
    latencies_batch.append((end - start) * 1000.0)

# Helper to compute simple statistics.
def summarize_latencies(values):
    # Validate non empty list.
    if not values:
        return None
    # Sort values for percentiles.
    sorted_vals = sorted(values)
    # Compute median index.
    mid = len(sorted_vals) // 2
    # Compute median value.
    median = sorted_vals[mid]
    # Compute p95 index.
    p95_index = int(0.95 * (len(sorted_vals) - 1))
    # Compute p95 value.
    p95 = sorted_vals[p95_index]
    # Compute mean value.
    mean = statistics.fmean(sorted_vals)
    return mean, median, p95

# Summarize single request latencies.
mean_s, median_s, p95_s = summarize_latencies(latencies_single)

# Summarize batched request latencies.
mean_b, median_b, p95_b = summarize_latencies(latencies_batch)

# Estimate throughput from batched timings.
total_requests = 20 * 16

# Total time seconds for batched runs.
total_time_sec = sum(latencies_batch) / 1000.0

# Compute requests per second.
throughput_rps = total_requests / total_time_sec

# Print concise latency summary.
print("Single request mean ms:", round(mean_s, 3))

# Print median latency for singles.
print("Single request median ms:", round(median_s, 3))

# Print p95 latency for singles.
print("Single request p95 ms:", round(p95_s, 3))

# Print concise batched latency summary.
print("Batch(16) mean ms:", round(mean_b, 3))

# Print median latency for batches.
print("Batch(16) median ms:", round(median_b, 3))

# Print p95 latency for batches.
print("Batch(16) p95 ms:", round(p95_b, 3))

# Print approximate throughput estimate.
print("Approx throughput rps:", round(throughput_rps, 2))




### **3.3. Scaling Model Replicas**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_09/Lecture_B/image_03_03.jpg?v=1769605503" width="250">



>* Run multiple identical model servers behind load balancing
>* Handle higher traffic and avoid single-instance failures

>* More replicas boost throughput and sometimes latency
>* Poor balancing or autoscaling can hurt tail latency

>* Balance horizontal, vertical scaling with batching, hardware
>* Watch for saturation where replicas stop improving throughput



# <font color="#418FDE" size="6.5" uppercase>**Serving and APIs**</font>


In this lecture, you learned to:
- Deploy a SavedModel using TensorFlow Serving or a similar serving stack. 
- Expose a prediction API endpoint that accepts JSON inputs and returns model outputs. 
- Evaluate basic performance characteristics of a served model, including latency and throughput. 

In the next Module (Module 10), we will go over 'Advanced Topics'