# <font color="#418FDE" size="6.5" uppercase>**Serving and APIs**</font>

>Last update: 20260127.
    
By the end of this Lecture, you will be able to:
- Deploy a SavedModel using TensorFlow Serving or a similar serving stack. 
- Expose a prediction API endpoint that accepts JSON inputs and returns model outputs. 
- Evaluate basic performance characteristics of a served model, including latency and throughput. 


## **1. TensorFlow Serving Essentials**

### **1.1. Dockerized TF Serving**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_09/Lecture_B/image_01_01.jpg?v=1769540523" width="250">



>* Prebuilt containers run models consistently across environments
>* They load SavedModels and expose simple network endpoints

>* Containers simplify scaling and managing model servers
>* Reuse existing orchestration tools, improving reliability and collaboration

>* Separates models, serving config, and applications cleanly
>* Enables independent team workflows with predictable runtime



### **1.2. Model Versioning Basics**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_09/Lecture_B/image_01_02.jpg?v=1769540538" width="250">



>* Versioning production models ensures stability and safety
>* Numbered directories let Serving switch versions transparently

>* Each version folder stores a full SavedModel
>* Supports safe upgrades, rollbacks, and validation runs

>* Versioning enables A/B tests and gradual rollouts
>* Policies ensure traceability, compliance, and reproducibility



### **1.3. REST and gRPC Setup**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_09/Lecture_B/image_01_03.jpg?v=1769540567" width="250">



>* TensorFlow Serving exposes models via REST or gRPC
>* REST favors simplicity; gRPC favors performance efficiency

>* REST is easiest for early testing, collaboration
>* JSON overhead can hurt performance at scale

>* gRPC adds complexity but strong, efficient contracts
>* Use REST for simplicity, gRPC for performance



In [None]:
#@title Python Code - REST and gRPC Setup

# This script shows simple REST and gRPC ideas.
# We simulate TensorFlow Serving style prediction calls.
# Focus is on beginner friendly client side usage.

# Required external libraries would be installed like this.
# !pip install tensorflow==2.20.0.

# Import standard libraries for HTTP and environment handling.
import os
import json
import random

# Import TensorFlow to show version and simple model.
import tensorflow as tf

# Set deterministic seeds for reproducible behavior.
random.seed(7)
tf.random.set_seed(7)

# Print TensorFlow version in one concise line.
print("TensorFlow version:", tf.__version__)

# Build a tiny example model for demonstration.
model = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(3,)),
    tf.keras.layers.Dense(2, activation="linear"),
])

# Compile the model with simple configuration.
model.compile(optimizer="sgd", loss="mse")

# Create a tiny deterministic training dataset.
x_train = tf.constant([[0.0, 1.0, 2.0]], dtype=tf.float32)
y_train = tf.constant([[1.0, 0.0]], dtype=tf.float32)

# Train briefly with silent output to avoid logs.
model.fit(x_train, y_train, epochs=5, verbose=0)

# Save the model in SavedModel format to disk.
saved_path = "demo_saved_model"
model.export(saved_path)

# Prepare a small example input for prediction.
example_input = [0.5, 1.5, 2.5]

# Build a JSON body similar to REST predict requests.
rest_body = {
    "signature_name": "serving_default",
    "instances": [example_input],
}

# Serialize the REST body to a JSON string.
rest_json = json.dumps(rest_body)

# Show how a REST request might look conceptually.
print("REST POST /v1/models/demo:predict body:")
print(rest_json)

# Run a local prediction to compare with server output.
local_pred = model.predict(
    tf.constant([example_input], dtype=tf.float32), verbose=0
)

# Convert prediction to a regular Python list.
local_pred_list = local_pred.tolist()

# Show the local prediction that REST or gRPC would return.
print("Local prediction result (simulating server):")
print(local_pred_list)

# Build a simple dictionary similar to gRPC protobuf message.
grpc_request = {
    "model_spec": {"name": "demo", "signature_name": "serving_default"},
    "inputs": {"input_1": example_input},
}

# Show the conceptual gRPC style request payload.
print("gRPC style request dictionary (conceptual):")
print(grpc_request)



## **2. Designing Prediction APIs**

### **2.1. JSON Request Design**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_09/Lecture_B/image_02_01.jpg?v=1769540628" width="250">



>* Define clear, stable JSON fields for features
>* Consistent structure reduces ambiguity and integration bugs

>* Support clear single and batched prediction requests
>* Separate optional from required fields and validate inputs

>* Design extensible requests with versions and metadata
>* Keep structure simple, consistent, and language-agnostic



In [None]:
#@title Python Code - JSON Request Design

# This script teaches JSON request design basics.
# We simulate a tiny prediction API using Flask.
# Focus is on clear request structure and validation.

# !pip install flask.

# Import standard libraries for JSON handling.
import json
import random
import textwrap

# Import Flask for a simple local style API.
from flask import Flask, request, jsonify

# Create a Flask app instance for demonstration.
app = Flask(__name__)

# Set a deterministic seed for reproducible behavior.
random.seed(42)

# Define required feature names for our fake model.
REQUIRED_FEATURES = ["transaction_amount", "merchant_category", "timestamp"]

# Define optional feature names for extra context.
OPTIONAL_FEATURES = ["user_id", "device_type"]

# Define supported API version identifiers for requests.
SUPPORTED_VERSIONS = ["v1"]

# Build a helper function to validate incoming JSON.
def validate_request(payload: dict) -> tuple[bool, str]:
    # Check that payload is a dictionary like JSON object.
    if not isinstance(payload, dict):
        return False, "Payload must be a JSON object."

    # Extract version field with a safe default value.
    version = payload.get("version", "v1")

    # Ensure the version is supported by this service.
    if version not in SUPPORTED_VERSIONS:
        return False, "Unsupported version in request payload."

    # Extract records list which holds feature dictionaries.
    records = payload.get("records")

    # Ensure records is a non empty list of items.
    if not isinstance(records, list) or len(records) == 0:
        return False, "Field 'records' must be a non empty list."

    # Validate each record for required feature presence.
    for index, record in enumerate(records):
        if not isinstance(record, dict):
            return False, f"Record {index} must be an object."
        for feature in REQUIRED_FEATURES:
            if feature not in record:
                return False, f"Missing required feature: {feature}."

    # Return success flag and empty message when valid.
    return True, ""

# Define a fake model prediction using simple logic.
def fake_model_predict(records: list[dict]) -> list[float]:
    # Compute a simple score using transaction amount feature.
    scores = []
    for record in records:
        amount = float(record["transaction_amount"])
        base_score = min(amount / 100.0, 1.0)
        noise = 0.01 * random.random()
        scores.append(round(base_score + noise, 3))
    return scores

# Define the prediction endpoint to handle JSON requests.
@app.route("/predict", methods=["POST"])
def predict_endpoint():
    # Parse JSON payload safely from the incoming request.
    payload = request.get_json(silent=True)

    # Validate payload structure and required fields.
    is_valid, message = validate_request(payload)

    # Return error response when validation fails early.
    if not is_valid:
        return jsonify({"error": message}), 400

    # Extract records and optional metadata from payload.
    records = payload["records"]
    metadata = payload.get("metadata", {})

    # Run fake model prediction on provided records.
    scores = fake_model_predict(records)

    # Build response with predictions and echoed metadata.
    response = {"predictions": scores, "metadata": metadata}
    return jsonify(response), 200

# Build a helper to pretty print example JSON payloads.
def show_example_request() -> None:
    # Create a single record with required and optional fields.
    example_record = {
        "transaction_amount": 42.5,
        "merchant_category": "grocery",
        "timestamp": "2024-01-01T12:00:00Z",
        "user_id": "user_123",
        "device_type": "mobile",
    }

    # Wrap record in a batched style request structure.
    example_request = {
        "version": "v1",
        "metadata": {"request_id": "demo_001", "locale": "en_US"},
        "records": [example_record],
    }

    # Convert dictionary to a formatted JSON string.
    json_text = json.dumps(example_request, indent=2)

    # Wrap text to keep printed lines reasonably short.
    wrapped = textwrap.indent(json_text, prefix="  ")

    # Print a short explanation and the JSON example.
    print("Example JSON request body for /predict endpoint:")
    print(wrapped)

# Only run demonstration code when executed directly.
if __name__ == "__main__":
    # Print a short description of the API contract.
    print("This demo shows a clear JSON request design.")

    # Show the example request structure for beginners.
    show_example_request()



### **2.2. Service Layer Preprocessing**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_09/Lecture_B/image_02_02.jpg?v=1769540659" width="250">



>* Service layer cleans and standardizes messy inputs
>* It protects the model from malformed, inconsistent data

>* Service layer converts raw JSON into features
>* Reapplies training transformations to match model expectations

>* Service layer adds business rules and context
>* Handles logging, rate limits, routing, and evolution



In [None]:
#@title Python Code - Service Layer Preprocessing

# This script shows simple service layer preprocessing.
# We simulate a JSON request and clean its fields.
# Focus on mapping raw inputs to model features.

# !pip install tensorflow==2.20.0.

# Import required standard libraries.
import json
import random
import numpy as np

# Import tensorflow for tensor creation.
import tensorflow as tf

# Set deterministic random seeds.
random.seed(42)
np.random.seed(42)
tf.random.set_seed(42)

# Print TensorFlow version once.
print("TensorFlow version:", tf.__version__)

# Define a simple feature specification dictionary.
FEATURE_SPEC = {
    "income": {"required": True, "default": 0.0},
    "employment_years": {"required": True, "default": 0.0},
    "requested_amount": {"required": True, "default": 0.0},
}

# Define a simple currency conversion mapping.
CURRENCY_RATES = {"USD": 1.0, "EUR": 1.1, "GBP": 1.3}

# Define a helper to safely get numeric values.
def _safe_float(value, default):
    try:
        return float(value)
    except (TypeError, ValueError):
        return float(default)

# Define a function to normalize numeric features.
def _normalize(value, mean, std):
    if std == 0.0:
        return 0.0
    return (value - mean) / std

# Define training time statistics for normalization.
TRAIN_STATS = {
    "income": {"mean": 50000.0, "std": 20000.0},
    "employment_years": {"mean": 5.0, "std": 3.0},
    "requested_amount": {"mean": 10000.0, "std": 5000.0},
}

# Define the core service layer preprocessing function.
def preprocess_request(raw_json):
    if isinstance(raw_json, str):
        parsed = json.loads(raw_json)
    else:
        parsed = dict(raw_json)
    features = {}
    for name, spec in FEATURE_SPEC.items():
        raw_value = parsed.get(name, spec["default"])
        numeric = _safe_float(raw_value, spec["default"])
        stats = TRAIN_STATS[name]
        normalized = _normalize(numeric, stats["mean"], stats["std"])
        features[name] = normalized
    currency = parsed.get("currency", "USD")
    rate = CURRENCY_RATES.get(currency, CURRENCY_RATES["USD"])
    amount_raw = parsed.get("requested_amount", 0.0)
    amount_value = _safe_float(amount_raw, 0.0)
    amount_usd = amount_value * rate
    stats_amount = TRAIN_STATS["requested_amount"]
    features["requested_amount"] = _normalize(
        amount_usd, stats_amount["mean"], stats_amount["std"]
    )
    feature_order = ["income", "employment_years", "requested_amount"]
    feature_vector = [features[name] for name in feature_order]
    tensor = tf.convert_to_tensor([feature_vector], dtype=tf.float32)
    if tensor.shape != (1, 3):
        raise ValueError("Unexpected tensor shape in preprocessing.")
    return tensor

# Define a dummy model that expects three input features.
class DummyLoanModel(tf.keras.Model):
    def __init__(self):
        super().__init__()
        self.dense = tf.keras.layers.Dense(1, activation="sigmoid")

    def call(self, inputs):
        return self.dense(inputs)


# Instantiate the dummy model once.
model = DummyLoanModel()

# Build the model by calling it with a sample tensor.
sample_input = tf.zeros((1, 3), dtype=tf.float32)
_ = model(sample_input)

# Define a function that simulates the prediction endpoint.
def predict_from_json(request_json):
    features_tensor = preprocess_request(request_json)
    prediction = model(features_tensor)
    return float(prediction.numpy()[0, 0])

# Create a few example JSON like requests.
example_requests = [
    {
        "income": 60000,
        "employment_years": 4,
        "requested_amount": 8000,
        "currency": "USD",
    },
    {
        "income": 45000,
        "employment_years": 2,
        "requested_amount": 7000,
        "currency": "EUR",
    },
]

# Loop through examples and show preprocessing and prediction.
for idx, req in enumerate(example_requests, start=1):
    tensor = preprocess_request(req)
    pred = predict_from_json(req)
    print("Request", idx, "features tensor:", tensor.numpy())
    print("Request", idx, "predicted approval score:", round(pred, 4))

# Print a short message summarizing the lesson.
print("Service layer converted JSON into stable model inputs.")



### **2.3. Interpreting Model Responses**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_09/Lecture_B/image_02_03.jpg?v=1769540736" width="250">



>* Convert raw model numbers into structured information
>* Include prediction, alternatives, metadata for context

>* Expose uncertainty with probabilities and risk levels
>* Balance detail with client needs and usability

>* Separate model scores from business rule decisions
>* Enable auditing, reproducibility, and flexible downstream logic



In [None]:
#@title Python Code - Interpreting Model Responses

# This script shows interpreting model responses.
# We simulate a served classification prediction response.
# Focus is on turning scores into clear decisions.

# Required TensorFlow version is available by default.
# !pip install tensorflow==2.20.0.

# Import standard libraries for typing and math.
import json
import math
import random

# Import TensorFlow to align with course context.
import tensorflow as tf

# Set deterministic seeds for reproducibility.
random.seed(7)
tf.random.set_seed(7)

# Print TensorFlow version in one concise line.
print("TensorFlow version:", tf.__version__)

# Define class labels for a simple classifier.
CLASS_LABELS = ["cat", "dog", "rabbit"]

# Define a helper to simulate raw model scores.
def simulate_model_scores(num_classes: int) -> list[float]:
    # Create small positive scores using random values.
    scores = [random.random() + 0.1 for _ in range(num_classes)]
    # Validate that we have at least one score.
    if len(scores) == 0:
        raise ValueError("No scores generated for prediction")
    return scores

# Define a softmax function to convert scores to probabilities.
def softmax(scores: list[float]) -> list[float]:
    # Subtract max for numerical stability in exponent.
    max_score = max(scores)
    exps = [math.exp(s - max_score) for s in scores]
    # Compute denominator and avoid division by zero.
    denom = sum(exps)
    if denom <= 0.0:
        raise ValueError("Softmax denominator is not positive")
    return [v / denom for v in exps]

# Define a function to build an interpretable API style response.
def build_prediction_response(probabilities: list[float]) -> dict:
    # Validate probability vector length matches labels.
    if len(probabilities) != len(CLASS_LABELS):
        raise ValueError("Probability length does not match labels")
    # Pair labels with probabilities for ranking.
    label_probs = list(zip(CLASS_LABELS, probabilities))
    # Sort by probability descending for top predictions.
    label_probs.sort(key=lambda x: x[1], reverse=True)

    # Extract top prediction and confidence score.
    top_label, top_prob = label_probs[0]
    # Build a ranked list of alternative predictions.
    alternatives = [
        {"label": label, "probability": round(prob, 4)}
        for label, prob in label_probs
    ]

    # Map probability into a simple risk style bucket.
    if top_prob >= 0.8:
        confidence_level = "high"
    elif top_prob >= 0.5:
        confidence_level = "medium"
    else:
        confidence_level = "low"

    # Build metadata to give context for the decision.
    metadata = {
        "model_version": "v1.0.0",
        "request_id": "req_12345",
        "confidence_level": confidence_level,
    }

    # Build final response separating model and service fields.
    response = {
        "top_prediction": {
            "label": top_label,
            "probability": round(top_prob, 4),
        },
        "alternatives": alternatives,
        "metadata": metadata,
    }
    return response

# Simulate a client request body with simple JSON.
request_body = {"image_id": "img_001", "features": "dummy"}

# Simulate raw model scores as a serving system might return.
raw_scores = simulate_model_scores(num_classes=len(CLASS_LABELS))

# Convert raw scores into probabilities using softmax.
probabilities = softmax(raw_scores)

# Build an interpretable response for the API client.
api_response = build_prediction_response(probabilities=probabilities)

# Pretty print the request and interpreted response as JSON.
print("Request JSON:")
print(json.dumps(request_body, indent=2))
print("Interpreted response JSON:")
print(json.dumps(api_response, indent=2))



## **3. Serving Performance Basics**

### **3.1. Batching for Higher Throughput**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_09/Lecture_B/image_03_01.jpg?v=1769540770" width="250">



>* Batching groups many requests for parallel processing
>* Improves hardware use, throughput, and cost efficiency

>* Batching boosts throughput but can increase latency
>* Tune batch size and wait time to balance

>* Batching benefits depend on model and workload
>* Use dynamic batching and monitoring to balance tradeoffs



In [None]:
#@title Python Code - Batching for Higher Throughput

# This script demonstrates batching for higher throughput.
# We simulate a served model and measure simple performance.
# Focus is on latency and throughput with batching.

# !pip install tensorflow==2.20.0.

# Import required standard libraries.
import os
import time
import random

# Import tensorflow and check version.
import tensorflow as tf

# Set deterministic seeds for reproducibility.
random.seed(42)

# Print TensorFlow version in one short line.
print("TensorFlow version:", tf.__version__)

# Select device string based on GPU availability.
physical_gpus = tf.config.list_physical_devices("GPU")

# Choose CPU if no GPU is available.
if physical_gpus:
    device_name = "/GPU:0"
else:
    device_name = "/CPU:0"

# Define a tiny dense model to simulate serving.
model = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(32,)),
    tf.keras.layers.Dense(32, activation="relu"),
    tf.keras.layers.Dense(16, activation="relu"),
    tf.keras.layers.Dense(4, activation="softmax"),
])

# Build the model by calling once on dummy data.
dummy_input = tf.zeros((1, 32), dtype=tf.float32)

# Run a single forward pass to initialize weights.
_ = model(dummy_input)

# Function to create random request tensors.
def make_requests(num_requests, feature_size):
    # Create random normal features for each request.
    data = tf.random.normal((num_requests, feature_size))
    return data

# Function to measure latency and throughput.
def measure_performance(batch_size, total_requests):
    # Validate that total_requests is divisible by batch_size.
    if total_requests % batch_size != 0:
        raise ValueError("total_requests must be divisible by batch_size")

    # Prepare all synthetic requests once.
    all_requests = make_requests(total_requests, 32)

    # Ensure tensor has expected shape before batching.
    if all_requests.shape[1] != 32:
        raise ValueError("Unexpected feature size for requests")

    # Compute number of batches to process.
    num_batches = total_requests // batch_size

    # Warm up model with one batch to stabilize.
    _ = model(all_requests[:batch_size])

    # Measure wall clock time for all batches.
    start = time.perf_counter()

    # Loop over batches and run predictions.
    for i in range(num_batches):
        # Slice the current batch from all_requests.
        batch = all_requests[i * batch_size:(i + 1) * batch_size]
        # Run model prediction inside selected device.
        with tf.device(device_name):
            _ = model(batch, training=False)

    # Compute total elapsed time in seconds.
    elapsed = time.perf_counter() - start

    # Compute average latency per request in milliseconds.
    avg_latency_ms = (elapsed / total_requests) * 1000.0

    # Compute throughput as requests per second.
    throughput_rps = total_requests / elapsed

    # Return metrics as a simple dictionary.
    return {
        "batch_size": batch_size,
        "total_requests": total_requests,
        "elapsed_sec": elapsed,
        "avg_latency_ms": avg_latency_ms,
        "throughput_rps": throughput_rps,
    }

# Define total number of synthetic requests.
TOTAL_REQUESTS = 512

# Define batch sizes to compare for throughput.
batch_sizes = [1, 8, 32, 128]

# Collect results for each batch size configuration.
results = []

# Measure performance for each batch size choice.
for b in batch_sizes:
    metrics = measure_performance(b, TOTAL_REQUESTS)
    results.append(metrics)

# Print a short header explaining the metrics.
print("Batch size, avg latency (ms), throughput (req/s)")

# Print one summary line per batch size configuration.
for m in results:
    print(
        f"{m['batch_size']:>9}, "
        f"{m['avg_latency_ms']:.3f}, "
        f"{m['throughput_rps']:.1f}"
    )

# Final line prints a brief interpretation hint.
print("Larger batches usually reduce latency per request and increase throughput.")



### **3.2. Measuring Latency and Throughput**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_09/Lecture_B/image_03_02.jpg?v=1769540803" width="250">



>* Latency and throughput describe speed and capacity
>* They impact user experience and guide scaling decisions

>* Simulate realistic concurrent traffic and record timestamps
>* Compute latency percentiles and throughput as load increases

>* Consider full request path and all dependencies
>* Measure metrics, correlate with resources, locate bottlenecks



In [None]:
#@title Python Code - Measuring Latency and Throughput

# This script demonstrates measuring latency and throughput.
# It uses a tiny TensorFlow model for predictions.
# Focus is on simple timing not advanced serving.

# !pip install tensorflow==2.20.0.

# Import required standard libraries.
import time
import statistics
import random

# Import TensorFlow and print version.
import tensorflow as tf
print("TensorFlow version:", tf.__version__)

# Set deterministic random seeds.
random.seed(42)
tf.random.set_seed(42)

# Define a small dense model.
model = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(4,)),
    tf.keras.layers.Dense(8, activation="relu"),
    tf.keras.layers.Dense(1, activation="linear"),
])

# Compile the model with mean squared error.
model.compile(optimizer="adam", loss="mse")

# Create tiny synthetic training data.
x_train = tf.random.normal(shape=(64, 4))
y_train = tf.reduce_sum(x_train, axis=1, keepdims=True)

# Train briefly with silent output.
model.fit(x_train, y_train, epochs=3, verbose=0)

# Create a single dummy input example.
sample_input = tf.random.normal(shape=(1, 4))

# Validate input shape before timing.
if sample_input.shape != (1, 4):
    raise ValueError("Unexpected input shape for sample_input")

# Define a helper function for one prediction.
def run_single_prediction(model_obj, input_tensor):
    # Run one prediction and return elapsed seconds.
    start = time.perf_counter()
    _ = model_obj(input_tensor, training=False)
    end = time.perf_counter()
    return end - start

# Measure latency for many single requests.
num_requests = 50
latencies = []
for _ in range(num_requests):
    elapsed = run_single_prediction(model, sample_input)
    latencies.append(elapsed)

# Compute basic latency statistics.
avg_latency = statistics.mean(latencies)
median_latency = statistics.median(latencies)
max_latency = max(latencies)

# Compute simple throughput estimate.
throughput_rps = num_requests / sum(latencies)

# Print latency and throughput summary.
print("Requests:", num_requests)
print("Average latency seconds:", round(avg_latency, 6))
print("Median latency seconds:", round(median_latency, 6))
print("Max latency seconds:", round(max_latency, 6))
print("Throughput requests_per_second:", round(throughput_rps, 2))




### **3.3. Scaling Model Replicas**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_09/Lecture_B/image_03_03.jpg?v=1769540828" width="250">



>* Multiple replicas behind load balancer boost throughput
>* Benefits depend on actual system bottlenecks and resources

>* Per-replica latency stays similar as replicas increase
>* More replicas boost throughput and improve tail latency

>* More replicas need good load balancing and coordination
>* Monitor full pipeline metrics to guide scaling



# <font color="#418FDE" size="6.5" uppercase>**Serving and APIs**</font>


In this lecture, you learned to:
- Deploy a SavedModel using TensorFlow Serving or a similar serving stack. 
- Expose a prediction API endpoint that accepts JSON inputs and returns model outputs. 
- Evaluate basic performance characteristics of a served model, including latency and throughput. 

In the next Module (Module 10), we will go over 'Advanced Topics'