# <font color="#418FDE" size="6.5" uppercase>**Distribution Basics**</font>

>Last update: 20260126.
    
By the end of this Lecture, you will be able to:
- Describe the main tf.distribute strategies available in TensorFlow 2.20.0 and their typical use cases. 
- Explain how data parallelism and replica synchronization work in distributed training. 
- Identify the changes required in model and input code to support distribution strategies. 


## **1. Core Distribution Strategies**

### **1.1. Single Host MirroredStrategy**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_08/Lecture_A/image_01_01.jpg?v=1769446326" width="250">



>* Copies one model across all GPUs, synchronizing gradients
>* Enables easy multi-GPU training on one machine

>* Use multiple GPUs to speed up training
>* Shared host memory keeps data handling simple

>* Fast on one machine, easy to integrate
>* Limited by single host resources, scales modestly



In [None]:
#@title Python Code - Single Host MirroredStrategy

# This script demonstrates Single Host MirroredStrategy basics.
# It compares single GPU or CPU with mirrored multi GPU training.
# It keeps output short while showing key distribution behavior.

# !pip install tensorflow==2.20.0.

# Import required standard libraries.
import os
import random
import numpy as np

# Import TensorFlow and distribution utilities.
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# Set deterministic seeds for reproducibility.
seed_value = 42
random.seed(seed_value)
np.random.seed(seed_value)
tf.random.set_seed(seed_value)

# Print TensorFlow version in one short line.
print("TensorFlow version:", tf.__version__)

# Detect available GPUs for potential mirroring.
physical_gpus = tf.config.list_physical_devices("GPU")
num_gpus = len(physical_gpus)
print("Detected GPUs:", num_gpus)

# Decide whether to use MirroredStrategy or default.
if num_gpus > 1:
    strategy = tf.distribute.MirroredStrategy()
    strategy_name = "MirroredStrategy"
else:
    strategy = tf.distribute.get_strategy()
    strategy_name = "DefaultStrategy"

# Print chosen strategy and replica count.
print("Using strategy:", strategy_name, "replicas:", strategy.num_replicas_in_sync)

# Load MNIST dataset from Keras utilities.
(x_train, y_train), _ = keras.datasets.mnist.load_data()

# Reduce dataset size for quick demonstration.
x_train = x_train[:6000]
y_train = y_train[:6000]

# Normalize images to float32 in range zero one.
x_train = x_train.astype("float32") / 255.0

# Add channel dimension for convolutional layers.
x_train = np.expand_dims(x_train, axis=-1)

# Validate shapes before building dataset.
print("Train shape:", x_train.shape, "Labels shape:", y_train.shape)

# Create tf.data dataset with small batch size.
batch_size = 128
train_ds = tf.data.Dataset.from_tensor_slices((x_train, y_train))
train_ds = train_ds.shuffle(6000, seed=seed_value)
train_ds = train_ds.batch(batch_size)

# Define a simple CNN model builder function.
def create_model():
    inputs = keras.Input(shape=(28, 28, 1))
    x = layers.Conv2D(16, 3, activation="relu")(inputs)
    x = layers.MaxPooling2D()(x)
    x = layers.Flatten()(x)
    x = layers.Dense(32, activation="relu")(x)
    outputs = layers.Dense(10, activation="softmax")(x)
    model = keras.Model(inputs=inputs, outputs=outputs)
    return model

# Build and compile model inside strategy scope.
with strategy.scope():
    model = create_model()
    model.compile(
        optimizer=keras.optimizers.Adam(learning_rate=0.001),
        loss="sparse_categorical_crossentropy",
        metrics=["accuracy"],
    )

# Show a brief model summary line manually.
print("Model parameters:", model.count_params())

# Train for a few epochs with silent verbose setting.
history = model.fit(
    train_ds,
    epochs=2,
    verbose=0,
)

# Extract final loss and accuracy from history.
final_loss = history.history["loss"][-1]
final_acc = history.history["accuracy"][-1]

# Print concise training results for comparison.
print("Final loss:", round(float(final_loss), 4))
print("Final accuracy:", round(float(final_acc), 4))

# Show effective global batch size under strategy.
effective_batch = batch_size * strategy.num_replicas_in_sync
print("Effective global batch size:", effective_batch)

# Confirm that strategy kept variables synchronized.
print("Trainable variables count:", len(model.trainable_variables))




### **1.2. Multiworker Mirrored Strategy**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_08/Lecture_A/image_01_02.jpg?v=1769446365" width="250">



>* Scales synchronous training from one machine to clusters
>* Replicates model on workers, aggregates gradients to synchronize

>* Best for clusters of networked GPU machines
>* Speeds training while hiding coordination and failures

>* More workers increase global batch size, adjust hyperparameters
>* Needs fast, balanced cluster to scale large workloads



In [None]:
#@title Python Code - Multiworker Mirrored Strategy

# This script introduces MultiWorkerMirroredStrategy basics.
# It simulates multiworker setup on a single machine.
# It keeps training tiny and output very compact.

# !pip install tensorflow==2.20.0.

# Import required standard libraries.
import os
import json
import random

# Import TensorFlow and check version.
import tensorflow as tf

# Set deterministic seeds for reproducibility.
random.seed(7)
os.environ["PYTHONHASHSEED"] = "7"

# Set TensorFlow random seed deterministically.
tf.random.set_seed(7)

# Print TensorFlow version in one short line.
print("TensorFlow version:", tf.__version__)

# Explain that we simulate two workers locally.
print("Simulating two workers on one process.")

# Build a simple JSON cluster specification.
cluster_spec = {"worker": ["localhost:12345"]}

# Convert cluster specification to JSON string.
os.environ["TF_CONFIG"] = json.dumps({
    "cluster": cluster_spec,
    "task": {"type": "worker", "index": 0}
})

# Create the MultiWorkerMirroredStrategy instance.
strategy = tf.distribute.MultiWorkerMirroredStrategy()

# Print how many replicas are in sync.
print("Replicas in sync:", strategy.num_replicas_in_sync)

# Prepare a tiny synthetic dataset for training.
features = tf.random.normal(shape=(64, 4))

# Create small integer labels for classification.
labels = tf.random.uniform(shape=(64,), maxval=3, dtype=tf.int32)

# Zip features and labels into a dataset.
base_ds = tf.data.Dataset.from_tensor_slices((features, labels))

# Shuffle and batch the dataset for training.
base_ds = base_ds.shuffle(64, seed=7).batch(8)

# Distribute the dataset across strategy replicas.
dist_ds = strategy.experimental_distribute_dataset(base_ds)

# Define a simple model building function.
def build_model():
    model = tf.keras.Sequential([
        tf.keras.layers.Input(shape=(4,)),
        tf.keras.layers.Dense(8, activation="relu"),
        tf.keras.layers.Dense(3, activation="softmax"),
    ])
    return model

# Create optimizer and loss objects for training.
with strategy.scope():
    model = build_model()
    optimizer = tf.keras.optimizers.Adam(learning_rate=0.01)
    loss_obj = tf.keras.losses.SparseCategoricalCrossentropy(
        reduction=tf.keras.losses.Reduction.NONE
    )

# Define a function to compute per replica loss.
def compute_loss(labels, predictions):
    per_example_loss = loss_obj(labels, predictions)
    return tf.nn.compute_average_loss(
        per_example_loss, global_batch_size=64
    )

# Define one distributed training step function.
@tf.function
def distributed_train_step(dist_inputs):
    def replica_step(inputs):
        x, y = inputs
        with tf.GradientTape() as tape:
            preds = model(x, training=True)
            loss = compute_loss(y, preds)
        grads = tape.gradient(loss, model.trainable_variables)
        optimizer.apply_gradients(zip(grads, model.trainable_variables))
        return loss

    per_replica_losses = strategy.run(replica_step, args=(dist_inputs,))
    return strategy.reduce(
        tf.distribute.ReduceOp.SUM, per_replica_losses, axis=None
    )

# Run a very small training loop for demonstration.
for epoch in range(2):
    total_loss = 0.0
    num_batches = 0
    for batch in dist_ds:
        loss = distributed_train_step(batch)
        total_loss += loss.numpy()
        num_batches += 1
    avg_loss = total_loss / float(num_batches)
    print("Epoch", epoch, "average loss:", round(avg_loss, 4))

# Evaluate the model briefly on a few samples.
small_batch = features[:8]

# Get predictions from the trained model.
preds = model(small_batch, training=False)

# Print predicted class indices for quick inspection.
print("Predicted classes:", tf.argmax(preds, axis=1).numpy())



### **1.3. TPU Strategy Basics**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_08/Lecture_A/image_01_03.jpg?v=1769453669" width="250">



>* Uses TPUs to train huge models efficiently
>* Automatically replicates models and hides hardware details

>* TPUs run via managed cloud, preconfigured environments
>* Strategy handles cores, suits large matrix-heavy models

>* Best for large, high-throughput, TPU-shaped workloads
>* Requires TPU-friendly batches; runtime manages orchestration



In [None]:
#@title Python Code - TPU Strategy Basics

# This script introduces basic TPU distribution strategy.
# It runs safely even without real TPU hardware.
# Focus on concepts using a tiny Keras example.

# !pip install tensorflow==2.20.0.

# Import required standard libraries.
import os
import random
import numpy as np

# Set deterministic seeds for reproducibility.
random.seed(7)
np.random.seed(7)

# Import TensorFlow and Keras utilities.
import tensorflow as tf
from tensorflow import keras

# Print TensorFlow version in one short line.
print("TensorFlow version:", tf.__version__)

# Try to detect and initialize a TPU if available.
try:
    resolver = tf.distribute.cluster_resolver.TPUClusterResolver()
    tf.config.experimental_connect_to_cluster(resolver)
    tf.tpu.experimental.initialize_tpu_system(resolver)
    strategy = tf.distribute.TPUStrategy(resolver)
    device_type = "TPU"
except Exception:
    strategy = tf.distribute.get_strategy()
    device_type = "CPU_or_GPU"

# Print which distribution strategy is being used.
print("Using strategy:", type(strategy).__name__, "on", device_type)

# Load a tiny subset of MNIST for quick training.
(x_train, y_train), _ = keras.datasets.mnist.load_data()

# Keep only a small number of samples.
x_train = x_train[:2048]
y_train = y_train[:2048]

# Normalize and add channel dimension.
x_train = x_train.astype("float32") / 255.0
x_train = np.expand_dims(x_train, axis=-1)

# Validate shapes before building the model.
print("Train shape:", x_train.shape, "Labels:", y_train.shape)

# Create a simple model building function.
def create_model(input_shape, num_classes):
    model = keras.Sequential([
        keras.layers.Conv2D(16, (3, 3), activation="relu",
                            input_shape=input_shape),
        keras.layers.Flatten(),
        keras.layers.Dense(32, activation="relu"),
        keras.layers.Dense(num_classes, activation="softmax"),
    ])
    model.compile(
        optimizer="adam",
        loss="sparse_categorical_crossentropy",
        metrics=["accuracy"],
    )
    return model

# Define batch size and epochs for quick runs.
GLOBAL_BATCH_SIZE = 128
EPOCHS = 2

# Build and train the model inside strategy scope.
with strategy.scope():
    model = create_model(input_shape=x_train.shape[1:],
                         num_classes=10)

# Train silently to avoid long logs.
history = model.fit(
    x_train,
    y_train,
    batch_size=GLOBAL_BATCH_SIZE,
    epochs=EPOCHS,
    verbose=0,
)

# Fetch final loss and accuracy from history.
final_loss = history.history["loss"][-1]
final_acc = history.history["accuracy"][-1]

# Print a short summary of training results.
print("Final loss:", round(float(final_loss), 4))
print("Final accuracy:", round(float(final_acc), 4))

# Show how many replicas the strategy is using.
print("Number of replicas in sync:", strategy.num_replicas_in_sync)




## **2. Data Parallelism Basics**

### **2.1. TensorFlow Replicas Explained**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_08/Lecture_A/image_02_01.jpg?v=1769453741" width="250">



>* Replicas are identical model copies running in parallel
>* Each replica handles different data, enabling scalable training

>* Replicas run the same step on different data
>* Strategy coordinates, syncs results as one big model

>* Each replica sees only its own mini-batch
>* Together replicas approximate large-batch training, hiding details



In [None]:
#@title Python Code - TensorFlow Replicas Explained

# This script explains TensorFlow replicas conceptually.
# It uses MirroredStrategy with a tiny example.
# Focus is on data parallelism and synchronization.

# !pip install tensorflow==2.20.0.

# Import required modules from TensorFlow.
import tensorflow as tf

# Set a deterministic random seed value.
tf.random.set_seed(7)

# Print TensorFlow version in one short line.
print("TensorFlow version:", tf.__version__)

# Detect available logical GPUs for demonstration.
logical_gpus = tf.config.list_logical_devices("GPU")

# Decide number of devices used by strategy.
num_devices = max(len(logical_gpus), 1)

# Print how many devices will host replicas.
print("Number of devices for replicas:", num_devices)

# Create a simple MirroredStrategy for data parallelism.
strategy = tf.distribute.MirroredStrategy()

# Print how many replicas the strategy will manage.
print("Number of replicas in sync:", strategy.num_replicas_in_sync)

# Define global batch size based on replica count.
global_batch_size = 8 * strategy.num_replicas_in_sync

# Create a tiny synthetic dataset for demonstration.
features = tf.range(0, global_batch_size, dtype=tf.float32)

# Create simple labels as double the features.
labels = features * 2.0

# Validate shapes to avoid broadcasting surprises.
assert features.shape == labels.shape

# Build a tf.data.Dataset from tensors.
base_ds = tf.data.Dataset.from_tensor_slices((features, labels))

# Batch the dataset using the global batch size.
base_ds = base_ds.batch(global_batch_size)

# Distribute the dataset across replicas automatically.
dist_ds = strategy.experimental_distribute_dataset(base_ds)

# Define a very small linear model inside strategy scope.
with strategy.scope():
    # Create a simple Dense model for regression.
    model = tf.keras.Sequential([
        tf.keras.layers.Input(shape=(1,)),
        tf.keras.layers.Dense(1)
    ])

# Define a mean squared error loss function.
loss_obj = tf.keras.losses.MeanSquaredError(reduction="none")

# Create an optimizer with a small learning rate.
optimizer = tf.keras.optimizers.SGD(learning_rate=0.1)

# Define per replica loss scaled by global batch size.
def compute_loss(labels_replica, preds_replica):
    # Compute unreduced loss for each example.
    per_example_loss = loss_obj(labels_replica, preds_replica)
    # Scale loss by global batch size value.
    return tf.nn.compute_average_loss(
        per_example_loss, global_batch_size=global_batch_size
    )

# Define one training step run on each replica.
@tf.function
def train_step(dist_inputs):
    # Unpack distributed features and labels.
    dist_x, dist_y = dist_inputs

    # Define replica computation using strategy.run.
    def replica_step(x_replica, y_replica):
        # Reshape inputs to match model expectations.
        x_replica = tf.reshape(x_replica, (-1, 1))
        y_replica = tf.reshape(y_replica, (-1, 1))
        # Forward pass to compute predictions.
        with tf.GradientTape() as tape:
            preds = model(x_replica, training=True)
            loss = compute_loss(y_replica, preds)
        # Compute gradients for model variables.
        grads = tape.gradient(loss, model.trainable_variables)
        # Apply gradients to update shared weights.
        optimizer.apply_gradients(
            zip(grads, model.trainable_variables)
        )
        # Return the per replica loss value.
        return loss

    # Run replica_step on each replica in parallel.
    per_replica_losses = strategy.run(
        replica_step, args=(dist_x, dist_y)
    )

    # Reduce losses to get a single scalar value.
    mean_loss = strategy.reduce(
        tf.distribute.ReduceOp.SUM, per_replica_losses, axis=None
    )

    # Return the synchronized mean loss.
    return mean_loss

# Take one batch from the distributed dataset.
for batch in iter(dist_ds):
    # Run a single distributed training step.
    initial_loss = float(train_step(batch))
    break

# Print the loss after one synchronized update.
print("Loss after one distributed step:", round(initial_loss, 4))

# Show current model weight to illustrate shared parameters.
current_weight = float(model.layers[0].kernel[0, 0])

# Print the shared weight value used by all replicas.
print("Shared model weight after update:", round(current_weight, 4))



### **2.2. All Reduce Gradients**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_08/Lecture_A/image_02_02.jpg?v=1769453845" width="250">



>* All-reduce gathers and combines gradients from replicas
>* Shared gradients keep all model copies synchronized

>* All-reduce combines gradients then shares results everywhere
>* Different algorithms trade communication cost and performance

>* All-reduce makes replicas act like one model
>* Gradients are averaged so weights match after updates



In [None]:
#@title Python Code - All Reduce Gradients

# This script illustrates all reduce gradient behavior.
# It uses tf.distribute.MirroredStrategy with simple data.
# Focus is on synchronized gradient averaging across replicas.

# !pip install tensorflow==2.20.0.

# Import TensorFlow and basic utilities.
import tensorflow as tf
import numpy as np
import os

# Set deterministic seeds for reproducibility.
os.environ["TF_DETERMINISTIC_OPS"] = "1"
np.random.seed(7)
tf.random.set_seed(7)

# Print TensorFlow version in one short line.
print("TensorFlow version:", tf.__version__)

# Create a MirroredStrategy for data parallel training.
strategy = tf.distribute.MirroredStrategy()
print("Replicas in sync:", strategy.num_replicas_in_sync)

# Define global batch size and per replica batch size.
global_batch_size = 4
per_replica_batch = global_batch_size // max(strategy.num_replicas_in_sync, 1)

# Create tiny synthetic features and labels.
features = tf.constant([[1.0], [2.0], [3.0], [4.0]], dtype=tf.float32)
labels = tf.constant([[2.0], [4.0], [6.0], [8.0]], dtype=tf.float32)

# Validate shapes before building dataset.
assert features.shape[0] == labels.shape[0]
assert features.shape[0] == global_batch_size

# Build a small dataset and batch it once.
dataset = tf.data.Dataset.from_tensor_slices((features, labels))
dataset = dataset.batch(global_batch_size, drop_remainder=True)

# Distribute the dataset across replicas.
dist_dataset = strategy.experimental_distribute_dataset(dataset)

# Define a simple linear model inside strategy scope.
with strategy.scope():
    # Single dense layer represents y = wx + b.
    model = tf.keras.Sequential([
        tf.keras.layers.Dense(1, use_bias=True, input_shape=(1,))
    ])

    # Use mean squared error loss function.
    loss_obj = tf.keras.losses.MeanSquaredError(
        reduction=tf.keras.losses.Reduction.NONE
    )

    # Use a simple SGD optimizer.
    optimizer = tf.keras.optimizers.SGD(learning_rate=0.1)

# Define a function to compute per example loss.
def compute_loss(labels, predictions):
    per_example_loss = loss_obj(labels, predictions)
    return tf.nn.compute_average_loss(
        per_example_loss, global_batch_size=global_batch_size
    )

# Define one training step run on each replica.
@tf.function
def distributed_train_step(dist_inputs):
    # Run the replica step on each replica.
    per_replica_losses, per_replica_grads = strategy.run(
        replica_step, args=(dist_inputs,)
    )

    # Reduce losses across replicas for reporting.
    mean_loss = strategy.reduce(
        tf.distribute.ReduceOp.SUM, per_replica_losses, axis=None
    )

    # Gradients are already reduced by optimizer.apply_gradients.
    return mean_loss, per_replica_grads

# Define the replica computation including gradient calculation.
def replica_step(inputs):
    features_replica, labels_replica = inputs
    with tf.GradientTape() as tape:
        predictions = model(features_replica, training=True)
        loss = compute_loss(labels_replica, predictions)

    # Compute gradients of loss with respect to trainable variables.
    grads = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(grads, model.trainable_variables))
    return loss, grads

# Take one batch from the distributed dataset.
for batch in dist_dataset:
    first_batch = batch
    break

# Run a single distributed training step.
mean_loss_value, per_replica_grads = distributed_train_step(first_batch)

# Collect gradients from each replica for the kernel weight.
kernel_grads = strategy.experimental_local_results(per_replica_grads[0])

# Print mean loss after one synchronized step.
print("Mean loss after one step:", float(mean_loss_value))

# Print per replica gradient values for the kernel.
for idx, g in enumerate(kernel_grads):
    print("Replica", idx, "kernel grad:", float(g.numpy()[0][0]))

# Print final shared kernel weight to show synchronization.
print("Shared kernel weight:", float(model.trainable_variables[0].numpy()[0][0]))



### **2.3. Sync and Async Training**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_08/Lecture_A/image_02_03.jpg?v=1769453881" width="250">



>* Synchronous training updates shared weights in lockstep
>* All replicas match weights; speed limited by stragglers

>* Replicas update shared weights independently, improving utilization
>* Stale weights add noise, slowing and complicating convergence

>* Synchronous training dominates on tightly connected accelerators
>* Asynchronous suits loosely coupled systems, trading stability



## **3. Adjusting Code for Distribution**

### **3.1. Strategy Scope Basics**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_08/Lecture_A/image_03_01.jpg?v=1769453896" width="250">



>* Strategy scope controls variable placement and execution
>* Inside scope is distributed; outside remains regular

>* Create models and variables inside strategy scope
>* Keep local utilities outside to avoid sync bugs

>* Put variable and gradient logic inside scope
>* Keep replica-aware metrics and steps under strategy



In [None]:
#@title Python Code - Strategy Scope Basics

# This script shows basic strategy scope usage.
# It compares non distributed and distributed model creation.
# Focus on where we enter and exit strategy scope.

# Uncomment next line if TensorFlow is not installed.
# !pip install tensorflow==2.20.0.

# Import required modules from TensorFlow.
import tensorflow as tf

# Set a deterministic global random seed.
tf.random.set_seed(7)

# Print TensorFlow version in one short line.
print("TensorFlow version:", tf.__version__)

# Define a simple function that builds a small model.
def build_simple_model(input_shape, num_classes):
    # Create a sequential model with two dense layers.
    model = tf.keras.Sequential([
        tf.keras.layers.InputLayer(shape=input_shape),
        tf.keras.layers.Dense(16, activation="relu"),
        tf.keras.layers.Dense(num_classes, activation="softmax"),
    ])

    # Compile the model with optimizer and loss.
    model.compile(
        optimizer=tf.keras.optimizers.Adam(learning_rate=0.01),
        loss="sparse_categorical_crossentropy",
        metrics=["accuracy"],
    )
    return model

# Prepare a tiny dummy dataset for quick demonstration.
num_samples, num_features, num_classes = 32, 8, 3

# Create random features and integer labels.
features = tf.random.normal(shape=(num_samples, num_features))
labels = tf.random.uniform(
    shape=(num_samples,), minval=0, maxval=num_classes, dtype=tf.int32
)

# Validate shapes before using the tensors.
assert features.shape == (num_samples, num_features)
assert labels.shape == (num_samples,)

# Build and train a model without any distribution strategy.
plain_model = build_simple_model((num_features,), num_classes)

# Train briefly with verbose set to zero.
plain_model.fit(features, labels, epochs=1, batch_size=8, verbose=0)

# Evaluate once and print a compact result line.
plain_loss, plain_acc = plain_model.evaluate(
    features, labels, verbose=0
)
print("Plain model accuracy:", round(float(plain_acc), 4))

# Create a MirroredStrategy for simple data parallel training.
strategy = tf.distribute.MirroredStrategy()

# Enter the strategy scope for distributed variable creation.
with strategy.scope():
    dist_model = build_simple_model((num_features,), num_classes)

# Train the distributed model using the same data.
dist_model.fit(features, labels, epochs=1, batch_size=8, verbose=0)

# Evaluate the distributed model silently.
dist_loss, dist_acc = dist_model.evaluate(
    features, labels, verbose=0
)

# Print a short comparison of both accuracies.
print("Distributed model accuracy:", round(float(dist_acc), 4))

# Show that both models share the same input data shape.
print("Input feature shape:", features.shape)



### **3.2. Distributed input pipelines**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_08/Lecture_A/image_03_02.jpg?v=1769453931" width="250">



>* Distributed training requires a shared, central dataset pipeline
>* Strategy splits global batches and routes data automatically

>* Set a global batch, split across replicas
>* Build and transform dataset once before distribution

>* Use shared datasets with deterministic sharding per worker
>* Centralize randomness to avoid inconsistent, leaky training



In [None]:
#@title Python Code - Distributed input pipelines

# This script shows distributed input pipelines basics.
# It compares non distributed and distributed dataset usage.
# Focus is on tf distribute and batching behavior.

# !pip install tensorflow==2.20.0.

# Import required standard libraries.
import os
import random
import numpy as np

# Import TensorFlow and check version.
import tensorflow as tf

# Set deterministic seeds for reproducibility.
seed_value = 42
random.seed(seed_value)
np.random.seed(seed_value)
tf.random.set_seed(seed_value)

# Print TensorFlow version in one short line.
print("TensorFlow version:", tf.__version__)

# Detect available GPUs for potential distribution.
physical_gpus = tf.config.list_physical_devices("GPU")
num_gpus = len(physical_gpus)
print("Number of GPUs detected:", num_gpus)

# Create a simple synthetic dataset using NumPy.
num_samples = 32
features = np.arange(num_samples, dtype=np.float32).reshape((-1, 1))
labels = (features * 2.0).astype(np.float32)

# Wrap NumPy arrays into a tf.data Dataset.
base_ds = tf.data.Dataset.from_tensor_slices((features, labels))

# Define a small global batch size for demonstration.
global_batch_size = 8

# Build the input pipeline with shuffle and batch.
train_ds = (base_ds.shuffle(buffer_size=num_samples, seed=seed_value)
            .batch(global_batch_size)
            .prefetch(tf.data.AUTOTUNE))

# Show one batch shape in the non distributed case.
for batch_x, batch_y in train_ds.take(1):
    print("Single device batch shape:", batch_x.shape)

# Choose a distribution strategy based on available GPUs.
if num_gpus > 1:
    strategy = tf.distribute.MirroredStrategy()
else:
    strategy = tf.distribute.OneDeviceStrategy(device="/cpu:0")

# Print the number of replicas in the strategy.
print("Replicas in sync:", strategy.num_replicas_in_sync)

# Create a distributed dataset from the base dataset.
with strategy.scope():
    dist_ds = (base_ds.shuffle(buffer_size=num_samples, seed=seed_value)
               .batch(global_batch_size)
               .prefetch(tf.data.AUTOTUNE))
    dist_ds = strategy.experimental_distribute_dataset(dist_ds)

# Inspect one distributed batch and per replica shapes.
# "DistributedDataset" in this TF version does not implement "take".
# Convert to an iterator and manually take one batch instead.
_dist_iter = iter(dist_ds)
for dist_batch_x, dist_batch_y in [_dist_iter.__next__()]:
    def show_replica_shape(x):
        return tf.shape(x)

    per_replica_shapes = strategy.run(show_replica_shape, args=(dist_batch_x,))
    # dist_batch_x is already a PerReplica; just use its first replica to infer global batch size.
    any_replica = tf.nest.flatten(dist_batch_x)[0]
    print("Global batch shape:", tf.shape(any_replica) * 0 + global_batch_size)
    print("Per replica shapes:", per_replica_shapes)

# Build a tiny model inside the strategy scope.
with strategy.scope():
    model = tf.keras.Sequential([
        tf.keras.layers.Input(shape=(1,)),
        tf.keras.layers.Dense(1)
    ])
    model.compile(optimizer="sgd", loss="mse")

# Train briefly to show the pipeline works with strategy.
history = model.fit(dist_ds, epochs=1, steps_per_epoch=2, verbose=0)

# Print final loss from the short distributed training.
print("Final distributed training loss:", float(history.history["loss"][-1]))



### **3.3. Batch Size Semantics**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_08/Lecture_A/image_03_03.jpg?v=1769454005" width="250">



>* Understand batch size meaning changes with distribution
>* Distinguish global and per-replica batch sizes

>* Set a global batch; strategy splits automatically
>* Optimizer updates per global batch keep training consistent

>* Larger global batches change optimization and generalization behavior
>* Balance device memory, throughput, and training dynamics carefully



In [None]:
#@title Python Code - Batch Size Semantics

# This script illustrates batch size semantics simply.
# It compares global and per replica batch sizes clearly.
# It uses MirroredStrategy with a tiny synthetic dataset.

# !pip install tensorflow==2.20.0.

# Import required modules for TensorFlow and numpy.
import os
import numpy as np
import tensorflow as tf

# Set deterministic seeds for reproducible behavior.
os.environ["TF_DETERMINISTIC_OPS"] = "1"
np.random.seed(7)
tf.random.set_seed(7)

# Print TensorFlow version in one concise line.
print("TensorFlow version:", tf.__version__)

# Detect available GPUs and print a short summary.
physical_gpus = tf.config.list_physical_devices("GPU")
print("Num GPUs detected:", len(physical_gpus))

# Choose strategy based on GPU availability for clarity.
if len(physical_gpus) > 0:
    strategy = tf.distribute.MirroredStrategy()
else:
    strategy = tf.distribute.OneDeviceStrategy("/cpu:0")

# Print the number of replicas used by the strategy.
print("Replicas in sync:", strategy.num_replicas_in_sync)

# Define a small global batch size for demonstration.
global_batch_size = 8
print("Global batch size:", global_batch_size)

# Compute per replica batch size from global and replicas.
per_replica_batch = global_batch_size // strategy.num_replicas_in_sync
print("Per replica batch size:", per_replica_batch)

# Create a tiny synthetic dataset of simple features.
num_examples = 32
features = np.arange(num_examples, dtype="float32").reshape(-1, 1)
labels = (features * 0.5 + 1.0).astype("float32")

# Validate shapes before building the TensorFlow dataset.
assert features.shape[0] == labels.shape[0]
assert features.shape[1] == 1 and labels.shape[1] == 1

# Build a tf.data.Dataset and batch with global size.
base_ds = tf.data.Dataset.from_tensor_slices((features, labels))
base_ds = base_ds.batch(global_batch_size, drop_remainder=True)

# Distribute the dataset using the chosen strategy.
dist_ds = strategy.experimental_distribute_dataset(base_ds)

# Define a simple model building function inside strategy.
with strategy.scope():
    # Create a minimal dense model for demonstration.
    model = tf.keras.Sequential([
        tf.keras.layers.Input(shape=(1,)),
        tf.keras.layers.Dense(1)
    ])

# Compile the model with a basic optimizer and loss.
with strategy.scope():
    model.compile(
        optimizer=tf.keras.optimizers.SGD(learning_rate=0.01),
        loss="mse",
        run_eagerly=False
    )

# Train for a single epoch with silent logging.
history = model.fit(
    dist_ds,
    epochs=1,
    verbose=0
)

# Take one distributed batch and inspect per replica shapes.
for batch_features, batch_labels in iter(dist_ds):
    per_replica_features = batch_features
    per_replica_labels = batch_labels
    break

# Convert per replica tensors to a list for inspection.
feature_parts = strategy.experimental_local_results(per_replica_features)
label_parts = strategy.experimental_local_results(per_replica_labels)

# Print how the global batch is split across replicas.
print("Number of feature parts:", len(feature_parts))
print("Shape of first part:", feature_parts[0].shape)
print("Shape of global batch:", feature_parts[0].shape)

# Print a small sample to confirm correct slicing behavior.
print("First replica feature sample:", feature_parts[0][0].numpy())



# <font color="#418FDE" size="6.5" uppercase>**Distribution Basics**</font>


In this lecture, you learned to:
- Describe the main tf.distribute strategies available in TensorFlow 2.20.0 and their typical use cases. 
- Explain how data parallelism and replica synchronization work in distributed training. 
- Identify the changes required in model and input code to support distribution strategies. 

In the next Lecture (Lecture B), we will go over 'Implementing Strategies'