# <font color="#418FDE" size="6.5" uppercase>**DDP Fundamentals**</font>

>Last update: 20260130.
    
By the end of this Lecture, you will be able to:
- Describe the data‑parallel training paradigm and how DDP synchronizes gradients across processes. 
- Configure a basic DDP training script using torchrun, process groups, and DistributedDataParallel wrappers. 
- Use DistributedSampler with DataLoader to ensure each process sees a unique subset of data. 


## **1. Data Parallel Training**

### **1.1. Model Replication Basics**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_08/Lecture_A/image_01_01.jpg?v=1769766636" width="250">



>* Same model copied to many devices, data split
>* Replicas train in parallel, scaling without redesigning model

>* All replicas start from identical initial weights
>* Each replica computes local gradients, then combines updates

>* Full model copies need significant device memory
>* Syncing updates yields one fast, consistent global model



### **1.2. Gradient Sync Mechanics**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_08/Lecture_A/image_01_02.jpg?v=1769766650" width="250">



>* Each process computes gradients on its own data
>* Collective ops average gradients, mimicking one big batch

>* Gradients are all-reduced bucket by bucket
>* Averaged gradients mimic one large batch update

>* Autograd hooks trigger automatic, overlapping gradient all-reduce
>* Training loop stays single-device like while scaling



In [None]:
#@title Python Code - Gradient Sync Mechanics

# This script illustrates gradient synchronization basics.
# We simulate two workers averaging simple scalar gradients.
# Focus on clear prints instead of heavy computation.
# Required standard library imports for this demo.
import random
import math
import os

# Set deterministic random seed for reproducibility.
random.seed(42)
# Define a tiny linear model with one weight.
class TinyLinear:
    # Initialize model with a single weight parameter.
    def __init__(self, initial_weight: float = 0.0):
        self.weight = float(initial_weight)

    # Forward pass computing prediction from input value.
    def forward(self, x_value: float) -> float:
        return self.weight * float(x_value)

    # Compute gradient of mean squared error loss.
    def grad_mse(self, x_value: float, y_target: float) -> float:
        prediction = self.forward(x_value)
        error = prediction - float(y_target)
        return 2.0 * error * float(x_value)


# Simulate one worker computing gradient on local data.
def worker_compute_grad(worker_id: int, model_weight: float) -> float:
    # Create local model replica with shared initial weight.
    model = TinyLinear(initial_weight=model_weight)
    # Define tiny local dataset for this worker.
    if worker_id == 0:
        x_values = [1.0, 2.0]
        y_values = [2.0, 4.0]
    else:
        x_values = [3.0, 4.0]
        y_values = [6.0, 8.0]

    # Validate dataset lengths before computing gradients.
    assert len(x_values) == len(y_values)

    # Accumulate gradient over local mini batch.
    total_grad = 0.0
    for x_value, y_target in zip(x_values, y_values):
        total_grad += model.grad_mse(x_value, y_target)

    # Return average gradient over local mini batch.
    return total_grad / float(len(x_values))


# Simulate all reduce averaging across two workers.
def all_reduce_mean(gradients):
    # Validate there is at least one gradient value.
    assert len(gradients) > 0
    # Compute summed gradients across all workers.
    grad_sum = sum(gradients)
    # Return averaged gradient representing synchronized value.
    return grad_sum / float(len(gradients))


# Main demonstration function for gradient synchronization.
def main():
    # Print simple header describing the demonstration purpose.
    print("Gradient sync demo with two tiny workers.")

    # Initialize shared starting weight for both workers.
    initial_weight = 0.5
    print("Initial shared weight:", initial_weight)

    # Each worker computes its local gradient independently.
    grad_worker0 = worker_compute_grad(0, initial_weight)
    grad_worker1 = worker_compute_grad(1, initial_weight)

    # Print local gradients before synchronization step.
    print("Worker0 local gradient:", round(grad_worker0, 3))
    print("Worker1 local gradient:", round(grad_worker1, 3))

    # Perform all reduce mean to synchronize gradients.
    synced_grad = all_reduce_mean([grad_worker0, grad_worker1])

    # Print synchronized gradient shared by both workers.
    print("Synchronized averaged gradient:", round(synced_grad, 3))

    # Apply one small gradient descent update step.
    learning_rate = 0.1
    updated_weight = initial_weight - learning_rate * synced_grad

    # Show that both workers would apply same update.
    print("Updated shared weight after sync:", round(updated_weight, 3))


# Execute main demonstration when script is run.
main()




### **1.3. Communication overhead**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_08/Lecture_A/image_01_03.jpg?v=1769766707" width="250">



>* DDP adds coordination and gradient sharing costs
>* Network transfers and synchronization can bottleneck training speed

>* All-reduce gradient syncing dominates communication cost
>* Cost grows with model size, processes, network speed

>* Communication overhead limits scaling when adding GPUs
>* Mitigated by tuning jobs and improving communication



## **2. Setting Up DDP**

### **2.1. Launching DDP with torchrun**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_08/Lecture_A/image_02_01.jpg?v=1769766725" width="250">



>* torchrun launches and manages multi‑GPU training processes
>* It coordinates process identities and shared environment state

>* Torchrun defines process counts, ranks, and devices
>* Scripts read env vars to init distributed backend

>* Torchrun replaces manual multi‑GPU process management
>* Same script scales easily from workstation to cluster



In [None]:
#@title Python Code - Launching DDP with torchrun

# This script introduces launching DDP with torchrun.
# It runs safely on CPU only in Colab.
# It simulates ranks instead of real multi GPU.

# Required installs for real PyTorch distributed training.
# !pip install torch torchvision torchaudio.

# Import standard library modules for environment handling.
import os
import sys
import random

# Import typing tools for clearer function signatures.
from typing import Dict, Any, List

# Set a deterministic random seed for reproducibility.
random.seed(42)

# Define a small helper to simulate environment variables.


def build_fake_torchrun_env(world_size: int) -> List[Dict[str, Any]]:
    """Build fake env dicts for each simulated process."""
    envs: List[Dict[str, Any]] = []
    for rank in range(world_size):
        env: Dict[str, Any] = {}
        env["RANK"] = str(rank)
        env["WORLD_SIZE"] = str(world_size)
        env["LOCAL_RANK"] = str(rank)
        env["MASTER_ADDR"] = "127.0.0.1"
        env["MASTER_PORT"] = "29500"
        envs.append(env)
    return envs

# Define a function that prints key distributed settings.


def describe_process_from_env(env: Dict[str, Any]) -> str:
    """Return a short description string for one process."""
    rank = int(env.get("RANK", "0"))
    world = int(env.get("WORLD_SIZE", "1"))
    local = int(env.get("LOCAL_RANK", "0"))
    addr = env.get("MASTER_ADDR", "unknown")
    port = env.get("MASTER_PORT", "0")
    desc = (
        f"Rank {rank} of {world}, local rank {local}, "
        f"master {addr}:{port}"
    )
    return desc

# Define a tiny fake training step for demonstration.


def fake_training_step(rank: int, world_size: int) -> str:
    """Simulate a tiny gradient sync description."""
    base_grad = 0.1 * (rank + 1)
    avg_grad = sum(0.1 * (r + 1) for r in range(world_size))
    avg_grad /= float(world_size)
    msg = (
        f"Process {rank} computed grad {base_grad:.3f}, "
        f"synced average {avg_grad:.3f}"
    )
    return msg

# Define a function that simulates one DDP process body.


def simulated_ddp_process(env: Dict[str, Any]) -> List[str]:
    """Simulate what a DDP process might report."""
    rank = int(env["RANK"])
    world = int(env["WORLD_SIZE"])
    lines: List[str] = []
    lines.append(describe_process_from_env(env))
    lines.append(fake_training_step(rank, world))
    return lines

# Define a helper that prints a short header for the demo.


def print_header() -> None:
    """Print a short explanation header."""
    print("Simulating how torchrun configures each process.")
    print("Each line represents one launched process.")

# Define the main function that orchestrates the simulation.


def main() -> None:
    """Main entry that mimics torchrun launching."""
    world_size = 4
    envs = build_fake_torchrun_env(world_size)
    print_header()
    for env in envs:
        lines = simulated_ddp_process(env)
        for line in lines:
            print(line)

# Execute the main function when the script runs.
if __name__ == "__main__":
    main()




### **2.2. Initializing Process Groups**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_08/Lecture_A/image_02_02.jpg?v=1769766787" width="250">



>* Process groups create a shared communication backbone
>* World size and rank enable coordinated collective operations

>* Choose backend and rendezvous to connect processes
>* All processes must match settings or initialization fails

>* Process groups can form flexible subgroups for communication
>* Initialize early and shut down cleanly for reliability



In [None]:
#@title Python Code - Initializing Process Groups

# This script introduces basic process group initialization.
# It uses simple prints to explain distributed concepts.
# Run it with torchrun to simulate multiple processes.

# Install PyTorch if not already available in the environment.
# !pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu.

# Import standard library modules for environment handling.
import os
import random
import socket

# Import torch and distributed utilities if available.
try:
    import torch
    import torch.distributed as dist
except ImportError:
    torch = None
    dist = None

# Set deterministic seeds for reproducible behavior.
random.seed(0)
os.environ["PYTHONHASHSEED"] = "0"


# Define a helper function to detect single process mode.
def is_single_process_mode() -> bool:
    world_env = os.getenv("WORLD_SIZE", "1")
    return world_env == "1"


# Define a helper to choose a default backend safely.
def choose_backend() -> str:
    if torch is None:
        return "gloo"
    if torch.cuda.is_available():
        return "nccl"
    return "gloo"


# Define a function to print a short header line.
def print_header(title: str) -> None:
    line = "=" * len(title)
    print(line)
    print(title)
    print(line)


# Define the main demonstration function for process groups.
def main() -> None:
    print_header("Initializing a simple process group")

    # Read rank and world size from environment variables.
    rank = int(os.getenv("RANK", "0"))
    world_size = int(os.getenv("WORLD_SIZE", "1"))

    # Show basic identity information for this process.
    print(f"Process rank: {rank}, world size: {world_size}")

    # Handle the case where torch or dist is not available.
    if torch is None or dist is None:
        print("PyTorch not available, showing conceptual explanation only.")
        return

    # Choose backend based on hardware availability.
    backend = choose_backend()
    print(f"Selected backend: {backend}")

    # Prepare rendezvous information for initialization.
    master_addr = os.getenv("MASTER_ADDR", "127.0.0.1")
    master_port = os.getenv("MASTER_PORT", "29500")

    # Show rendezvous configuration for clarity.
    print(f"Rendezvous at {master_addr}:{master_port}")

    # Skip initialization when running in single process mode.
    if is_single_process_mode():
        print("Single process mode, process group not initialized.")
        return

    # Initialize the default process group for all processes.
    dist.init_process_group(
        backend=backend,
        init_method=f"tcp://{master_addr}:{master_port}",
        rank=rank,
        world_size=world_size,
    )

    # Verify that the process group is initialized correctly.
    if dist.is_initialized():
        print("Process group successfully initialized.")
    else:
        print("Process group initialization failed.")

    # Cleanly destroy the process group before exiting.
    dist.destroy_process_group()


# Execute the main function when the script runs.
main()




### **2.3. DDP Model Wrapping**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_08/Lecture_A/image_02_03.jpg?v=1769766842" width="250">



>* Move each model replica to its device
>* Wrap model so gradients sync across processes automatically

>* Wrapper syncs gradients via bucketed all-reduce
>* All replicas share averaged updates, scaling training

>* Register all trainable parameters before wrapping DDP
>* Keep identical models per process to synchronize updates



In [None]:
#@title Python Code - DDP Model Wrapping

# This script illustrates a simple DDP style concept.
# We simulate wrapping a model and synchronizing gradients.
# The focus is on beginner friendly conceptual clarity.

# Required install for PyTorch if not already available.
# !pip install torch torchvision torchaudio --quiet.

# Import standard libraries for typing and randomness.
import os
import random
import math

# Import torch modules for tensors and neural networks.
import torch
import torch.nn as nn

# Set deterministic seeds for reproducible tiny experiment.
random.seed(0)
torch.manual_seed(0)

# Detect device but keep everything small and simple.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Print torch version and chosen device once only.
print("Torch version:", torch.__version__, "Device:", device)

# Define a tiny linear model for demonstration purposes.
class TinyModel(nn.Module):

    # Initialize with a single linear layer only.
    def __init__(self, in_features: int, out_features: int):
        super().__init__()
        self.layer = nn.Linear(in_features, out_features)

    # Forward pass applies the linear layer to inputs.
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.layer(x)

# Create a helper to build the base model on cpu.
def build_base_model() -> TinyModel:
    # Use fixed input and output feature sizes.
    model = TinyModel(in_features=4, out_features=2)
    return model

# Create two model replicas to mimic two DDP processes.
base_model = build_base_model()
replica_one = build_base_model()

# Copy base weights so replicas start identically.
replica_one.load_state_dict(base_model.state_dict())

# Move both replicas to the selected device.
base_model = base_model.to(device)
replica_one = replica_one.to(device)

# Define a simple mean squared error loss function.
criterion = nn.MSELoss()

# Create tiny fake batch data and targets.
inputs = torch.randn(2, 4, device=device)
targets = torch.randn(2, 2, device=device)

# Validate shapes to avoid silent broadcasting mistakes.
assert inputs.shape == (2, 4)
assert targets.shape == (2, 2)

# Perform forward pass on each replica separately.
outputs_zero = base_model(inputs)
outputs_one = replica_one(inputs)

# Compute local losses for each simulated process.
loss_zero = criterion(outputs_zero, targets)
loss_one = criterion(outputs_one, targets)

# Backward to compute local gradients on each replica.
loss_zero.backward()
loss_one.backward()

# Collect gradients from both replicas into simple lists.
grads_zero = [p.grad.detach().clone() for p in base_model.parameters()]
grads_one = [p.grad.detach().clone() for p in replica_one.parameters()]

# Simulate DDP all reduce by averaging gradients manually.
averaged_grads = []
for g0, g1 in zip(grads_zero, grads_one):
    avg = (g0 + g1) / 2.0
    averaged_grads.append(avg)

# Apply averaged gradients back to both model replicas.
with torch.no_grad():
    for param, avg_grad in zip(base_model.parameters(), averaged_grads):
        param.grad.copy_(avg_grad)
    for param, avg_grad in zip(replica_one.parameters(), averaged_grads):
        param.grad.copy_(avg_grad)

# Define a tiny optimizer for each model replica.
optimizer_zero = torch.optim.SGD(base_model.parameters(), lr=0.1)
optimizer_one = torch.optim.SGD(replica_one.parameters(), lr=0.1)

# Step both optimizers so parameters update identically.
optimizer_zero.step()
optimizer_one.step()

# Compare parameters to confirm synchronized updates.
max_difference = 0.0
for p0, p1 in zip(base_model.parameters(), replica_one.parameters()):
    diff = (p0.detach() - p1.detach()).abs().max().item()
    max_difference = max(max_difference, diff)

# Print a few key values to summarize the behavior.
print("Loss process zero:", float(loss_zero))
print("Loss process one:", float(loss_one))
print("Max parameter difference after sync:", max_difference)




## **3. Distributed Data Samplers**

### **3.1. Using DistributedSampler**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_08/Lecture_A/image_03_01.jpg?v=1769766916" width="250">



>* DistributedSampler splits dataset indices across processes
>* Ensures full coverage without duplicate samples per epoch

>* Sampler uses world size and rank to partition
>* DataLoader then serves each process unique batches

>* Sampler pads data to balance workloads
>* Padding preserves fairness, uniqueness, and synchronization



In [None]:
#@title Python Code - Using DistributedSampler

# This script demonstrates DistributedSampler usage conceptually.
# It runs in a single process but mimics multiple ranks.
# Focus on how indices are partitioned across fake processes.

# No extra installs are required for this simple example.
# All used modules are from the Python standard library.

# Import random module for deterministic shuffling.
import random

# Set a deterministic seed for reproducible shuffling.
random.seed(42)

# Define a tiny toy dataset as a list of indices.
full_dataset_indices = list(range(12))

# Define world size representing total parallel processes.
world_size = 3

# Validate that world size is a positive integer.
assert isinstance(world_size, int) and world_size > 0

# Define a helper that mimics DistributedSampler behavior.
def get_indices_for_rank(indices, world_size, rank):
    # Validate rank is within the correct range.
    assert 0 <= rank < world_size

    # Shuffle a copy of indices deterministically.
    shuffled = indices.copy()
    random.shuffle(shuffled)

    # Compute padded length divisible by world size.
    remainder = len(shuffled) % world_size

    # If needed, pad with initial indices to balance.
    if remainder != 0:
        pad_size = world_size - remainder

        # Extend shuffled list with repeated indices.
        shuffled.extend(shuffled[:pad_size])

    # Compute per rank slice size after padding.
    per_rank = len(shuffled) // world_size

    # Compute start and end positions for this rank.
    start = rank * per_rank
    end = start + per_rank

    # Return the slice assigned to this rank.
    return shuffled[start:end]

# Collect assigned indices for each fake rank.
assigned = {}

# Loop over ranks and compute their index subsets.
for rank in range(world_size):
    # Get indices for this rank using helper.
    rank_indices = get_indices_for_rank(
        full_dataset_indices,
        world_size,
        rank,
    )

    # Store the indices for later inspection.
    assigned[rank] = rank_indices

# Print framework style header for clarity.
print("Simulated DistributedSampler index partitioning:")

# Show full dataset indices once for reference.
print("Full dataset indices:", full_dataset_indices)

# Print how many samples each rank receives.
print("World size:", world_size, "ranks in total")

# Loop again to display each rank assignment.
for rank in range(world_size):
    # Print rank specific subset of indices.
    print("Rank", rank, "sees indices:", assigned[rank])

# Confirm that union of unique indices covers dataset.
print("Unique indices covered:", sorted(set(sum(assigned.values(), []))))




### **3.2. Epoch Shuffling Behavior**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_08/Lecture_A/image_03_02.jpg?v=1769766970" width="250">



>* Sampler creates one global shuffled index order
>* Each process gets a unique, deterministic slice

>* Sampler seed depends on provided epoch number
>* Updating epoch reshuffles data while keeping synchronization

>* Sampler pads and shuffles to balance samples
>* Careful reshuffling ensures fairness, stability, reproducibility



In [None]:
#@title Python Code - Epoch Shuffling Behavior

# This script illustrates distributed epoch shuffling behavior.
# We simulate ranks and a sampler without external libraries.
# Focus is on how epochs change sample order deterministically.

# Example pip install for real distributed training environments.
# !pip install torch.

# Import standard random and math utilities.
import random
import math

# Set a global seed for deterministic behavior.
random.seed(42)

# Define a tiny synthetic dataset of integer sample ids.
dataset_indices = list(range(10))

# Configure world size and ensure it divides padded length.
world_size = 3

# Validate dataset size is positive and world size reasonable.
assert len(dataset_indices) > 0 and world_size > 0

# Compute padded length so each rank gets equal samples.
padded_len = int(math.ceil(len(dataset_indices) / world_size)) * world_size

# Brief function to build a padded index list.
def build_padded_indices(indices, padded_len):
    # Repeat indices then truncate to padded length.
    repeated = (indices * ((padded_len // len(indices)) + 1))[:padded_len]
    return repeated

# Build the padded index list once for all epochs.
base_indices = build_padded_indices(dataset_indices, padded_len)

# Function to simulate one epoch shuffling for all ranks.
def simulate_epoch(epoch, world_size, base_indices):
    # Create a local random generator seeded by epoch.
    rng = random.Random(1000 + epoch)

    # Copy and shuffle indices deterministically for this epoch.
    shuffled = list(base_indices)
    rng.shuffle(shuffled)

    # Partition shuffled indices into equal rank chunks.
    chunk_size = len(shuffled) // world_size
    rank_chunks = []

    # Slice shuffled list so each rank gets unique segment.
    for rank in range(world_size):
        start = rank * chunk_size
        end = start + chunk_size
        rank_chunks.append(shuffled[start:end])

    # Return the per rank index assignment for this epoch.
    return rank_chunks

# Print a short header explaining the demonstration.
print("Simulating epoch shuffling for", world_size, "ranks")

# Choose a few epochs to visualize behavior changes.
epochs_to_show = [0, 1, 2]

# Loop over epochs and display rank assignments.
for epoch in epochs_to_show:
    # Simulate shuffling and partitioning for this epoch.
    rank_chunks = simulate_epoch(epoch, world_size, base_indices)

    # Print epoch number and per rank index lists.
    print("\nEpoch", epoch, "rank index assignments:")
    for rank, chunk in enumerate(rank_chunks):
        print("  Rank", rank, "sees indices:", chunk)



### **3.3. Debugging Rank Mismatches**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_08/Lecture_A/image_03_03.jpg?v=1769767010" width="250">



>* Rank mismatches cause inconsistent data slices per process
>* Leads to hangs, wasted compute, and bad metrics

>* Spot symptoms like uneven epochs or batches
>* Log ranks, world size, dataset stats to compare

>* Standardize configs, launches, environments, and data views
>* Ensure sampler resets and each rank’s dataset slice



In [None]:
#@title Python Code - Debugging Rank Mismatches

# This script illustrates debugging rank mismatches simply.
# We simulate two ranks and compare sampler index assignments.
# Focus on understanding shapes sizes and mismatched expectations.

# Required external install for torch in some environments.
# !pip install torch torchvision torchaudio --quiet.

# Import standard random and os modules.
import os
import random
import math

# Import torch and set deterministic behavior.
import torch

# Set deterministic seeds for reproducibility.
random.seed(0)

torch.manual_seed(0)

# Define a tiny dummy dataset length.
DATASET_LEN = 12

# Define a helper to build per rank index slices.
def build_indices(dataset_len, world_size, rank):
    # Compute base samples per rank using ceiling division.
    per_rank = math.ceil(dataset_len / world_size)
    # Compute start and end positions for this rank.
    start = rank * per_rank
    end = min(start + per_rank, dataset_len)
    # Return list of indices for this rank.
    return list(range(start, end))

# Simulate correct configuration with world size two.
correct_world_size = 2

# Build indices for rank zero with correct world size.
correct_rank0 = build_indices(DATASET_LEN, correct_world_size, 0)

# Build indices for rank one with correct world size.
correct_rank1 = build_indices(DATASET_LEN, correct_world_size, 1)

# Simulate a buggy configuration for rank one.
buggy_world_size_rank1 = 3

# Build buggy indices for rank one using wrong world size.
buggy_rank1 = build_indices(DATASET_LEN, buggy_world_size_rank1, 1)

# Print framework version in one concise line.
print("Torch version:", torch.__version__)

# Print correct per rank index assignments for reference.
print("Correct rank0 indices:", correct_rank0)

# Print correct rank one indices for comparison.
print("Correct rank1 indices:", correct_rank1)

# Print buggy rank one indices showing mismatch clearly.
print("Buggy rank1 indices:", buggy_rank1)

# Compute union of indices for correct configuration.
correct_union = sorted(set(correct_rank0 + correct_rank1))

# Compute union of indices when rank one is buggy.
buggy_union = sorted(set(correct_rank0 + buggy_rank1))

# Print union sizes to highlight coverage differences.
print("Correct union size:", len(correct_union))

# Print buggy union size and note potential missing samples.
print("Buggy union size:", len(buggy_union))

# Compute intersection to show overlapping duplicated work.
intersection = sorted(set(correct_rank0).intersection(buggy_rank1))

# Final print summarizing overlap and mismatch situation.
print("Overlapping indices between rank0 and buggy rank1:", intersection)




# <font color="#418FDE" size="6.5" uppercase>**DDP Fundamentals**</font>


In this lecture, you learned to:
- Describe the data‑parallel training paradigm and how DDP synchronizes gradients across processes. 
- Configure a basic DDP training script using torchrun, process groups, and DistributedDataParallel wrappers. 
- Use DistributedSampler with DataLoader to ensure each process sees a unique subset of data. 

In the next Lecture (Lecture B), we will go over 'Scaling Practices'