# PyTorch DDP Fashion MNIST Training Example run locally with Docker

This example demonstrates how to utilise Kubeflow Trainer locally with docker. It simulates a similar experience to distributed training on kubernetes from your local machine. 

The notebook demonstrates how to train a convolutional neural network (CNN) to classify images using the [Fashion MNIST](https://github.com/zalandoresearch/fashion-mnist) dataset and [PyTorch Distributed Data Parallel (DDP)](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html). 


## Install the Kubeflow SDK

You need to install the Kubeflow SDK with the docker extra to interact with Kubeflow Trainer APIs:

In [1]:
!pip install -U kubeflow[docker]

[33mDEPRECATION: git+https://github.com/briangallagher/sdk.git@docker-backend#egg=kubeflow[docker] contains an egg fragment with a non-PEP 508 name. pip 25.3 will enforce this behaviour change. A possible replacement is to use the req @ url syntax, and remove the egg fragment. Discussion can be found at https://github.com/pypa/pip/issues/13157[0m[33m
[0mCollecting kubeflow (from kubeflow[docker])
  Cloning https://github.com/briangallagher/sdk.git (to revision docker-backend) to /private/var/folders/rv/666pnlds63945vbrm6zbfkhc0000gn/T/pip-install-b48lu5mk/kubeflow_f2531a4401fa4eddbfb115d575a3353d
  Running command git clone --filter=blob:none --quiet https://github.com/briangallagher/sdk.git /private/var/folders/rv/666pnlds63945vbrm6zbfkhc0000gn/T/pip-install-b48lu5mk/kubeflow_f2531a4401fa4eddbfb115d575a3353d
  Running command git checkout -b docker-backend --track origin/docker-backend
  Switched to a new branch 'docker-backend'
  branch 'docker-backend' set up to track 'ori

## Define the Training Function

The first step is to create function to train CNN model using Fashion MNIST data.

In [2]:
def train_fashion_mnist():
    import os

    import torch
    import torch.distributed as dist
    import torch.nn.functional as F
    from torch import nn
    from torch.utils.data import DataLoader, DistributedSampler
    from torchvision import datasets, transforms

    # Define the PyTorch CNN model to be trained
    class Net(nn.Module):
        def __init__(self):
            super(Net, self).__init__()
            self.conv1 = nn.Conv2d(1, 20, 5, 1)
            self.conv2 = nn.Conv2d(20, 50, 5, 1)
            self.fc1 = nn.Linear(4 * 4 * 50, 500)
            self.fc2 = nn.Linear(500, 10)

        def forward(self, x):
            x = F.relu(self.conv1(x))
            x = F.max_pool2d(x, 2, 2)
            x = F.relu(self.conv2(x))
            x = F.max_pool2d(x, 2, 2)
            x = x.view(-1, 4 * 4 * 50)
            x = F.relu(self.fc1(x))
            x = self.fc2(x)
            return F.log_softmax(x, dim=1)

    # Use NCCL if a GPU is available, otherwise use Gloo as communication backend.
    device, backend = ("cuda", "nccl") if torch.cuda.is_available() else ("cpu", "gloo")
    print(f"Using Device: {device}, Backend: {backend}")

    # Setup PyTorch distributed.
    local_rank = int(os.getenv("LOCAL_RANK", 0))
    dist.init_process_group(backend=backend)
    rank = dist.get_rank()
    print(
        "Distributed Training for WORLD_SIZE: {}, RANK: {}, LOCAL_RANK: {}".format(
            dist.get_world_size(),
            rank,
            local_rank,
        )
    )

    # Create the model and load it into the device.
    device = torch.device(f"{device}:{local_rank}")
    model = nn.parallel.DistributedDataParallel(Net().to(device))
    optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)


    # Use a rank-specific dataset directory to avoid concurrent writes to a shared mount
    data_dir = f"/tmp/fashion-mnist-{rank}"
    os.makedirs(data_dir, exist_ok=True)
    dataset = datasets.FashionMNIST(
        data_dir,
        train=True,
        download=True,
        transform=transforms.Compose([transforms.ToTensor()]),
    )


    # Shard the dataset accross workers.
    train_loader = DataLoader(
        dataset,
        batch_size=100,
        sampler=DistributedSampler(dataset)
    )

    # TODO(astefanutti): add parameters to the training function
    dist.barrier()
    for epoch in range(1, 3):
        model.train()

        # Iterate over mini-batches from the training set
        for batch_idx, (inputs, labels) in enumerate(train_loader):
            # Copy the data to the GPU device if available
            inputs, labels = inputs.to(device), labels.to(device)
            # Forward pass
            outputs = model(inputs)
            loss = F.nll_loss(outputs, labels)
            # Backward pass
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            if batch_idx % 10 == 0 and dist.get_rank() == 0:
                print(
                    "Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}".format(
                        epoch,
                        batch_idx * len(inputs),
                        len(train_loader.dataset),
                        100.0 * batch_idx / len(train_loader),
                        loss.item(),
                    )
                )

    # Wait for the distributed training to complete
    dist.barrier()
    if dist.get_rank() == 0:
        print("Training is finished")

    # Finally clean up PyTorch distributed
    dist.destroy_process_group()

## Run PyTorch DDP with Kubeflow TrainJob

You can use `TrainerClient()` from the Kubeflow SDK to communicate with Kubeflow Trainer APIs and scale your training function across multiple PyTorch training nodes.

`TrainerClient(backend_config=LocalDockerBackendConfig())` verifies that you have required access to a local docker client.

Kubeflow Trainer creates a `TrainJob` resource and automatically sets the appropriate environment variables to set up PyTorch in distributed environment. Distributed in this context means a local docker instance with multiple containers running communicating over a docker network.



In [3]:
from kubeflow.trainer import CustomTrainer, TrainerClient, LocalDockerBackendConfig
import os

backend_config = LocalDockerBackendConfig()

# The SDK will look for the docker socket in the default location, for example: /var/run/docker.sock
# If it's not in the default location, for example if you are using Colima on Mac, you can specify the path to the docker socket.
# backend_config = LocalDockerBackendConfig(
#     docker_host=f"unix://{os.path.expanduser('~')}/.colima/default/docker.sock"
# )

client = TrainerClient(backend_config=backend_config)

## List the Training Runtimes

You can get the list of available Training Runtimes to start your TrainJob.

Additionally, it might show available accelerator type and number of available resources.

In [4]:
for runtime in client.list_runtimes():
    print(runtime)
    if runtime.name == "torch-distributed":
        torch_runtime = runtime

Runtime(name='torch-distributed', trainer=RuntimeTrainer(trainer_type=<TrainerType.CUSTOM_TRAINER: 'CustomTrainer'>, framework='torch', num_nodes=1, device='Unknown', device_count='Unknown'), pretrained_model=None)


## Run the Distributed TrainJob

Kubeflow TrainJob will train the above model on 3 PyTorch nodes.

In [5]:
job_name = client.train(
    trainer=CustomTrainer(
        func=train_fashion_mnist,
        # Set how many PyTorch nodes you want to use for distributed training. 
        # num_nodes will equal the number of local containers running
        num_nodes=2, 
    ),
    runtime=torch_runtime,
)

## Check the TrainJob steps

You can check the components of TrainJob that's created.

Since the TrainJob performs distributed training across 3 nodes, it generates 3 steps: `trainer-node-0` .. `trainer-node-2`.

You can get the individual status for each of these steps.

In [6]:
# Wait for the running status.
client.wait_for_job_status(name=job_name, status={"Running"})

TrainJob(name='db70754a1731', creation_timestamp=datetime.datetime(2025, 9, 30, 12, 52, 52, 87272), runtime=Runtime(name='torch-distributed', trainer=RuntimeTrainer(trainer_type=<TrainerType.CUSTOM_TRAINER: 'CustomTrainer'>, framework='torch', num_nodes=1, device='Unknown', device_count='Unknown'), pretrained_model=None), steps=[Step(name='node-0', status='Running', pod_name='db70754a1731-node-0', device='Unknown', device_count='Unknown'), Step(name='node-1', status='Running', pod_name='db70754a1731-node-1', device='Unknown', device_count='Unknown')], num_nodes=2, status='Running')

In [7]:
for c in client.get_job(name=job_name).steps:
    print(f"Step: {c.name}, Status: {c.status}, Devices: {c.device} x {c.device_count}\n")

Step: node-0, Status: Running, Devices: Unknown x Unknown

Step: node-1, Status: Running, Devices: Unknown x Unknown



## Watch the TrainJob logs

We can use the `get_job_logs()` API to get the TrainJob logs.

In [8]:
for logline in client.get_job_logs(job_name, follow=True):
    print(logline)

Using Device: cpu, Backend: gloo

Distributed Training for WORLD_SIZE: 2, RANK: 0, LOCAL_RANK: 0

100%|██████████| 26.4M/26.4M [00:04<00:00, 5.45MB/s]

100%|██████████| 29.5k/29.5k [00:00<00:00, 810kB/s]

100%|██████████| 4.42M/4.42M [00:00<00:00, 4.76MB/s]

100%|██████████| 5.15k/5.15k [00:00<00:00, 10.8MB/s]




























































Training is finished

Using Device: cpu, Backend: gloo

Distributed Training for WORLD_SIZE: 2, RANK: 1, LOCAL_RANK: 0

100%|██████████| 26.4M/26.4M [00:05<00:00, 5.22MB/s]

100%|██████████| 29.5k/29.5k [00:00<00:00, 958kB/s]

100%|██████████| 4.42M/4.42M [00:00<00:00, 5.38MB/s]

100%|██████████| 5.15k/5.15k [00:00<00:00, 13.6MB/s]



## Optional: Examine Docker resources

- Containers for this training job

```bash
docker ps --filter label=trainer.kubeflow.ai/trainjob-name
```

Example:
```text
CONTAINER ID   IMAGE                                           NAMES
f6a786574f73   pytorch/pytorch:2.7.1-cuda12.8-cudnn9-runtime   ydb5bf3c10c4-node-1
c36274db6eb9   pytorch/pytorch:2.7.1-cuda12.8-cudnn9-runtime   ydb5bf3c10c4-node-0
```

- Network created for this training job

```bash
docker network ls --filter label=trainer.kubeflow.org/trainjob-name
```

Example:
```text
NETWORK ID     NAME               DRIVER    SCOPE
2cded187f9e7   b69f13d3f8dc-net   bridge    local
```









## Delete the TrainJob

When TrainJob is finished, you can delete the resource.


In [None]:
# client.delete_job(job_name)