The colab assumes there are 2 GPUs on a single node.

This can run on Kaggle Notebook.

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [2]:
!nvidia-smi

Tue Aug  5 03:53:50 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   43C    P8              9W /   70W |       1MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  Tesla T4                      

In [3]:
import os
import torch
import torch.distributed as dist
import torch.multiprocessing as mp

In [4]:
# @title Process for collective communication

# 广播（Broadcast）：广播是一种将数据从一个源进程发送到所有其他进程的通信操作。在 torch.distributed 中，通过 broadcast(tensor, src=0) 可以实现该操作，将 rank 为 0 的进程中的数据广播到所有其他进程。广播操作能够确保所有进程拥有相同的数据，适合需要共享模型参数、初始化权重等场景。比如在分布式训练的初始化阶段，用于将主进程的模型参数广播到所有其他进程，保证训练从同样的初始参数开始。
# 规约（Reduce 和 All-Reduce）：规约操作是一种将多个进程的数据进行计算（如求和、求最大值等）的操作。常用的规约操作有两种，reduce()：一个进程（通常是主进程）收集并合并来自所有进程的数据；all_reduce()：所有进程同时得到合并后的数据。比如 all_reduce(tensor, op=ReduceOp.SUM) 会在所有进程中求和，并将结果存放在每个进程的 tensor 中。规约操作能有效减少通信负担，适用于大规模梯度汇总或模型权重更新。譬如在分布式训练中，all_reduce 常用于梯度求和，以确保在多个进程中的梯度保持一致，实现同步更新。
# 收集（Gather 和 All-Gather）：收集操作是将多个进程的数据收集到一个或多个进程的操作：gather()：将多个进程的数据收集到一个进程中。all_gather()：所有进程都收集到全部进程的数据。例如 all_gather(gathered_tensors, tensor) 会将所有进程中的 tensor 收集到每个进程的 gathered_tensors 列表中。收集操作方便对所有进程中的数据进行后续分析和处理。譬如做 evaluation 时，可以使用 all_gather 来汇总各个进程的中间结果。
# 散发（Scatter）：scatter() 操作是将一个进程的数据分散到多个进程中。例如在 rank 为 0 的进程中有一个包含若干子张量的列表，scatter() 可以将列表中的每个子张量分配给其他进程。适用于数据分发，将大型数据集或模型权重在多个进程中分散，以便每个进程可以处理不同的数据块。

def init_process(rank, world_size, backend="nccl"):
  device = f"cuda:{rank}"
  print(f"Starting process with {rank=}, {world_size=} {device=}")

  # Use the gloo backend for CPU-based distributed processing
  dist.init_process_group(backend="nccl", world_size=world_size, rank=rank)

  assert rank == dist.get_rank()
  assert world_size == dist.get_world_size()
  dist.barrier()
  print("{rank=} Finished init!!!")

  # Task 1 - all gather
  # It gathers information from all nodes.
  if rank == 0:
    print("\nTask 1 - all gather")
  process_info = (
      f"Process {rank} Information..."
  )
  max_len = 100
  process_info_tensor = torch.zeros(max_len, dtype=torch.int32).to(device)
  process_info_bytes = process_info.encode('utf-8')
  process_info_tensor[:len(process_info_bytes)] = torch.tensor([b for b in process_info_bytes], dtype=torch.int32)

  gathered_tensors = [torch.zeros(max_len, dtype=torch.int32).to(device) for _ in range(world_size)]

  dist.all_gather(gathered_tensors, process_info_tensor)

  if rank == 0:
    for t in gathered_tensors:
      info_bytes = t.to('cpu').numpy().astype('uint8').tobytes()
      info_str = info_bytes.decode('utf-8', 'ignore').strip('\x00')
      print(info_str)
  dist.barrier()
  print("{rank=} Finished step 1!!!")

  # Task 2 - all reduce (sum)
  if rank == 0:
    print("\nTask 2 - all reduce")
  tensor = torch.ones((4,)).to(device)
  dist.all_reduce(tensor)
  print(f"All reduce for all processes: in rank {rank}, tensor = {tensor}")
  dist.barrier()

  # Task 3 - all reduce (sum) in a sub-group.
  if rank == 0:
    print("\nTask 3 - all reduce for sub-group")
  sub_group_ranks = range(1, world_size, 2)
  sub_group = dist.new_group(ranks=sub_group_ranks)
  if rank in sub_group_ranks:
    sub_group_tensor = torch.ones((4,)).to(device)
    dist.all_reduce(sub_group_tensor, group=sub_group)
    print(f"Sub group all reduce: in rank {rank}, tensor = {sub_group_tensor}")
  dist.barrier()

  # Task 4 - all reduce (sum) in a sub-group, then sync results to the entire group.
  if rank == 0:
    print("\nRank 4 - all reduce (sum) in a sub-group, then sync results to the entire group.")
  group_1_sum = torch.tensor([1, 1, 1, 1]).to(device)
  group_2_sum = torch.tensor([1.5] * 4).to(device)
  group_1_ranks = list(range(world_size // 2))
  group_2_ranks = list(range(world_size // 2, world_size))
  group_1 = dist.new_group(ranks=group_1_ranks)
  group_2 = dist.new_group(ranks=group_2_ranks)
  if rank in group_1_ranks:
    dist.all_reduce(group_1_sum, group=group_1)
  else:
    dist.all_reduce(group_2_sum, group=group_2)
  # Communicate the sub-group sums to the entire group.
  dist.all_reduce(group_1_sum, op=dist.ReduceOp.MAX)
  dist.all_reduce(group_2_sum, op=dist.ReduceOp.MAX)
  print(f"In rank {rank}, {group_1_sum.to('cpu')=}, {group_2_sum.to('cpu')=}")

  # Finish
  print(f"\nFinishing process with {rank=}, {world_size=}")
  dist.destroy_process_group()

# Run the distributed processing

os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '12459' # You can choose a different port if 12355 is in use

world_size = 2

processes = []
for rank in range(world_size):
  p = mp.Process(target=init_process, args=(rank, world_size))
  p.start()
  processes.append(p)

for p in processes:
  p.join()

Starting process with rank=0, world_size=2 device='cuda:0'
Starting process with rank=1, world_size=2 device='cuda:1'


[W805 03:54:05.917920006 socket.cpp:204] [c10d] The hostname of the client socket cannot be retrieved. err=-3
[W805 03:54:05.924203798 socket.cpp:204] [c10d] The hostname of the client socket cannot be retrieved. err=-3
[W805 03:54:13.919544309 socket.cpp:204] [c10d] The hostname of the client socket cannot be retrieved. err=-3
[W805 03:54:21.921284104 socket.cpp:204] [c10d] The hostname of the client socket cannot be retrieved. err=-3
[rank1]:[W805 03:54:21.273145008 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 1]  using GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id.
[rank0]:[W805 03:54:21.273141696 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 0]  using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially

{rank=} Finished init!!!
{rank=} Finished init!!!

Task 1 - all gather
Process 0 Information...
Process 1 Information...
{rank=} Finished step 1!!!{rank=} Finished step 1!!!


Task 2 - all reduce
All reduce for all processes: in rank 0, tensor = tensor([2., 2., 2., 2.], device='cuda:0')All reduce for all processes: in rank 1, tensor = tensor([2., 2., 2., 2.], device='cuda:1')


Task 3 - all reduce for sub-group
Sub group all reduce: in rank 1, tensor = tensor([1., 1., 1., 1.], device='cuda:1')

Rank 4 - all reduce (sum) in a sub-group, then sync results to the entire group.
In rank 1, group_1_sum.to('cpu')=tensor([1, 1, 1, 1]), group_2_sum.to('cpu')=tensor([1.5000, 1.5000, 1.5000, 1.5000])In rank 0, group_1_sum.to('cpu')=tensor([1, 1, 1, 1]), group_2_sum.to('cpu')=tensor([1.5000, 1.5000, 1.5000, 1.5000])


Finishing process with rank=1, world_size=2
Finishing process with rank=0, world_size=2

