# Message Passing and Mixed Precision
---

The objective of this notebook is to expose you briefly to the concept of message-passing and mixed precision. The content also includes the implementations to aid your understanding.   

## Message Passing

`torch.distributed` package enables parallelism across processes and clusters of machines. It leverages `message passing` semantics that allows data communication among processes with different communication backends and machines. The message passing can be executed by running multiple processes simultaneously or using a single machine to spawn multiple processes. There are two communication approaches to consider, `Point-to-Point` and `Collective` communications.


#### Point-to-Point Communication

Point-to-point communication is a data transfer from one process to another through `send` and `recv` functions or immediate counter-parts, `isend` and `irecv.` 

<img src="images/send_recv.png" width="500px" height="500px" alt-text="p2p"/>

The communication pattern for `send/recv` can be blocking or non-blocking. When both processes are blocked until communication is completed, it is called blocking, but when they continue to execute and possess a `DistributedRequest` object upon which we can choose to `req.wait()`, it is referred to as non-blocking. Point-to-point communication is helpful when we want more fine-grained control over the communication of our processes. Both patterns can be implemented using fancy algorithms, like in Baidu’s [DeepSpeech](https://github.com/baidu-research/baidu-allreduce) or Facebook’s [large-scale experiments](https://research.facebook.com/publications/accurate-large-minibatch-sgd-training-imagenet-in-1-hour/). In the blocking point-to-point communication code snippet below, both processes start with a zero tensor, then `process 0` increments the tensor and sends it to `process 1` to both possess a value of 1.0. However, `process 1` needs to allocate memory to store the data received. 


**Blocking point-to-point communication**
```python
"""source: https://pytorch.org/tutorials/intermediate/dist_tuto.html"""

def run(rank, size):
    tensor = torch.zeros(1)
    if rank == 0:
        tensor += 1
        # Send the tensor to process 1
        dist.send(tensor=tensor, dst=1)
    else:
        # Receive tensor from process 0
        dist.recv(tensor=tensor, src=0)
    print('Rank ', rank, ' has data ', tensor[0])

```
**Non-blocking point-to-point communication**

Running the function below might result in `process 1` still having `0.0` while having already started receiving. However, after `req.wait()` has been executed, we are guaranteed that the communication took place and that the value stored in `tensor[0]` is 1.0.

```python
"""source: https://pytorch.org/tutorials/intermediate/dist_tuto.html"""

def run(rank, size):
    tensor = torch.zeros(1)
    req = None
    if rank == 0:
        tensor += 1
        # Send the tensor to process 1
        req = dist.isend(tensor=tensor, dst=1)
        print('Rank 0 started sending')
    else:
        # Receive tensor from process 0
        req = dist.irecv(tensor=tensor, src=0)
        print('Rank 1 started receiving')
    req.wait()
    print('Rank ', rank, ' has data ', tensor[0])

```


#### Collective Communication

Collective communication allows communication patterns across all processes in a **group**. A group is a subset of all processes. To create a group, we can pass a list of ranks to `dist.new_group(group)`. By default, collectives are executed on all processes, also known as the **world**. For example, to obtain the sum of all tensors on all processes, use the `dist.all_reduce(tensor, op, group)` collective. An All-Reduce example is given below:

```python
""" source: https://pytorch.org/tutorials/intermediate/dist_tuto.html"""

def run_all_reduce(rank, size):
    """ Simple point-to-point communication. """
    group = dist.new_group([0, 1]) 
    tensor = torch.ones(1)
    dist.all_reduce(tensor, op=dist.reduce_op.SUM, group=group)
    print('Rank ', rank, ' has data ', tensor[0])
```

To sum up all tensors in the group, use `dist.reduce_op.SUM` as the reduce operator. Generally, any commutative mathematical operation can be used as an operator. PyTorch comes with four such operators, and all are executed element-wise:

* `dist.reduce_op.SUM`,
* `dist.reduce_op.PRODUCT`,
* `dist.reduce_op.MAX`,
* `dist.reduce_op.MIN`.

In addition to `dist.all_reduce(tensor, op, group)`, there are a total of 6 collectives (*broadcast, reduce, all_reduce, scatter, gather, and all_gather*) currently implemented in PyTorch. Details on them with diagram illustration were discussed in the system topology [notebook](system-topology.ipynb) 

Let's execute the [message passing send-recv script](../source_code/send_receive.py) by spawning two processes to setup the distributed environment, initialize the process group (`dist.init_process_group`), and finally execute the given `run` function. The `init_processes` ensures that every process will be able to coordinate through a master, using the same `ip address` and `port.`

Let's run the `Blocking point-to-point` communication part of our script with two GPUs by executing the cell below. 

In [None]:
!cd ../source_code && srun -p gpu -N 1 --gres=gpu:2 python3 send_receive.py

**Expected Output:**

```python
Rank 0 started sending
Rank  0  has data  tensor(1.)
Rank 1 started receiving
Rank  1  has data  tensor(1.)
```

## Overview of Mixed Precision 

[Mixed Precision](https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html) is the combined use of different numerical formats (single- and half-precision computation) in the training of a deep neural network. The single precision is referred to as `FP32 (float32),` while the half-precision is denoted as `FP16 (float16).` Mixed precision training offers significant computational speedup by performing operations in half-precision format while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since introducing [Tensor Cores](https://www.nvidia.com/en-us/data-center/tensor-cores/) in GPU architectures, significant training speedups have been experienced by switching to mixed precision. Using mixed precision training requires two steps: `Porting the model to use the FP16 data type where appropriate` and `adding loss scaling to preserve small gradient values`.

#### Benefits of Mixed Precision Training

- Speeds up math-intensive operations, such as linear and convolution layers, by using [Tensor Cores](https://www.nvidia.com/en-us/data-center/tensor-cores/).
- Speeds up memory-limited operations by accessing half the bytes compared to single-precision.
- Reduces memory requirements for training models, enabling larger models or larger minibatches.

#### FP16 Porting

- FP16 dynamic range is sufficient for training; however, gradients may require scaling to move them into the range to keep them from becoming zeros in FP16. Overflow should be avoided.

  <img src="images/mixed-precision.png" width="450px" height="450px" alt-text="mp"/>
  
- Loss Scaling: The purpose of loss scaling is to preserve small gradient magnitudes. 

*Training procedure:*
```text
1. Maintain a primary copy of weights in FP32
2. For each iteration:
     Make an FP16 copy of the weights
     Forward propagation (FP16 weights and activations)
     Multiply the resulting loss with the scaling factor S
     Backward propagation (FP16 weights, activations, and their gradients)
     Multiply the weight gradient with 1/S
     Complete the weight update (including gradient clipping)
```
- Procedure for choosing a scaling factor

```text
1. Maintain a primary copy of weights in FP32.
2. Initialize S to a large value.
3. For each iteration:
     Make an FP16 copy of the weights.
     Forward propagation (FP16 weights and activations).
     Multiply the resulting loss with the scaling factor S.
     Backward propagation (FP16 weights, activations, and their gradients).
     If there is an Inf or NaN in weight gradients:
         Reduce S.
         Skip the weight update and move to the next iteration.
     Multiply the weight gradient with 1/S.
     Complete the weight update (including gradient clipping, etc.).
     If there hasn’t been an Inf or NaN in the last N iterations, increase S.

```
**Summary of Mixed Precision Training**
- Choose FP16 format tensor core
- Forward pass of the model
- Scale the loss and backpropagate the scaled gradients
- Un-scale the gradients and optimizer performs the weight update

**Automatic Mixed Precision (AMP)**

[Automatic Mixed Precision (AMP)](https://developer.nvidia.com/automatic-mixed-precision) makes mixed precision training with FP16 easy in frameworks. AMP automates the process of training in mixed precision. It converts matrix multiplies/convolutions to 16-bits for Tensor Core acceleration.

 <img src="images/amp.png" width="550px" height="550px" alt-text="amp"/>

### Pytorch Automatic Mixed Precision

In Pytorch, automatic mixed precision training implies training with `torch.autocast` and `torch.amp.GradScaler` together. Instances of `torch.autocast` enable autocasting for chosen regions. Autocasting automatically chooses the precision for operations to improve performance while maintaining accuracy. Instances of `torch.amp.GradScaler` help perform the steps of gradient scaling conveniently. Gradient scaling improves convergence for networks with float16 (by default on CUDA and XPU) gradients by minimizing gradient underflow, as explained [here](https://pytorch.org/docs/stable/amp.html#gradient-scaling).

Below, we illustrate the process using AMP. Using default precision (without torch.cuda.amp) implies that all ops are executed in default precision (torch.float32):

```python
...

for epoch in range(epochs):
    for input, target in zip(data, targets):
        output = net(input)
        loss = loss_fn(output, target)
        loss.backward()
        opt.step()
        opt.zero_grad() # set_to_none=True here can modestly improve performance
end_timer_and_print("Default precision:")
...
```
**Adding torch.autocast**

Instances of `torch.autocast` serve as context managers that allow script regions to run in mixed precision. In these regions, `CUDA` ops run in a `dtype` chosen by `autocast` to improve performance while maintaining accuracy. For details on what precision `autocast` chooses for each op, See the [autocast Op reference](https://pytorch.org/docs/stable/amp.html#autocast-op-reference).

```python
for epoch in range(0): # 0 epochs, this section is for illustration only
    for input, target in zip(data, targets):
        # Runs the forward pass under ``autocast``.
        with torch.autocast(device_type=device, dtype=torch.float16):
            output = net(input)
            # output is float16 because linear layers ``autocast`` to float16.
            assert output.dtype is torch.float16

            loss = loss_fn(output, target)
            # loss is float32 because ``mse_loss`` layers ``autocast`` to float32.
            assert loss.dtype is torch.float32

        # Exits ``autocast`` before backward(). Backward passes under ``autocast`` are not recommended. Backward ops run in the same ``dtype`` ``autocast`` chose for corresponding forward ops.
        loss.backward()
        opt.step()
        opt.zero_grad() # set_to_none=True here can modestly improve performance
```
**Adding GradScaler**

Gradient scaling helps prevent gradients with small magnitudes from flushing to zero (“underflowing”) when training with mixed precision.

```python
# Constructs a ``scaler`` once, at the beginning of the convergence run, using default arguments.
scaler = torch.amp.GradScaler("cuda")

for epoch in range(0): # 0 epochs, this section is for illustration only
    for input, target in zip(data, targets):
        with torch.autocast(device_type=device, dtype=torch.float16):
            output = net(input)
            loss = loss_fn(output, target)

        # Scales loss. Calls ``backward()`` on scaled loss to create scaled gradients.
        scaler.scale(loss).backward()

# ``scaler.step()`` first unscales the gradients of the optimizer's assigned parameters. If these gradients do not contain ``inf``s or ``NaN``s, optimizer.step() is then called; otherwise, optimizer.step() is skipped.
        scaler.step(opt)
        # Updates the scale for next iteration.
        scaler.update()
        opt.zero_grad() # set_to_none=True here can modestly improve performance
```

**Complete Flow for Automatic Mixed Precision**

In the example below, an optional argument `enabled` is added. If set to false,  `autocast` and `GradScaler‘s` calls become no-ops. This allows switching between default and mixed precision without if/else statements.
```python
use_amp = True
...
scaler = torch.amp.GradScaler("cuda" ,enabled=use_amp)

for epoch in range(epochs):
    for input, target in zip(data, targets):
        with torch.autocast(device_type=device, dtype=torch.float16, enabled=use_amp):
            output = net(input)
            loss = loss_fn(output, target)
        scaler.scale(loss).backward()
        scaler.step(opt)
        scaler.update()
        opt.zero_grad() # set_to_none=True here can modestly improve performance
end_timer_and_print("Mixed precision:")

```
You can learn more about modifying gradients (e.g., clipping), saving/resuming Amp-enabled runs with bitwise accuracy, and inference/Evaluation through this [link](https://pytorch.org/tutorials/recipes/recipes/amp_recipe.html#inspecting-modifying-gradients-e-g-clipping). 

Please run the cell below to execute a [sample DDP](../source_code/ddp_mixed_precision.py) with the application of AMP using 2 GPUs.

In [None]:
!cd ../source_code && srun -p gpu -N 1 --gres=gpu:2 torchrun --nnodes 1 --nproc_per_node 2 ddp_mixed_precision.py

**Likely Output:**

```python
W0227 13:26:29.388000 2903532 site-packages/torch/distributed/run.py:792] *****************************************
W0227 13:26:29.388000 2903532 site-packages/torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0227 13:26:29.388000 2903532 site-packages/torch/distributed/run.py:792] *****************************************
[rank0]:[W227 13:26:32.794582922 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 0]  using GPU 0 to perform barrier as devices used by this process are currently unknown. 
...
Start training...
[Epoch 1/4] loss: 2.834
[Epoch 2/4] loss: 2.134
[Epoch 3/4] loss: 1.973
[Epoch 4/4] loss: 1.866
Finished Training
...
```

Now, we have come to the end of the distributed training strategy. We can proceed to the next notebook to learn how to profile, trace bottlenecks, and improve performance using `NVIDIA Nsight Systems.` Please click the [Next Link](nsys-introduction.ipynb).

---
## References

- https://pytorch.org/tutorials/intermediate/dist_tuto.html
- https://github.com/programmah/hpdl/blob/multi_gpu_pytorch/Pytorch_Distributed_Deep_Learning/workspace/jupyter_notebook/07-Message_Passing.ipynb
- https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html
- https://developer.nvidia.com/automatic-mixed-precision


## Licensing 

Copyright © 2025 OpenACC-Standard.org. This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). These materials include references to hardware and software developed by other entities; all applicable licensing and copyrights apply.