# Distributed Model Training
---



### Overview

The Distributed Model Training Bootcamp is designed from a real-world perspective on how to efficiently utilize GPUs in training models in a distributed manner. Attendees walk through the system topology to learn the dynamics of multi-GPU and multi-node connections and architecture. Using the PyTorch Framework, they will also learn and understand state-of-the-art strategies for training models that include distributed data parallelism (DDP), Fully Sharded Data Parallelism (FSDP), model parallelism, pipeline parallelism, and tensor parallelism. Furthermore, attendees will learn to profile code and analyze performance using NVIDIA® Nsight™ Systems. This tool helps identify optimization opportunities and improve the performance of applications running on a system consisting of multiple CPUs and GPUs.


### Why Distributed Training?

Training deep learning models is a task that often takes a long time because the process typically requires substantial storage and computing capacity. During training, intermediate results must be calculated and held in memory. Therefore, dividing one huge task into a number of subtasks and running them parallelly makes the whole process much more time efficient and enables us to complete complex tasks with large datasets. In distributed training, storage and compute power are magnified, reducing training time. Distributed Model Training speeds up training and enables the training of very large models through different strategies listed in the overview section.
 

The table of contents below will walk you through the topology of a multi-GPU/multi-node environment, solidifying your understanding of various distributed model training strategies, as well as performance.

### Table of Content

The following contents will be covered:

- Lab 1: [System Topology](jupyter_notebook/system-topology.ipynb)
- Lab 2: Distributed Training Strategy
    1. [Data Parallelism](jupyter_notebook/data-parallelism.ipynb)
    3. [Fully Sharded Data Parallelism (FSDP)](jupyter_notebook/fsdp.ipynb)
    4. [Model Parallelism](jupyter_notebook/model-parallelism.ipynb)
    5. [Pipeline Parallelism](jupyter_notebook/pipeline-parallelism.ipynb)
    6. [Tensor Parallelism](jupyter_notebook/tensor-parallelism.ipynb)
    7. [Message Passing and Mixed Precision](jupyter_notebook/other-topics.ipynb)
- Lab 3: Performance Overview
    1. [Profiling DDP with Nsight Systems](jupyter_notebook/nsys-introduction.ipynb)


### Tutorial Duration

The material will be presented in a total of 8hrs

### Content Level

 Advanced

### Target Audience and Prerequisites

The target audience for these labs is researchers, graduate students, and developers interested in scaling their model training approach to solving tasks via GPUs. Audiences are expected to have background Knowledge of Python programming and the PyTorch framework.

---
## Licensing 

Copyright © 2025 OpenACC-Standard.org. This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). These materials include references to hardware and software developed by other entities; all applicable licensing and copyrights apply.