# Fully Sharded Data Parallelism (FSDP2)

## FSDP is Data Parallelism...

Discussions about FSDP can get confusing, because it implements a lot of different techinques (some of them quite advanced).  But the first thing to know is that it is data parallelism, just as we've seen in [the DDP notebook](4_Distributed_data_parallel.ipynb), and as we saw in images there:

![Overview of data parallelism](images/data-par-1.png)

So each instance is responsible for training the entire model on a separate batch of data; you need something like DistributedSampler in your data loader, etc.   It's data parallelism.

## ... and FSDP also uses Model Parallelism Techniques (Amongst Others) To Reduce Memory Usage

FSDP differs by implementing a number of techniques to reduce memory usage, so that **FSDP can work even if the entire model won't fit on a single GPU**.  

The signature method, sharding, means that each replica only persistantly stores a shard of the entire model, and state is materialized in place only when needed:

![Sharded data parallelism](images/sharded-data-par-2.png)

So that each GPU can be training a replica of a model which is, in principle, significantly larger than the memory of the GPU.

My clumsy diagrams above probably make it look like it's only the model parameters which are sharded, but in fact GPU memory is required for parameters, gradients, and the potentially quite large optimizer state; all of those can be sharded (or at least not persisted:)

![Diagram showing a memory-use graph demonstrating sharding of parameters, gradients, and optimizer state, from the ZeRO paper](images/ZeRO.png)

The figure above is from the paper "[ZeRO: Memory Optimizations Toward Training Trillion Parameter Models](https://arxiv.org/abs/1910.02054)" which described and implemented these approaches; PyTorch's implementation of these methods is FSDP.   The math to the side sketches out the memory requirements; if there are $x$ parameters, and we're using FP16 (2 bytes per parameter), the sizes of the different layers are:

* Parameters - $2x$
* Gradients - $2x$
* Optimizer State (for, say, Adam) - $12x$
  * Parameter copy $4x$ (4 bytes for float32)
  * Momentum $4x$
  * Variance $4x$

FSDP uses [DTensors](https://docs.pytorch.org/docs/stable/distributed.tensor.html) and [Device Meshes](https://docs.pytorch.org/docs/stable/distributed.html#torch.distributed.device_mesh.DeviceMesh), from the Tensor parallism framework, to handle the sharding.   This isn't tensor parallelism, though; computation isn't parallelized over pieces of tensors.   Each of the replicas trains the entire over its subset of batches; it's the _persistant storage_ of shards of tensors which is distributed.


## The FSDP2 workflow



![FSDP combines tensor and pipeline parallelism; from the FSDP paper](images/fsdp.png)