<a href="https://colab.research.google.com/github/learn2Pro/rl_learning/blob/master/llm/data_parallelism.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1 summary

- 数据并行 vs 模型并行
  - 数据并行： 模型拷贝（per device）, 数据split/chunk (batch)
    - the module is replicated on each device, and each replica handles a portion of the input
    - During the backwards pass gradients each replica are summed into the original module
  - 模型并行：数据拷贝（per device）, 模型split/chunk (单卡放不下模型)
- DP => DDP
  - DP: nn.DataParallel
    - https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html
  - DDP: DistributedDataParallel
  - Use nn.parallel.DistributedDataParallel instead of multiprocessing or nn.DataParallel and Distributed Data Parallel.
- 参考
  - https://pytorch.org/tutorials/beginner/blitz/data_parallel_tutorial.html
  - https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html

# imports and parameters

In [1]:
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

# Parameters and DataLoaders
input_size = 5
output_size = 2

batch_size = 30
data_size = 100

In [2]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

In [3]:
class RandomDataset(Dataset):

    def __init__(self, size, length):
        self.len = length
        # 100*5
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        # (5, )
        return self.data[index]

    def __len__(self):
        # 100
        return self.len

rand_loader = DataLoader(dataset=RandomDataset(input_size, data_size),
                         batch_size=batch_size,
                         shuffle=True)

In [4]:
next(iter(rand_loader)).shape

torch.Size([30, 5])

## simple model

In [6]:
class Model(nn.Module):
  # our model
  def __init__(self, input_size, output_size):
    super(Model, self).__init__()
    self.fc = nn.Linear(input_size,output_size)

  def forward(self,input):
    output = self.fc(input)
    print(f"\tIn Model: input size:{input.size()}, output size:{output.size()}")
    return output

# DataParallel
- https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html
  - device_ids=None,
    - 参与训练的 GPU 有哪些，device_ids=gpus；
  - output_device=None
    - 用于汇总梯度的 GPU 是哪个，output_device=gpus[0]
  - dim=0
- The parallelized module must have its parameters and buffers on device_ids[0] before running(forward/backward) this DataParallel module.
  - model.to('cuda:0')

In [8]:
input_size,output_size

(5, 2)

In [12]:
# (5, 2)
model = Model(input_size, output_size)
if torch.cuda.device_count()>0:
  print(f"Let's use {torch.cuda.device_count()} GPUS!")
  model = nn.DataParallel(model)

Let's use 1 GPUS!


In [13]:
model.to(device)

DataParallel(
  (module): Model(
    (fc): Linear(in_features=5, out_features=2, bias=True)
  )
)

In [18]:
a = torch.randn((3,4))
print(f'a.is_cuda {a.is_cuda}')
b = a.to(device)
print(f'a.is_cuda {a.is_cuda}')
print(f'b.is_cuda {b.is_cuda}')

a.is_cuda False
a.is_cuda False
b.is_cuda True


## model to device

In [20]:
a = Model(3,4)
print(next(a.parameters()).is_cuda)
b = a.to(device)
print(next(a.parameters()).is_cuda)
print(next(b.parameters()).is_cuda)

False
True
True


## 6. run the model (forward)

In [21]:
for data in rand_loader:
  input = data.to(device)
  output = model(input)
  print(f'Outsize: input size {input.size()} output size {output.size()}')

	In Model: input size:torch.Size([30, 5]), output size:torch.Size([30, 2])
Outsize: input size torch.Size([30, 5]) output size torch.Size([30, 2])
	In Model: input size:torch.Size([30, 5]), output size:torch.Size([30, 2])
Outsize: input size torch.Size([30, 5]) output size torch.Size([30, 2])
	In Model: input size:torch.Size([30, 5]), output size:torch.Size([30, 2])
Outsize: input size torch.Size([30, 5]) output size torch.Size([30, 2])
	In Model: input size:torch.Size([10, 5]), output size:torch.Size([10, 2])
Outsize: input size torch.Size([10, 5]) output size torch.Size([10, 2])
