Skip to content

Latest commit

History

History
95 lines (70 loc) 路 4.06 KB

DistributedTraining.md

File metadata and controls

95 lines (70 loc) 路 4.06 KB

Distributed Training

Description A pre-trained transformer model (vit_base_patch32_224) from timm was trained on CIFAR10 Dataset using 4 nodes each with single NVIDIA T4 GPU and Distributed DataParallel (DDP). Refer this link to understand how gradients are applied in the backward pass. Read about all reduce algorithm here.

How to run

# On master node
MASTER_PORT=29500 MASTER_ADDR=172.31.10.234 WORLD_SIZE=4 NODE_RANK=0 python src/train.py experiment=cifar10_ddp

# On worker node
MASTER_PORT=29500 MASTER_ADDR=172.31.10.234 WORLD_SIZE=4 NODE_RANK=1 python src/train.py experiment=cifar10_ddp

# MASTER_PORT - required; has to be a free port on machine with NODE_RANK 0
# MASTER_ADDR - required (except for NODE_RANK 0); address of NODE_RANK 0 node
# WORLD_SIZE - required; how many nodes are in the cluster
# NODE_RANK - required; id of the node in the cluster

Results The best model checkpoint for the above training can be found here.

  • Model: vit_base_patch32_224
  • Dataset: CIFAR10
  • Epochs: 25
  • Effective batch size: 1280 (=320*4*1)
  • Test Accuracy: 90.13 (link)
  • Test Loss: 0.36529
  • Training time: 1 hour 22 minutes

Forwarding
GPU utilization

Sharded Training using FSDP

Sharded Training still utilizes Data Parallel Training under the hood, except optimizer states and gradients are sharded across GPUs. This means the memory overhead per GPU is lower, as each GPU only has to maintain a partition of your optimizer state and gradients. Refer the link.

How to run

# On master node
MASTER_PORT=29500 MASTER_ADDR=172.31.10.234 WORLD_SIZE=4 NODE_RANK=0 python src/train.py experiment=cifar10_fsdp

# On worker node
MASTER_PORT=29500 MASTER_ADDR=172.31.10.234 WORLD_SIZE=4 NODE_RANK=1 python src/train.py experiment=cifar10_fsdp

Results The best model checkpoint for the above training can be found here.

  • Model: vit_base_patch32_224
  • Dataset: CIFAR10
  • Epochs: 25
  • Effective batch size: 1280 (=320*4*1)
  • Test Accuracy: 83.6799 (link)
  • Test Loss: 0.52269
  • Training time: 35 minutes

Forwarding
GPU utilization

Training over the internet using Hivemind

Hivemind provides capabilities of decentralized Training over the Internet. Refer the link.

How to run

# On master node
python src/train.py experiment=cifar10_hivemind

# On other node, provide initial peers as ENV var or pass it during instantiation and
# run above command

Results The best model checkpoint for the above training can be found here.

  • Model: vit_base_patch32_224
  • Dataset: CIFAR10
  • Epochs: 25
  • Effective batch size: 640(=320*2*1)
  • Test Accuracy: 95.4299 (link)
  • Test Loss: 0.22328
  • Training time: 3 hours 50 minutes
  • Tensorboard link : click here