|
1 | 1 | # pytorch-transformer-distributed |
2 | 2 |
|
3 | 3 | Distributed training of an attention model. Forked from: [hkproj/pytorch-transformer](https://github.com/hkproj/pytorch-transformer) |
| 4 | + |
| 5 | +## Instructions for Paperspace |
| 6 | + |
| 7 | +### Machines |
| 8 | + |
| 9 | +1. 1 Private network. Assign both computers to the private network when creating the machines. |
| 10 | +2. 2x nodes of `P4000x2` (multi-GPU) with `ML-in-a-Box` as operating system |
| 11 | +3. 1 Network drive (250 GB) |
| 12 | + |
| 13 | +### Setup |
| 14 | + |
| 15 | +1. `sudo apt-get update` |
| 16 | +2. `sudo apt-get install net-tools` |
| 17 | +3. If you get an error about `seahorse` while installing `net-tools`, do the following: |
| 18 | + 1. sudo rm /var/lib/dpkg/info/seahorse.list |
| 19 | + 2. sudo apt-get install seahorse --reinstall |
| 20 | +4. Get each machine's private IP address using `ifconfig` |
| 21 | +5. Add IP and hostname mapping of all the slave nodes on `/etc/hosts` file of the master node |
| 22 | +6. Mount the network drive |
| 23 | + 1. `sudo apt-get install smbclient` |
| 24 | + 2. `sudo apt-get install cifs-utils` |
| 25 | + 3. `sudo mkdir /mnt/training-data` |
| 26 | + 4. Replace the following values on the command below: |
| 27 | + 1. `NETWORD_DRIVE_IP` with the IP address of the network drive |
| 28 | + 2. `NETWORK_SHARE_NAME` with the name of the network share |
| 29 | + 3. `DRIVE_USERNAME` with the username of the network drive |
| 30 | + 5. `sudo mount -t cifs //NETWORD_DRIVE_IP/NETWORK_SHARE_NAME /mnt/training-data -o uid=1000,gid=1000,rw,user,username=NETWORK_DRIVE_USERNAME` |
| 31 | + 1. Type the drive's password when prompted |
| 32 | +7. `git clone https://github.com/hkproj/pytorch-transformer-distributed` |
| 33 | +8. `cd pytorch-transformer-distributed` |
| 34 | +9. `pip install -r requirements.txt` |
| 35 | +10. Login on Weights & Biases |
| 36 | + 1. `wandb login` |
| 37 | + 2. Copy the API key from the browser and paste it on the terminal |
| 38 | +11. Run the training command from below |
| 39 | + |
| 40 | +### Local training |
| 41 | + |
| 42 | +`torchrun --nproc_per_node=2 --nnodes=1 --rdzv_id=456 --rdzv_backend=c10d --rdzv_endpoint=127.0.0.1:48123 train.py --batch_size 8 --model_folder "/mnt/training-data/weights"` |
| 43 | + |
| 44 | +### Distributed training |
| 45 | + |
| 46 | +Run the following command on each machine (replace `IP_ADDR_MASTER_NODE` with the IP address of the master node): |
| 47 | + |
| 48 | +`torchrun --nproc_per_node=2 --nnodes=2 --rdzv_id=456 --rdzv_backend=c10d --rdzv_endpoint=IP_ADDR_MASTER_NODE:48123 train.py --batch_size 8 --model_folder "/mnt/training-data/weights"` |
| 49 | + |
| 50 | +### Monitoring |
| 51 | + |
| 52 | +Login to Weights & Biases to monitor the training progress: https://app.wandb.ai/ |
0 commit comments