Skip to content

Commit b6421d5

Browse files
committed
added instructions for paperspace
1 parent b82c2b6 commit b6421d5

File tree

1 file changed

+49
-0
lines changed

1 file changed

+49
-0
lines changed

README.md

Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,52 @@
11
# pytorch-transformer-distributed
22

33
Distributed training of an attention model. Forked from: [hkproj/pytorch-transformer](https://github.com/hkproj/pytorch-transformer)
4+
5+
## Instructions for Paperspace
6+
7+
### Machines
8+
9+
1. 1 Private network. Assign both computers to the private network when creating the machines.
10+
2. 2x nodes of `P4000x2` (multi-GPU) with `ML-in-a-Box` as operating system
11+
3. 1 Network drive (250 GB)
12+
13+
### Setup
14+
15+
1. `sudo apt-get update`
16+
2. `sudo apt-get install net-tools`
17+
3. If you get an error about `seahorse` while installing `net-tools`, do the following:
18+
1. sudo rm /var/lib/dpkg/info/seahorse.list
19+
2. sudo apt-get install seahorse --reinstall
20+
4. Get each machine's private IP address using `ifconfig`
21+
5. Add IP and hostname mapping of all the slave nodes on `/etc/hosts` file of the master node
22+
6. Mount the network drive
23+
1. `sudo apt-get install smbclient`
24+
2. `sudo apt-get install cifs-utils`
25+
3. `sudo mkdir /mnt/training-data`
26+
4. Replace the following values on the command below:
27+
1. `NETWORD_DRIVE_IP` with the IP address of the network drive
28+
2. `NETWORK_SHARE_NAME` with the name of the network share
29+
3. `DRIVE_USERNAME` with the username of the network drive
30+
5. `sudo mount -t cifs //NETWORD_DRIVE_IP/NETWORK_SHARE_NAME /mnt/training-data -o uid=1000,gid=1000,rw,user,username=NETWORK_DRIVE_USERNAME`
31+
1. Type the drive's password when prompted
32+
7. `git clone https://github.com/hkproj/pytorch-transformer-distributed`
33+
8. `cd pytorch-transformer-distributed`
34+
9. `pip install -r requirements.txt`
35+
10. Login on Weights & Biases
36+
1. `wandb login`
37+
2. Copy the API key from the browser and paste it on the terminal
38+
11. Run the training command from below
39+
40+
### Local training
41+
42+
`torchrun --nproc_per_node=2 --nnodes=1 --rdzv_id=456 --rdzv_backend=c10d --rdzv_endpoint=127.0.0.1:48123 train.py --batch_size 8 --model_folder "/mnt/training-data/weights"`
43+
44+
### Distributed training
45+
46+
Run the following command on each machine (replace `IP_ADDR_MASTER_NODE` with the IP address of the master node):
47+
48+
`torchrun --nproc_per_node=2 --nnodes=2 --rdzv_id=456 --rdzv_backend=c10d --rdzv_endpoint=IP_ADDR_MASTER_NODE:48123 train.py --batch_size 8 --model_folder "/mnt/training-data/weights"`
49+
50+
### Monitoring
51+
52+
Login to Weights & Biases to monitor the training progress: https://app.wandb.ai/

0 commit comments

Comments
 (0)