Run PaddlePaddle with virtualenv in a SLURM cluster

Login to the dedicated cluster machine.
Create a new virtualenv environment (can skip this step if already installed).
```
virtualenv ~/paddle-train
```
Enter the environment (use deactivate to exit the virtualenv).
```
cd ~/paddle-train/bin && source activate
```

Install PaddlePaddle and dependencies into virtualenv (can skip this step if already installed).

Because the cluster uses a mirrored Pypi index, some packages that PaddlePaddle depends on are not there, we will download them manually.
pip checks architecture of the *.whl file by name, I am not sure why the cluster does not support the manylinux architecture, we will rename the *.whl files to fix this issue.

wget https://pypi.python.org/packages/b2/30/ab593c6ae73b45a5ef0b0af24908e8aec27f79efcda2e64a3df7af0b92a2/protobuf-3.1.0-py2.py3-none-any.whl#md5=f02742e46128f1e0655b44c33d8c9718
pip install protobuf-3.1.0-py2.py3-none-any.whl
wget https://pypi.python.org/packages/cc/87/76e691bbf1759ad6af5831649aae6a8d2fa184a1bcc71018ca6300399e5f/nltk-3.2.5.tar.gz#md5=73a33f58da26a18e8d40ef630a40b599
pip install nltk-3.2.5.tar.gz
wget https://pypi.python.org/packages/8b/e7/229a428b8eb9a7f925ef16ff09ab25856efe789410d661f10157919f2ae2/requests-2.9.2-py2.py3-none-any.whl#md5=afecc76f13f3ae5e5dab18ae64c73c84
pip install requests-2.9.2-py2.py3-none-any.whl
wget https://pypi.python.org/packages/eb/7e/27b3b9e26cb64e081799546a756059baf285eb886a771e9d26743876ccbb/scipy-0.19.0-cp27-cp27mu-manylinux1_x86_64.whl#md5=adfa1f5127a789165dfe9ff140ec0d6e
mv scipy-0.19.0-cp27-cp27mu-manylinux1_x86_64.whl scipy-0.19.0-cp27-none-any.whl
pip install scipy-0.19.0-cp27-none-any.whl
wget https://pypi.python.org/packages/d6/82/98063eed7cb9c169c24831539fcf286799368cd89f6fd46d4de0430d1fce/recordio-0.1.4-cp27-cp27mu-manylinux1_x86_64.whl#md5=d2ced7eb6e6215fe1972891f10a0b5cb
mv recordio-0.1.4-cp27-cp27mu-manylinux1_x86_64.whl recordio-0.1.4-cp27-none-any.whl  
pip install recordio-0.1.4-cp27-none-any.whl 
wget https://pypi.python.org/packages/5f/d2/9fa0201944933afd6d059f1e32aa6bdb203b23ab62fc823d3adf36295b9a/numpy-1.13.1-cp27-cp27mu-manylinux1_x86_64.whl#md5=de272621d41b7856e1580307be9d1fba
mv numpy-1.13.1-cp27-cp27mu-manylinux1_x86_64.whl numpy-1.13.1-cp27-none-any.whl
pip install numpy-1.13.1-cp27-none-any.whl
wget https://pypi.python.org/packages/96/90/4e8328119e5fed3145c737beba63567d5557b1a20ffad453391aba95fbe4/opencv_python-3.3.0.10-cp27-cp27mu-manylinux1_x86_64.whl#md5=8c9d2f8bb89f5000142042c303779f82
mv opencv_python-3.3.0.10-cp27-cp27mu-manylinux1_x86_64.whl opencv_python-3.3.0.10-cp27-none-any.whl 
pip install opencv_python-3.3.0.10-cp27-none-any.whl
wget https://pypi.python.org/packages/af/0c/dbe68bb52de57432dfd857ac089be1f5652783859750adfbb0c301ec59d4/paddlepaddle_gpu-0.10.5-cp27-cp27mu-manylinux1_x86_64.whl#md5=3925d29f42c43924f67beaf50fb1dde6
mv paddlepaddle_gpu-0.10.5-cp27-cp27mu-manylinux1_x86_64.whl  paddlepaddle_gpu-0.10.5-cp27-none-any.whl
pip install paddlepaddle_gpu-0.10.5-cp27-none-any.whl

Download and run the test script.

wget https://raw.githubusercontent.com/PaddlePaddle/book/develop/01.fit_a_line/train.py
LD_LIBRARY_PATH=/tools/cudnn-8.0-linux-x64-v5.0-ga/lib64:/tools/cuda-8.0/lib64:$LD_LIBRARY_PATH WITH_GPU=1 python train.py

Run in the SLURM cluster

Check Available Resource

helin@svail-3:~$ sinfo
PARTITION        AVAIL  TIMELIMIT  NODES  STATE NODELIST
K40x4_Paddle        up 8-04:00:00      4    mix svail-[4-7]

From TIMELIMIT, we can see that the allocated resource will be released in 8 days and 4 hours. Please make sure to save the model if your training is longer than the time!

Run

ssh to the interactive node.
```
$ ssh helin@svail-3.xxx
```

Allocate the computing nodes (from the interactive node).

Argument for N is the number of nodes, we can allocate 4 nodes max.

helin@svail-3:~$ screen # Start a screen so the allocation will not be release accidentally when the ssh connection broke.
helin@svail-3:~$ salloc --partition=K40x4_Paddle --reservation=Paddle-Paddle -N 1
salloc: Granted job allocation 1031619

Check which computing nodes have been allocated

helin@svail-3:~$ squeue|grep helin
           1031624 K40x4_Pad     bash    helin  R       0:07      1 svail-4

ssh to the computing node and do work.

$ ssh helin@svail-4.xxx
helin@svail-4:~$ wget https://raw.githubusercontent.com/PaddlePaddle/book/develop/01.fit_a_line/train.py
helin@svail-4:~$ LD_LIBRARY_PATH=/tools/cudnn-8.0-linux-x64-v5.0-ga/lib64:/tools/cuda-8.0/lib64:$LD_LIBRARY_PATH WITH_GPU=1 python train.py

Release the allocation (from the interactive node, not the computing node).

helin@svail-3:~$ exit
salloc: Relinquishing job allocation 1031619
salloc: Job allocation 1031619 has been revoked.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run PaddlePaddle with virtualenv in a SLURM cluster

Run in the SLURM cluster

Check Available Resource

Run

Clone this wiki locally