Skip to content

Run PaddlePaddle with virtualenv in a SLURM cluster

helinwang edited this page Oct 9, 2017 · 12 revisions
  1. Login to the dedicated cluster machine.

  2. Create a new virtualenv environment (can skip this step if already installed).

    virtualenv ~/paddle-train
  3. Enter the environment (use deactivate to exit the virtualenv).

    cd ~/paddle-train/bin && source activate
  4. Install PaddlePaddle and dependencies into virtualenv (can skip this step if already installed).

    • Because the cluster uses a mirrored Pypi index, some packages that PaddlePaddle depends on are not there, we will download them manually.
    • pip checks architecture of the *.whl file by name, I am not sure why the cluster does not support the manylinux architecture, we will rename the *.whl files to fix this issue.
    wget https://pypi.python.org/packages/b2/30/ab593c6ae73b45a5ef0b0af24908e8aec27f79efcda2e64a3df7af0b92a2/protobuf-3.1.0-py2.py3-none-any.whl#md5=f02742e46128f1e0655b44c33d8c9718
    pip install protobuf-3.1.0-py2.py3-none-any.whl
    wget https://pypi.python.org/packages/cc/87/76e691bbf1759ad6af5831649aae6a8d2fa184a1bcc71018ca6300399e5f/nltk-3.2.5.tar.gz#md5=73a33f58da26a18e8d40ef630a40b599
    pip install nltk-3.2.5.tar.gz
    wget https://pypi.python.org/packages/8b/e7/229a428b8eb9a7f925ef16ff09ab25856efe789410d661f10157919f2ae2/requests-2.9.2-py2.py3-none-any.whl#md5=afecc76f13f3ae5e5dab18ae64c73c84
    pip install requests-2.9.2-py2.py3-none-any.whl
    wget https://pypi.python.org/packages/eb/7e/27b3b9e26cb64e081799546a756059baf285eb886a771e9d26743876ccbb/scipy-0.19.0-cp27-cp27mu-manylinux1_x86_64.whl#md5=adfa1f5127a789165dfe9ff140ec0d6e
    mv scipy-0.19.0-cp27-cp27mu-manylinux1_x86_64.whl scipy-0.19.0-cp27-none-any.whl
    pip install scipy-0.19.0-cp27-none-any.whl
    wget https://pypi.python.org/packages/d6/82/98063eed7cb9c169c24831539fcf286799368cd89f6fd46d4de0430d1fce/recordio-0.1.4-cp27-cp27mu-manylinux1_x86_64.whl#md5=d2ced7eb6e6215fe1972891f10a0b5cb
    mv recordio-0.1.4-cp27-cp27mu-manylinux1_x86_64.whl recordio-0.1.4-cp27-none-any.whl  
    pip install recordio-0.1.4-cp27-none-any.whl 
    wget https://pypi.python.org/packages/5f/d2/9fa0201944933afd6d059f1e32aa6bdb203b23ab62fc823d3adf36295b9a/numpy-1.13.1-cp27-cp27mu-manylinux1_x86_64.whl#md5=de272621d41b7856e1580307be9d1fba
    mv numpy-1.13.1-cp27-cp27mu-manylinux1_x86_64.whl numpy-1.13.1-cp27-none-any.whl
    pip install numpy-1.13.1-cp27-none-any.whl
    wget https://pypi.python.org/packages/96/90/4e8328119e5fed3145c737beba63567d5557b1a20ffad453391aba95fbe4/opencv_python-3.3.0.10-cp27-cp27mu-manylinux1_x86_64.whl#md5=8c9d2f8bb89f5000142042c303779f82
    mv opencv_python-3.3.0.10-cp27-cp27mu-manylinux1_x86_64.whl opencv_python-3.3.0.10-cp27-none-any.whl 
    pip install opencv_python-3.3.0.10-cp27-none-any.whl
    wget https://pypi.python.org/packages/af/0c/dbe68bb52de57432dfd857ac089be1f5652783859750adfbb0c301ec59d4/paddlepaddle_gpu-0.10.5-cp27-cp27mu-manylinux1_x86_64.whl#md5=3925d29f42c43924f67beaf50fb1dde6
    mv paddlepaddle_gpu-0.10.5-cp27-cp27mu-manylinux1_x86_64.whl  paddlepaddle_gpu-0.10.5-cp27-none-any.whl
    pip install paddlepaddle_gpu-0.10.5-cp27-none-any.whl
  5. Download and run the test script.

    wget https://raw.githubusercontent.com/PaddlePaddle/book/develop/01.fit_a_line/train.py
    LD_LIBRARY_PATH=/tools/cudnn-8.0-linux-x64-v5.0-ga/lib64:/tools/cuda-8.0/lib64:$LD_LIBRARY_PATH WITH_GPU=1 python train.py

Run in the SLURM cluster

Check Available Resource

helin@svail-3:~$ sinfo
PARTITION        AVAIL  TIMELIMIT  NODES  STATE NODELIST
K40x4_Paddle        up 8-04:00:00      4    mix svail-[4-7]

From TIMELIMIT, we can see that the allocated resource will be released in 8 days and 4 hours. Please make sure to save the model if your training is longer than the time!

Run

  1. ssh to the interactive node.

    $ ssh helin@svail-3.xxx
    
  2. Allocate the computing nodes (from the interactive node).

    Argument for N is the number of nodes, we can allocate 4 nodes max.

    helin@svail-3:~$ screen # Start a screen so the allocation will not be release accidentally when the ssh connection broke.
    helin@svail-3:~$ salloc --partition=K40x4_Paddle --reservation=Paddle-Paddle -N 1
    salloc: Granted job allocation 1031619
    
  3. Check which computing nodes have been allocated

    helin@svail-3:~$ squeue|grep helin
               1031624 K40x4_Pad     bash    helin  R       0:07      1 svail-4
    
  4. ssh to the computing node and do work.

    $ ssh helin@svail-4.xxx
    helin@svail-4:~$ wget https://raw.githubusercontent.com/PaddlePaddle/book/develop/01.fit_a_line/train.py
    helin@svail-4:~$ LD_LIBRARY_PATH=/tools/cudnn-8.0-linux-x64-v5.0-ga/lib64:/tools/cuda-8.0/lib64:$LD_LIBRARY_PATH WITH_GPU=1 python train.py
    
  5. Release the allocation (from the interactive node, not the computing node).

    helin@svail-3:~$ exit
    salloc: Relinquishing job allocation 1031619
    salloc: Job allocation 1031619 has been revoked.