PLAI cluster

PLAI cluster docs

This doc details the usage of PLAI group's GPUs and CPUs. Both sets of resources are managed by a torque/maui scheduler, with the headnode at submit.cs.ubc.ca.

Quick Start

You can access headnode through remote.cs.ubc.ca (ssh username@remote.cs.ubca.ca) and then ssh submit from there.
Your stuff is at /ubc/cs/research/plai-scratch/[username] which is mounted on these machines
On submit, qstat -q to see available queues
qsub -q [queue] -I to grab an interactive session
qsub [PBS_script] to submit a job
qstat to see who else has jobs running at the moment
ssh to machine name to see usage only (nvidia-smi / htop)

Best Practices

If you want to ssh directly to headnode (for scp-ing or laziness), add this to your .ssh/config:

Host remote.cs.ubc.ca       
    User [username]
Host submit
    User [username]
    ProxyCommand ssh -W %h:%p remote.cs.ubc.ca 22

The best way to run things on here is through conda environments. See https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html

You can set up different environments for different experimental setups or for different builds of packages on GPU machines or CPU. In order to install stuff correctly, you should grab an interactive session and activate the environment on the machine of choice.

Don't run anything directly on the machine (i.e. do not ssh nodea or ssh chicago and run anything besides usage stuff).

UBC GPU cluster

The gpu queues are 'desktop', 'gpu', and 'gpu-desktop'. They'll have all the stuff you'd expect - nvidia-smi, nvidia-docker.

Get an interactive session with:

$ ssh username@remote.cs.ubc.ca
$ ssh submit
$ qsub -I -q [desktop/gpu/gpu-week]
$ [some fancy command]

The distinction between the queues are most notably the time limit for each - i.e. 1 hour/1 day/1 week respectively. Also, gpu-week grabs one of the machines with TitanX GPUs, and you can use this for long running experiments.

You can potentially run things with all the GPUs at once, qsub -q [queue] -l nodes=chicago+sanjose+..., but its unclear if this could work. If you figure this out before I get a chance, pls update this doc.

UBC CPU cluster

There are 4x 40-core machines called node(a-d) and a fifth 60 core machine called nodee. The 4 40 core machines have MPI installed. Do not ever run a job that grabs nodee + any of the other nodes.

The queues for the CPUs are laura/nice/parallel.

laura is used for node0 only. It has 400gb of memory for use and can run jobs for a month. For the other nodes, use nice or parallel. nice is low memory and 10 day max jobs, and parallel is significantly higher memory. Again, it is possible to do:

qsub -q [nice/parallel] -l nodes=nodea+nodeb+nodec+noded

Docker images

If you want to run something with missing dependencies, ask Frank or Michael about adding it to the cluster. Alternatively you can run things in a convoluted way through Docker:

username@gpu-machine:user-scratch$ nvidia-docker --user some-docker-user --rm -it -v $PWD:/workspace username/some-docker-image:some-tag bash
some-docker-user@7891fd18edc2:/workspace# [some fancy command]

Some important tips:

If you need to persist things, like experimental run data or training artifacts, you should mount plai-scratch or some directory in the docker image. The way the cluster machines are set up, you will need a user in the docker image that mirrors your own user on ubc network. As your own user on headnode, run id -u. This is the user id associated with your user. In the docker image, you will need to add a user with these same identifying information:

# SOME DOCKERFILE
FROM ubuntu:16.04
RUN [random docker stuff]
...
...
RUN useradd -u 99999 mjteng

Now, use docker with the created user as normal and things should work, i.e. nvidia-docker --user [user you just created] ...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly