Skip to content

NYU_cluster

ajordana edited this page Dec 21, 2020 · 7 revisions

Greene HPC cluster

You can find NYU's documentation here. Click Here to request an account.

Accessing the cluster

If you are connected to the NYU network, you can ssh the cluster directly:

ssh <NYU_NetID>@greene.hpc.nyu.edu

Otherwise, you need to go through the special gateway server.

ssh <NYU_NetID>@gw.hpc.nyu.edu
ssh <NYU_NetID>@greene.hpc.nyu.edu

You can set up a ssh tunnel by following this documentation.

Set up conda

You can download and install Miniconda with:

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh

When it asks whether to put Miniconda to your .bashrc, say yes.

Do you wish the installer to prepend the Miniconda install location
to PATH in your /home/<netid>/.bashrc ? [yes|no]

Then, source .bashrc and Conda should be available. You can try conda list.

You can learn how to manage Conda environments here. On Prince cluster, I noticed that conda activate myenv needs to be replace by source activate myenv in sbatch files. I need to check if it is still the case for Greene.

Launch jobs

  • Submitting jobs with sbatch
  • If you only want to access a node: srun --pty /bin/bash or srun --gres gpu:1 --pty /bin/bash.
  • You can see your jobs with watch squeue -u <netid>
  • Dask can be a very usefull toof if you want to run a lot of similar jobs. Here is an example:
from dask_jobqueue import SLURMCluster
from dask.distributed import Client
import itertools
from mycode import main_exp

learning_rate = [1e-3, 1e-4]
batch_size = [10, 100, 1000]
n_layers = [1, 2, 3, 4]

parallel_args = []
for (bs, lr, layer) in itertools.product(learning_rate, batch_size, n_layers):
        args = {"lr": lr, "bs":bs, "layer":layer}
        parallel_args.append(args)

env_extra = ['source activate myenv']

if __name__=='__main__':
        cluster = SLURMCluster(job_extra=['--cpus-per-task=1', '--ntasks-per-node=1'],
                        cores=1, processes=1,
                        memory='16GB',
                        walltime='96:00:00',
                        interface='ib0',
                        log_directory='log_dask',
                        local_directory='log_dask')


        n_workers = 4
        cluster.scale(n_workers)
        client = Client(cluster)
        print(client.cluster)

        results = [client.submit(main_exp, args) for args in parallel_args]
        print(results)
        print(client.gather(results))

Run solo12 demo

  • Link to solo12 Demo
Clone this wiki locally