qgpu

A simple job scheduler to monitor gpu usage across multiple nodes and submit jobs to available resources.

Installation

git clone https://github.com/noamteyssier/qgpu
cd qgpu
cargo build --release && cargo install --path .

Configuration

Node Pool

Describes the json configuration of the nodes you'd like to queue to. can be given as a -i <node_pool>.json or will read from ~/.qgpu_node_config.json if path exists

Recognized Fields:

name
- name to node as specified in your ~/.ssh/config file
- requires that passwordless ssh is setup
- requires that proxyjumps are set up
env
- Optional configuration
- name of cuda environment to start on the node

Example JSON

{
    "name": "kgpu1",
    "env": "ifpe_cuda10"
}
{
    "name": "kgpu2"
}
{
    "name": "kgpu10",
    "env": "ifpe_cuda11"
}

Job Pool

Describes the configuration of the jobs you'd like to submit across the node pool

Recognized Fields:

command
- command to use to start the job (must be recognized by shell environment)
- if you are submitting a bash script use "bash"
- if you are submitting a python script use "python"
args
- optional configuration
- list of arguments to provide to command
relative_path
- Optional configuration
- the relative path from $HOME where you'd like the command to be run
- under the hood will cd <path>
env
- Optional configuration
- name of cuda environment to start for job
- takes priority over node environment if both are provided
n_submission
- Optional configuration
- number of times to submit this job
- not implemented yet, but useful for when the same script needs to be run multiple times

Example JSON

{
    "command": "bash",
    "args": ["tune_dataset.sh", "../_cache/splits/my_dataset1.tab", "cml", "hparams_cml_data2"],
    "relative_path": "~/projects/project_dir/"
}
{
    "command": "bash",
    "args": ["tune_dataset.sh", "../_cache/splits/my_dataset2.tab", "cml", "hparams_cml_data2"],
    "relative_path": "~/projects/project_dir/"
}

Usage

How it works

qgpu queries which resources are available given free memory/usage thresholds and will set environment variables used to submit jobs across the node pool that meet that threshold.

The way for you to use this is through accessing those environment variables in the same way you would with any job scheduler.

qgpu uses these environment variables :

QG_NODE_ID
- the zero_indexed node_id across the node pool
QG_GPU_ID
- the zero indexed gpu_id within a given node
QG_LOG_PATH
- a path to redirect stdout/stderr to.
- protip : your_script.sh |& tee output.lxt will redirect both stdout and stderr
- format of log path: node<QG_NODE_ID>_gpu<QG_GPU_ID>.log.txt

Example Script

#!/usr/bin/env bash

dataset=$1
model=$2
output=$3

python3 my_machine_learning_run.py \
    -i ${dataset} \
    -m ${model} \
    -o ${output} \
    -f 4096 \
    -n 12 \
    -g ${QG_GPU_ID} \
    -M 100 \
    -t 25 |& tee ${QG_LOG_PATH}

Querying Usage and Submitting

# access help menu
qgpu --help

# access help menu for stat
qgpu stat --help

# access help menu for sub
qgpu sub --help

# query the usage statistics on the node pool
# expects ~/.qgpu_node_config.json to exist
qgpu stat

# query the usage statistics on the node pool with custom node_pool
qgpu stat -i node_pool.json

# submit a pool of jobs across the available nodes on a node pool
qgpu sub -j jobs.json

# submit a pool of jobs across a custom node_pool
qgpu sub -i node_pool.json -j jobs.json

# submit with very stringent resource availability
qgpu sub -i node_pool.json -j jobs.json -f 100 -F 100

# submit with flexible resource availability
qgpu sub -i node_pool.json -j jobs.json -f 50 -F 50

# submit with stringent utilization with flexible memory utilization
qgpu sub -i node_pool.json -j jobs.json -f 100 -F 20

# submit with flexible utilization with stringent memory utilization
qgpu sub -i node_pool.json -j jobs.json -f 20 -F 100

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src

src

.gitignore

.gitignore

Cargo.lock

Cargo.lock

Cargo.toml

Cargo.toml

README.md

README.md

Repository files navigation

qgpu

Installation

Configuration

Node Pool

Recognized Fields:

Example JSON

Job Pool

Recognized Fields:

Example JSON

Usage

How it works

Example Script

Querying Usage and Submitting

About

Releases

Packages

Languages

noamteyssier/qgpu

Folders and files

Latest commit

History

Repository files navigation

qgpu

Installation

Configuration

Node Pool

Recognized Fields:

Example JSON

Job Pool

Recognized Fields:

Example JSON

Usage

How it works

Example Script

Querying Usage and Submitting

About

Resources

Stars

Watchers

Forks

Languages