A simple job scheduler to monitor gpu usage across multiple nodes and submit jobs to available resources.
git clone https://github.com/noamteyssier/qgpu
cd qgpu
cargo build --release && cargo install --path .
Describes the json configuration of the nodes you'd like to queue to.
can be given as a -i <node_pool>.json
or will read from ~/.qgpu_node_config.json
if path exists
- name
- name to node as specified in your ~/.ssh/config file
- requires that passwordless ssh is setup
- requires that proxyjumps are set up
- env
- Optional configuration
- name of cuda environment to start on the node
{
"name": "kgpu1",
"env": "ifpe_cuda10"
}
{
"name": "kgpu2"
}
{
"name": "kgpu10",
"env": "ifpe_cuda11"
}
Describes the configuration of the jobs you'd like to submit across the node pool
-
command
- command to use to start the job (must be recognized by shell environment)
- if you are submitting a bash script use
"bash"
- if you are submitting a python script use
"python"
-
args
- optional configuration
- list of arguments to provide to command
-
relative_path
- Optional configuration
- the relative path from
$HOME
where you'd like the command to be run - under the hood will
cd <path>
-
env
- Optional configuration
- name of cuda environment to start for job
- takes priority over node environment if both are provided
-
n_submission
- Optional configuration
- number of times to submit this job
- not implemented yet, but useful for when the same script needs to be run multiple times
{
"command": "bash",
"args": ["tune_dataset.sh", "../_cache/splits/my_dataset1.tab", "cml", "hparams_cml_data2"],
"relative_path": "~/projects/project_dir/"
}
{
"command": "bash",
"args": ["tune_dataset.sh", "../_cache/splits/my_dataset2.tab", "cml", "hparams_cml_data2"],
"relative_path": "~/projects/project_dir/"
}
qgpu queries which resources are available given free memory/usage thresholds and will set environment variables used to submit jobs across the node pool that meet that threshold.
The way for you to use this is through accessing those environment variables in the same way you would with any job scheduler.
qgpu uses these environment variables :
-
QG_NODE_ID
- the zero_indexed
node_id
across the node pool
- the zero_indexed
-
QG_GPU_ID
- the zero indexed
gpu_id
within a given node
- the zero indexed
-
QG_LOG_PATH
- a path to redirect stdout/stderr to.
- protip :
your_script.sh |& tee output.lxt
will redirect both stdout and stderr - format of log path:
node<QG_NODE_ID>_gpu<QG_GPU_ID>.log.txt
#!/usr/bin/env bash
dataset=$1
model=$2
output=$3
python3 my_machine_learning_run.py \
-i ${dataset} \
-m ${model} \
-o ${output} \
-f 4096 \
-n 12 \
-g ${QG_GPU_ID} \
-M 100 \
-t 25 |& tee ${QG_LOG_PATH}
# access help menu
qgpu --help
# access help menu for stat
qgpu stat --help
# access help menu for sub
qgpu sub --help
# query the usage statistics on the node pool
# expects ~/.qgpu_node_config.json to exist
qgpu stat
# query the usage statistics on the node pool with custom node_pool
qgpu stat -i node_pool.json
# submit a pool of jobs across the available nodes on a node pool
qgpu sub -j jobs.json
# submit a pool of jobs across a custom node_pool
qgpu sub -i node_pool.json -j jobs.json
# submit with very stringent resource availability
qgpu sub -i node_pool.json -j jobs.json -f 100 -F 100
# submit with flexible resource availability
qgpu sub -i node_pool.json -j jobs.json -f 50 -F 50
# submit with stringent utilization with flexible memory utilization
qgpu sub -i node_pool.json -j jobs.json -f 100 -F 20
# submit with flexible utilization with stringent memory utilization
qgpu sub -i node_pool.json -j jobs.json -f 20 -F 100