A light-weight web-based resource monitor for GPU clusters.
- Visualize per-job GPU usage on the web
- Visualize CPU usage and other system information
- Responsive web design implemented with plain CSS and JavaScript, and is therefore lightweight
- Python-based with few dependencies
- Easy deployment without root privilege or Internet access
- Work for both centralized and distributed clusters
- Quick access bar designed for large clusters
- Already been deployed at labs in top universities like CMU, UC Berkeley, UT Austin and RWTH Aachen
The only requirement is that all nodes should be SSH-able from one of the nodes.
Jinja2, Flask, psutil, nvidia-ml-py (Python2) or nvidia-ml-py3 (Python3).
- Clone this repo and install the dependencies on all the nodes (or just the head node if all nodes share the same software environment through NFS).
- Pick one of the nodes as the main node to collect stats from other nodes and host the website. In a centralized cluster, this is typically the head node. From this point on, all steps should take place on the main node.
- Update your list of nodes in
node-list.txt
, and make sure you can ssh into all the listed nodes. - (Optionally) tweak the design of the web page by editing
index-design.html
. - Generate a web page template by running
python design2template.py
(the final web page will be rendered dynamically based on this template). - Run
run_monitor.sh
to dispatch monitors onto listed nodes - Run
python webserver.py <port>
to start the web server with a specified port - Go to
<hostname>:<port>/
and viola!
Run ./kill_monitor.sh && ./run_monitor.sh
on the main node.
Find the error log here: /tmp/monitor-<node name>.log
Currently, there are two themes:
Dark blue (checkout the autobot
branch for this)
The initial prototype of this project is developed by my awesome labmate Rohit Girdhar.