MTCMon

A light-weight web-based resource monitor for GPU clusters.

Visualize per-job GPU usage on the web
Visualize CPU usage and other system information
Responsive web design implemented with plain CSS and JavaScript, and is therefore lightweight
Python-based with few dependencies
Easy deployment without root privilege or Internet access
Work for both centralized and distributed clusters
Quick access bar designed for large clusters
Already been deployed at labs in top universities like CMU, UC Berkeley, UT Austin and RWTH Aachen

System requirement

The only requirement is that all nodes should be SSH-able from one of the nodes.

Python dependency

Jinja2, Flask, psutil, nvidia-ml-py (Python2) or nvidia-ml-py3 (Python3).

Deployment

Clone this repo and install the dependencies on all the nodes (or just the head node if all nodes share the same software environment through NFS).
Pick one of the nodes as the main node to collect stats from other nodes and host the website. In a centralized cluster, this is typically the head node. From this point on, all steps should take place on the main node.
Update your list of nodes in node-list.txt, and make sure you can ssh into all the listed nodes.
(Optionally) tweak the design of the web page by editing index-design.html.
Generate a web page template by running python design2template.py (the final web page will be rendered dynamically based on this template).
Run run_monitor.sh to dispatch monitors onto listed nodes
Run python webserver.py <port> to start the web server with a specified port
Go to <hostname>:<port>/ and viola!

Restart monitor daemons

Run ./kill_monitor.sh && ./run_monitor.sh on the main node.

Trouble shooting

Find the error log here: /tmp/monitor-<node name>.log

Themes

Currently, there are two themes:

Yellow (default)

Dark blue (checkout the autobot branch for this)

Acknowledgement

The initial prototype of this project is developed by my awesome labmate Rohit Girdhar.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
doc/img		doc/img
nodestats		nodestats
static		static
templates		templates
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
design2template.py		design2template.py
detectSSD.sh		detectSSD.sh
getstat.py		getstat.py
humanize_time.py		humanize_time.py
index-design.html		index-design.html
kill_monitor.sh		kill_monitor.sh
kill_single_node.sh		kill_single_node.sh
node-list.txt		node-list.txt
partition2disk.py		partition2disk.py
run_monitor.sh		run_monitor.sh
run_single_node.sh		run_single_node.sh
webserver.py		webserver.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MTCMon

System requirement

Python dependency

Deployment

Restart monitor daemons

Trouble shooting

Themes

Acknowledgement

About

Releases

Packages

Languages

License

mtli/MTCMon

Folders and files

Latest commit

History

Repository files navigation

MTCMon

System requirement

Python dependency

Deployment

Restart monitor daemons

Trouble shooting

Themes

Acknowledgement

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages