PyTorch + Lightning + W&B + TensorBoard Reference Implementation

It is finally time for me to jump in the PyTorch pool. My office was closed for two weeks over the 2023/2024 holidays, so this was a perfect time to learn some new (to me) tools.

The key objective was to build a reference implementation for a simple CNN for MNIST. From here, the plan is to implement more interesting things after getting the rust knocked off:

Diffusion models
Very small GPT-like model (e.g., nanoGPT)
Low-bit quantization in PyTorch
Graph neural networks

Components of the Reference Implementation

A few updated approaches since last time I got my hands dirty:

PyTorch Lightning : Get some feeling for what I like and what I don't
YAML configuration (Hydra): Move away from CLI arguments to YAML configuration
Weights & Biases : See what a slick experiment tracking tool can do
TensorBoard : Still seems the go-to for basic training montoring
Dev Containers : Move from PyCharm to VS Code and give dev containers a try
TODO: Add type hinting to make my code less readable

Setup

Create VM on cloud provider of your choice

Some notes:

Be sure to enable http traffic (for TensorBoard)
Open port 6006 (both directions - also for Tensorboard)
Make sure GPU driver is installed
Confirm docker is installed

Build container

Copy Dockerfile to your VM, or just clone this repo.

git clone --depth 1 https://github.com/pat-coady/pytorch-sandbox.git

Build it. Run command from directory containing the Dockerfile:

docker build -t mnist0 .

Run container

Put your Weights & Biases API Key in your environment.

export WANDB_API_KEY=<your API key>

Run the container detached. We'll connect to the container in the next step using docker exec to start training and tensorboard. Yes, I know, containers aren't really meant to be used interactively like this. But it is pretty handy for having a fixed development environment and replicating almost anywhere.

docker run --gpus all -p 6006:6006 -e WANDB_API_KEY -td mnist0

Train a model and monitor results via TensorBoard

Start tmux so you can launch training and start TensorBoard server. Also easy way to prevent your job from dying if you disconnect your ssh session (which can happen very easily if your computer goes to sleep or your network glitches).

tmux

Here is an example hyperparameter sweep using the hydra interace:

cd pytorch-sandbox

python3 train.py -m data.num_workers=4 trainer.max_epochs=20 \
model.l1_chan=4,8 model.l2_chan=8,16 model.l3_chan=16,32 \
model.optimizer_params.lr=0.02,0.1 model.optimizer_params.momentum=0.8,0.9 \
model.optimizer_params.nesterov=0,1

Start another pane in tmux to launch tensorboard and monitor training: CNTL-b c. Here is a nice tmux tutorial if you aren't familar.

Launch tensorboard (--bind_all to serve externally):

tensorboard --bind_all --logdir .

To detach from tmux: CNTL-b d. You can now walk away from things, close ssh, and everything should keep on running.

Results

I've made the Weights and Biases workspace for this project if you want to look around:

https://wandb.ai/pcoady00/mnist/workspace

This contains the results of the 64-run hyperparameter sweep I ran above. Here are some screen shots. Here is a quick (not narrated) video that explores results.

And here some representative screen shots.

Figure 1. Results sorted by validation accuracy.

Figure 2. Training curves: validation loss.

Figure 3. Parameter importance and positive (GREEN) or negative (RED) correlation.

Figure 4. "Parallel Coordinates" view, somewhat useful when used interactive -- see video.

Figure 5. Parmeter histograms, can be somewhat informative to debug training and where learning is occuring.

Figure 6. Also useful for monitoring training.

Figure 7. Good to check ... for instance, would gradient clipping or batch normalization be useful?

Acknowledgements

I'd like to thank Tao Yu and Zsolt Majzik on my team for giving me some really helpful pointers as I got started.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
.devcontainer		.devcontainer
assets		assets
conf		conf
notebooks		notebooks
scripts		scripts
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
TODO.md		TODO.md
data.py		data.py
launch_container.sh		launch_container.sh
models.py		models.py
multirun_train.sh		multirun_train.sh
requirements.txt		requirements.txt
ruff.toml		ruff.toml
train.py		train.py

pat-coady/pytorch-sandbox

Folders and files

Latest commit

History

Repository files navigation

PyTorch + Lightning + W&B + TensorBoard Reference Implementation

Components of the Reference Implementation

Setup

Results

Acknowledgements

About

Resources

Stars

Watchers

Forks

Languages