In this notebook, we use the trained noise model to guide the training of a VAE for denoising.

In [1]:
import sys
import os

import torch
import numpy as np
import pytorch_lightning as pl
from pytorch_lightning.callbacks import LearningRateMonitor, EarlyStopping
from pytorch_lightning.loggers import TensorBoardLogger

sys.path.append("../")
from noise_model.PixelCNN import PixelCNN
from HDN.models.lvae import LadderVAE
from utils.dataloaders import create_dn_loader

Load noisy measurements.
These should be numpy ndarrays of shape [Number, 1, Width] or [Number, Width]. </br>

In [2]:
at_particle_location = f"./sample_data/Particle.npy"
at_particle = np.load(at_particle_location)

# In our data, the scattering is the channel is the second
at_particle = at_particle[:, 1]
# We reshape it to fit pytorch's conventional input shape
at_particle = at_particle.reshape((4000, 1, 1000))

Load trained noise model and disable gradients


In [3]:
noise_model_location = f"../nm_checkpoint/final_params.ckpt"
noise_model = PixelCNN.load_from_checkpoint(noise_model_location).eval()

for param in noise_model.parameters():
    param.requires_grad = False

Create data loaders and get the shape, mean and standard deviation of the noisy images.</br>

In [4]:
dn_train_loader, dn_val_loader, img_width, data_mean, data_std = create_dn_loader(
    at_particle, batch_size=32, split=0.9
)

Set denoiser checkpoint directory


In [5]:
dn_checkpoint_path = f"../dn_checkpoint"

Initialise trainer and noise model.</br>


The defauly hyperparameters should work for most cases, but if training takes too long or an out of memory error is encountered, the `num_latents` can be decreased to `6`to reduce the size of the network while still getting good results. Alternatively, better performance could be achieved by increasing the `num_latents` to `10` and `z_dims` to `[64] * num_latents`.</br>
Sometimes, increasing `dropout` to `0.1` or `0.2` can help when working with a limited amount of training data.</br>

Note that here we train for a maximum of 100 epochs to get adequate results in about an hour. Change `max_epochs` to 1000 to train the model fully.

In [6]:
use_cuda = torch.cuda.is_available()
trainer = pl.Trainer(
    default_root_dir=dn_checkpoint_path,
    accelerator="gpu" if use_cuda else "cpu",
    devices=1,
    max_epochs=100,
    logger=TensorBoardLogger(dn_checkpoint_path),
    log_every_n_steps=len(dn_train_loader),
    callbacks=[LearningRateMonitor(logging_interval="epoch"),
               EarlyStopping(monitor='val/elbo', patience=50)],
)

num_latents = 8
z_dims = [32] * num_latents
vae = LadderVAE(
    z_dims=z_dims,
    noise_model=noise_model,
    img_width=img_width,
    dropout=0.0,
    data_mean=data_mean,
    data_std=data_std,
)

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/home/ben/miniforge3/envs/dnm/lib/python3.12/site-packages/pytorch_lightning/utilities/parsing.py:199: Attribute 'noise_model' is an instance of `nn.Module` and is already saved during checkpointing. It is recommended to ignore them using `self.save_hyperparameters(ignore=['noise_model'])`.


Train and save final parameters</br>
Training logs can be monitored on Tensorboard. Open a terminal, activate the dnm environment with Tensorboard installed and enter `tensorboard --logdir path/to/autonoise/nm_checkpoint/` then open a browser and enter localhost:6006. 

The main metric to monitor here is the validation reconstruction loss, or val/reconstruction_loss. This should go down sharply at first then level off. The kl divergence, or kl_loss, is expected to go either up or down. The evidence lower bound, or elbo, is the sum of these two losses, and training should stop when both of these have plateaued. 

In [7]:
trainer.fit(vae, dn_train_loader, dn_val_loader)
trainer.save_checkpoint(os.path.join(dn_checkpoint_path, "final_params.ckpt"))

You are using a CUDA device ('NVIDIA GeForce RTX 3090') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]

  | Name             | Type       | Params
------------------------------------------------
0 | noise_model      | PixelCNN   | 8.9 K 
1 | first_bottom_up  | Sequential | 25.3 K
2 | top_down_layers  | ModuleList | 1.2 M 
3 | bottom_up_layers | ModuleList | 365 K 
4 | final_top_down   | Sequential | 33.5 K
------------------------------------------------
1.6 M     Trainable params
8.9 K     Non-trainable params
1.6 M     Total params
6.457     Total estimated model params size (MB)


Sanity Checking DataLoader 0:   0%|          | 0/2 [00:00<?, ?it/s]

/home/ben/miniforge3/envs/dnm/lib/python3.12/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:441: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=35` in the `DataLoader` to improve performance.


                                                                           

/home/ben/miniforge3/envs/dnm/lib/python3.12/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:441: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=35` in the `DataLoader` to improve performance.


Epoch 0:   2%|▏         | 2/113 [00:00<00:32,  3.40it/s, v_num=1]

  return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass


Epoch 99: 100%|██████████| 113/113 [00:14<00:00,  7.75it/s, v_num=1]

`Trainer.fit` stopped: `max_epochs=100` reached.


Epoch 99: 100%|██████████| 113/113 [00:14<00:00,  7.66it/s, v_num=1]
