# Introduction to Pytorch Lightning 

In this notebook, I will walk you through writing a pytorch lightning code step-by-step that you may want to run with the MNIST datasets on the jupyter notebook that you have launched.   

## Simplest Example only with a training loop in the lightning model 
This could be the simplest example with just a training_step (no validation, no testing).

In [1]:
import os
import torch
import pytorch_lightning as pl
from torchvision.datasets import MNIST
from torch.utils.data import DataLoader
from torchvision import transforms
from torch import nn
from torch.nn import functional as F


class LitModel(pl.LightningModule):
    def __init__(self):
        super().__init__()
        self.l1 = nn.Linear(28 * 28, 10)

    def forward(self, x):
        return torch.relu(self.l1(x.view(x.size(0), -1)))

    def training_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self(x)
        loss = F.cross_entropy(y_hat, y)
        return loss

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=0.02)

BATCH_SIZE = 128 if torch.cuda.is_available() else 32

train_loader = DataLoader(
    MNIST(os.getcwd(), download=True, transform=transforms.ToTensor()),
    batch_size = BATCH_SIZE)

trainer = pl.Trainer(
    accelerator="auto",
    max_epochs=5
)

model = LitModel()
trainer.fit(model, train_loader)

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
You are using a CUDA device ('NVIDIA A100-SXM4-80GB') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name | Type   | Params
--------------------------------
0 | l1   | Linear | 7.9 K 
--------------------------------
7.9 K     Trainable params
0         Non-trainable params
7.9 K     Total params
0.031     Total estimated model params size (MB)
SLURM auto-requeueing enabled. Setting signal handlers.
  rank_zero_warn(


Training: 0it [00:00, ?it/s]

`Trainer.fit` stopped: `max_epochs=5` reached.


## Adding a validation loop in the lightning model 
* a validation_step added in the lightning model 

In [2]:
import os
import torch
import pytorch_lightning as pl
from torchvision.datasets import MNIST
from torch.utils.data import DataLoader
from torchvision import transforms
from torch import nn
from torch.nn import functional as F

#from torchmetrics import Accuracy

class LitModel(pl.LightningModule):
    def __init__(self):
        super().__init__()
        self.l1 = nn.Linear(28 * 28, 10)

    def forward(self, x):
        return torch.relu(self.l1(x.view(x.size(0), -1)))

    def training_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self(x)
        loss = F.cross_entropy(y_hat, y)
        return loss
    
    def validation_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self(x)
        loss = F.cross_entropy(y_hat, y)
        return loss

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=0.02)

BATCH_SIZE = 128 if torch.cuda.is_available() else 32
#train_loader = DataLoader(MNIST(os.getcwd(), download=True, transform=transforms.ToTensor()))

train_loader = DataLoader(
    MNIST(os.getcwd(), download=True, transform=transforms.ToTensor()), num_workers=4, 
    batch_size = BATCH_SIZE
)

val_loader = DataLoader(
    MNIST(os.getcwd(), download=True, transform=transforms.ToTensor()), num_workers=4,
    batch_size = BATCH_SIZE
)

#mnist_full = MNIST(self.data_dir, train=True, transform=self.transform)
#mnist_train, mnist_val = random_split(mnist_full, [55000, 5000])


trainer = pl.Trainer(
    accelerator="auto",
    max_epochs = 5
)
model = LitModel()
trainer.fit(model, train_loader, val_loader)


GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
You are using a CUDA device ('NVIDIA A100-SXM4-80GB') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name | Type   | Params
--------------------------------
0 | l1   | Linear | 7.9 K 
--------------------------------
7.9 K     Trainable params
0         Non-trainable params
7.9 K     Total params
0.031     Total estimated model params size (MB)
SLURM auto-requeueing enabled. Setting signal handlers.


Sanity Checking: 0it [00:00, ?it/s]

Training: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

`Trainer.fit` stopped: `max_epochs=5` reached.


## Adding a lightning logging
* If you log your loss with the prog_bar turned on in training or validation step with self.log, then you will see the loss in the progress bar.
* Here is a snippet code:
```
class LitModel(pl.LightningModule):  
   ... 
   def validation_step(self, batch, batch_idx):
        ...

        # logs metrics for each training_step,
        self.log("val_loss", loss, prog_bar=True)
        return loss
```

In [3]:
import os
import torch
import pytorch_lightning as pl
from torchvision.datasets import MNIST
from torch.utils.data import DataLoader
from torchvision import transforms
from torch import nn
from torch.nn import functional as F

class LitModel(pl.LightningModule):
    def __init__(self):
        super().__init__()
        self.l1 = nn.Linear(28 * 28, 10)

    def forward(self, x):
        return torch.relu(self.l1(x.view(x.size(0), -1)))

    def training_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self(x)
        loss = F.cross_entropy(y_hat, y)
        
        # logs metrics for each training_step,
        # and the average across the epoch, to the progress bar and logger
        self.log("train_loss", loss, on_step=True, on_epoch=True, prog_bar=True, logger=True)
        return loss
    
    def validation_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self(x)
        loss = F.cross_entropy(y_hat, y)

        # Calling self.log will surface up scalars for you in TensorBoard
        self.log("val_loss", loss, prog_bar=True)
        return loss

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=0.02)

BATCH_SIZE = 128 if torch.cuda.is_available() else 32
#train_loader = DataLoader(MNIST(os.getcwd(), download=True, transform=transforms.ToTensor()))

train_loader = DataLoader(
    MNIST(os.getcwd(), download=True, transform=transforms.ToTensor()), num_workers=4, 
    batch_size = BATCH_SIZE
)

val_loader = DataLoader(
    MNIST(os.getcwd(), download=True, transform=transforms.ToTensor()), num_workers=4,
    batch_size = BATCH_SIZE
)

#mnist_full = MNIST(self.data_dir, train=True, transform=self.transform)
#mnist_train, mnist_val = random_split(mnist_full, [55000, 5000])


trainer = pl.Trainer(
    accelerator="auto",
    max_epochs = 5
)
model = LitModel()
trainer.fit(model, train_loader, val_loader)

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
You are using a CUDA device ('NVIDIA A100-SXM4-80GB') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name | Type   | Params
--------------------------------
0 | l1   | Linear | 7.9 K 
--------------------------------
7.9 K     Trainable params
0         Non-trainable params
7.9 K     Total params
0.031     Total estimated model params size (MB)
SLURM auto-requeueing enabled. Setting signal handlers.


Sanity Checking: 0it [00:00, ?it/s]

Training: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

`Trainer.fit` stopped: `max_epochs=5` reached.


## Adding a validation_epoch_end
* You may want to see your accuracy metric every validation epoch end.
* <span style="color:blue; font-weight: bold">Note that support for validation_epoch_end has been removed in *v2.0.x.*</span>
* Here is a snippet code.
```
from torchmetrics import Accuracy
  
class LitModel(pl.LightningModule):  
   def __init__(self):
        self.val_accuracy = Accuracy(task="multiclass", num_classes=10)
    
   def validation_step(self, batch, batch_idx):
        ...
        self.val_accuracy.update(preds, y)

        # Calling self.log will surface up scalars for you in TensorBoard
        self.log("val_loss", loss, prog_bar=True)
        self.log("val_acc", self.val_accuracy, prog_bar=True)
        return loss
        
    #Support for `validation_epoch_end` has been removed in v2.0.0.
    def validation_epoch_end(self, validation_step_outputs): 
        avg_loss = torch.stack([x['val_loss'] for x in validation_step_outputs]).mean()
```

In [4]:
import os
import torch
import pytorch_lightning as pl
#import lightning as pl
from torchvision.datasets import MNIST
from torch.utils.data import DataLoader
from torchvision import transforms
from torch import nn
from torch.nn import functional as F

from torchmetrics import Accuracy

class LitModel(pl.LightningModule):
    def __init__(self):
        super().__init__()
        self.l1 = nn.Linear(28 * 28, 10)
        self.val_accuracy = Accuracy(task="multiclass", num_classes=10)
        

    def forward(self, x):
        return torch.relu(self.l1(x.view(x.size(0), -1)))

    def training_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self(x)
        loss = F.cross_entropy(y_hat, y)
        # logs metrics for each training_step,
        # and the average across the epoch, to the progress bar and logger
        self.log("train_loss", loss, on_step=True, on_epoch=True, prog_bar=True, logger=True)
        return loss
    
    def validation_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self(x)
        loss = F.cross_entropy(y_hat, y)
        preds = torch.argmax(y_hat, dim=1)
        self.val_accuracy.update(preds, y)

        # Calling self.log will surface up scalars for you in TensorBoard
        self.log("val_loss", loss, prog_bar=True, on_step=True, on_epoch=True )
        self.log("val_acc", self.val_accuracy, prog_bar=True, on_step=True, on_epoch=True )
        return {'val_loss': loss}
    
    #Support for `validation_epoch_end` has been removed in v2.0.0.
    def validation_epoch_end(self, validation_step_outputs): 
        avg_loss = torch.stack([x['val_loss'] for x in validation_step_outputs]).mean()
        self.log("avg_val_loss", avg_loss, prog_bar=True)
        return {'avg_val_loss': avg_loss}
    
    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=0.02)

BATCH_SIZE = 128 if torch.cuda.is_available() else 32
#train_loader = DataLoader(MNIST(os.getcwd(), download=True, transform=transforms.ToTensor()))

train_loader = DataLoader(
    MNIST(os.getcwd(), download=True, transform=transforms.ToTensor()), num_workers=4, 
    batch_size = BATCH_SIZE
)

val_loader = DataLoader(
    MNIST(os.getcwd(), download=True, transform=transforms.ToTensor()), num_workers=4,
    batch_size = BATCH_SIZE
)

#mnist_full = MNIST(self.data_dir, train=True, transform=self.transform)
#mnist_train, mnist_val = random_split(mnist_full, [55000, 5000])


trainer = pl.Trainer(
    accelerator="auto",
    max_epochs = 5
)
model = LitModel()
trainer.fit(model, train_loader, val_loader)

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
You are using a CUDA device ('NVIDIA A100-SXM4-80GB') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name         | Type               | Params
----------------------------------------------------
0 | l1           | Linear             | 7.9 K 
1 | val_accuracy | MulticlassAccuracy | 0     
----------------------------------------------------
7.9 K     Trainable params
0         Non-trainable params
7.9 K     Total params
0.031     Total estimated model params size (MB)
SLURM auto-requeueing enabled. Setting signal handlers.


Sanity Checking: 0it [00:00, ?it/s]

Training: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

`Trainer.fit` stopped: `max_epochs=5` reached.


## Replacing validataion_epoch_end with on_validation_epoch_end
* If you have a lightning v2.0.x installed, you need to use on_validation_epoch_end instread. 
* Here is a snipper code.
```
class LitModel(pl.LightningModule):
    def __init__(self):
        self.validation_step_outputs = []
        ...
    def validation_step(self, batch, batch_idx):
        self.validation_step_outputs.append(loss)
        ...
    def on_validation_epoch_end(self): 
        avg_loss = torch.stack(self.validation_step_outputs).mean()
        ...
```

In [5]:
import os
import torch
import pytorch_lightning as pl
#import lightning as pl
from torchvision.datasets import MNIST
from torch.utils.data import DataLoader
from torchvision import transforms
from torch import nn
from torch.nn import functional as F

from torchmetrics import Accuracy

class LitModel(pl.LightningModule):
    def __init__(self):
        super().__init__()
        self.l1 = nn.Linear(28 * 28, 10)
        self.val_accuracy = Accuracy(task="multiclass", num_classes=10)
        self.validation_step_outputs = []
        

    def forward(self, x):
        return torch.relu(self.l1(x.view(x.size(0), -1)))

    def training_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self(x)
        loss = F.cross_entropy(y_hat, y)
        # logs metrics for each training_step,
        # and the average across the epoch, to the progress bar and logger
        self.log("train_loss", loss, on_step=True, on_epoch=True, prog_bar=True, logger=True)
        return loss
    
    def validation_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self(x)
        loss = F.cross_entropy(y_hat, y)
        preds = torch.argmax(y_hat, dim=1)
        self.val_accuracy.update(preds, y)
        self.validation_step_outputs.append(loss)

        # Calling self.log will surface up scalars for you in TensorBoard
        self.log("val_loss", loss, prog_bar=True, on_step=True, on_epoch=True )
        self.log("val_acc", self.val_accuracy, prog_bar=True, on_step=True, on_epoch=True )
        return {'val_loss': loss}
   
    def on_validation_epoch_end(self): 
        avg_loss = torch.stack(self.validation_step_outputs).mean()
        #print("avg_loss: ", avg_loss)
        self.log("avg_val_loss", avg_loss, prog_bar=True)
        self.validation_step_outputs.clear()
        return {'avg_val_loss': avg_loss}

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=0.02)

BATCH_SIZE = 128 if torch.cuda.is_available() else 32
#train_loader = DataLoader(MNIST(os.getcwd(), download=True, transform=transforms.ToTensor()))

train_loader = DataLoader(
    MNIST(os.getcwd(), download=True, transform=transforms.ToTensor()), num_workers=4, 
    batch_size = BATCH_SIZE
)

val_loader = DataLoader(
    MNIST(os.getcwd(), download=True, transform=transforms.ToTensor()), num_workers=4,
    batch_size = BATCH_SIZE
)

trainer = pl.Trainer(
    accelerator="auto",
    max_epochs = 5
)
model = LitModel()
trainer.fit(model, train_loader, val_loader)

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
You are using a CUDA device ('NVIDIA A100-SXM4-80GB') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name         | Type               | Params
----------------------------------------------------
0 | l1           | Linear             | 7.9 K 
1 | val_accuracy | MulticlassAccuracy | 0     
----------------------------------------------------
7.9 K     Trainable params
0         Non-trainable params
7.9 K     Total params
0.031     Total estimated model params size (MB)
SLURM auto-requeueing enabled. Setting signal handlers.


Sanity Checking: 0it [00:00, ?it/s]

Training: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

`Trainer.fit` stopped: `max_epochs=5` reached.


## MNIST dataset split with train and val data (55000 vs. 5000)
* So far, the same MNIST datasets with the size of 60000 are used for both training and validation.   
* You may want to split the MNISTI dataset.

In [6]:
import os
import torch
import pytorch_lightning as pl
#import lightning as pl
from torchvision.datasets import MNIST
from torch.utils.data import DataLoader
from torchvision import transforms
from torch import nn
from torch.nn import functional as F

from torchmetrics import Accuracy
from torch.utils.data import random_split

class LitModel(pl.LightningModule):
    def __init__(self):
        super().__init__()
        self.l1 = nn.Linear(28 * 28, 10)
        self.val_accuracy = Accuracy(task="multiclass", num_classes=10)
        self.validation_step_outputs = []
        

    def forward(self, x):
        return torch.relu(self.l1(x.view(x.size(0), -1)))

    def training_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self(x)
        loss = F.cross_entropy(y_hat, y)
        # logs metrics for each training_step,
        # and the average across the epoch, to the progress bar and logger
        self.log("train_loss", loss, on_step=True, on_epoch=True, prog_bar=True, logger=True)
        return loss
    
    def validation_step(self, batch, batch_idx):
        x, y = batch
        #logits = self(x)
        #loss = F.nll_loss(logits, y)
        y_hat = self(x)
        loss = F.cross_entropy(y_hat, y)
        #preds = torch.argmax(logits, dim=1)
        preds = torch.argmax(y_hat, dim=1)
        self.val_accuracy.update(preds, y)
        #self.validation_step_outputs.append(pred)
        self.validation_step_outputs.append(loss)

        # Calling self.log will surface up scalars for you in TensorBoard
        self.log("val_loss", loss, prog_bar=True, on_step=True, on_epoch=True )
        self.log("val_acc", self.val_accuracy, prog_bar=True, on_step=True, on_epoch=True )
        return {'val_loss': loss}
    
    ##Support for `validation_epoch_end` has been removed in v2.0.0. 
    #def validation_epoch_end(self, validation_step_outputs): 
    #    avg_loss = torch.stack([x['val_loss'] for x in validation_step_outputs]).mean()
    #    print("avg_loss: ", avg_loss)
    #    self.log("avg_val_loss", avg_loss, prog_bar=True)
    #    return {'avg_val_loss': avg_loss}
   
    def on_validation_epoch_end(self): 
        #avg_loss = torch.stack([x['val_loss'] for x in validation_step_outputs]).mean()
        avg_loss = torch.stack(self.validation_step_outputs).mean()
        #print("avg_loss: ", avg_loss)
        self.log("avg_val_loss", avg_loss, prog_bar=True)
        self.validation_step_outputs.clear()
        return {'avg_val_loss': avg_loss}

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=0.02)

BATCH_SIZE = 128 if torch.cuda.is_available() else 32
#train_loader = DataLoader(MNIST(os.getcwd(), download=True, transform=transforms.ToTensor()))

#train_loader = DataLoader(
#    MNIST(os.getcwd(), download=True, transform=transforms.ToTensor()), num_workers=4, 
#    batch_size = BATCH_SIZE
#)

#val_loader = DataLoader(
#    MNIST(os.getcwd(), download=True, transform=transforms.ToTensor()), num_workers=4,
#    batch_size = BATCH_SIZE
#)

mnist_full = MNIST(os.getcwd(), train=True, download=True, transform=transforms.ToTensor())
mnist_train, mnist_val = random_split(mnist_full, [55000, 5000])

train_loader = DataLoader(
    mnist_train, num_workers=4, batch_size = BATCH_SIZE
)

val_loader = DataLoader(
    mnist_val, num_workers=4, batch_size = BATCH_SIZE
)


trainer = pl.Trainer(
    accelerator="auto",
    max_epochs = 5
)
model = LitModel()
trainer.fit(model, train_loader, val_loader)

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
You are using a CUDA device ('NVIDIA A100-SXM4-80GB') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name         | Type               | Params
----------------------------------------------------
0 | l1           | Linear             | 7.9 K 
1 | val_accuracy | MulticlassAccuracy | 0     
----------------------------------------------------
7.9 K     Trainable params
0         Non-trainable params
7.9 K     Total params
0.031     Total estimated model params size (MB)
SLURM auto-requeueing enabled. Setting signal handlers.


Sanity Checking: 0it [00:00, ?it/s]

Training: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

`Trainer.fit` stopped: `max_epochs=5` reached.


## Adding the data-related Hook inside the lightning model

In [7]:
import os
import torch
import pytorch_lightning as pl
#import lightning as pl
from torchvision.datasets import MNIST
from torch.utils.data import DataLoader
from torchvision import transforms
from torch import nn
from torch.nn import functional as F

from torchmetrics import Accuracy
from torch.utils.data import random_split

PATH_DATASETS = os.environ.get("PATH_DATASETS", ".")
BATCH_SIZE = 128 if torch.cuda.is_available() else 32

class LitModel(pl.LightningModule):
    def __init__(self, data_dir=PATH_DATASETS, hidden_size=64, learning_rate=2e-4):
        super().__init__()
        
        # Set our init args as class attributes
        self.data_dir = data_dir
        self.hidden_size = hidden_size
        self.learning_rate = learning_rate
        
        #self.l1 = nn.Linear(28 * 28, 10)
        # Hardcode some dataset specific attributes
        self.num_classes = 10
        self.dims = (1, 28, 28)
        channels, width, height = self.dims
        self.transform = transforms.Compose(
            [
                transforms.ToTensor(),
                transforms.Normalize((0.1307,), (0.3081,)),
            ]
        )

        # Define PyTorch model
        self.model = nn.Sequential(
            nn.Flatten(),
            nn.Linear(channels * width * height, hidden_size),
            nn.ReLU(),
            nn.Dropout(0.1),
            nn.Linear(hidden_size, hidden_size),
            nn.ReLU(),
            nn.Dropout(0.1),
            nn.Linear(hidden_size, self.num_classes),
        )
        
        self.val_accuracy = Accuracy(task="multiclass", num_classes=10)
        self.validation_step_outputs = []

    def forward(self, x):
        #return torch.relu(self.l1(x.view(x.size(0), -1)))
        logit = self.model(x)
        return logit

    def training_step(self, batch, batch_idx):
        x, y = batch
        logits = self(x)
        loss = F.cross_entropy(logits, y)
        # logs metrics for each training_step,
        # and the average across the epoch, to the progress bar and logger
        self.log("train_loss", loss, on_step=True, on_epoch=True, prog_bar=True, logger=True)
        return loss
    
    def validation_step(self, batch, batch_idx):
        x, y = batch
        #logits = self(x)
        #loss = F.nll_loss(logits, y)
        logits = self(x)
        loss = F.cross_entropy(logits, y)
        #preds = torch.argmax(logits, dim=1)
        preds = torch.argmax(logits, dim=1)
        self.val_accuracy.update(preds, y)
        #self.validation_step_outputs.append(pred)
        self.validation_step_outputs.append(loss)

        # Calling self.log will surface up scalars for you in TensorBoard
        self.log("val_loss", loss, prog_bar=True, on_step=True, on_epoch=True )
        self.log("val_acc", self.val_accuracy, prog_bar=True, on_step=True, on_epoch=True )
        return {'val_loss': loss}
   
    def on_validation_epoch_end(self): 
        #avg_loss = torch.stack([x['val_loss'] for x in validation_step_outputs]).mean()
        avg_loss = torch.stack(self.validation_step_outputs).mean()
        #print("avg_loss: ", avg_loss)
        self.log("avg_val_loss", avg_loss, prog_bar=True)
        self.validation_step_outputs.clear()
        return {'avg_val_loss': avg_loss}

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=0.02)
    
    ####################
    # DATA RELATED HOOKS
    ####################
    
    def prepare_data(self):
        # download
        MNIST(self.data_dir, train=True, download=True)
        #MNIST(PATH_DATASETS, train=False, download=True)
        
    def setup(self, stage=None):
        # Assign train/val datasets for use in dataloaders
        if stage == "fit" or stage is None:
            mnist_full = MNIST(self.data_dir, train=True, download=True, transform=transforms.ToTensor())
            self.mnist_train, self.mnist_val = random_split(mnist_full, [55000, 5000])
        
    def train_dataloader(self):
        return DataLoader(mnist_train, num_workers=4, batch_size = BATCH_SIZE)
    
    def val_dataloader(self):
        return DataLoader(mnist_val, num_workers=4, batch_size = BATCH_SIZE)

#BATCH_SIZE = 128 if torch.cuda.is_available() else 32
#train_loader = DataLoader(MNIST(os.getcwd(), download=True, transform=transforms.ToTensor()))

#train_loader = DataLoader(
#    MNIST(os.getcwd(), download=True, transform=transforms.ToTensor()), num_workers=4, 
#    batch_size = BATCH_SIZE
#)

#val_loader = DataLoader(
#    MNIST(os.getcwd(), download=True, transform=transforms.ToTensor()), num_workers=4,
#    batch_size = BATCH_SIZE
#)

#mnist_full = MNIST(os.getcwd(), train=True, download=True, transform=transforms.ToTensor())
#mnist_train, mnist_val = random_split(mnist_full, [55000, 5000])

#train_loader = DataLoader(
#    mnist_train, num_workers=4, batch_size = BATCH_SIZE
#)

#val_loader = DataLoader(
#    mnist_val, num_workers=4, batch_size = BATCH_SIZE
#)

trainer = pl.Trainer(
    accelerator="auto",
    max_epochs = 5
)
model = LitModel()
#trainer.fit(model, train_loader, val_loader)
trainer.fit(model)

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
You are using a CUDA device ('NVIDIA A100-SXM4-80GB') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name         | Type               | Params
----------------------------------------------------
0 | model        | Sequential         | 55.1 K
1 | val_accuracy | MulticlassAccuracy | 0     
----------------------------------------------------
55.1 K    Trainable params
0         Non-trainable params
55.1 K    Total params
0.220     Total estimated model params size (MB)
SLURM auto-requeueing enabled. Setting signal handlers.


Sanity Checking: 0it [00:00, ?it/s]

Training: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

`Trainer.fit` stopped: `max_epochs=5` reached.


## Adding a test loop in the lightning model

In [8]:
import os
import torch
import pytorch_lightning as pl
#import lightning as pl
from torchvision.datasets import MNIST
from torch.utils.data import DataLoader
from torchvision import transforms
from torch import nn
from torch.nn import functional as F

from torchmetrics import Accuracy
from torch.utils.data import random_split

PATH_DATASETS = os.environ.get("PATH_DATASETS", ".")
BATCH_SIZE = 128 if torch.cuda.is_available() else 32

class LitModel(pl.LightningModule):
    def __init__(self, data_dir=PATH_DATASETS, hidden_size=64, learning_rate=2e-4):
        super().__init__()
    
        #self.l1 = nn.Linear(28 * 28, 10)
        # Set our init args as class attributes
        self.data_dir = data_dir
        self.hidden_size = hidden_size
        self.learning_rate = learning_rate
        
        #self.l1 = nn.Linear(28 * 28, 10)
        # Hardcode some dataset specific attributes
        self.num_classes = 10
        self.dims = (1, 28, 28)
        channels, width, height = self.dims
        self.transform = transforms.Compose(
            [
                transforms.ToTensor(),
                transforms.Normalize((0.1307,), (0.3081,)),
            ]
        )

        # Define PyTorch model
        self.model = nn.Sequential(
            nn.Flatten(),
            nn.Linear(channels * width * height, hidden_size),
            nn.ReLU(),
            nn.Dropout(0.1),
            nn.Linear(hidden_size, hidden_size),
            nn.ReLU(),
            nn.Dropout(0.1),
            nn.Linear(hidden_size, self.num_classes),
        )
        
        self.val_accuracy = Accuracy(task="multiclass", num_classes=10)
        self.validation_step_outputs = []
        
        self.test_accuracy = Accuracy(task="multiclass", num_classes=10) 

    def forward(self, x):
        #return torch.relu(self.l1(x.view(x.size(0), -1)))
        logits = self.model(x)
        return logits

    def training_step(self, batch, batch_idx):
        x, y = batch
        logits = self(x)
        loss = F.cross_entropy(logits, y)
        # logs metrics for each training_step,
        # and the average across the epoch, to the progress bar and logger
        self.log("train_loss", loss, on_step=True, on_epoch=True, prog_bar=True, logger=True)
        return loss
    
    def validation_step(self, batch, batch_idx):
        x, y = batch
        #logits = self(x)
        #loss = F.nll_loss(logits, y)
        logits = self(x)
        loss = F.cross_entropy(logits, y)
        #preds = torch.argmax(logits, dim=1)
        preds = torch.argmax(logits, dim=1)
        self.val_accuracy.update(preds, y)
        #self.validation_step_outputs.append(pred)
        self.validation_step_outputs.append(loss)

        # Calling self.log will surface up scalars for you in TensorBoard
        self.log("val_loss", loss, prog_bar=True, on_step=True, on_epoch=True )
        self.log("val_acc", self.val_accuracy, prog_bar=True, on_step=True, on_epoch=True )
        return {'val_loss': loss}
    
    ##Support for `validation_epoch_end` has been removed in v2.0.0. 
    #def validation_epoch_end(self, validation_step_outputs): 
    #    avg_loss = torch.stack([x['val_loss'] for x in validation_step_outputs]).mean()
    #    print("avg_loss: ", avg_loss)
    #    self.log("avg_val_loss", avg_loss, prog_bar=True)
    #    return {'avg_val_loss': avg_loss}
   
    def on_validation_epoch_end(self): 
        #avg_loss = torch.stack([x['val_loss'] for x in validation_step_outputs]).mean()
        avg_loss = torch.stack(self.validation_step_outputs).mean()
        #print("avg_loss: ", avg_loss)
        self.log("avg_val_loss", avg_loss, prog_bar=True)
        self.validation_step_outputs.clear()
        return {'avg_val_loss': avg_loss}
    
    def test_step(self, batch, batch_idx):
        x, y = batch
        #logits = self(x)
        #loss = F.nll_loss(logits, y)
        logits = self(x)
        loss = F.cross_entropy(logits, y)
        #preds = torch.argmax(logits, dim=1)
        preds = torch.argmax(logits, dim=1)
        self.test_accuracy.update(preds, y)
        #self.validation_step_outputs.append(pred)
        #self.validation_step_outputs.append(loss)

        # Calling self.log will surface up scalars for you in TensorBoard
        self.log("test_loss", loss, prog_bar=True, on_step=True, on_epoch=True )
        self.log("test_acc", self.test_accuracy, prog_bar=True, on_step=True, on_epoch=True )
        #return {'test_loss': loss}

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=self.learning_rate)
    
    def prepare_data(self):
        # download
        MNIST(self.data_dir, train=True, download=True)
        MNIST(self.data_dir, train=False, download=True)
            
    def setup(self, stage=None):
        # Assign train/val datasets for use in dataloaders
        if stage == "fit" or stage is None:
            mnist_full = MNIST(self.data_dir, train=True, transform=self.transform)
            #mnist_train, mnist_val = random_split(mnist_full, [55000, 5000])
            self.mnist_train, self.mnist_val = random_split(mnist_full, [55000, 5000])

        # Assign test dataset for use in dataloader(s)
        if stage == "test" or stage is None:
            #mnist_test = MNIST(self.data_dir, train=False, transform=self.transform)
            self.mnist_test = MNIST(self.data_dir, train=False, transform=self.transform)

    def train_dataloader(self):
        return DataLoader(self.mnist_train, num_workers=4, batch_size = BATCH_SIZE)
    
    def val_dataloader(self):
        return DataLoader(self.mnist_val, num_workers=4, batch_size = BATCH_SIZE)
    
    def test_dataloader(self):
        return DataLoader(self.mnist_test, num_workers=4, batch_size = BATCH_SIZE)

trainer = pl.Trainer(
    accelerator="auto",
    max_epochs = 5
)
model = LitModel()
#trainer.fit(model, train_loader, val_loader)
trainer.fit(model)

print()
print("Trainer Testing Starting...")
trainer.test()

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
You are using a CUDA device ('NVIDIA A100-SXM4-80GB') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name          | Type               | Params
-----------------------------------------------------
0 | model         | Sequential         | 55.1 K
1 | val_accuracy  | MulticlassAccuracy | 0     
2 | test_accuracy | MulticlassAccuracy | 0     
-----------------------------------------------------
55.1 K    Trainable params
0         Non-trainable params
55.1 K    Total params
0.220     Total estimated model params size (MB)
SLURM auto-re

Sanity Checking: 0it [00:00, ?it/s]

Training: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

`Trainer.fit` stopped: `max_epochs=5` reached.
  rank_zero_warn(
You are using a CUDA device ('NVIDIA A100-SXM4-80GB') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
Restoring states from the checkpoint path at /scratch/qualis/lightning/lightning_logs/version_202342/checkpoints/epoch=4-step=2150.ckpt
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Loaded model weights from checkpoint at /scratch/qualis/lightning/lightning_logs/version_202342/checkpoints/epoch=4-step=2150.ckpt



Trainer Testing Starting...


Testing: 0it [00:00, ?it/s]

[{'test_loss_epoch': 0.16354569792747498, 'test_acc_epoch': 0.951200008392334}]