## Install dependencies

In [1]:
!pip install -qqq wandb
!pip install -qqq pytorch-lightning

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
lightning 2022.9.30 requires tensorboard>=2.9.1, but you have tensorboard 2.2.0 which is incompatible.[0m[31m
[0m

## Import required modules

In [2]:
# Weights & Biases
import wandb
from pytorch_lightning.loggers import WandbLogger

import pytorch_lightning.metrics
import pytorch_lightning.callbacks as pt_callbacks
# Pytorch modules
import torch
from torch.nn import functional as F
from torch import nn
from torch.optim import Adam
from torch.utils.data import DataLoader, random_split
import torchvision
# Pytorch-Lightning
from pytorch_lightning import LightningDataModule, LightningModule, Trainer
import pytorch_lightning as pl

# Dataset
from torchvision.datasets import CIFAR10
from torchvision import transforms
from torchmetrics.functional import accuracy

## Defining a model

In Pytorch-Lightning, models are built with `LightningModule`, equivalent to `torch.nn.Module` but with added functionality to simplify training.

Models are defined with:
* `__init__` for model parameters
* `forward` for inference
* `training_step` returns a loss from a single batch
* `configure_optimizers` defines the training optimizer

Additional methods can be defined such as:
* `validation_step` and `test_step` for logging metrics when working with validation & test data sets
* methods such as `training_step_end` and `training_epoch_end` for more complex loops
* other custom hooks for more flexibility

In [3]:
def create_model():
    model = torchvision.models.resnet18(pretrained=False, num_classes=10)
    model.conv1 = nn.Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
    model.maxpool = nn.Identity()
    return model

In [4]:
class LitResnet(LightningModule):
    def __init__(self, lr=0.05):
        super().__init__()
        self.model = create_model()
        self.lr = lr
        self.accuracy = pl.metrics.Accuracy()
        self.save_hyperparameters()
        

    def forward(self, x):
        batch_size, channels, width, height = x.size()
        out = self.model(x)
        return F.log_softmax(out, dim=1)

    def training_step(self, batch, batch_idx):
        x, y = batch
        logits = self(x)
        loss = F.nll_loss(logits, y)
        self.log("train_loss", loss)
        return loss

    def evaluate(self, batch, stage=None):
        x, y = batch
        logits = self(x)
        loss = F.nll_loss(logits, y)
        preds = torch.argmax(logits, dim=1)
        acc = accuracy(preds, y)

        if stage:
            self.log(f"{stage}_loss", loss, prog_bar=True)
            self.log(f"{stage}_acc", acc, prog_bar=True)

    def validation_step(self, batch, batch_idx):
        self.evaluate(batch, "val")

    def test_step(self, batch, batch_idx):
        self.evaluate(batch, "test")

    def configure_optimizers(self):
        '''defines model optimizer'''
        return Adam(self.parameters(), lr=self.lr)

*Note: in this particular model we could refactor `training_step`, `validation_step` and `test_step` which share similar code.*

## Loading data

Data pipelines can be created with:
* Pytorch `DataLoaders`
* LightningModule `DataLoaders`
* `DataModules`

Using `DataModules` is recommended whenever possible as its structured definition allows for additional automated optimization such as workload distribution between CPU & GPU.

`DataModules` are defined with:
* `prepare_data` (optional) which is called only once and on 1 GPU
* `setup` which is called on each GPU separately and accepts `stage` to define if we are at `fit` or `test` step
* `train_dataloader`, `val_dataloader` and `test_dataloader` to load respectively training, validation and test datasets

In [5]:
class CIFAR10DataModule(LightningDataModule):

    def __init__(self, data_dir='./', batch_size=256):
        super().__init__()
        self.data_dir = data_dir
        self.batch_size = batch_size
        self.transform = transforms.ToTensor()

    def prepare_data(self):
        '''called only once and on 1 GPU'''
        # download data
        CIFAR10(self.data_dir, train=True, download=True)
        CIFAR10(self.data_dir, train=False, download=True)

    def setup(self, stage=None):
        '''called on each GPU separately - stage defines if we are at fit or test step'''
        # we set up only relevant datasets when stage is specified (automatically set by Pytorch-Lightning)
        if stage == 'fit' or stage is None:
            cifar_train = CIFAR10(self.data_dir, train=True, transform=self.transform)
            self.cifar_train, self.cifar_val = random_split(cifar_train, [45000, 5000])
        if stage == 'test' or stage is None:
            self.cifar_test = CIFAR10(self.data_dir, train=False, transform=self.transform)

    def train_dataloader(self):
        '''returns training dataloader'''
        cifar_train = DataLoader(self.cifar_train, batch_size=self.batch_size)
        return cifar_train

    def val_dataloader(self):
        '''returns validation dataloader'''
        cifar_val = DataLoader(self.cifar_val, batch_size=self.batch_size)
        return cifar_val

    def test_dataloader(self):
        '''returns test dataloader'''
        cifar_test = DataLoader(self.cifar_test, batch_size=self.batch_size)
        return cifar_test

## Setting up Weights & Biases

We log in to W&B (required only once per machine):
* in bash, `wandb login`
* in notebooks, `wandb.login`

In [6]:
wandb.login()

[34m[1mwandb[0m: Currently logged in as: [33mmarkllmark[0m. Use [1m`wandb login --relogin`[0m to force relogin


True

Logging to W&B is automated by `WandbLogger`. Refer to [the documentation](https://docs.wandb.com/library/integrations/lightning) for custom options.



In [7]:
wandb_logger = WandbLogger(project='2022320001_김병준_pytoch lightning Cifar10')

## Training the model

We set up our data and model.

In [8]:
# setup data
cifar10 = CIFAR10DataModule()
 
# setup model - choose different hyperparameters per experiment
model = LitResnet(lr=0.05)

  f"The parameter '{pretrained_param}' is deprecated since 0.13 and will be removed in 0.15, "


In [9]:
!pip install https://github.com/PyTorchLightning/pytorch-lightning/archive/master.zip

Collecting https://github.com/PyTorchLightning/pytorch-lightning/archive/master.zip
  Using cached https://github.com/PyTorchLightning/pytorch-lightning/archive/master.zip
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Collecting tensorboard>=2.9.1
  Using cached tensorboard-2.10.1-py3-none-any.whl (5.9 MB)


Installing collected packages: tensorboard
  Attempting uninstall: tensorboard
    Found existing installation: tensorboard 2.2.0
    Uninstalling tensorboard-2.2.0:
      Successfully uninstalled tensorboard-2.2.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
pytorch-lightning 0.9.0 requires tensorboard==2.2.0, but you have tensorboard 2.10.1 which is incompatible.[0m[31m
[0mSuccessfully installed tensorboard-2.10.1


We can then set up our trainer and customize several options, such as gradient accumulation, half precision training and distributed computing.

In [10]:
trainer = Trainer(
    logger=wandb_logger,    # W&B integration
    gpus=-1,                # use all GPU's
    max_epochs=20            # number of epochs
    )

  f"Setting `Trainer(gpus={gpus!r})` is deprecated in v1.7 and will be removed"
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs


Training just requires a call to `fit` method.

In [11]:
trainer.fit(model, cifar10)

Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./cifar-10-python.tar.gz


  0%|          | 0/170498071 [00:00<?, ?it/s]

Extracting ./cifar-10-python.tar.gz to ./
Files already downloaded and verified


LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name     | Type     | Params
--------------------------------------
0 | model    | ResNet   | 11.2 M
1 | accuracy | Accuracy | 0     
--------------------------------------
11.2 M    Trainable params
0         Non-trainable params
11.2 M    Total params
44.696    Total estimated model params size (MB)


Sanity Checking: 0it [00:00, ?it/s]



Training: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

`Trainer.fit` stopped: `max_epochs=20` reached.


When a test set is available, we just need to call the `test` method.

In [12]:
trainer.test(model, datamodule=cifar10)

Files already downloaded and verified
Files already downloaded and verified


LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Testing: 0it [00:00, ?it/s]

[{'test_loss': 1.8210960626602173, 'test_acc': 0.711899995803833}]

When we want to close our W&B run, we can call `wandb.finish()` (mainly useful in notebooks, called automatically in scripts).

In [13]:
wandb.finish()

VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

0,1
epoch,▁▁▁▁▂▂▂▂▂▂▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▇▇▇▇▇▇████
test_acc,▁
test_loss,▁
train_loss,██▇▆▅▅▅▅▄▄▄▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁
trainer/global_step,▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇████
val_acc,▁▄▄▅▅▇▇▇▇█▇█▇███▇███
val_loss,█▄▄▂▂▁▁▂▃▁▂▂▃▂▂▄▅▃▃▅

0,1
epoch,20.0
test_acc,0.7119
test_loss,1.8211
train_loss,0.06786
trainer/global_step,3520.0
val_acc,0.721
val_loss,1.81783
