[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/juansensio/blog/blob/master/072_pytorch_ngc/072_pytorch_ngc.ipynb)

# Pytorch NGC

En los posts anteriores hemos visto muchos trucos para optimizar nuestro código en Pytorch, sin embargo no nos hemos preocupado por la instalación del mismo. En este post vamos a aprender a usar una versión de `Pytorch` optimizada que en alguna ocasión nos dará un pequeño extra de performance. Para ello vamos a usar `Docker`.

- Imágen con `Pytorch`: https://ngc.nvidia.com/catalog/containers/nvidia:pytorch
- Instala `Docker`: https://docs.docker.com/engine/install/ubuntu/
- Instala `nvidia-docker`: https://github.com/NVIDIA/nvidia-docker
- Opcionalmente, instala `docker-compose`: https://docs.docker.com/compose/install/

## Docker

> No es el objetivo de este post explicar en detalle qué es `Docker` o para qué sirve. Existen muchos recursos en internet que puedes usar para aprender más sobre esta tecnología.

## Notebooks en Docker

docker run --gpus all --ipc=host --rm -v local_dir:container_dir nvcr.io/nvidia/pytorch:xx.xx-py3


docker run --gpus all --ipc=host --rm  nvcr.io/nvidia/pytorch:21.06-py3 echo "hola"

docker run --gpus all --ipc=host --rm  nvcr.io/nvidia/pytorch:21.06-py3 pwd

docker run --gpus all --ipc=host --rm -v $PWD/072_pytorch_ngc:/workspace  -p 8888:8888 nvcr.io/nvidia/pytorch:21.06-py3 jupyter notebook --allow-root --ip=0.0.0.0 --no-browser --NotebookApp.token=abc123

```
docker run --gpus all --ipc=host --rm -v $PWD/072_pytorch_ngc:/workspace -v $PWD/072_pytorch_ngc/data:/workspace/data  -p 8888:8888 nvcr.io/nvidia/pytorch:21.06-py3 jupyter notebook --allow-root --ip=0.0.0.0 --no-browser --NotebookApp.token=abc123
```

In [1]:
import torch 

torch.__version__

'1.9.0a0+c3d40fd'

In [2]:
torch.cuda.is_available()

True

In [3]:
import os
from sklearn.model_selection import train_test_split

def setup(path='./data', test_size=0.2, random_state=42):

    classes = sorted(os.listdir(path))

    print("Generating images and labels ...")
    images, encoded = [], []
    for ix, label in enumerate(classes):
        _images = os.listdir(f'{path}/{label}')
        images += [f'{path}/{label}/{img}' for img in _images]
        encoded += [ix]*len(_images)
    print(f'Number of images: {len(images)}')

     # train / val split
    print("Generating train / val splits ...")
    train_images, val_images, train_labels, val_labels = train_test_split(
        images,
        encoded,
        stratify=encoded,
        test_size=test_size,
        random_state=random_state
    )

    print("Training samples: ", len(train_labels))
    print("Validation samples: ", len(val_labels))
    
    return classes, train_images, train_labels, val_images, val_labels

classes, train_images, train_labels, val_images, val_labels = setup('./data')

Generating images and labels ...
Number of images: 27000
Generating train / val splits ...
Training samples:  21600
Validation samples:  5400


In [4]:
import torch
from skimage import io 

class Dataset(torch.utils.data.Dataset):
    def __init__(self, images, labels):
        self.images = images
        self.labels = labels

    def __len__(self):
        return len(self.images)

    def __getitem__(self, ix):
        img = io.imread(self.images[ix])[...,(3,2,1)]
        img = torch.tensor(img / 4000, dtype=torch.float).clip(0,1).permute(2,0,1)  
        label = torch.tensor(self.labels[ix], dtype=torch.long)        
        return img, label
    
ds = {
    'train': Dataset(train_images, train_labels),
    'val': Dataset(val_images, val_labels)
}

batch_size = 1024
dl = {
    'train': torch.utils.data.DataLoader(ds['train'], batch_size=batch_size, shuffle=True, num_workers=20, pin_memory=True),
    'val': torch.utils.data.DataLoader(ds['val'], batch_size=batch_size, shuffle=False, num_workers=20, pin_memory=True)
}

In [5]:
import torch.nn.functional as F
import timm

class Model(torch.nn.Module):

    def __init__(self, n_outputs=10, use_amp=True):
        super().__init__()
        self.model = timm.create_model('tf_efficientnet_b5', pretrained=True, num_classes=n_outputs)
        self.use_amp = use_amp

    def forward(self, x, log=False):
        if log:
            print(x.shape)
        with torch.cuda.amp.autocast(enabled=self.use_amp):
            return self.model(x)

In [6]:
from tqdm import tqdm
import numpy as np

def step(model, batch, device):
    x, y = batch
    x, y = x.to(device), y.to(device)
    y_hat = model(x)
    loss = F.cross_entropy(y_hat, y)
    acc = (torch.argmax(y_hat, axis=1) == y).sum().item() / y.size(0)
    return loss, acc

def train_amp(model, dl, optimizer, epochs=10, device="cpu", use_amp = True, prof=None, end=0):
    model.to(device)
    hist = {'loss': [], 'acc': [], 'val_loss': [], 'val_acc': []}
    scaler = torch.cuda.amp.GradScaler(enabled=use_amp)
    for e in range(1, epochs+1):
        # train
        model.train()
        l, a = [], []
        bar = tqdm(dl['train'])
        stop=False
        for batch_idx, batch in enumerate(bar):
            optimizer.zero_grad()
            
            # AMP
            with torch.cuda.amp.autocast(enabled=use_amp):
                loss, acc = step(model, batch, device)
            scaler.scale(loss).backward()
            # gradient clipping 
            #torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=0.1)
            scaler.step(optimizer)
            scaler.update()
            
            l.append(loss.item())
            a.append(acc)
            bar.set_description(f"training... loss {np.mean(l):.4f} acc {np.mean(a):.4f}")
            # profiling
            if prof:
                if batch_idx >= end:
                    stop = True
                    break
                prof.step()  
        hist['loss'].append(np.mean(l))
        hist['acc'].append(np.mean(a))
        if stop:
            break
        # eval
        model.eval()
        l, a = [], []
        bar = tqdm(dl['val'])
        with torch.no_grad():
            for batch in bar:
                loss, acc = step(model, batch, device)
                l.append(loss.item())
                a.append(acc)
                bar.set_description(f"evluating... loss {np.mean(l):.4f} acc {np.mean(a):.4f}")
        hist['val_loss'].append(np.mean(l))
        hist['val_acc'].append(np.mean(a))
        # log
        log = f'Epoch {e}/{epochs}'
        for k, v in hist.items():
            log += f' {k} {v[-1]:.4f}'
        print(log)
        
    return hist

In [7]:
model = Model()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
hist = train_amp(model, dl, optimizer, epochs=3, device="cuda")

Downloading: "https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-weights/tf_efficientnet_b5_ra-9a3e5369.pth" to /root/.cache/torch/hub/checkpoints/tf_efficientnet_b5_ra-9a3e5369.pth
training... loss 1.7557 acc 0.6718: 100%|██████████| 22/22 [00:14<00:00,  1.49it/s]
evluating... loss 24.6224 acc 0.4783: 100%|██████████| 6/6 [00:02<00:00,  2.52it/s]
  0%|          | 0/22 [00:00<?, ?it/s]

Epoch 1/3 loss 1.7557 acc 0.6718 val_loss 24.6224 val_acc 0.4783


training... loss 0.2032 acc 0.9370: 100%|██████████| 22/22 [00:11<00:00,  1.88it/s]
evluating... loss 0.4083 acc 0.8991: 100%|██████████| 6/6 [00:02<00:00,  2.62it/s]
  0%|          | 0/22 [00:00<?, ?it/s]

Epoch 2/3 loss 0.2032 acc 0.9370 val_loss 0.4083 val_acc 0.8991


training... loss 0.0604 acc 0.9807: 100%|██████████| 22/22 [00:11<00:00,  1.94it/s]
evluating... loss 0.1867 acc 0.9491: 100%|██████████| 6/6 [00:02<00:00,  2.63it/s]

Epoch 3/3 loss 0.0604 acc 0.9807 val_loss 0.1867 val_acc 0.9491





In [8]:
model = Model()
if torch.cuda.device_count() > 1:
  print("Let's use", torch.cuda.device_count(), "GPUs!")
  model = torch.nn.DataParallel(model)

Let's use 2 GPUs!


In [9]:
model.cuda()

# cada gpu recibe la mitad del batch !
output = model(torch.randn(32, 3, 32, 32).cuda(), log=True)

output.size()

torch.Size([16, 3, 32, 32])torch.Size([16, 3, 32, 32])



torch.Size([32, 10])

In [None]:
batch_size = 2048
dl = {
    'train': torch.utils.data.DataLoader(ds['train'], batch_size=batch_size, shuffle=True, num_workers=20, pin_memory=True),
    'val': torch.utils.data.DataLoader(ds['val'], batch_size=batch_size, shuffle=False, num_workers=20, pin_memory=True)
}

optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
hist = train_amp(model, dl, optimizer, epochs=3, device="cuda")

training... loss 2.1215 acc 0.6086: 100%|██████████| 11/11 [00:08<00:00,  1.29it/s]
evluating... loss 19.9792 acc 0.1761: 100%|██████████| 3/3 [00:02<00:00,  1.06it/s]
  0%|          | 0/11 [00:00<?, ?it/s]

Epoch 1/3 loss 2.1215 acc 0.6086 val_loss 19.9792 val_acc 0.1761


training... loss 0.2685 acc 0.9155: 100%|██████████| 11/11 [00:07<00:00,  1.41it/s]
evluating... loss 7.6862 acc 0.3792: 100%|██████████| 3/3 [00:02<00:00,  1.06it/s]
  0%|          | 0/11 [00:00<?, ?it/s]

Epoch 2/3 loss 0.2685 acc 0.9155 val_loss 7.6862 val_acc 0.3792


training... loss 0.0880 acc 0.9704: 100%|██████████| 11/11 [00:07<00:00,  1.40it/s]
  0%|          | 0/3 [00:00<?, ?it/s]

docker-compose up

docker-compose up -d

docker-compose down

## Scripts en Docker

## Resumen

Usar una versión de `Pytorch` optimizada, como las que nos provee `NVIDIA` a través de su servicio `NGC` puede darnos un extra de performance en ciertas ocasiones ya que el código ha sido tuneado por expertos en vez de recurrir a una versión instalable genérica que puede hacer algún compromiso para evitar posibles conflictos con hardware o sistemas operativos. En este post hemos visto como podemos ejectur nuestros notebooks y scripts de `Python` con `Docker` y `docker-compose`.