# Week 5: Fast pipelines, profiling

### Seminar outline
- Automatic mixed precision and tensor cores
- Efficient batching
- Efficient pipelines with NVidia DALI
- HuggingFace streaming dataset
- Image decoders benchmarks
- General purpose Python profiling with `py-spy`
- Deep Learning profiling with `torch.utils.bottleneck()`
- Profiling with `nvprof`

## Automatic mixed precision

In [1]:
!nvidia-smi

Fri Feb 11 14:55:27 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla V100-SXM2...  On   | 00000000:00:17.0 Off |                    0 |
| N/A   50C    P0    70W / 300W |      0MiB / 16160MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:00:18.0 Off |                    0 |
| N/A   40C    P0    45W / 300W |      0MiB / 16160MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:00:19.0 Off |                    0 |
| N/A   

In [2]:
# HuggingFace datasets
# !pip install datasets

# NVidia DALI
# Make sure to choose correct CUDA version according to the nvidia-smi output
# !pip install --extra-index-url https://developer.download.nvidia.com/compute/redist --upgrade nvidia-dali-cuda102

In [3]:
import gc
from time import time

import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision
from tqdm.auto import trange, tqdm

In [4]:
transform = torchvision.transforms.Compose(
    [
        torchvision.transforms.ToTensor(),
        torchvision.transforms.Normalize((0.1307,), (0.3081,))
    ]
)

mnist_train = torchvision.datasets.MNIST(
    "./mnist/", 
    train=True, 
    download=True, 
    transform=transform
) 
mnist_val = torchvision.datasets.MNIST(
    "./mnist/",
    train=False, 
    download=True,
    transform=transform
)


train_dataloader = torch.utils.data.DataLoader(mnist_train, batch_size=1024, shuffle=True)
val_dataloader = torch.utils.data.DataLoader(mnist_val, batch_size=1024, shuffle=False)

In [5]:
def train(model, loss_fn, optimizer, n_epochs=3, device="cuda:0", precision="full"):
    if precision == "half":
        model.half()
    model.to(device)
    
    for epoch in range(n_epochs):
        model.train()
        for x_train, y_train in tqdm(train_dataloader, desc=f"Epoch {epoch}: "):
            if precision == "half":
                x_train = x_train.half()
            x_train, y_train = x_train.to(device), y_train.to(device)
            y_pred = model(x_train)
            loss = loss_fn(y_pred.float(), y_train)
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()

        if epoch % 2 == 0 or epoch == n_epochs - 1:
            print("Starting validation...")
            model.eval()
            val_loss = torch.empty(len(val_dataloader))
            val_accuracy = torch.empty(len(val_dataloader))
            
            with torch.no_grad():
                for i, (x_val, y_val) in enumerate(val_dataloader):
                    if precision == "half":
                        x_val = x_val.half()
                    x_val, y_val = x_val.to(device), y_val.to(device)
                    y_pred = model(x_val)
                    loss = loss_fn(y_pred.float(), y_val)
                    val_loss[i] = loss
                    val_accuracy[i] = (torch.argmax(y_pred, dim=-1) == y_val).float().mean()

            print(
                f"Epoch: {epoch}, loss: {val_loss.mean().detach().cpu()}, "
                f"accuracy: {val_accuracy.mean().detach().cpu()}"
            )
    model.eval()

In [6]:
model = nn.Sequential(
    nn.Conv2d(in_channels=1, out_channels=20, kernel_size=5),
    nn.ReLU(),
    nn.MaxPool2d(kernel_size=4),
    nn.Conv2d(in_channels=20, out_channels=10, kernel_size=3),
    nn.ReLU(),
    nn.MaxPool2d(kernel_size=2),
    nn.Flatten(),
    nn.Linear(2*2*10, 128),
    nn.ReLU(),
    nn.Linear(128, 1024),
    nn.ReLU(),
    nn.Linear(1024, 10)
)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
loss = nn.CrossEntropyLoss()

train(model.to("cuda:0"), loss, optimizer)

Epoch 0:   0%|          | 0/59 [00:00<?, ?it/s]

Starting validation...
Epoch: 0, loss: 0.10469746589660645, accuracy: 0.9667630195617676


Epoch 1:   0%|          | 0/59 [00:00<?, ?it/s]

Epoch 2:   0%|          | 0/59 [00:00<?, ?it/s]

Starting validation...
Epoch: 2, loss: 0.06188463047146797, accuracy: 0.9779775738716125


In [8]:
model = nn.Sequential(
    nn.Conv2d(in_channels=1, out_channels=20, kernel_size=5),
    nn.ReLU(),
    nn.MaxPool2d(kernel_size=4),
    nn.Conv2d(in_channels=20, out_channels=10, kernel_size=3),
    nn.ReLU(),
    nn.MaxPool2d(kernel_size=2),
    nn.Flatten(),
    nn.Linear(2*2*10, 128),
    nn.ReLU(),
    nn.Linear(128, 1024),
    nn.ReLU(),
    nn.Linear(1024, 10)
)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
loss = nn.CrossEntropyLoss()

train(model.to("cuda:0"), loss, optimizer, precision="half")

Epoch 0:   0%|          | 0/59 [00:00<?, ?it/s]

Starting validation...
Epoch: 0, loss: nan, accuracy: 0.09818439185619354


Epoch 1:   0%|          | 0/59 [00:00<?, ?it/s]

Epoch 2:   0%|          | 0/59 [00:00<?, ?it/s]

Starting validation...
Epoch: 2, loss: nan, accuracy: 0.09818439185619354


In [9]:
# Timing utilities
start_time = None
def start_timer():
    global start_time
    gc.collect()
    torch.cuda.empty_cache()
    torch.cuda.reset_max_memory_allocated()
    torch.cuda.synchronize()
    start_time = time()

def end_timer_and_print(local_msg):
    torch.cuda.synchronize()
    end_time = time()
    print("\n" + local_msg)
    print("Total execution time = {:.3f} sec".format(end_time - start_time))
    print("Max memory used by tensors = {} bytes".format(torch.cuda.max_memory_allocated()))

In [10]:
def make_model(in_size, out_size, num_layers):
    layers = []
    for _ in range(num_layers - 1):
        layers.append(torch.nn.Linear(in_size, in_size))
        layers.append(torch.nn.ReLU())
    layers.append(torch.nn.Linear(in_size, out_size))
    return torch.nn.Sequential(*tuple(layers))

In [11]:
batch_size = 512 # Try, for example, 128, 256, 513
in_size = 4096 + 2048
out_size = 4096 + 2048
num_layers = 3
num_batches = 50
epochs = 3

# Creates data in default precision.
# The same data is used for both default and mixed precision trials below.
# You don't need to manually change inputs' dtype when enabling mixed precision.
data = [torch.randn(batch_size, in_size, device="cuda:0") for _ in range(num_batches)]
targets = [torch.randn(batch_size, out_size, device="cuda:0") for _ in range(num_batches)]

loss_fn = torch.nn.MSELoss().to("cuda:0")

In [12]:
net = make_model(in_size, out_size, num_layers)
opt = torch.optim.SGD(net.parameters(), lr=0.001)
net.to("cuda:0")

start_timer()
for epoch in trange(epochs):
    for input, target in zip(data, targets):
        output = net(input)
        loss = loss_fn(output, target)
        loss.backward()
        opt.step()
        opt.zero_grad() # set_to_none=True here can modestly improve performance
end_timer_and_print("Default precision:")



  0%|          | 0/3 [00:00<?, ?it/s]


Default precision:
Total execution time = 4.182 sec
Max memory used by tensors = 2379551744 bytes


In [13]:
net = make_model(in_size, out_size, num_layers)
opt = torch.optim.SGD(net.parameters(), lr=0.001)
net.half()
net.to("cuda:0")

start_timer()
for epoch in trange(epochs):
    for input, target in zip(data, targets):
        output = net(input.half())
        loss = loss_fn(output, target.half())
        loss.backward()
        opt.step()
        opt.zero_grad()
end_timer_and_print("Half precision:")

  0%|          | 0/3 [00:00<?, ?it/s]


Half precision:
Total execution time = 0.977 sec
Max memory used by tensors = 2429897728 bytes


In [14]:
net = make_model(in_size, out_size, num_layers)
opt = torch.optim.SGD(net.parameters(), lr=0.001)
net.to("cuda:0")

start_timer()
for epoch in trange(epochs):
    for input, target in zip(data, targets):
        with torch.cuda.amp.autocast(enabled=True):
            output = net(input)
            # output is float16 because linear layers autocast to float16.
            assert output.dtype is torch.float16
            
            loss = loss_fn(output, target)
            # loss is float32 because mse_loss layers autocast to float32.
            assert loss.dtype is torch.float32
            
        loss.backward()
        opt.step()
        opt.zero_grad()
end_timer_and_print("Mixed precision without scaling:")

  0%|          | 0/3 [00:00<?, ?it/s]


Mixed precision without scaling:
Total execution time = 1.493 sec
Max memory used by tensors = 2499077120 bytes


In [15]:
net = make_model(in_size, out_size, num_layers)
opt = torch.optim.SGD(net.parameters(), lr=0.001)
scaler = torch.cuda.amp.GradScaler(enabled=True)
net.to("cuda:0")

start_timer()
for epoch in trange(epochs):
    for input, target in zip(data, targets):
        with torch.cuda.amp.autocast(enabled=True):
            output = net(input)
            # output is float16 because linear layers autocast to float16.
            assert output.dtype is torch.float16
            
            loss = loss_fn(output, target)
            # loss is float32 because mse_loss layers autocast to float32.
            assert loss.dtype is torch.float32
            
        scaler.scale(loss).backward()
        scaler.step(opt)
        scaler.update()
        opt.zero_grad()
end_timer_and_print("Mixed precision with autocast and scaling:")

  0%|          | 0/3 [00:00<?, ?it/s]


Mixed precision with autocast and scaling:
Total execution time = 1.693 sec
Max memory used by tensors = 2895539200 bytes


### Gradients modification

In [16]:
for epoch in range(0):
    for input, target in zip(data, targets):
        with torch.cuda.amp.autocast():
            output = net(input)
            loss = loss_fn(output, target)
        scaler.scale(loss).backward()

        # Unscales the gradients of optimizer's assigned params in-place
        scaler.unscale_(opt)

        # Since the gradients of optimizer's assigned params are now unscaled, clips as usual
        # You may use the same value for max_norm here as you would without gradient scaling
        torch.nn.utils.clip_grad_norm_(net.parameters(), max_norm=0.1)

        scaler.step(opt)
        scaler.update()
        opt.zero_grad()

### Tensor cores

In [17]:
data_512 = [torch.randn(512, in_size, device="cuda:0") for _ in range(num_batches)]
targets_512 = [torch.randn(512, out_size, device="cuda:0") for _ in range(num_batches)]

data_513 = [torch.randn(513, in_size, device="cuda:0") for _ in range(num_batches)]
targets_513 = [torch.randn(513, out_size, device="cuda:0") for _ in range(num_batches)]

loss_fn = torch.nn.MSELoss().to("cuda:0")

net = make_model(in_size, out_size, num_layers)
opt = torch.optim.SGD(net.parameters(), lr=0.001)
net.to("cuda:0")

start_timer()
for epoch in trange(epochs):
    for input, target in zip(data_512, targets_512):
        output = net(input)
        loss = loss_fn(output, target)
        loss.backward()
        opt.step()
        opt.zero_grad()
end_timer_and_print("Default precision, batch_size 512:")

net = make_model(in_size, out_size, num_layers)
opt = torch.optim.SGD(net.parameters(), lr=0.001)
net.to("cuda:0")

start_timer()
for epoch in trange(epochs):
    for input, target in zip(data_513, targets_513):
        output = net(input)
        loss = loss_fn(output, target)
        loss.backward()
        opt.step()
        opt.zero_grad()
end_timer_and_print("Default precision, batch_size 513:")

  0%|          | 0/3 [00:00<?, ?it/s]


Default precision, batch_size 512:
Total execution time = 4.177 sec
Max memory used by tensors = 5188050944 bytes


  0%|          | 0/3 [00:00<?, ?it/s]


Default precision, batch_size 513:
Total execution time = 4.576 sec
Max memory used by tensors = 5188149248 bytes


## Batching

Standard batching approach is just to stack tensors aquired with `__getitem__`.

- Sequences with varying length
- Different labels dimensions

In [18]:
from torch.utils.data import DataLoader
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

brain: pad everything to a fixed max_length

big brain: pad only in the collate_fn

ultra duper big brain: presort data to sample sequences smartly, preserving +- the same length in a batch

In [19]:
# TODO: substitute with DataSet with __getitem__
lines = [
    "One thing I don't know why",
    "It doesn't even matter how hard you try",
    "Keep that in mind, I designed this rhyme",
    "To explain in due time",
    "All I know",
    "Time is a valuable thing",
    "Watch it fly by as the pendulum swings",
    "Watch it count down to the end of the day",
    "The clock ticks life away",
    "It's so unreal",
    "Didn't look out below",
    "Watch the time go right out the window",
    "Tryin' to hold on, did-didn't even know",
    "I wasted it all just to watch you go",
    "I kept everything inside and even though I tried",
    "It all fell apart",
    "What it meant to me will eventually",
    "Be a memory of a time when I tried so hard",
    "I tried so hard and got so far",
    "But in the end it doesn't even matter",
    "I had to fall to lose it all",
    "But in the end it doesn't even matter"
]
labels = torch.randint(2, (len(lines), ))
dataset = list(zip(lines, labels))
tokenizer = get_tokenizer("basic_english")


def yield_tokens(data_iter):
    for text, label in data_iter:
        yield tokenizer(text)

        
vocab = build_vocab_from_iterator(yield_tokens(iter(dataset)), specials=["<unk>"])
vocab.set_default_index(vocab["<unk>"])

text_pipeline = lambda x: vocab(tokenizer(x))


def collate_batch(batch: list[tuple[str, torch.Tensor]]) -> tuple[torch.Tensor, torch.Tensor]:
    text_list, label_list = [], []
    for _text, _label in batch:
        label_list.append(_label)
        processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64)
        text_list.append(processed_text)
  
    text_list = nn.utils.rnn.pad_sequence(text_list, batch_first=True, padding_value=0)
    label_list = torch.tensor(label_list, dtype=torch.int64)
  
    return text_list, label_list


dataloader = DataLoader(
    dataset, 
    batch_size=2, 
    collate_fn=collate_batch,
    shuffle=True
)

for x, _ in dataloader:
    print(f"Current batch:\n{x}\n")

Current batch:
tensor([[ 2, 58, 44, 54, 21,  7, 75,  2, 19],
        [67, 26,  2, 40,  3,  6, 17, 84,  0]])

Current batch:
tensor([[ 1,  8, 48, 28,  0,  0,  0,  0,  0,  0],
        [22,  9,  4, 15,  1, 14,  3,  6,  7, 18]])

Current batch:
tensor([[22,  9,  4, 15,  1, 14,  3,  6,  7, 18],
        [57, 73,  9, 65, 20,  2, 37, 74, 69,  0]])

Current batch:
tensor([[31, 13, 64, 24, 13, 11, 83,  2, 19, 10, 16],
        [39,  3,  6, 60, 25, 32,  0,  0,  0,  0,  0]])

Current batch:
tensor([[ 4, 34, 76, 59, 30],
        [ 5, 45,  9, 42, 11]])

Current batch:
tensor([[ 1, 14,  3,  6,  7, 18, 53, 16, 27, 77],
        [ 2, 19, 10, 16, 21, 50, 10, 47,  0,  0]])

Current batch:
tensor([[11, 55, 13, 80, 26,  0,  0,  0],
        [12,  4, 11, 23, 70, 25,  4, 86]])

Current batch:
tensor([[ 2, 51,  5, 46,  5, 61,  1,  8],
        [82,  1, 63,  5, 62, 85, 43,  0]])

Current batch:
tensor([[78,  3,  5, 52, 66, 20, 38,  3,  6,  7, 17],
        [ 1,  3, 71, 10, 79,  0,  0,  0,  0,  0,  0]])

Current bat

Also check out `transformers.DataCollatorWithPadding` at https://huggingface.co/docs/transformers/main_classes/data_collator

## Data preprocessing with DALI
https://docs.nvidia.com/deeplearning/dali/user-guide/docs/index.html

Augmentations: https://docs.nvidia.com/deeplearning/dali/user-guide/docs/examples/image_processing/augmentation_gallery.html

TODO: select simple example 

For images check out source for ResNet50 on ImageNet: https://docs.nvidia.com/deeplearning/dali/user-guide/docs/examples/use_cases/pytorch/resnet50/pytorch-resnet50.ht

## Albumentations
TBD

## Streaming datasets

What if we do not want to wait until dataset is downloaded? Terrabytes of data.

What if we do not have enough disk space?

https://huggingface.co/docs/datasets/dataset_streaming.html

In [20]:
from datasets import load_dataset

In [21]:
dataset = load_dataset('oscar', "unshuffled_deduplicated_en", split='train', streaming=True)
print(next(iter(dataset)))

Downloading:   0%|          | 0.00/5.58k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/359k [00:00<?, ?B/s]

{'id': 0, 'text': 'Mtendere Village was inspired by the vision of Chief Napoleon Dzombe, which he shared with John Blanchard during his first visit to Malawi. Chief Napoleon conveyed the desperate need for a program to intervene and care for the orphans and vulnerable children (OVC) in Malawi, and John committed to help.\nEstablished in honor of John & Lindy’s son, Christopher Blanchard, this particular program is very dear to the Blanchard family. Dana Blanchard, or Mama Dana as she is more commonly referred to at Mtendere, lived on site during the initial development, and she returns each summer to spend the season with her Malawian family. The heart of the program is to be His hands and feet by caring for the children at Mtendere, and meeting their spiritual, physical, academic, and emotional needs.\nMtendere Village is home to 134 children, living in 16 homes with a housemother and several brothers and sisters. This family environment is one that many of the children have never pre

In [22]:
# what to do with thing we are used to? shuffling?
shuffled_dataset = dataset.shuffle(buffer_size=10_000, seed=42)

In [23]:
next(iter(shuffled_dataset))

{'id': 892,
 'text': 'In this role, she oversees the day-to-day operations of the agency’s motoring services divisions (Vehicle Titles & Registration, Motor Vehicles, Motor Carrier, Enforcement, Consumer Relations and the Automobile Burglary & Theft Prevention Authority) to ensure they are constantly improving and identifying opportunities to become more efficient and effective in service delivery.\nMellott came to the TxDMV from Alaska’s Division of Motor Vehicles where she most recently served as deputy executive director and acting executive director where she led a major initiative to modernize and improve the customer service experience. Previous positions at the Alaska DMV include oversight of all large field offices and leading the driver licensing program.\nMellott serves on the American Association of Motor Vehicle Administrators Unconventional Vehicle Working Group and has worked collaboratively with representatives from across the country to develop best practices for states

In [24]:
print(dataset.n_shards)

670


In [25]:
shuffled_dataset.set_epoch(epoch) # seed -> seed + epoch