First, using standard PyTorch; then using Lighting:

1. Using dummy (random) data, load, say, 8GB into pinned system RAM.
2. Push into GPU memory at start of training loop.
3. Train batch

Time each step.  See how to make steps async.


## Results

4 GBytes copy from CPU memory to GPU memory:

```
  Unpinned: 3.6 s
    Pinned: 0.3 s
```

## Pure PyTorch

In [1]:
import torch

### Unpinned

In [93]:
%%time
tensor_cpu_unpinned = torch.randn(1000, 1000, 1000)  # 4 GBytes

CPU times: user 9.17 s, sys: 1.63 s, total: 10.8 s
Wall time: 10.8 s


In [5]:
%%time
tensor_cuda_from_unpinned = tensor_cpu_unpinned.cuda()

CPU times: user 2.51 s, sys: 1.5 s, total: 4.01 s
Wall time: 39.6 s


### Pinned

In [94]:
%%time
tensor_pinned = tensor_cpu_unpinned.pin_memory()

CPU times: user 833 ms, sys: 1.6 s, total: 2.43 s
Wall time: 1.79 s


In [95]:
tensor_pinned[0, 0, 0]

tensor(-0.1914)

In [96]:
tensor_cpu_unpinned[0, 0, 0]

tensor(-0.1914)

In [97]:
tensor_cpu_unpinned[0, 0, 0] = 1

In [98]:
tensor_pinned[0, 0, 0]

tensor(-0.1914)

In [100]:
pinned_numpy = tensor_pinned.numpy()

In [101]:
pinned_numpy[0, 0, 0]

-0.19137391

In [102]:
pinned_numpy[0, 0, 0] = 1

In [104]:
tensor_pinned[0, 0, 0]

tensor(1.)

In [4]:
%%time
tensor_cuda_from_pinned = tensor_pinned.cuda(non_blocking=True)

CPU times: user 2.57 ms, sys: 4.93 ms, total: 7.49 ms
Wall time: 6.24 ms


In [120]:
# Taking a contiguous chunk keeps the pinned memory.
tensor_pinned[:10].is_pinned()

True

In [121]:
# Taking randome chunk stops the pinning.
tensor_pinned[[1, 5, 7, 10]].is_pinned()

False

next up:  Put this into a simple DataLoader; and then in a Lightning training loop (where the forward and train functions do nothing, so the only delay is loading memory).

## DataSet and DataLoader

Basic plan:

* Dataset loads a "big chunk" of data into (unpinned) CPU memory.
  - When we do this for real, this could be done in a separate process.
* Dataset then yields batches (sampled from the "big chunk" in memory).
* All DataLoader does is turn the numpy arrays into pinned tensors.  It doesn't do any sampling.  It gets a pre-made batch from Dataset.
* The training loop in Lightning then asynchonously loads that pinned data into GPU memory while the GPU is training.  The batch needs to be pinned to enable async copies into GPU RAM.

Things that won't work:

* I don't think we can use multiple processes to load data into the GPU or into pinned memory.  How would we share that memory between processes?
* Pinning the "big chunk" probably isn't useful, because the sampled data isn't pinned **if we take random samples**.  But taking a contiguous chunk of pinned memory retains the pinning.

In [1]:
import torch
import numpy as np
import gc

In [105]:
getattr?

[0;31mDocstring:[0m
getattr(object, name[, default]) -> value

Get a named attribute from an object; getattr(x, 'y') is equivalent to x.y.
When a default argument is given, it is returned when the attribute doesn't
exist; without it, an exception is raised in that case.
[0;31mType:[0m      builtin_function_or_method


In [88]:
#del model, trainer, dataloader, dataset
gc.collect()
torch.cuda.empty_cache()

In [89]:
class MyIterableDataset(torch.utils.data.IterableDataset):
    def __init__(self, batch_size=64, n_loads_from_disk=1, n_samples_from_mem=128):
        self.rng = np.random.default_rng()
        self.batch_size = batch_size
        self.n_loads_from_disk = n_loads_from_disk
        self.n_samples_from_mem = n_samples_from_mem
        
    def __iter__(self):
        print('start iter')
        # Fake loading lots of data from disk -
        # in reality this would be done in a separate process, which
        # outputs unpinned data in CPU RAM.
        data = self.rng.random(size=(1000, 256, 256), dtype=np.float32)
        
        # Pre-allocating pinned memory speeds things up a little.
        #pinned = torch.empty(1000, 256, 256, dtype=torch.float32, pin_memory=True)
        cuda_mem = torch.empty(1000, 256, 256, dtype=torch.float32, device='cuda')
        n = len(data)
        max_start_i = n - self.batch_size
        for _ in range(self.n_loads_from_disk):
            #data = torch.tensor(data, device='cuda')
            #pinned = torch.from_numpy(data).pin_memory()
            #pinned[:, :, :] = data
            #cuda_mem[:, :, :] = torch.from_numpy(data).pin_memory()
            cuda_mem.copy_(torch.from_numpy(data).pin_memory())
            #cuda_mem.copy_(data)
            for _ in range(self.n_samples_from_mem):
                idx = torch.randint(high=n, size=(self.batch_size, ))
                yield cuda_mem[idx]
                #yield self.rng.choice(data, self.batch_size)
                #start_i = self.rng.integers(0, max_start_i)
                #end_i = start_i + self.batch_size
                #yield pinned[start_i:end_i]

In [90]:
dataset = MyIterableDataset()

In [91]:
%%time
dataloader = torch.utils.data.DataLoader(
    dataset=dataset,
    collate_fn=lambda x: x[0],
    #collate_fn=lambda x: torch.tensor(x[0], device='cuda'),
    #collate_fn=lambda x: torch.tensor(x[0]),
    #pin_memory=True,
    #num_workers=8
)

CPU times: user 103 µs, sys: 11 µs, total: 114 µs
Wall time: 117 µs


In [92]:
%%time
i = 0
for d in dataloader:
    print(d.shape)
    print(d.is_pinned())
    print(d.device)
    print(d.dtype)
    if i > 3:
        break
    #d.cuda()
    i += 1

start iter
torch.Size([64, 256, 256])
False
cuda:0
torch.float32
torch.Size([64, 256, 256])
False
cuda:0
torch.float32
torch.Size([64, 256, 256])
False
cuda:0
torch.float32
torch.Size([64, 256, 256])
False
cuda:0
torch.float32
torch.Size([64, 256, 256])
False
cuda:0
torch.float32
CPU times: user 417 ms, sys: 36.6 ms, total: 454 ms
Wall time: 344 ms


* copying numpy array to CUDA Tensor within DataLoader = 8.48s (but this won't allow us to async load into GPU)
* Pinning batch by batch (pinning done by DataLoader) = 8.63s
* No pinning = 8.71s
* Pinning the "big chunk", and then taking contiguous slices: 9.64s

## Lightning

In [6]:
import pytorch_lightning as pl
from torch import nn
import torch.nn.functional as F

In [7]:
torch.__version__

'1.7.0'

In [8]:
torch.cuda.is_available()

True

In [9]:
pl.__version__

'1.0.6'

In [14]:
class LitAutoEncoder(pl.LightningModule):
    def __init__(self):
        super().__init__()
        self.encoder = nn.Sequential(
            nn.Linear(256 * 256, 64),
            nn.ReLU(),
            nn.Linear(64, 3)
        )
        self.decoder = nn.Sequential(
            nn.Linear(3, 64),
            nn.ReLU(),
            nn.Linear(64, 256 * 256)
        )

    def forward(self, x):
        # in lightning, forward defines the prediction/inference actions
        embedding = self.encoder(x)
        return embedding

    def training_step(self, batch, batch_idx):
        x = batch
        x = x.view(x.size(0), -1)
        z = self.encoder(x)
        x_hat = self.decoder(z)
        loss = F.mse_loss(x_hat, x)
        return loss

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)
        return optimizer

In [81]:
model = LitAutoEncoder()

In [82]:
trainer = pl.Trainer(gpus=1, max_epochs=30)

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


In [83]:
%%time
trainer.fit(model, dataloader)


  | Name    | Type       | Params
---------------------------------------
0 | encoder | Sequential | 4 M   
1 | decoder | Sequential | 4 M   


Epoch 29: : 128it [00:01, 89.13it/s, loss=0.083, v_num=29] 
CPU times: user 56.6 s, sys: 6.7 s, total: 1min 3s
Wall time: 42.4 s


1

* load 'big chunk' into GPU in pre-allocated GPU memory: 42.2s - PROBABLY THE RIGHT SOLUTION.  IT'S FAST.  AND IT PROVIDES LOTS OF FLEXIBILITY IN TERMS OF SAMPLING EACH BATCH.  THE CODE IS A TINY BIT MORE COMPLEX, BUT NOT BY MUCH!
* 'big chunk' in pinned memory at start, then copied into pre-allocated GPU memory: 42.4s
* load 'big chunk' into GPU in one go: 43.5s
* 'big chunk' in pinned CPU memory, where we only allocate pinned memory once, and copy into the pinned memory: 45.8s
* 'big chunk' in pinned CPU memory, using `from_numpy(data).pin_memory()`: 46.0s; 45.7s  CLEAN CODE, BUT VERY LIMITING IN TERMS OF CONSTRUCTING EACH BATCH
* 'big chunk' in pinned CPU memory, where we only allocate pinned memory once, and copy into the pinned memory using `from_numpy`: 46.2s
* 'big chunk' in pinned CPU memory, then individual contiguous chunks are taken: 47.2s or 47.3
  - is a little slower, but has the advantage that we have to do less GPU memory management, and can run on GPUs with small amounts of RAM, and it's more like standard PyTorch code
* 'big chunk' in unpinned CPU memory, random chunks are put into CUDA memory in the DataLoader: 1min 2s
* 'big chunk' in unpinned CPU memory, and then individual random chunks are pinned in DataLoader: 1min 5s; 1min 2s

* num_workers = 8:  wall time = 1min 9s
* num_workers = 1:  wall time = 26.1s
* load 'big chunk' into GPU in one go: 27.6s