First, using standard PyTorch; then using Lighting:

1. Using dummy (random) data, load, say, 8GB into pinned system RAM.
2. Push into GPU memory at start of training loop.
3. Train batch

Time each step.  See how to make steps async.


## Results

4 GBytes copy from CPU memory to GPU memory:

```
  Unpinned: 3.6 s
    Pinned: 0.3 s
```

## Pure PyTorch

In [1]:
import torch

### Unpinned

In [4]:
%%time
tensor_cpu_unpinned = torch.randn(1000, 1000, 1000)  # 4 GBytes

CPU times: user 9.34 s, sys: 1.67 s, total: 11 s
Wall time: 11.3 s


In [5]:
%%time
tensor_cuda_from_unpinned = tensor_cpu_unpinned.cuda()

CPU times: user 2.51 s, sys: 1.5 s, total: 4.01 s
Wall time: 39.6 s


### Pinned

In [3]:
%%time
tensor_pinned = tensor_cpu_unpinned.pin_memory()

CPU times: user 2.22 s, sys: 2.89 s, total: 5.1 s
Wall time: 4.58 s


In [4]:
%%time
tensor_cuda_from_pinned = tensor_pinned.cuda(non_blocking=True)

CPU times: user 2.57 ms, sys: 4.93 ms, total: 7.49 ms
Wall time: 6.24 ms


In [120]:
# Taking a contiguous chunk keeps the pinned memory.
tensor_pinned[:10].is_pinned()

True

In [121]:
# Taking randome chunk stops the pinning.
tensor_pinned[[1, 5, 7, 10]].is_pinned()

False

next up:  Put this into a simple DataLoader; and then in a Lightning training loop (where the forward and train functions do nothing, so the only delay is loading memory).

## DataSet and DataLoader

Basic plan:

* Dataset loads a "big chunk" of data into (unpinned) CPU memory.
  - When we do this for real, this could be done in a separate process.
* Dataset then yields batches (sampled from the "big chunk" in memory).
* All DataLoader does is turn the numpy arrays into pinned tensors.  It doesn't do any sampling.  It gets a pre-made batch from Dataset.
* The training loop in Lightning then asynchonously loads that pinned data into GPU memory while the GPU is training.  The batch needs to be pinned to enable async copies into GPU RAM.

Things that won't work:

* I don't think we can use multiple processes to load data into the GPU or into pinned memory.  How would we share that memory between processes?
* Pinning the "big chunk" probably isn't useful, because the sampled data isn't pinned **if we take random samples**.  But taking a contiguous chunk of pinned memory retains the pinning.

In [2]:
import numpy as np

In [3]:
class MyIterableDataset(torch.utils.data.IterableDataset):
    def __init__(self, batch_size=64, n_loads_from_disk=2, n_samples_from_mem=4):
        self.rng = np.random.default_rng()
        self.batch_size = batch_size
        self.n_loads_from_disk = n_loads_from_disk
        self.n_samples_from_mem = n_samples_from_mem
        
    def __iter__(self):
        for _ in range(self.n_loads_from_disk):
            data = self.rng.random(size=(1000, 256, 256), dtype=np.float32)
            #n = len(data)
            # data = torch.tensor(data).pin_memory()
            for i in range(self.n_samples_from_mem):
                #idx = torch.randint(high=n, size=(self.batch_size, ))
                #yield data[idx]
                yield self.rng.choice(data, self.batch_size)
                #yield tensor[i*self.batch_size:(i+1)*self.batch_size]

In [4]:
dataset = MyIterableDataset()

In [5]:
%%time
dataloader = torch.utils.data.DataLoader(
    dataset=dataset,
    #collate_fn=lambda x: x[0],
    #collate_fn=lambda x: torch.tensor(x[0], device='cuda'),
    collate_fn=lambda x: torch.tensor(x[0]),
    pin_memory=True
)

CPU times: user 65 µs, sys: 11 µs, total: 76 µs
Wall time: 78.4 µs


In [48]:
%%time
for d in dataloader:
    print(d.shape)
    print(d.is_pinned())
    #d.cuda()

torch.Size([64, 256, 256])
True
torch.Size([64, 256, 256])
True
torch.Size([64, 256, 256])
True
torch.Size([64, 256, 256])
True
torch.Size([64, 256, 256])
True
torch.Size([64, 256, 256])
True
torch.Size([64, 256, 256])
True
torch.Size([64, 256, 256])
True
CPU times: user 1.14 s, sys: 103 ms, total: 1.24 s
Wall time: 531 ms


* copying numpy array to CUDA Tensor within DataLoader = 8.48s (but this won't allow us to async load into GPU)
* Pinning batch by batch (pinning done by DataLoader) = 8.63s
* No pinning = 8.71s
* Pinning the "big chunk", and then taking contiguous slices: 9.64s

## Lightning

In [6]:
import pytorch_lightning as pl
from torch import nn
import torch.nn.functional as F

In [7]:
torch.__version__

'1.7.0'

In [8]:
torch.cuda.is_available()

True

In [9]:
pl.__version__

'1.0.5'

In [10]:
class LitAutoEncoder(pl.LightningModule):
    def __init__(self):
        super().__init__()
        self.encoder = nn.Sequential(
            nn.Linear(256 * 256, 64),
            nn.ReLU(),
            nn.Linear(64, 3)
        )
        self.decoder = nn.Sequential(
            nn.Linear(3, 64),
            nn.ReLU(),
            nn.Linear(64, 256 * 256)
        )

    def forward(self, x):
        # in lightning, forward defines the prediction/inference actions
        embedding = self.encoder(x)
        return embedding

    def training_step(self, batch, batch_idx):
        x = batch
        x = x.view(x.size(0), -1)
        z = self.encoder(x)
        x_hat = self.decoder(z)
        loss = F.mse_loss(x_hat, x)
        return loss

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)
        return optimizer

In [11]:
model = LitAutoEncoder()

In [12]:
trainer = pl.Trainer(gpus=1, max_epochs=30)

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


In [13]:
%%time
trainer.fit(model, dataloader)


  | Name    | Type       | Params
---------------------------------------
0 | encoder | Sequential | 4 M   
1 | decoder | Sequential | 4 M   


Epoch 0: : 0it [00:00, ?it/s]



Epoch 29: : 8it [00:00,  8.33it/s, loss=0.083, v_num=5]
CPU times: user 1min 4s, sys: 11.3 s, total: 1min 15s
Wall time: 31.6 s


1