# A small example to bechmark the perfromance of torchtuples vs regular pytorch

In this notebook we illustrate the performance difference using the torchtuples Model object compared to regular pytorch, when we are working with small data sets in memory.

The main difference is that the DataLoader provided by pytorch reads batches in a for-loop, which is somewhat slow.
This is not the case for for larger network, as the DataLoader is no longer the bootleneck. 
However for smaler networks we see a significant difference.

For the regulrar pytorch implementation (see fit_torch below), we use a standard TensorDataset and Dataloader for batched iterations. The Dataloader is not targeted towards small data sets in memory, so this comparison is somewhat unfair. It's not har do drop the data loader and write something faster on your own. However, the implementation in (fit_torch below), is quite common to find (e.g. in skorch https://skorch.readthedocs.io/en/latest/index.html)

The notebook is run on a 2016 MacBook Pro.

In [1]:
import numpy as np
import torch
import torchtuples
from torchtuples import Model, tuplefy
from torchtuples.practical import MLPVanilla

In [2]:
torch.__version__

'1.1.0'

In [3]:
torchtuples.__version__

'0.0.1'

In [4]:
from sklearn.datasets import make_classification # to create a data set

In [5]:
device = 'cpu' # change to run on gpu

In [6]:
loss_func = torch.nn.BCEWithLogitsLoss()

We make both numpy and torch version of the data set, as Model also works with numpy arrays.

In [7]:
def make_dataset(n):
    inp, tar = make_classification(n)
    inp = inp.astype('float32')
    tar = tar.reshape(-1, 1).astype('float32')
    inp_tensor = torch.from_numpy(inp)
    tar_tensor = torch.from_numpy(tar)
    return inp, tar, inp_tensor, tar_tensor

We use a vanilla mlp with two hidden layers, each with 64 nodes. Larger networks would produce timing results that are close to each other.

### First with use of torchtuples.Model:

In [8]:
def fit_torchtuples(inp, tar, epochs, batch_size, num_workers):
    torch.manual_seed(0)
    net = MLPVanilla(inp.shape[1], [64, 64], 1)
    optimizer = torchtuples.optim.SGD(0.01)
    model = Model(net, loss_func, optimizer, device=device)
    model.fit(inp, tar, batch_size, epochs, verbose=False, num_workers=num_workers)
    return net

### Standar pytorch implementation:

In [9]:
def fit_torch(inp, tar, epochs, batch_size, num_workers):
    """ We have used (more or less) the same setup for the regular 
    pytorch implementation as they do in scorch: 
    https://skorch.readthedocs.io/en/latest/user/tutorials.html
    """
    torch.manual_seed(0)
    net = MLPVanilla(inp.shape[1], [64, 64], 1)
    optimizer = torch.optim.SGD(net.parameters(), lr=0.01)
    dataset = torch.utils.data.TensorDataset(inp, tar)
    dataloader = torch.utils.data.DataLoader(dataset, batch_size, True, num_workers=num_workers)
    for _ in range(epochs):
        for x, y in dataloader:
            optimizer.zero_grad()
            x, y = x.to(device), y.to(device)
            y_pred = net(x)
            loss = loss_func(y_pred, y)
            loss.backward()
            optimizer.step()
    return net

In [10]:
np.random.seed(123)
inp, tar, inp_tensor, tar_tensor = make_dataset(2000)
epochs = 50
batch_size = 256
num_workers = 0

### Check code

First we just verify that both implementations produce the same weights. 

In [11]:
net_tt = fit_torchtuples(inp, tar, epochs, batch_size, num_workers)

net_t = fit_torch(inp_tensor, tar_tensor, epochs, batch_size, num_workers)

assert all([(w_tt == w_t).all() for w_tt, w_t in zip(net_tt.parameters(), net_t.parameters())]), 'Not equal weights'

## Timing trainig progress

We fist illustrte that for small data sets, there is no point in spinning up multiple worker for the data loading.
This is because we close down the workers after every epoch, and hence the cost of starting them again is not worth it.

Note that because torch.utils.data.DataLoader use a for-loop to collect a batch, it takes substantially longer time.

In [12]:
num_workers = 0

In [13]:
%%time
_ = fit_torchtuples(inp, tar, epochs, batch_size, num_workers)

CPU times: user 3.42 s, sys: 428 ms, total: 3.85 s
Wall time: 1.99 s


In [14]:
%%time
_ = fit_torch(inp_tensor, tar_tensor, epochs, batch_size, num_workers)

CPU times: user 6.42 s, sys: 855 ms, total: 7.27 s
Wall time: 3.23 s


In [15]:
num_workers = 1

In [16]:
%%time
_ = fit_torchtuples(inp, tar, epochs, batch_size, num_workers)

CPU times: user 6.15 s, sys: 1.81 s, total: 7.96 s
Wall time: 3.53 s


In [17]:
%%time
_ = fit_torch(inp_tensor, tar_tensor, epochs, batch_size, num_workers)

CPU times: user 5.98 s, sys: 1.91 s, total: 7.89 s
Wall time: 4.48 s


In [18]:
num_workers = 2

In [19]:
%%time
_ = fit_torchtuples(inp, tar, epochs, batch_size, num_workers)

CPU times: user 5.81 s, sys: 2.17 s, total: 7.98 s
Wall time: 3.78 s


In [20]:
%%time
_ = fit_torch(inp_tensor, tar_tensor, epochs, batch_size, num_workers)

CPU times: user 5.46 s, sys: 2.26 s, total: 7.72 s
Wall time: 4.61 s


### Larger data set

Next, for a larger data set, there is some benefit in using a dedicated worker in torch.utils.data.DataLoader.
However, the start-up cost of two workers is still to high, and for the the torchtuples dataloader, even one is to high.

In [21]:
np.random.seed(123)
inp, tar, inp_tensor, tar_tensor = make_dataset(100000)
epochs = 10
batch_size = 256
num_workers = 0

In [22]:
%%time
_ = fit_torchtuples(inp, tar, epochs, batch_size, num_workers)

CPU times: user 33.6 s, sys: 3.5 s, total: 37.1 s
Wall time: 14.6 s


In [23]:
%%time
_ = fit_torch(inp_tensor, tar_tensor, epochs, batch_size, num_workers)

CPU times: user 1min 3s, sys: 7.85 s, total: 1min 11s
Wall time: 28.3 s


In [24]:
num_workers = 1

In [25]:
%%time
_ = fit_torchtuples(inp, tar, epochs, batch_size, num_workers)

CPU times: user 37.8 s, sys: 6.1 s, total: 43.9 s
Wall time: 20.2 s


In [26]:
%%time
_ = fit_torch(inp_tensor, tar_tensor, epochs, batch_size, num_workers)

CPU times: user 39.3 s, sys: 6.23 s, total: 45.5 s
Wall time: 21.4 s


In [27]:
num_workers = 2

In [28]:
%%time
_ = fit_torchtuples(inp, tar, epochs, batch_size, num_workers)

CPU times: user 37.7 s, sys: 5.73 s, total: 43.4 s
Wall time: 17.3 s


In [29]:
%%time
_ = fit_torch(inp_tensor, tar_tensor, epochs, batch_size, num_workers)

CPU times: user 36.3 s, sys: 5.99 s, total: 42.3 s
Wall time: 19 s


## Just data loaders

Finally, if we just look at the data loaders, we can see there is quite a substantial difference.

In [30]:
num_workers = 0

In [31]:
dl_tt = tuplefy(inp_tensor, tar_tensor).make_dataloader(batch_size, True, num_workers)

In [32]:
ds = torch.utils.data.TensorDataset(inp_tensor, tar_tensor)
dl_t = torch.utils.data.DataLoader(ds, batch_size, True, num_workers=num_workers)

In [33]:
%timeit next(iter(dl_tt))

4.67 ms ± 134 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [34]:
%timeit next(iter(dl_t))

6.84 ms ± 78.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
