# Batch size finder

Otro de los hiperparámetros importantes es el tamaño de cada lote o batch size, cuanto más grande sea mejor por dos motivos
 * El primero es que va a entrenar más rápido, porque en cada iteracción va a entrenar con una cantidad de datos mayor. Por lo que vamos a necesitar menos iteracciones
 * El segundo lo vamos a ver más adelante más ane profundidad, pero cuanto mayor sea el batch size, más estable va a ser el proceso de entrenamiento, por lo que la búsqueda del mínimo, al ser más estable, va a ser más rápida.

Lo implementamos con el dataset de cancer al igual que en el tema enterior en el que buscamos el mejor learning rate posible

In [1]:
from sklearn import datasets
cancer = datasets.load_breast_cancer()

In [2]:
import pandas as pd

cancer_df = pd.DataFrame(cancer['data'], columns=cancer['feature_names'])
cancer_df['type'] = cancer['target']
cancer_df.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,type
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0


Se crea el dataset

In [3]:
import torch

class CancerDataset(torch.utils.data.Dataset):
    def __init__(self, dataframe):
        cols = [col for col in dataframe.columns if col != 'target']
        self.parameters = torch.from_numpy(dataframe[cols].values).type(torch.float32)
        self.targets = torch.from_numpy(dataframe['type'].values).type(torch.float32)
        self.targets = self.targets.reshape((len(self.targets), 1))

    def __len__(self):
        return len(self.parameters)

    def __getitem__(self, idx):
        parameters = self.parameters[idx]
        target = self.targets[idx]
        return parameters, target

ds = CancerDataset(cancer_df)

En este caso no se va a dividir el dataset en uno de entrenamiento y otro de validación, porque el dataset tiene tan pocos datos, que para poder hacer el ejemplo es necesario usar todos los datos

In [4]:
train_ds = ds
len(train_ds)

569

El dataset de cancer es tan pequeño y la red que hemos usado hasta ahora también es tan pequeña, que no nos valen para hacer un ejemplo de una batch size finder. Por lo que para este tema defino una red absurdamente grande para que ocupe mucha memoria de GPU y así poder llenarla

In [5]:
from torch import nn

class CancerNeuralNetwork(nn.Module):
    def __init__(self, num_inputs, num_outputs, hidden_layers=[20000, 5000, 2000]):
        super().__init__()
        self.network = torch.nn.Sequential(
            torch.nn.Linear(num_inputs, hidden_layers[0]),
            torch.nn.ReLU(),
            torch.nn.Linear(hidden_layers[0], hidden_layers[1]),
            torch.nn.ReLU(),
            torch.nn.Linear(hidden_layers[1], hidden_layers[1]),
            torch.nn.ReLU(),
            torch.nn.Linear(hidden_layers[1], hidden_layers[1]),
            torch.nn.ReLU(),
            torch.nn.Linear(hidden_layers[1], hidden_layers[1]),
            torch.nn.ReLU(),
            torch.nn.Linear(hidden_layers[1], hidden_layers[1]),
            torch.nn.ReLU(),
            torch.nn.Linear(hidden_layers[1], hidden_layers[1]),
            torch.nn.ReLU(),
            torch.nn.Linear(hidden_layers[1], hidden_layers[1]),
            torch.nn.ReLU(),
            torch.nn.Linear(hidden_layers[1], hidden_layers[1]),
            torch.nn.ReLU(),
            torch.nn.Linear(hidden_layers[1], hidden_layers[2]),
            torch.nn.ReLU(),
            torch.nn.Linear(hidden_layers[2], num_outputs),
        )
        self.activation = torch.nn.Sigmoid()

    def forward(self, x):
        logits = self.network(x)
        probs = self.activation(logits)
        return logits, probs

Comprobamos si hay GPU

In [6]:
# Get cpu or gpu device for training.
device = "cuda" if torch.cuda.is_available() else "cpu"
print("Using {} device".format(device))

Using cuda device


Definimos una función para poder ver la memoria total, libre y ocupada de la GPU

In [7]:
import subprocess as sp
import os

def get_gpu_memory():
    command = "nvidia-smi --query-gpu=memory.total --format=csv"
    memory_total_info = sp.check_output(command.split()).decode('ascii').split('\n')[:-1][1:]
    memory_total_values = [int(x.split()[0]) for i, x in enumerate(memory_total_info)]

    command = "nvidia-smi --query-gpu=memory.free --format=csv"
    memory_free_info = sp.check_output(command.split()).decode('ascii').split('\n')[:-1][1:]
    memory_free_values = [int(x.split()[0]) for i, x in enumerate(memory_free_info)]

    command = "nvidia-smi --query-gpu=memory.used --format=csv"
    memory_used_info = sp.check_output(command.split()).decode('ascii').split('\n')[:-1][1:]
    memory_used_values = [int(x.split()[0]) for i, x in enumerate(memory_used_info)]
    return memory_total_values, memory_free_values, memory_used_values

total, free, used = get_gpu_memory()
print(f"GPU memory: total: {total} MiB, free: {free} MiB, used: {used} MiB")

GPU memory: total: [4096] MiB, free: [3733] MiB, used: [170] MiB


Vemos que ahora que aun no hemos mandado el modelo a la GPU ni los datos casi no tenemos GPU ocupada

Instanciamos la red y la mandamos a la GPU

In [8]:
model = CancerNeuralNetwork(31, 1)
model.to(device)
print(f"model to {device}")

model to cuda


In [9]:
total, free, used = get_gpu_memory()
print(f"GPU memory: total: {total} MiB, free: {free} MiB, used: {used} MiB")

GPU memory: total: [4096] MiB, free: [1802] MiB, used: [2101] MiB


Vemos que la memoria de la GPU ha subido bastante

Creamos las funciones de coste y el optimizador, usamos el learning rate que habíamos obtenido en el tema anterior. Aunque ahora puede que no sea el mejor porque no hemos separado el learning rate en entrenamiento y validación, y la red que usamos es más grande.

In [10]:
LR = 1e-2

loss_fn = nn.BCEWithLogitsLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=LR)

Se crea la función de entrenamiento

In [11]:
num_prints = 4

def train_loop(dataloader, model, loss_fn, optimizer):
    size = len(dataloader.dataset)
    model.train()
    for batch, (X, y) in enumerate(dataloader):
        # X and y to device
        X, y = X.to(device), y.to(device)

        # Compute prediction and loss
        logits, probs = model(X)
        loss = loss_fn(logits, y)

        # Backpropagation
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if batch % (int(len(dataloader)/num_prints)+1) == 0:
            loss, current = loss.item(), batch * len(X)
            print(f"loss: {loss:>7f}  [{current:>5d}/{size:>5d}]")

Obtenemos los posibles valores de learning rate. Esto se hace así porque en este caso el dataset es tan pequeño (569 datos) que no podemos usar un batch size de 1024 por ejemplo

In [12]:
def list_of_posible_batch_sizes(dataset):
    batch_sizes = []
    batch_size = 1
    while batch_size < len(dataset):
        batch_sizes.append(batch_size)
        batch_size *= 2
    batch_sizes.sort(reverse=True)
    return batch_sizes

BSs = list_of_posible_batch_sizes(train_ds)
BSs

[512, 256, 128, 64, 32, 16, 8, 4, 2, 1]

Y por fin creamos la función que busca el mejor batch size. Buscamos que sea lo mayor posible sin que se desborde la GPU y que sea un múltiplo de 2.

Se comienza con el mayor valor posible de batch size y si se desborda la memoria de la GPU se prueba con el siguiente, hasta que no se desborde y por lo tanto ese es el valor más ótimo de batch size.

In [13]:
from torch.utils.data import DataLoader

for BS_train in BSs:
    print(f"batch size: {BS_train}")
    train_dl = DataLoader(train_ds, batch_size=BS_train, shuffle=True)
    epochs = 2
    out_of_memory = False
    for t in range(epochs):
        print(f"Epoch {t+1}\n-------------------------------")
        try:
            train_loop(train_dl, model, loss_fn, optimizer)
        except Exception as e:
            print(f'Error: {e}')
            out_of_memory = True
            break
    if out_of_memory == False:
        break
    print()
print(f"Done!, bacth size is {BS_train}")

batch size: 512
Epoch 1
-------------------------------
loss: 0.691913  [    0/  569]
loss: 0.692149  [   57/  569]
Epoch 2
-------------------------------
Error: CUDA out of memory. Tried to allocate 382.00 MiB (GPU 0; 3.81 GiB total capacity; 2.22 GiB already allocated; 112.38 MiB free; 2.53 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

batch size: 256
Epoch 1
-------------------------------
Error: CUDA out of memory. Tried to allocate 382.00 MiB (GPU 0; 3.81 GiB total capacity; 2.18 GiB already allocated; 112.38 MiB free; 2.53 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

batch size: 128
Epoch 1
-------------------------------
Error: CUDA out of memory. Tried to allocate 382.00 Mi

In [14]:
BS_train

64

Ya tenemos el mayor valor de batch size posible para nuestro problema, por lo que pasamos a entrenar la red

In [15]:
from torch.utils.data import DataLoader
train_dl = DataLoader(train_ds, batch_size=BS_train, shuffle=True)
epochs = 14
for t in range(epochs):
    print(f"Epoch {t+1}\n-------------------------------")
    try:
        train_loop(train_dl, model, loss_fn, optimizer)
    except Exception as e:
        if "CUDA out of memory" in str(e):
            position = str(e).index('CUDA out of memory')
            print(f"\t{str(e)[position:]}")
            out_of_memory = True
            break
        else:
            out_of_memory = False
        break
print("Done!")

Epoch 1
-------------------------------
loss: 0.661637  [    0/  569]
loss: 0.650951  [  192/  569]
loss: 0.652181  [  384/  569]
Epoch 2
-------------------------------
loss: 0.617266  [    0/  569]
loss: 0.641129  [  192/  569]
loss: 0.645686  [  384/  569]
Epoch 3
-------------------------------
loss: 0.632995  [    0/  569]
loss: 0.632342  [  192/  569]
loss: 0.601855  [  384/  569]
Epoch 4
-------------------------------
loss: 0.651026  [    0/  569]
loss: 0.629167  [  192/  569]
loss: 0.570165  [  384/  569]
Epoch 5
-------------------------------
loss: 0.637134  [    0/  569]
loss: 0.672079  [  192/  569]
loss: 0.600126  [  384/  569]
Epoch 6
-------------------------------
loss: 0.599684  [    0/  569]
loss: 0.603658  [  192/  569]
loss: 0.650290  [  384/  569]
Epoch 7
-------------------------------
loss: 0.571652  [    0/  569]
loss: 0.573309  [  192/  569]
loss: 0.601443  [  384/  569]
Epoch 8
-------------------------------
loss: 0.672219  [    0/  569]
loss: 0.527147  [  1

Si vemos ahora el estado de la memoria de la GPU

In [16]:
total, free, used = get_gpu_memory()
print(f"GPU memory: total: {total} MiB, free: {free} MiB, used: {used} MiB")

GPU memory: total: [4096] MiB, free: [110] MiB, used: [3793] MiB


Podemos ver que la memoria de la GPU, está casi llena