<a href="https://colab.research.google.com/github/jcmachicao/deep_learning_2025_curso/blob/main/S6__tecnicas_modernas_agilizacion.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
import wandb

In [2]:
# -------------------------
# 1. Setup and Dummy Data
# -------------------------
wandb.login()  # or wandb.init(anonymous="allow") for demo mode

torch.manual_seed(0)
X = torch.randn(2000, 20)
y = (X.sum(dim=1) > 0).long()
train_loader = DataLoader(TensorDataset(X, y), batch_size=64, shuffle=True)

  | |_| | '_ \/ _` / _` |  _/ -_)


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mgdmk[0m to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


In [7]:
X

tensor([[-1.1258, -1.1524, -0.2506,  ..., -1.6959,  0.5667,  0.7935],
        [ 0.5988, -1.5551, -0.3414,  ...,  0.1124,  0.6408,  0.4412],
        [-0.1023,  0.7924, -0.2897,  ...,  0.7440,  1.5210,  3.4105],
        ...,
        [ 2.1509, -0.4035, -0.3132,  ...,  1.3783,  0.2739, -0.1737],
        [-1.3889, -2.2144, -0.3373,  ..., -0.5062, -0.6107, -0.2559],
        [ 0.4154, -1.4043,  3.4601,  ...,  0.0167,  1.2206,  1.0346]])

In [8]:
y

tensor([0, 1, 1,  ..., 0, 0, 1])

`nn.BatchNorm1d` in PyTorch is a type of normalization layer applied to inputs that are typically mini-batches of 1D data (like in a fully connected network layer).

It works by normalizing the activations of the previous layer for each mini-batch. This involves calculating the mean and variance of the activations within the batch and then scaling and shifting the normalized values using learnable parameters (gamma and beta).

The benefits of using batch normalization include:
- **Improved training stability:** Reduces the impact of changes in the distribution of activations across layers.
- **Faster convergence:** Allows for higher learning rates.
- **Regularization:** Can sometimes act as a mild regularizer, reducing the need for techniques like dropout.

While both batch normalization and dropout can act as regularizers, their mechanisms differ:

- **Dropout:** Randomly sets a fraction of neurons to zero during training, forcing the network to be less reliant on any single neuron. This explicitly reduces the complexity of the network during each training step.

- **Batch Normalization:** Introduces noise by normalizing with mini-batch statistics (mean and variance) which vary across batches. This means the network sees slightly different inputs for the same data point depending on the batch it's in. This "noise" can have a regularizing effect, making the network more robust.

So, while both can reduce overfitting, batch normalization doesn't necessarily make "all neurons matter" in the same way dropout does by explicitly dropping them out. Instead, it makes the network less sensitive to the exact values of individual activations due to the batch-wise normalization. It can sometimes reduce the *need* for dropout because it offers a similar benefit of making the network more robust to small changes in activations.

In [3]:
# -------------------------
# 2. Model Definitions
# -------------------------
class NormalizedMLP(nn.Module):
    def __init__(self, norm_type='batch'):
        super().__init__()
        if norm_type == 'batch':
            norm_layer = nn.BatchNorm1d(64)
        elif norm_type == 'layer':
            norm_layer = nn.LayerNorm(64)
        elif norm_type == 'group':
            norm_layer = nn.GroupNorm(4, 64)
        else:
            norm_layer = nn.Identity()

        self.net = nn.Sequential(
            nn.Linear(20, 64),
            norm_layer,
            nn.ReLU(),
            nn.Linear(64, 2)
        )

    def forward(self, x):
        return self.net(x)

In [5]:
# -------------------------
# 3. Training Function
# -------------------------
def train_model(norm_type='none', epochs=10):
    wandb.init(project="dl-training-techniques", name=f"{norm_type}_norm", reinit=True)

    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    model = NormalizedMLP(norm_type).to(device)
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=1e-3)

    total_steps = len(train_loader) * epochs
    warmup_steps = total_steps // 10
    scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=total_steps)
    scaler = torch.cuda.amp.GradScaler()

    model.train()
    for epoch in range(epochs):
        running_loss = 0.0
        for step, (inputs, targets) in enumerate(train_loader):
            inputs, targets = inputs.to(device), targets.to(device)

            # Mixed precision context
            with torch.cuda.amp.autocast(enabled=True):
                outputs = model(inputs)
                loss = criterion(outputs, targets)

            optimizer.zero_grad()
            scaler.scale(loss).backward()

            # Gradient clipping
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

            scaler.step(optimizer)
            scaler.update()

            # Learning rate warmup
            current_step = epoch * len(train_loader) + step
            if current_step < warmup_steps:
                warmup_lr = 1e-3 * (current_step + 1) / warmup_steps
                for g in optimizer.param_groups:
                    g['lr'] = warmup_lr
            else:
                scheduler.step()

            running_loss += loss.item()

            # Log to W&B
            wandb.log({
                "loss": loss.item(),
                "lr": optimizer.param_groups[0]['lr'],
                "epoch": epoch,
            })

        avg_loss = running_loss / len(train_loader)
        print(f"[{norm_type}] Epoch {epoch+1}/{epochs} - Loss: {avg_loss:.4f}")

    wandb.finish()

In [6]:
# -------------------------
# 4. Run Experiments
# -------------------------
for norm in ['none', 'batch', 'layer', 'group']:
    train_model(norm_type=norm, epochs=5)

  scaler = torch.cuda.amp.GradScaler()
  with torch.cuda.amp.autocast(enabled=True):


[none] Epoch 1/5 - Loss: 0.6292
[none] Epoch 2/5 - Loss: 0.5011
[none] Epoch 3/5 - Loss: 0.3974
[none] Epoch 4/5 - Loss: 0.3391
[none] Epoch 5/5 - Loss: 0.3162


0,1
epoch,▁▁▁▁▁▁▁▁▃▃▃▃▃▃▅▅▅▅▅▅▆▆▆▆▆▆▆▆▆▆▆▆████████
loss,████▇▇▇▆▆▆▆▅▆▆▅▅▅▅▄▄▄▃▂▃▃▃▃▂▃▃▂▂▂▂▃▂▂▂▁▁
lr,▁▄▅▇███████▇▇▇▆▆▆▆▆▆▅▅▅▅▄▄▃▃▃▃▃▂▂▂▂▁▁▁▁▁

0,1
epoch,4.0
loss,0.30785
lr,2e-05


[batch] Epoch 1/5 - Loss: 0.6304
[batch] Epoch 2/5 - Loss: 0.4601
[batch] Epoch 3/5 - Loss: 0.3553
[batch] Epoch 4/5 - Loss: 0.3054
[batch] Epoch 5/5 - Loss: 0.2850


0,1
epoch,▁▁▁▁▁▁▃▃▃▃▃▃▃▃▃▅▅▅▅▅▅▅▅▅▅▆▆▆▆▆▆█████████
loss,▇██▇▆▅▅▄▅▅▃▃▃▃▃▂▂▃▂▂▂▃▂▃▂▁▂▂▁▂▂▂▂▃▁▂▂▁▁▄
lr,▂▃▆▇█████████▇▇▇▇▇▇▇▆▆▆▅▅▅▄▄▄▄▃▃▃▂▂▂▁▁▁▁

0,1
epoch,4.0
loss,0.44827
lr,2e-05


[layer] Epoch 1/5 - Loss: 0.6854
[layer] Epoch 2/5 - Loss: 0.4554
[layer] Epoch 3/5 - Loss: 0.3268
[layer] Epoch 4/5 - Loss: 0.2716
[layer] Epoch 5/5 - Loss: 0.2506


0,1
epoch,▁▁▁▁▁▁▁▁▃▃▃▃▃▃▃▃▃▅▅▅▅▅▅▅▅▆▆▆▆▆▆▆████████
loss,▇████▇▆▅▆▅▅▅▄▄▄▃▃▂▃▂▃▃▂▂▂▂▂▂▃▂▂▁▂▂▂▁▂▂▂▂
lr,▁▂▃▄▇███████▇▇▇▆▆▆▆▆▆▅▅▄▄▃▃▃▃▃▂▂▂▂▂▁▁▁▁▁

0,1
epoch,4.0
loss,0.23876
lr,2e-05


[group] Epoch 1/5 - Loss: 0.6614
[group] Epoch 2/5 - Loss: 0.4830
[group] Epoch 3/5 - Loss: 0.3622
[group] Epoch 4/5 - Loss: 0.2969
[group] Epoch 5/5 - Loss: 0.2685


0,1
epoch,▁▁▁▁▁▁▁▁▁▁▃▃▃▃▃▃▃▃▃▃▃▃▅▅▅▆▆▆▆▆▆▆████████
loss,█▇▇██▆▆▆▆▆▆▆▅▆▅▄▄▅▄▅▃▄▃▃▃▃▂▂▂▂▃▂▂▂▂▂▂▁▁▁
lr,▂▄▆▆██████████▇▇▇▇▆▆▆▅▅▅▅▄▄▄▃▃▃▃▂▂▂▁▁▁▁▁

0,1
epoch,4.0
loss,0.19215
lr,2e-05
