# Deep learning in practice

What do we do when we actually want to be doing deep learning in practice?

## Look at your data

For images, we can do things like:
- Sample a random subset of data.
- Look at the distribution of labels.
- Look at the smallest/largest file size.
- Look at common/rare labels. Look for outliers.
- Try to solve the task manually and see how you could do it.

For other mediums, steps are pretty similar.

## Feeding the data into the network

### Input normalization
For images, you need to resize the outputs as well as normalize the data. If we don't have normalized data, then it's hard to make consistent changes in the network.

The solution is mean subtraction. For example, given $x_i$, apply a transformation: $x^*_i = x_i - \mu_x$. For an image, for example, we'll take the (3 * 64 * 64) input and subtract $(\mu_\text{red}, \mu_\text{green}, \mu_\text{blue})$, or the average red, green, and blue color values across all the images in the dataset.

We can also get a case where if the magnitude of the inputs isn't well organized, you can get very uneven training. For example, let's say that you have $|x_1| << |x_2|$. In this case, the gradient caused by $x_2$ will be much larger than the one caused by $x_1$, meaning that the model will attend to only $x_1$.

The solution to this then is to divide by the standard deviation as well. For images, this would be the SD of the red, green, and blue color values in the dataset respectively.

Therefore, our normalization becomes:
$$x^*_i =\frac{x_i - \mu_x}{\sigma_x}$$

## Model design

We can pick whichever model we want to choose, based on things like complexity and speed. Transformers are pretty popular nowadays though and is probably what should be used in practice.

### How do we initialize the network?
PyTorch does automatic initialization for us, but it's good to understand what is actually happening in that initialization.

#### Can we initialize with zeros?
We can't initialize with zeros, since that will lead to a gradient of zeros in the network, meaning that no signal or backpropagation signal will ever be sent. Even if we did get some signal, it would be multiplied by the weights in the backpropagation step, and if the weights are 0, then a signal of 0 will be sent back.



In [2]:
import torch
import torch.nn.functional as F

In [15]:
input_tensor = torch.tensor([1.0, 2.0, 3.0])
weights_layer1 = torch.zeros(3, 2, requires_grad=True)
weights_layer2 = torch.zeros(2, 1, requires_grad=True)
target_tensor = torch.tensor([0.5])

Let's simulate a forward pass and backpropagation

In [16]:
def simple_forward_pass(input_tensor):
    hidden_layer = torch.matmul(input_tensor, weights_layer1)
    hidden_layer_activated = F.relu(hidden_layer)
    output_layer = torch.matmul(hidden_layer_activated, weights_layer2)
    return output_layer

In [17]:
# Perform the forward pass
output = simple_forward_pass(input_tensor)

# Compute loss (MSE)
loss = F.mse_loss(output, target_tensor)

# Backpropagation
loss.backward()

In [18]:
# Print gradients to observe the effect of zero initialization
print("Gradients for Layer 1 Weights:\n", weights_layer1.grad)
print("Gradients for Layer 2 Weights:\n", weights_layer2.grad)

# Attempt to update weights (hypothetical learning rate of 0.01)
with torch.no_grad():
    weights_layer1 -= 0.01 * weights_layer1.grad
    weights_layer2 -= 0.01 * weights_layer2.grad

    # Zero the gradients after updating
    weights_layer1.grad.zero_()
    weights_layer2.grad.zero_()

# Print updated weights to observe the lack of effective update
print("Updated Weights for Layer 1:\n", weights_layer1)
print("Updated Weights for Layer 2:\n", weights_layer2)

Gradients for Layer 1 Weights:
 tensor([[0., 0.],
        [0., 0.],
        [0., 0.]])
Gradients for Layer 2 Weights:
 tensor([[0.],
        [0.]])
Updated Weights for Layer 1:
 tensor([[0., 0.],
        [0., 0.],
        [0., 0.]], requires_grad=True)
Updated Weights for Layer 2:
 tensor([[0.],
        [0.]], requires_grad=True)


When we backpropagate, we see that the gradient is full of zeros. For the last hidden layer, we would get some nonzero signal, since we can, for example, get a loss of $0.5^2$ due to the MSE loss. The problem though is when the loss is multiplied by the weight of 0, in which case the net gradient is 0. When this is backpropagated to the previous layers, we'll get that the gradient in the previous steps is 0.

#### Can we initialize with a constant value?
We don't have a way to break symmetries. Let's review this with a motivating example.

In [3]:
# Define the input tensor
input_tensor = torch.tensor([1.0, 2.0, 3.0])

# 3 input neurons, 2 output neurons, initialized to a constant value of 0.5
weights_layer1 = torch.full((3, 2), 0.5)

# 2 input neurons, 1 output neuron, initialized to a constant value of 0.5
weights_layer2 = torch.full((2, 1), 0.5)

If we do this multiplication, we see that the output of the first layer will be completely symmetrical. This effectively means that the neurons in the first layer are all learning the same thing. Each column in the weight matrix, which is what multiplies against the input vector, is the exact same. The operation that's being done is:

$$
\left[
\begin{matrix}
1&2&3
\end{matrix}
\right]
* 
\left[
\begin{matrix}
0.5\\
0.5\\
0.5
\end{matrix}
\right]
= [3]
$$

$$+$$

$$
\left[
\begin{matrix}
1&2&3
\end{matrix}
\right]
* 
\left[
\begin{matrix}
0.5\\
0.5\\
0.5
\end{matrix}
\right]
= [3]
$$


$$=$$

$$[3, 3]$$

Each of the 2 weight neurons (here, each weight neuron being a 3x1 tensor) learns the exact same signal. This is no good, since at that point we might as well only have one neuron.

In [9]:
# matrix multiply the input tensor with the weights of the first layer
layer1_output = torch.matmul(input_tensor, weights_layer1)
print(f"Input tensor: {input_tensor}\t Shape: {input_tensor.shape}")
print(f"Weights of the first layer: {weights_layer1}\t Shape: {weights_layer1.shape}")
print(f"Output of the first layer: {layer1_output}\t Shape: {layer1_output.shape}")

Input tensor: tensor([1., 2., 3.])	 Shape: torch.Size([3])
Weights of the first layer: tensor([[0.5000, 0.5000],
        [0.5000, 0.5000],
        [0.5000, 0.5000]])	 Shape: torch.Size([3, 2])
Output of the first layer: tensor([3., 3.])	 Shape: torch.Size([2])


#### What should we initialize instead?

The best initialization that we know is some sort of a random initialization:
$$W \sim \mathcal{N}(\mu,\,\sigma^{2}I)$$

The popular ways to do this are:
- Xavier initialization
- Kaiming initialization

##### Kaiming initialization
The Kaiming initialization is a way to initializd the weights of the network layers such that the variance of the outputs from a layer is about equal to the variance of its inputs. The idea here is that this will help maintain a stable gradient flow.

We sample some data and then pass them through a randomly initialized network and then tune the weights such that the standard deviation of the activations across the weights is about equal.

This is done normally to keep the magnitude of the activations in the network constant, but there is a variant that can be done where we keep the magnitude of the *gradients* constant instead. It's good to do one *or* the other.

Kaiming initialization is the default initialization in PyTorch.

##### Xavier initialization
Xavier initialization is a "compromise" version of the Kaiming initialization. Lots of math involved.

##### Final layer initialization
It is OK to initialize the final layer to zeros, but only if you use SGD. If you use any other optimizer like Adam, using zeros is harmful.


In [20]:
# Example tensor dimensions for a layer with 100 input features and 50 output features
kaiming_tensor = torch.empty(100, 50)
print(kaiming_tensor[0])
torch.nn.init.kaiming_uniform_(kaiming_tensor, mode='fan_in', nonlinearity='relu')
print(kaiming_tensor[0])


tensor([0.0000e+00, 0.0000e+00, 1.5274e-43, 0.0000e+00, 0.0000e+00, 0.0000e+00,
        1.5274e-43, 0.0000e+00, 1.4013e-45, 1.4013e-45, 9.2754e-39, 0.0000e+00,
        2.4325e+38, 1.4013e-45, 2.4397e+38, 1.4013e-45, 0.0000e+00, 0.0000e+00,
        0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 1.4013e-45, 0.0000e+00,
        1.4013e-45, 1.4013e-45, 9.2754e-39, 0.0000e+00, 2.4325e+38, 1.4013e-45,
        2.4397e+38, 1.4013e-45, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
        0.0000e+00, 0.0000e+00, 1.4013e-45, 0.0000e+00, 1.4013e-45, 1.4013e-45,
        9.2754e-39, 0.0000e+00, 2.4325e+38, 1.4013e-45, 2.4397e+38, 1.4013e-45,
        0.0000e+00, 0.0000e+00])
tensor([-0.2868,  0.3174,  0.1450,  0.1464, -0.1350, -0.1187, -0.2127,  0.2188,
         0.0790, -0.0263, -0.2273, -0.1576,  0.1647, -0.2634, -0.3181, -0.1570,
        -0.1710, -0.3414,  0.2110, -0.1524,  0.1486, -0.3294,  0.0502, -0.2693,
        -0.1220, -0.3193, -0.2725,  0.1503,  0.1354, -0.1946, -0.1188,  0.1695,
       

## Principles of model training

### Revisiting SGD
SGD is a good default optimizer. However, there are pros and cons to using it:
- Pros:
    - SGD is simple to implement and understand.
    - SGD is efficient on large dataset since it takes small batches.
    - Great for online learning.
    - Converges to the global minimum for convex problems, and can easily escape local minima.
- Cons:
    - Requires manual tuning of the learning rate.
    - The updates are pretty stochastic, which can lead to variance in the optimization path.
    - The convergence is slower than other more sophisticated optimizers, especially during later stages of training.

These flaws are especially evident later in the training, where we might have wanted to use a large step size or learning rate to learn initial signals, but then this leads to spikes in loss during the later stages as the model can "overshoot" when it's coming closer to a desired minima.

### Alternatives to SGD

#### RMSProp
RMSProp attempts to normalize the gradients by their magnitude. This forces the gradient to make more consistent updates throughout the network. RMSProp calculates a running average of the gradient magnitude across batches, and then divides each gradient update by that average (and then applies the momentum term).

Frequently updated weights will have their learning rate decreased, since they'll have more larger gradients to average (and therefore, these weights will be decreased more heavily during normalization), while infrequently updated weights will have a higher learning rate. This approach lets us use a relatively large global learning rate and still make stable progress, since the weights are adjusted anyways. It also helps mitigate vanishing and exploding gradients since the gradients are being normalized.

#### Adam
Adam is similar to RMSProp, but it makes more effective use of the momentum term to make even more sure that the gradients across all the layers are scaled, so that all layers can make almost equal progress. Adam is a pretty popularly used optimizer in practice (and its variant, AdamW, is even more popular). The biggest con with Adam optimizers is the memory requirements to store the gradient normalizations, momentum, and updates, but in practice it works very well.

The change in PyTorch is quite simple:

```python
optim = torch.optim.AdamW(model.parameters(), lr=lr)
```

### Setting the learning rate
How can we determine what learning rate to use?

We could try a bunch of learning rates and then see what is the highest that won't give us spikes in loss. However, a more effective way of doing it is setting learning rate schedules.

A common way to think about it is to start at a fairly high learning rate, learn until you make progress, and then reduce the learning rate.
- Step decay: train until you no longer make progress, then cut the learning rate (by, say, a factor of 2 or a factor of 10), then continue.
- Linear decay: decrease the learning rate linearly across training iterations
- Cosine decay: popular in industry. Base the learning rate schedule on a cosine decay curve.

### Learning rate vs. batch size
Often times, bigger batch sizes -> faster convergence. We can take advantages of heuristics for this. For example, if we double the batch size, we can often double the learning rate. This is due in large part to the fact that if you double the batch size, you end up cutting the variance both within a batch and also across batches by a factor of two. It's normally this variance that limits the useful learning rate, so reducing the variance increases the possible learning rate magnitude. This doubling heuristic works decently well for small learning rates, but its usefulness does cap when the learning rate gets large (i.e., starts to approach 1).

### Learning rate warmup
It can help to gradually increase the learning rate initially and then have it decrease. This is often combined with cosine decay, and the learning rate is spiked up very early in training before it's decreased.


## Overfitting

### Splitting our data
We should split our data into training (60%-80%), validation (10%-20%), and a holdout test (10%-20%) set. Ideally we train our model on the training set, validate (and tweak/tune) on the validation set, and then at the very end, after we have a completed and finished model, test just once on the test set. The easiest way to split the data into these subsets is to do random splitting without replacement. But it's important to *check for correlations* in your data (for example, is the distribution of labels different in the training and validation sets?).

The validation set checks for overfitting on parameters $\theta$, since we can see if our model overfit its parameters on the training set. The test set checks for overfitting of hyperparameters since we tune and update those hyperparameters based on what we see in the validation set and it is possible to overfit on the results of the validation set.

Overfitting isn't always bad, however, since there's always going to be a degree of overfitting. It becomes bad when our performance on the validation set suffers because we overfit on the characteristics of the training set.

### Why do we overfit?

#### Sampling bias
- The model finds patterns that happen to exist only in the training set. In practice, this happens over the course of epochs, since the model begins to see the same samples multiple times; it's unlikely that the model will overfit on one pass of the data.
- Neural networks are really good at creating nonlinear separators of data because it represents the data in a higher dimensional space in which there are cleaner linear separators. The problem is that it's highly unlikely that the representation of two sets of data, such as the training and test set, are identical in a high-enough dimension.

### How do we prevent overfitting?

#### Early stopping
One strategy is to stop training when the validation accuracy peaks. Every few epochs, we can measure the validation accuracy and then save the model, and at the end, we just pick the model that had the highest validation accuracy. We don't necessarily manually stop; rather, we pick the best model after the fact.

#### Collect more data
We begin to overfit when we sample the same data over and over, so one approach is just collecting more data. This is commonly done these days with LLMs, where there's so much data being used that they don't have to see the same data more than once during training.

##### Data augmentation
If we can't collect more data, we can do data augmentation. This works especially well for images.

##### Transfer learning
If we still don't have enough data, we can use transfer learning. We can train a model on a large dataset (pre-training) and then take that model and train it on a target dataset (fine-tuning). The idea is that training on the large dataset learns the "general distribution" of a wider task (e.g., for an NLP model, we can use a large dataset to learn general "language comprehesion"), and then fine-tune it on a smaller dataset to refine its ability on a narrower task (e.g., comprehension of legal text).

In practice, we can download a pretrained model and run a few training iterations on the small dataset.

Transfer learning works pretty well because knowledge from one domain or task often transfers well to another and because using a pretrained model means that the weights are already well-initialized.

###### When should we use transfer learning?
WHENEVER POSSIBLE.

For most tasks, similar pretrained models already exists. In early experiments, using a pretrained model gives you a huge head start over creating a model from scratch, and they probably get you most of the way there in whatever task you're working on.

### Why do models overfit (deeper dive)
The first layer of the model will overfit most closely to the underlying data since it directly touches the data. However, downstream models magnify this overfitting effect because they'll learn to work with the correlations within the upstream overfit layers. This means that often times, deeper layers overfit more since they rely on already overfit activations from previous layers.

#### Dropout
One way to prevent this is **dropout**, where, with a certain probability, the model will randomly drop the activations and zero them out. This helps with overfitting because downstream layers can't fully rely on the outputs of the upstream layers when making its decisions, so the neurons will be pushed to learn more generalizable patterns. 

*Note: During testing, we remove this dropout layer. During evaluation, our model needs to be in model.eval model so that the dropout layers switch to being identity matrices.*

#### Where to add dropout?
In general, we should add dropout:
- Before any large fully-connected layer. These have a high number of parameters and connections and they're prone to overfit because they learn to rely on very specific configurations of inputs. Adding dropout to certain activations pushes the weights to learn parameters that are more uncorrelated than the parameters that are learned by other weights.
- Before some 1x1 convolutions. These can have a lot of parameters and are used normally for changing the number of channels. Doing it here can be beneficial due to the number of parameters.

We should *not* add dropout:
- Before general convolutions. Dropout before convolutions can hinder the learning of features by the kernels, plus within the receptive field, there is enough correlation among the different pixels (i.e., even if you dropout a certain pixel, the pixels around it are correlated) that dropout doesn't work well (this is the same principle that makes something like stride work OK since you don't 100% need to analyze every pixel; you can skip around).

#### Models can become large
If we make our models larger, without any additional modifications, it can (possibly) overfit more. This idea was popularized in classical ML and is less true in deep learning in reality, but we still have to be mindful of it.

##### Using smaller models
In general, smaller models overfit less, so the principle of using a simpler model is good to be mindful of. However, smaller models also often fit worse and generalize worse, which is something to keep in mind.

##### Adding regularization
A great way to be able to get the benefits of larger models while minimizing overfitting is throguh using regularization.

We can do this by adding weight decay. We can keep the weights small (e.g., L2 regularization) and therefore keep the weight magnitudes small. It also helps with exploding gradients by keeping the magnitudes controlled.

In practice, we can just add this to the optimizer (e.g,. AdamW).

```python
torch.optim.AdamW(model.parameters(), lr=lr, weight_decay=1e-4)
```

*In practice this doesn't actually really help overfitting, BUT it is good for making sure that model gradients don't blow up*

#### Models can become too complex
We can, instead of training a single complex model, try to train multiple small models on the dataset and then average the predictions of the multiple models. This ensemble model works well and the averaging has a good effect. Pre-deep learning, doing ensembling also required using different subsets of data, but in deep learning, there's enough randomness (e.g., by adding randomness to the dataloader or by having random weight initializations).

Each model overfits in its own way, so averaging often times leads to the "true signal" emerging since a model that overfits in a particular way is averaged out by the other models (and those models' overfitting is averaged out by the others).

When to use ensembles?
- If you have the compute power - i.e., nearly infinite compute, because it is computationally more expensive
- When you need that last 1%-3% of accuracy. Ensembles are popular in competitions like Kaggle.

In practice, they're not that popular anymore.


## Making it work, in practice

We can go through an example to see what it would be like to implement these changes in practice.

Let's start with a previous ConvNet model.

In [23]:
class ConvNet(torch.nn.Module):
    """Convolutional neural network."""
    class CNNBlock(torch.nn.Module):
        def __init__(self, in_channels, out_channels, stride):
            super().__init__()
            kernel_size = 3
            padding = (kernel_size - 1) // 2

            self.c1 = torch.nn.Conv2d(
                in_channels,
                out_channels,
                kernel_size=kernel_size,
                stride=stride,
                padding=padding
            )
            self.n1 = torch.nn.GroupNorm(1, out_channels)
            self.c2 = torch.nn.Conv2d(
                out_channels,
                out_channels,
                kernel_size=kernel_size,
                stride=1,
                padding=padding
            )
            self.n2 = torch.nn.GroupNorm(1, out_channels)
            self.relu1 = torch.nn.ReLU()
            self.relu2 = torch.nn.ReLU()

            self.skip = torch.nn.Conv2d(
                in_channels,
                out_channels,
                kernel_size=1,
                stride=stride
            ) if in_channels != out_channels else torch.nn.Identity()
        
        def forward(self, x0):
            x = self.relu1(self.n1(self.c1(x0)))
            x = self.relu2(self.n2(self.c2(x))) + self.skip(x0)
            return x


    def __init__(self, channels_l0 = 64, n_blocks = 4):
        super().__init__()
        kernel_size_l0 = 11
        padding_l0 = (kernel_size_l0 - 1) // 2
        stride_l0 = 2
        # initialize the first layer
        cnn_layers = [
            torch.nn.Conv2d(
                3,
                channels_l0,
                kernel_size=kernel_size_l0,
                stride=stride_l0,
                padding=padding_l0
            )
        ]
        c1 = channels_l0
        # blocks
        for _ in range(n_blocks):
            stride = 2
            c2 = c1 * stride # to make up for stride
            cnn_layers.append(self.CNNBlock(c1, c2, stride))
            c1 = c2
        
        # 1x1 convolution, to output the 102 values that are needed in the
        # flowers dataset classification.
        cnn_layers.append(torch.nn.Conv2d(c1, 102, kernel_size=1))
        self.network = torch.nn.Sequential(*cnn_layers)


    def forward(self, x):
        return self.network(x).mean(dim=-1).mean(dim=-1)


Let's now create some training code as well.

In [21]:
import torch
import torchvision
from torch.utils.tensorboard import SummaryWriter

In [None]:
# let's load our data
size = (128, 128)
transform = torchvision.transforms.Compose([
    torchvision.transforms.Resize(size),
    torchvision.transforms.ToTensor()
])
train_dataset = torchvision.datasets.Flowers102(
    "./flowers", "train", transform=transform, download=True
)
valid_dataset = torchvision.datasets.Flowers102(
    "./flowers", "val", transform=transform, download=True
)
test_dataset = torchvision.datasets.Flowers102(
    "./flowers", "test", transform=transform, download=True
)

In [26]:
net = ConvNet(channels_l0=32, n_blocks=4)
optim = torch.optim.AdamW(net.parameters(), lr=0.005)
num_epochs = 100

In [27]:
def get_device():
    if torch.cuda.is_available():
        print("CUDA backend available.")
        device = torch.device("cuda")
    elif torch.backends.mps.is_available() and torch.backends.mps.is_built():
        print("Arm mac GPU available, using GPU.")
        device = torch.device("mps") # for Arm Macs
    else:
        print("GPU not available, using CPU")
        device = torch.device("cpu")
    return device

In [29]:
device = get_device()
net.to(device)

Arm mac GPU available, using GPU.


ConvNet(
  (network): Sequential(
    (0): Conv2d(3, 32, kernel_size=(11, 11), stride=(2, 2), padding=(5, 5))
    (1): CNNBlock(
      (c1): Conv2d(32, 64, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
      (n1): GroupNorm(1, 64, eps=1e-05, affine=True)
      (c2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (n2): GroupNorm(1, 64, eps=1e-05, affine=True)
      (relu): ReLU()
      (skip): Conv2d(32, 64, kernel_size=(1, 1), stride=(2, 2))
    )
    (2): CNNBlock(
      (c1): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
      (n1): GroupNorm(1, 128, eps=1e-05, affine=True)
      (c2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (n2): GroupNorm(1, 128, eps=1e-05, affine=True)
      (relu): ReLU()
      (skip): Conv2d(64, 128, kernel_size=(1, 1), stride=(2, 2))
    )
    (3): CNNBlock(
      (c1): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
      (n1): GroupNorm(1, 256, eps=1e-05, 

In [None]:
train_loader = torch.utils.data.DataLoader(
    train_dataset,
    batch_size=256,
    shuffle=True,
    num_workers=8
)

In [None]:
for epoch in range(100):
    epoch_loss = []
    train_accuracy = []
    for data, label in train_loader:
        data, label = data.to(device), label.to(device)
        output = net(data)
        loss = torch.nn.functional.cross_entropy(output, label)

        train_accuracy.extend(
            (output.argmax(1) == label).cpu().detach().numpy()
        )

        optim.zero_grad()
        loss.backward()
        optim.step()

        epoch_loss.append(loss.item())
    
    # note: would normally add validation loop here.
    
    # early stopping
    if epoch % 10 == 0:
        # save weights instead of model since saving the
        # model also pickles the model architecture,
        # which can be OK but is not necessary.
        torch.save(net.state_dict(), f"model_{epoch}.pth")
