<a href="https://colab.research.google.com/github/gtbook/robotics/blob/main/S56_diffdrive_learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
%pip install -q -U gtbook


In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import plotly

# Deep Learning

> Stochastic gradient descent and the like.

<img src="Figures5/S56-Two-wheeled_Toy_Robot-09.jpg" alt="Splash image with steampunk differential drive robot" width="40%" align=center style="vertical-align:middle;margin:10px 0px">

## Supervised Learning Setup

> From data, learn concept.

In the **supervised learning** setup, we have a large number of examples of inputs $x$ and corresponding labels $y$.
We will often refer to the *training dataset* as $D$, consisting of pairs $(x,y)$. The nature of the output labels $y$ determine the type of learning problem we are dealing with:

1. **Classification**; If the labels $y$ are *discrete*, we talk about a supervised *classification* problem. The prototypical example is classifying images as portraying either a cat or a dog: here the images are the inputs $x$, and the output label $y \in \{\text{cat},\text{dog}\}$.

2. **Regression** If the labels $y$ are *continuous*, this is called a supervised *regression* problem. For example, predicting the blue-book value of a second-hand car based on its make, model, year, and miles driven.

Whether we are talking about classification or regression, the supervised leaning process normally follows these steps:

1. Define a model $f$ and its parameters $\theta$ that allow you to output a prediction $\hat{y}$ from the input features $x$:

$$
\hat{y} = f(x; \theta)
$$

2. Divide your data into *training*, *validation*, and *test* datasets. Typically, the largest portion of the data is used for training, while setting aside smaller validation and test portions of the data.

2. Train the model using the training data $D_{\text{train}}$, while monitoring for "overfitting" on the validation dataset $D_{\text{val}}$. We train by adjusting the parameters $\theta$ to minimize a training loss, both of which we look at in more detail below.

3. After we decided to stop the training process, we typically test the model on the held-out dataset $D_{\text{test}}$ that the training process has never seen, to get an independent assessment on how well the model will generalize towards new, unseen data.

Supervised learning is the staple of machine learning and its use has exploded in recent years to encompass almost any human economic activity, ranging from finance to healthcare and everything in between. Most recently the success of large language models is also based on supervised learning, where a "transformer"-based model is trained to predict the next word (or token) in a sequence, from very large textual datasets, a paradigm which is rapidly finding its way to different modalities like vision as well.

## Example: Interpolation in 1D

As an example, e formulate a simple regression problem that asks for interpolating functions in 1D. We will create a *differentiable* interpolation scheme that can be trained using samples from any function we want to interpolate, even functions with multi-dimensional outputs.

The `LineGrid` class below is designed for this purpose, and divides up the 1D interval over which the function is defined in a number of *cells*, arranged in a 1D grid. It is initialized with two parameters:

- `n`: the number of cells in a 1D grid
- `d`: the dimensionality of the function we want to interpolate

In our case, we will focus on interpolating the grid values defined at the cell boundaries. The `forward` method of the `LineGrid` module performs the interpolation for any value inside the grid.

In [None]:
def interpolate(v0, v1, alpha):
    """Interpolate between v0 and v1 using alpha, using unsqueeze to properly handle batches."""
    return v0 * (1 - alpha.unsqueeze(-1)) + v1 * alpha.unsqueeze(-1)


class LineGrid(nn.Module):
    """A 1D grid with learnable values at the boundaries on n cells."""

    def __init__(self, n, d=1):
        super().__init__()  # Calling the superclass's __init__ method
        self.grid = nn.Parameter(nn.init.normal_(torch.empty(n + 1, d)))

    def forward(self, x):
        X = torch.floor(x).long()
        a = x - X  # blending weights (same size as x)
        return interpolate(self.grid[X], self.grid[X + 1], a)

Here is an example of how to initialize a `LineGrid` instance, and subsequently call its `forward` method:

In [None]:
grid_module = LineGrid(n=5, d=2)

x = torch.Tensor([1.5, 2.7, 3.6])
print("Interpolated Output:", grid_module(x))

In the example above, the shape of the output is $3\times 2$ because we asked to interpolate a 2D function (`d` is 2) at three different locations (as `x.shape` is 3). The output looks rather random, however, because the grid was initialized with random values in the constructor.

Below we discuss often-used loss functions and the stochastic gradient descent method to train models, and show how to train the simple regression example above to drive all these concepts home.

### Exercise

For the example above with $n=5$ and $d=2$, give an $(x,y)$ pair. What is the model $f$? What are the model parameters $\theta$? How many parameters are there?

## Loss Functions

> A loss function for every occasion.

Different tasks require different loss functions, and a lot of creativity and research goes into crafting loss functions for complex tasks. For "vanilla" regression tasks, we typically use a **sum of squares differences** loss function as we already encountered before:

$$L_{\text{SSD}}(\theta; D) \doteq \sum_{(x,y)\in D}|f(x;\theta)-y|^2$$

Above $f(x;\theta)$ is the continuous prediction function implemented by, say, a neural network, where $\theta$ represents the weights in all layers. Note that the formula above can be easily generalized to vector-valued labels $y$.

For classification, the **cross entropy** loss function is very popular: it measures the average disagreement of the predicted labels with the ground truth labels:

$$L_{\text{CE}}(\theta; D) \doteq \sum_c \sum_{(x,y=c)\in D}\log\frac{1}{p_c(x;\theta)}$$

This formula seems perhaps unintuitive and rather complicated. However, it is actually quite intuitive once you understand a few concepts.
In particular, in the multi-class classification problem we assume that the model outputs a probability $p_c(x;\theta)$ for every class $c\in[N]$, where $N$ is the number of classes. The quantity 

$$\log\frac{1}{p_c(x;\theta)}$$

is called the *surprise* that we should experience when seeing a label $y=c$. 
Indeed, for example, if we see class $3$ and the probability output by the network is $p_3(x;\theta)=1$, we are not surprised at all, as $\log\frac{1}{1}=\log 1 = 0$. 
However, if the probability is only $0.01$, our surprise is $\log\frac{1}{0.01}=\log 100 = 2$.
The lower the probability, the higher the surprise. Hence, the cross-entropy above measures the *average surprise* for seeing the labeled examples in the training data. After training, the model is the least surprised possible, hopefully, which is why it is an intuitive loss function to minimize.

Note that training with cross-entropy does not guarantee that the outputs can be *truly* interpreted as probabilities: the recent field of "model calibration" has shown that especially neural networks can severely over-estimate those probability values in attempting to minimize the loss. If this interpretation is important for the application at hand, several techniques now exist to "calibrate" the models to be more interpretable that way.

## Stochastic Gradient Descent

> Calculate gradient, reduce loss.

A neural network output, and in particular a CNN, depends on the large set of continuous weights $W$ that make up its convolutional layers, pooling layers, and fully connected layers. In other words, the neural network is the model $f(x;\theta)$ in the learning setup discussed above, and the weights $W$ are its parameters $\theta$.


When we train a neural networks, we adjust its weights $W$ to perform better on the task at hand, be it classification or regression. To measure whether the model performs "better", we can use one of the loss functions defined above. To adjust the weights, we could calculate the gradient of the loss function with respect to each of the weights, and adjust the weights accordingly. That procedure is called **gradient descent**, but it is typically too expensive to calculate the gradient using the large datasets that are used in practice.

**Stochastic gradient descent** is an approximate gradient descent procedure, to cope with the very large data sets typically thrown at supervised problems. It is typically impossible to calculate the *exact* gradient, which requires looping over all the examples, which can run in the millions (or billions). An easy approximation scheme is to *randomly sample* a small subset of the examples, and calculate the gradient of the weights using only those examples. The upside is that this is much faster, but the downside is that this is only approximate. Hence, if we adjust weights with this approximate gradient, we might or might not make progress on the task. This procedure is called stochastic gradient descent, and it works amazingly well in practice.

## Learning to Interpolate

To "learn" a function we need to provide training data, and minimize a loss function. As an example, maybe we can learn a sine and cosine function at the same time? Let us create some training data by creating 500 samples of these two functions:

In [None]:
n = 20 # we use a grid size of 20, allowing for x values between 0 and 20
num_samples = 500 # we use 500 samples to train our model
x_samples = torch.rand((num_samples,)) * n
y_samples = torch.stack([torch.sin(x_samples * 2 * torch.pi / n) + 0.1 * torch.randn((num_samples, )), 
                         torch.cos(x_samples * 2 * torch.pi / n) + 0.1 * torch.randn((num_samples, ))], dim=1)
print(y_samples.shape)

The training code below is a standard way of training a neural network using PyTorch, which we will abuse here to optimize instead for the parameters of a LineGrid. That is possible because all the operations inside our LineGrid class are differentiable, so stochastic gradient descent (SGD) will just work. The loss we will minimize is the **Mean-Squared Error** loss function or MSE, which minimized the squared difference between the predicted values and the training data values. This is the standard loss function for so-called *regression* problems, where we are trying to optimize a continuous function. 

Inside the training loop below, you'll find the typical sequence of operations: zeroing gradients, performing a forward pass to get predictions, computing the loss, and doing a backward pass to update the model's parameters. Try to understand the code, as this same training loop is at the core of most deep learning architectures. Now, let's take a closer look at the code itself, which is extensively documented for clarity:

In [None]:
def train(model, x_samples, y_samples, learning_rate=0.3, num_epochs=601, checkpoint_freq=100):
    # Initialize Stochastic Gradient Descent optimizer
    optimizer = optim.SGD(model.parameters(), lr=learning_rate)
    
    # Initialize the built-in Mean-Squared Error loss function
    mse = nn.MSELoss()

    # Loop over the dataset multiple times (each loop is an epoch)
    for epoch in range(num_epochs):
        # Zero the parameter gradients
        optimizer.zero_grad()

        # Forward pass: Compute predicted y by passing x_samples through the model
        output = model(x_samples)

        # Compute loss using built-in MSE loss function
        loss = mse(output, y_samples)

        # Backward pass: Compute gradient of the loss with respect to model parameters
        loss.backward()

        # Update model parameters using optimizer
        optimizer.step()

        # Print loss at specified checkpoint frequencies
        if epoch % checkpoint_freq == 0:
            print(f'Loss at epoch {epoch}: {loss.item()}')

Using this code is rather trivial: we initialize a grid with random values, and call `train`:

In [None]:
# Initialize model
model = LineGrid(n=n, d=2)  # d=2 as we are regressing both sin and cos

# Run the training loop
train(model, x_samples, y_samples)

We can then evaluate the resulting functions and plot the result against the training data, and we see that we get decent approximations of sin and cos, even with noisy training data:

In [None]:
y_pred = model(x_samples).detach().numpy()

fig = plotly.graph_objects.Figure()
fig.add_scatter(x=x_samples, y=y_samples[:, 0], mode='markers', name='sin')
fig.add_scatter(x=x_samples, y=y_samples[:, 1], mode='markers', name='cos')
fig.add_scatter(x=x_samples, y=y_pred[:, 0], mode='markers', name='predicted sin')
fig.add_scatter(x=x_samples, y=y_pred[:, 1], mode='markers', name='predicted cos')
fig.show()
