# LeNet

In this section, we introduce LeNet, one of the earliest convolutional neural networks (CNNs) to gain significant attention for its effectiveness in computer vision. Developed by Yann LeCun during his time at AT&T Bell Labs, LeNet was specifically designed for recognizing handwritten digits (LeCun et al., 1998). The model represented the culmination of nearly a decade of work in CNN research; in fact, LeCun and his collaborators were the first to demonstrate successful training of CNNs using backpropagation (LeCun et al., 1989).

When it was introduced, LeNet delivered remarkable results, rivaling the accuracy of support vector machines—the leading supervised learning method of the time, by reaching an error rate below 1% per digit. The architecture was later deployed in practical applications, such as digit recognition for processing ATM deposits, and impressively, some ATMs still rely on the original code developed by Yann LeCun and Leon Bottou in the 1990s.

In [None]:
import torch
from torch import nn
from d2l import torch as d2l

## Architecture

At a broad level, LeNet-5 is organized into two main components: (i) a convolutional encoder made up of two convolutional layers, and (ii) a fully connected block comprising three dense layers.

<center>
<img src="../images/15_Image_1.png" width="800">

Figure 1: Data flow in LeNet. The input is a handwritten digit, the output is a probability over 10 possible outcomes.
</center>

In LeNet, each convolutional block is composed of a convolutional layer, a sigmoid activation function, and an average pooling step. Unlike modern CNNs, ReLUs and max pooling had not yet been introduced at the time. Both convolutional layers use 5 × 5 kernels to transform inputs into two-dimensional feature maps, increasing the number of channels. The first convolutional layer produces 6 output channels, while the second produces 16. Dimensionality is further reduced through 2 × 2
2×2 pooling with stride 2, shrinking the spatial size by a factor of four. The output of the convolutional block thus has the form
(batch size, channels, height, width). To connect this output to the dense block, the four-dimensional data must be flattened into a two-dimensional format suitable for fully connected layers, with one dimension indexing batch examples and the other representing the flattened features. The dense block then processes this representation through three fully connected layers containing 120, 84, and finally 10 neurons, where the last layer corresponds to the 10 possible output classes.

Although fully grasping the inner workings of LeNet may require some effort, the following code example demonstrates how straightforward it is to implement such models using modern deep learning frameworks.

In [None]:
def init_cnn(module):  #@save
    """Initialize weights for CNNs."""
    if type(module) == nn.Linear or type(module) == nn.Conv2d:
        nn.init.xavier_uniform_(module.weight)

class LeNet(d2l.Classifier):  #@save
    """The LeNet-5 model."""
    def __init__(self, lr=0.1, num_classes=10):
        super().__init__()
        self.save_hyperparameters()
        self.net = nn.Sequential(
            nn.LazyConv2d(6, kernel_size=5, padding=2), nn.Sigmoid(),
            nn.AvgPool2d(kernel_size=2, stride=2),
            nn.LazyConv2d(16, kernel_size=5), nn.Sigmoid(),
            nn.AvgPool2d(kernel_size=2, stride=2),
            nn.Flatten(),
            nn.LazyLinear(120), nn.Sigmoid(),
            nn.LazyLinear(84), nn.Sigmoid(),
            nn.LazyLinear(num_classes))

Let’s examine the network’s internal operations. By feeding a single-channel (grayscale) image into the model and printing the output shape at each stage, we can verify that the transformations match the structure shown in Fig. 2.

<center>
<img src="../images/15_Image_2.png" width="150">

Figure 2: Compressed notation for LeNet-5.
</center>



In [None]:
@d2l.add_to_class(d2l.Classifier)  #@save
def layer_summary(self, X_shape):
    X = torch.randn(*X_shape)
    for layer in self.net:
        X = layer(X)
        print(layer.__class__.__name__, 'output shape:\t', X.shape)

model = LeNet()
model.layer_summary((1, 1, 28, 28))

In LeNet, the spatial dimensions (height and width) shrink as data passes through the convolutional block. The first convolutional layer uses padding to preserve size, while the second does not, reducing both height and width by four pixels. Historically, MNIST images were trimmed from 32 × 32 to 28 × 28 to save storage space. As layers progress, the number of channels grows (from 1 in the input, to 6 after the first convolution, and 16 after the second). Pooling layers further halve the spatial dimensions, and the final fully connected layers reduce everything down to match the number of output classes.

## Training

After implementing LeNet-5, we test it on the Fashion-MNIST dataset. Although CNNs use fewer parameters than MLPs, they can be more computationally intensive, making GPUs useful for faster training. Using the `d2l.Trainer` class simplifies the process by handling device setup and parameter initialization. As with MLPs, training employs cross-entropy loss minimized through minibatch stochastic gradient descent.

In [None]:
trainer = d2l.Trainer(max_epochs=10, num_gpus=1)
data = d2l.FashionMNIST(batch_size=128)
model = LeNet(lr=0.1)
model.apply_init([next(iter(data.get_dataloader(True)))[0]], init_cnn)
trainer.fit(model, data)

## Exercises

1. Let’s modernize LeNet. Implement and test the following changes:
- Replace average pooling with max-pooling.
- Replace the softmax layer with ReLU.

2. Try to change the size of the LeNet style network to improve its accuracy in addition to max-pooling and ReLU.
- Adjust the convolution window size.
- Adjust the number of output channels.
- Adjust the number of convolution layers.
- Adjust the number of fully connected layers.
- Adjust the learning rates and other training details (e.g., initialization and number of epochs).

3. Display the activations of the first and second layer of LeNet for different inputs (e.g., sweaters and coats).