# Sequential Models in Deep Learning

## Introduction

Sequential models represent the simplest and most straightforward way of building neural networks. They consist of a linear stack of layers where data flows from the input layer through each subsequent layer to the output, without any branching or complex topology.

## Mathematical Foundation

A sequential model with $L$ layers can be represented mathematically as a composition of functions:

$$f(x) = f_L \circ f_{L-1} \circ \ldots \circ f_2 \circ f_1(x)$$

Where each function $f_i$ represents the operation performed by the $i$-th layer. If we denote the input as $x^{(0)}$ and the output of each layer $i$ as $x^{(i)}$, then:

$$x^{(i)} = f_i(x^{(i-1)})$$

For a typical fully connected layer with weights $W_i$ and biases $b_i$, followed by an activation function $\sigma_i$, the operation is:

$$x^{(i)} = \sigma_i(W_i x^{(i-1)} + b_i)$$

The sequential architecture simply chains these operations together.

## Visual Representation

A sequential model can be visualized as a linear chain of operations:

```mermaid
flowchart LR
    Input --> Layer1[Layer 1] --> Layer2[Layer 2] --> DotDot[...] --> LayerL[Layer L] --> Output
    style Input fill:#f9f9f9,stroke:#333,stroke-width:1px,color:black
    style Output fill:#f9f9f9,stroke:#333,stroke-width:1px,color:black
    style Layer1 fill:#bbdefb,stroke:#333,stroke-width:1px,color:black
    style Layer2 fill:#bbdefb,stroke:#333,stroke-width:1px,color:black
    style DotDot fill:none,stroke:none,color:black
    style LayerL fill:#bbdefb,stroke:#333,stroke-width:1px,color:black
```

More detailed diagram showing the mathematical operations in each layer:

```mermaid
flowchart LR
    Input(("Input x")) --> L1["Layer 1\nW₁x + b₁\n + σ₁"] --> L2["Layer 2\nW₂x + b₂\n + σ₂"] --> Dots["..."] --> LL["Layer L\nWₗx + bₗ\n + σₗ"] --> Output(("Output"))
    style Input fill:#f5f5f5,stroke:#333,stroke-width:1px,color:black
    style Output fill:#f5f5f5,stroke:#333,stroke-width:1px,color:black
    style L1 fill:#bbdefb,stroke:#333,stroke-width:1px,color:black
    style L2 fill:#bbdefb,stroke:#333,stroke-width:1px,color:black
    style Dots fill:none,stroke:none,color:black
    style LL fill:#bbdefb,stroke:#333,stroke-width:1px,color:black
```

Where $\sigma_i$ represents the activation function at layer $i$.

In [27]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import matplotlib.pyplot as plt
import numpy as np

## Implementation with PyTorch

PyTorch provides the `nn.Sequential` container to create sequential models easily:

In [28]:
# Define a simple sequential model
model = nn.Sequential(
    nn.Linear(784, 128),  # Input layer: 784 features (e.g., MNIST images) -> 128 hidden units
    nn.ReLU(),            # Activation function
    nn.Dropout(0.2),      # Regularization: 20% dropout
    nn.Linear(128, 64),   # Hidden layer: 128 -> 64 units
    nn.ReLU(),            # Activation function
    nn.Linear(64, 10)     # Output layer: 64 -> 10 classes
)

print(model)

Sequential(
  (0): Linear(in_features=784, out_features=128, bias=True)
  (1): ReLU()
  (2): Dropout(p=0.2, inplace=False)
  (3): Linear(in_features=128, out_features=64, bias=True)
  (4): ReLU()
  (5): Linear(in_features=64, out_features=10, bias=True)
)


### Forward Pass Calculation

Let's trace how data flows through this sequential model:

In [29]:
# Create a random input tensor (batch_size=1, features=784)
x = torch.randn(1, 784)

# Manual forward pass through each layer to show the sequential nature
print(f"Input shape: {x.shape}")

# Layer 1: Linear + ReLU
z1 = model[0](x)  # Linear
print(f"After first linear layer: {z1.shape}")
a1 = model[1](z1)  # ReLU
print(f"After first activation: {a1.shape}")

# Layer 2: Dropout + Linear + ReLU
d1 = model[2](a1)  # Dropout
z2 = model[3](d1)  # Linear
print(f"After second linear layer: {z2.shape}")
a2 = model[4](z2)  # ReLU
print(f"After second activation: {a2.shape}")

# Output layer: Linear
output = model[5](a2)  # Linear
print(f"Final output shape: {output.shape}")

# Verify that this matches the direct forward pass
direct_output = model(x)
print("\nDirect model output matches manual calculation:", 
      torch.allclose(output, direct_output))

Input shape: torch.Size([1, 784])
After first linear layer: torch.Size([1, 128])
After first activation: torch.Size([1, 128])
After second linear layer: torch.Size([1, 64])
After second activation: torch.Size([1, 64])
Final output shape: torch.Size([1, 10])

Direct model output matches manual calculation: False


## Alternative Ways to Define Sequential Models in PyTorch

In [30]:
# Method 1: Using nn.Sequential with ordered dictionary
from collections import OrderedDict

model_ordered = nn.Sequential(OrderedDict([
    ('fc1', nn.Linear(784, 128)),
    ('relu1', nn.ReLU()),
    ('dropout', nn.Dropout(0.2)),
    ('fc2', nn.Linear(128, 64)),
    ('relu2', nn.ReLU()),
    ('fc3', nn.Linear(64, 10))
]))

print("Model with OrderedDict:")
print(model_ordered)

# Method 2: Using a class definition
class SimpleSequentialNet(nn.Module):
    def __init__(self):
        super(SimpleSequentialNet, self).__init__()
        self.sequential = nn.Sequential(
            nn.Linear(784, 128),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(128, 64),
            nn.ReLU(),
            nn.Linear(64, 10)
        )
        
    def forward(self, x):
        return self.sequential(x)

model_class = SimpleSequentialNet()
print("\nModel as a class:")
print(model_class)

Model with OrderedDict:
Sequential(
  (fc1): Linear(in_features=784, out_features=128, bias=True)
  (relu1): ReLU()
  (dropout): Dropout(p=0.2, inplace=False)
  (fc2): Linear(in_features=128, out_features=64, bias=True)
  (relu2): ReLU()
  (fc3): Linear(in_features=64, out_features=10, bias=True)
)

Model as a class:
SimpleSequentialNet(
  (sequential): Sequential(
    (0): Linear(in_features=784, out_features=128, bias=True)
    (1): ReLU()
    (2): Dropout(p=0.2, inplace=False)
    (3): Linear(in_features=128, out_features=64, bias=True)
    (4): ReLU()
    (5): Linear(in_features=64, out_features=10, bias=True)
  )
)


## Advantages and Limitations of Sequential Models

### Advantages
- **Simplicity**: Easy to define and understand
- **Maintainability**: Clear structure makes code maintenance easier
- **Efficiency**: Straightforward optimization for simple architectures

### Limitations
- **Restricted Topology**: Only supports linear layer stacking
- **No Branching**: Cannot implement skip connections or parallel paths
- **Limited Layer Reuse**: Difficult to reuse layers or weights
- **No Conditional Logic**: Cannot incorporate dynamic behavior based on inputs

## Training a Sequential Model

Below is a complete example of training a sequential model on the MNIST dataset:

In [31]:
import torch.optim as optim
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

# Define transformations for MNIST dataset
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))
])

# Load MNIST dataset (if available)
try:
    train_dataset = datasets.MNIST('./data', train=True, download=True, transform=transform)
    test_dataset = datasets.MNIST('./data', train=False, transform=transform)
    
    train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
    test_loader = DataLoader(test_dataset, batch_size=1000)
    
    # Create sequential model
    model = nn.Sequential(
        nn.Flatten(),  # Flatten 28x28 images to 784 features
        nn.Linear(784, 128),
        nn.ReLU(),
        nn.Dropout(0.2),
        nn.Linear(128, 64),
        nn.ReLU(),
        nn.Linear(64, 10)
    )
    
    # Define loss function and optimizer
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=0.001)
    
    def train(model, train_loader, criterion, optimizer, epochs=3):
        model.train()
        for epoch in range(epochs):
            running_loss = 0.0
            for batch_idx, (data, target) in enumerate(train_loader):
                # Zero the parameter gradients
                optimizer.zero_grad()
                
                # Forward pass
                output = model(data)
                loss = criterion(output, target)
                
                # Backward pass and optimize
                loss.backward()
                optimizer.step()
                
                # Print statistics
                running_loss += loss.item()
                if batch_idx % 100 == 99:    # Print every 100 mini-batches
                    print(f'Epoch {epoch+1}, Batch {batch_idx+1}: Loss = {running_loss/100:.4f}')
                    running_loss = 0.0
                    
    def test(model, test_loader):
        model.eval()
        test_loss = 0
        correct = 0
        with torch.no_grad():
            for data, target in test_loader:
                output = model(data)
                test_loss += criterion(output, target).item()  # Sum up batch loss
                pred = output.argmax(dim=1, keepdim=True)  # Get the index of the max log-probability
                correct += pred.eq(target.view_as(pred)).sum().item()

        test_loss /= len(test_loader.dataset)
        accuracy = 100. * correct / len(test_loader.dataset)
        print(f'\nTest set: Average loss: {test_loss:.4f}, Accuracy: {correct}/{len(test_loader.dataset)} ({accuracy:.2f}%)\n')
        
    # Train the model (uncomment to run)
    # train(model, train_loader, criterion, optimizer, epochs=3)
    # test(model, test_loader)
    
    print("Model and data ready for training. Uncomment the train() and test() calls to run training.")
    
except Exception as e:
    print(f"Couldn't load MNIST dataset: {e}\nThis is expected if running offline without dataset.")
    print("The code structure is preserved for reference.")

Model and data ready for training. Uncomment the train() and test() calls to run training.


## Mathematical Analysis: Gradients in Sequential Models

The backpropagation algorithm in a sequential model follows a clear path from the output layer back to the input layer. For a loss function $L$ and a sequential model with $n$ layers, we can derive the gradient with respect to the weights $W_i$ of layer $i$ using the chain rule:

$$\frac{\partial L}{\partial W_i} = \frac{\partial L}{\partial x^{(n)}} \cdot \frac{\partial x^{(n)}}{\partial x^{(n-1)}} \cdot ... \cdot \frac{\partial x^{(i+1)}}{\partial x^{(i)}} \cdot \frac{\partial x^{(i)}}{\partial W_i}$$

Where $x^{(i)}$ is the output of layer $i$. This demonstrates the sequential nature of gradient computation in these models.

When using activation functions like ReLU:

$$\frac{\partial \text{ReLU}(z)}{\partial z} = \begin{cases}
1 & \text{if } z > 0 \\
0 & \text{if } z \leq 0
\end{cases}$$

This introduces non-linearity that allows the network to learn complex patterns while maintaining the simple sequential structure.

## Conclusion

Sequential models provide an intuitive and straightforward approach to building neural networks. They are ideal for problems where the data can be processed in a linear fashion, from input to output. Their mathematical simplicity makes them easier to understand, implement, and debug. However, their linear structure can be limiting for complex architectures requiring skip connections or branches.

For more complex architectures, we'll need to move beyond the sequential model to functional and subclassing approaches, which we'll explore in subsequent sections.

## Exercises

1. **Basic**: Explain how data flows through a sequential model both mathematically and conceptually.

2. **Intermediate**: Implement a sequential model for digit classification on MNIST that achieves at least 97% accuracy.

3. **Advanced**: Compare and contrast the performance of different sequential architectures with varying depths and widths. Analyze how changing layer sizes affects model performance.

4. **Practical**: Design a sequential model for a regression task on a dataset of your choice. Implement early stopping to prevent overfitting.