<a href="https://colab.research.google.com/github/gnoejh/ict1022/blob/main/Architectures/vgg.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# VGG: A Deep Convolutional Neural Network Architecture

## Introduction

VGG (Visual Geometry Group) is a deep convolutional neural network architecture created by the Visual Geometry Group at the University of Oxford. It was introduced by Karen Simonyan and Andrew Zisserman in their 2014 paper "Very Deep Convolutional Networks for Large-Scale Image Recognition." The network achieved impressive results in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2014, securing the first and second places in the localization and classification tasks respectively.

VGG is notable for its simplicity and uniform architecture, using only 3×3 convolutional layers stacked on top of each other with increasing depth.

## Historical Importance

VGG made several important contributions to the field of deep learning and computer vision:

1. Demonstrated that network depth is crucial for good performance
2. Showed the effectiveness of using small 3×3 filters throughout the entire network
3. Provided evidence that simple, homogeneous architectures can achieve state-of-the-art results
4. Became a popular feature extractor for many computer vision tasks beyond classification

Following VGG, many architectures continued to explore deeper networks, ultimately leading to architectures like ResNet which introduced skip connections to enable much deeper networks.

## Architecture Overview

The VGG architecture comes in several variants, with VGG16 and VGG19 being the most common (the numbers refer to the total number of weighted layers).

### Key architectural features of VGG16:

1. **Input**: 224×224×3 RGB images
2. **Convolutional Layers**: All use 3×3 filters with stride 1 and same padding
3. **Pooling Layers**: 2×2 max pooling with stride 2 (no overlap)
4. **Network Depth**: 16 weight layers (13 convolutional + 3 fully connected)
5. **Architecture Pattern**: Blocks of convolutional layers followed by max pooling layers
6. **Fully Connected Layers**: Three FC layers at the end (4096, 4096, 1000 neurons)
7. **Output Layer**: 1000-way softmax (for ImageNet's 1000 classes)

### Notable design principles in VGG:

- **Consistent Filter Size**: Use of small 3×3 filters throughout the network
- **Stacking Small Filters**: Multiple 3×3 filters have the same effective receptive field as larger filters (e.g., two 3×3 filters have a 5×5 receptive field) but with fewer parameters
- **Increasing Feature Maps**: Number of feature maps increases as the spatial dimensions decrease
- **ReLU Activation**: Used after every convolutional layer

## Implementation with PyTorch

Let's implement the VGG16 architecture using PyTorch:

In [None]:
# Install required packages if needed
!pip install torch torchvision matplotlib numpy

In [None]:
import torch
import torch.nn as nn
import torchvision
import torchvision.transforms as transforms
import matplotlib.pyplot as plt
import numpy as np

# Check if CUDA is available
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

In [None]:
# Define the VGG16 model
class VGG16(nn.Module):
    def __init__(self, num_classes=1000):
        super(VGG16, self).__init__()
        
        # Block 1: 64 channels
        self.block1 = nn.Sequential(
            nn.Conv2d(3, 64, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(64, 64, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2)
        )
        
        # Block 2: 128 channels
        self.block2 = nn.Sequential(
            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(128, 128, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2)
        )
        
        # Block 3: 256 channels
        self.block3 = nn.Sequential(
            nn.Conv2d(128, 256, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(256, 256, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(256, 256, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2)
        )
        
        # Block 4: 512 channels
        self.block4 = nn.Sequential(
            nn.Conv2d(256, 512, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(512, 512, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(512, 512, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2)
        )
        
        # Block 5: 512 channels
        self.block5 = nn.Sequential(
            nn.Conv2d(512, 512, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(512, 512, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(512, 512, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2)
        )
        
        # Classifier (Fully Connected layers)
        self.classifier = nn.Sequential(
            nn.Linear(7 * 7 * 512, 4096),
            nn.ReLU(inplace=True),
            nn.Dropout(p=0.5),
            nn.Linear(4096, 4096),
            nn.ReLU(inplace=True),
            nn.Dropout(p=0.5),
            nn.Linear(4096, num_classes)
        )
        
        # Initialize weights according to the original paper
        self._initialize_weights()
        
    def forward(self, x):
        x = self.block1(x)
        x = self.block2(x)
        x = self.block3(x)
        x = self.block4(x)
        x = self.block5(x)
        x = torch.flatten(x, 1)  # flatten all dimensions except batch
        x = self.classifier(x)
        return x
    
    def _initialize_weights(self):
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
                if m.bias is not None:
                    nn.init.constant_(m.bias, 0)
            elif isinstance(m, nn.Linear):
                nn.init.normal_(m.weight, 0, 0.01)
                nn.init.constant_(m.bias, 0)

# Create an instance of VGG16
model = VGG16(num_classes=1000).to(device)
print(model)

## Using Pre-Trained VGG from torchvision

In [None]:
# Load pre-trained VGG16 model from torchvision
pretrained_model = torchvision.models.vgg16(pretrained=True)
pretrained_model.eval()  # Set to evaluation mode
pretrained_model = pretrained_model.to(device)

# Load ImageNet class labels
import json
import urllib.request

# Download ImageNet class labels if needed
try:
    url = "https://raw.githubusercontent.com/pytorch/examples/master/imagenet/imagenet_classes.txt"
    with urllib.request.urlopen(url) as response:
        classes = [line.decode('utf-8').strip() for line in response.readlines()]
except:
    # Fallback to a smaller subset if download fails
    classes = [f"Class_{i}" for i in range(1000)]

## Image Classification with Pre-trained VGG

In [None]:
from PIL import Image
from torchvision import transforms

# Define image preprocessing
preprocess = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

# Function to make predictions on an image
def predict_image(image_path):
    # Load and preprocess the image
    img = Image.open(image_path)
    img_t = preprocess(img)
    batch_t = torch.unsqueeze(img_t, 0).to(device)
    
    # Make a prediction
    with torch.no_grad():
        output = pretrained_model(batch_t)
    
    # Get the top 5 predictions
    _, indices = torch.sort(output, descending=True)
    percentages = torch.nn.functional.softmax(output, dim=1)[0] * 100
    results = [(classes[idx], percentages[idx].item()) for idx in indices[0][:5]]
    
    # Display the image
    plt.figure(figsize=(8, 6))
    plt.imshow(img)
    plt.axis('off')
    plt.title("Top predictions:")
    
    # Display the top 5 predictions
    for i, (cls, prob) in enumerate(results):
        plt.text(5, 30 + i*20, f"{cls}: {prob:.2f}%", fontsize=12, 
                 bbox=dict(facecolor='white', alpha=0.8))
    
    plt.tight_layout()
    plt.show()

# To use this function:
# predict_image('path/to/your/image.jpg')

## Visualizing VGG Filters

In [None]:
def visualize_filters(layer_name='features.0'):
    """
    Visualize filters from a specific convolutional layer
    layer_name: Name of the layer (e.g., 'features.0' for first conv layer)
    """
    # Get the layer by name
    layer = dict([*pretrained_model.named_modules()])[layer_name]
    
    if not isinstance(layer, nn.Conv2d):
        print(f"Layer {layer_name} is not a convolutional layer.")
        return
    
    # Get the filter weights
    filters = layer.weight.data.cpu().numpy()
    
    # Number of filters
    num_filters = filters.shape[0]
    n_cols = 8  # Number of columns in the grid
    n_rows = num_filters // n_cols + (1 if num_filters % n_cols != 0 else 0)
    
    # Create figure for all filters
    fig, axes = plt.subplots(n_rows, n_cols, figsize=(15, n_rows * 2))
    
    for i in range(n_rows * n_cols):
        row, col = i // n_cols, i % n_cols
        if i < num_filters:
            # Normalize the filter for better visualization
            filt = filters[i].transpose(1, 2, 0)
            filt = (filt - filt.min()) / (filt.max() - filt.min() + 1e-5)
            
            if n_rows > 1:
                axes[row, col].imshow(filt)
                axes[row, col].set_title(f'Filter {i}')
                axes[row, col].axis('off')
            else:
                axes[col].imshow(filt)
                axes[col].set_title(f'Filter {i}')
                axes[col].axis('off')
        else:
            if n_rows > 1:
                axes[row, col].axis('off')
            else:
                axes[col].axis('off')
                
    plt.tight_layout()
    plt.suptitle(f"Layer {layer_name} Filters", fontsize=16)
    plt.subplots_adjust(top=0.92)
    plt.show()

# Visualize the first convolutional layer filters (64 filters)
visualize_filters('features.0')

## Feature Map Visualization

In [None]:
def visualize_feature_maps(image_path, layer_name='features.1'):
    """
    Visualize feature maps produced by a specific layer
    layer_name: Name of the layer to visualize
    """
    # Load and preprocess image
    img = Image.open(image_path)
    img_t = preprocess(img)
    batch_t = torch.unsqueeze(img_t, 0).to(device)
    
    # Create a hook to capture feature maps
    feature_maps = []
    
    def hook_fn(module, input, output):
        feature_maps.append(output.detach().cpu())
    
    # Get the layer by name
    layer = dict([*pretrained_model.named_modules()])[layer_name]
    
    # Register the hook
    hook = layer.register_forward_hook(hook_fn)
    
    # Forward pass
    with torch.no_grad():
        pretrained_model(batch_t)
    
    # Remove the hook
    hook.remove()
    
    # Get feature maps
    feature_map = feature_maps[0][0]
    
    # Plot the original image
    plt.figure(figsize=(10, 5))
    plt.subplot(1, 2, 1)
    plt.imshow(img)
    plt.title('Original Image')
    plt.axis('off')
    
    # Plot feature maps
    n = min(16, feature_map.size(0))  # Display up to 16 feature maps
    fig = plt.figure(figsize=(15, 15))
    
    for i in range(n):
        a = fig.add_subplot(4, 4, i+1)
        img_map = feature_map[i].numpy()
        img_map = (img_map - img_map.min()) / (img_map.max() - img_map.min() + 1e-5)
        plt.imshow(img_map, cmap='viridis')
        plt.axis('off')
        a.set_title(f'Feature Map {i}')
        
    plt.tight_layout()
    plt.suptitle(f'Feature Maps from {layer_name}', fontsize=20)
    plt.subplots_adjust(top=0.93)
    plt.show()

# To use this function:
# visualize_feature_maps('path/to/your/image.jpg', 'features.1')

## Comparing Different VGG Models

In [None]:
def compare_vgg_models():
    """
    Compare VGG11, VGG13, VGG16, and VGG19 models in terms of parameters and layers
    """
    models = {
        'VGG11': torchvision.models.vgg11(pretrained=False),
        'VGG13': torchvision.models.vgg13(pretrained=False),
        'VGG16': torchvision.models.vgg16(pretrained=False),
        'VGG19': torchvision.models.vgg19(pretrained=False)
    }
    
    # Compare parameters
    print("VGG Model Comparison:")
    print("-" * 60)
    print(f"{'Model':<10} {'Total Parameters':<20} {'Conv Layers':<15} {'FC Layers':<15}")
    print("-" * 60)
    
    for name, model in models.items():
        # Count parameters
        total_params = sum(p.numel() for p in model.parameters())
        
        # Count convolutional layers
        conv_layers = sum(1 for m in model.features.modules() if isinstance(m, nn.Conv2d))
        
        # Count fully connected layers
        fc_layers = sum(1 for m in model.classifier.modules() if isinstance(m, nn.Linear))
        
        print(f"{name:<10} {total_params:<20,d} {conv_layers:<15} {fc_layers:<15}")
    
    print("-" * 60)

compare_vgg_models()

## VGG Performance and Historical Context

### Performance on ImageNet

| Model | Top-1 Accuracy | Top-5 Accuracy | Parameters |
|-------|---------------|---------------|------------|
| VGG16 (2014) | 71.3% | 90.1% | 138M |
| VGG19 (2014) | 71.6% | 90.3% | 144M |
| AlexNet (2012) | 57.1% | 80.2% | 60M |
| ResNet-50 (2015) | 76.0% | 92.9% | 25M |

### Impact and Legacy

VGG's impact on deep learning and computer vision has been significant:

1. **Simplicity**: Demonstrated that a simple, uniform architecture could achieve excellent results
2. **Transfer Learning**: Became a popular feature extractor for many downstream tasks
3. **Depth Study**: Provided empirical evidence that increasing network depth improves performance
4. **Small Filter Design**: Showed the benefits of using small 3×3 filters throughout the network
5. **Feature Visualization**: Its simple structure made it easier to visualize and understand learned features

### Limitations

VGG also has several notable limitations:

- Very large number of parameters (138M for VGG16), making it memory-intensive
- Computationally expensive at inference time
- No batch normalization in the original architecture
- Prone to overfitting due to the large number of parameters
- Limited depth compared to more modern architectures like ResNet

## Conclusion

VGG represents an important evolutionary step in the development of convolutional neural networks. Its simple, uniform design principles made it both effective and easy to understand, while its strong performance solidified the importance of network depth in convolutional architectures. Although newer architectures have since surpassed VGG in terms of accuracy and efficiency, VGG's impact on the field remains significant, and its feature extractors continue to be used in various computer vision applications to this day.