# Image classification

# Fashion MNIST Dataset

The dataset is designed for machine learning classification tasks and contains in total 60 000 training and 10 000 test images (gray scale) with each 28x28 pixel. Each training and test case is associated with one of ten labels (0–9).

<img src="./img/comp_vision/img_classification/Fashion_MNIST.png" alt="nearby_objects" width="800"/>

# NNs Architectures for the imgs classification:

# AlexNet 


Background
Motivation
What problems AlexNet solve?
Architecture
Summary



- This was the first architecture that used GPU to boost the training performance.

- AlexNet consists of 5 convolution layers, 3 max-pooling layers, 2 Normalized layers, 2 fully connected layers and 1 SoftMax layer.

- Each convolution layer consists of a convolution filter and a non-linear activation function called “ReLU”.

- The pooling layers are used to perform the max-pooling function and the input size is fixed due to the presence of fully connected layers. 

- input size is mentioned at most of the places as 224x224x3 but due to some padding which happens it works out to be 227x227x3. Above all this AlexNet has over 60 million parameters.

#### Key Features:

- ‘ReLU’ is used as an activation function rather than ‘tanh’
- Batch size of 128
- SGD Momentum is used as a learning algorithm
- Data Augmentation is been carried out like flipping, jittering, cropping, colour normalization, etc.

<img src="./img/comp_vision/img_classification/AlexNet_Architecture.png" alt="nearby_objects" width="800"/>

<img src="./img/comp_vision/img_classification/AlexNet_architecture_layers.png" alt="nearby_objects" width="800"/>

- **Input data:** This is the raw input that feeds into the network. For AlexNet, the input size is typically a 227x227 pixel image with 3 channels (corresponding to RGB color channels).

- **Conv1:** (Convolutional Layer 1): This layer performs convolution operations that involve filtering the input image with learned kernels to create a feature map. The dimensions 55x55x96 indicate that this layer produces 96 separate feature maps, each of size 55x55. This layer detects simple features like edges and corners.

- **Conv2:** (Convolutional Layer 2): Another convolutional layer that takes the output from Conv1 and applies a different set of filters to detect more complex features. It outputs 256 feature maps, each of size 27x27.

- **Conv3:** (Convolutional Layer 3): This is a deeper convolutional layer in the network, which works on even more abstract features. With 384 feature maps, each 13x13, this layer allows the network to begin understanding more complex structures within the image.

- **Conv4:** (Convolutional Layer 4): Similar to Conv3, but typically the filters in this layer are more specialized to detect high-level features. It outputs the same number of feature maps (384) but with a different set of filters.

- **Conv5:** (Convolutional Layer 5): This is usually the last convolutional layer and outputs 256 feature maps of size 13x13. It is directly connected to the fully connected layers and often captures the highest-level features.

beginning the process of classifying the image based on the features extracted by the convolutional layers. Each neuron in this layer is connected to all the activations in the previous layer, thus it's called fully connected.

- **FC6:** (Fully Connected Layer 6): A dense layer with 4096 units that takes all the feature maps from Conv5 and flattens them into a single vector. This layer is responsible for

- **FC7:** (Fully Connected Layer 7): This is another dense layer with 4096 units, which continues the classification process. It takes the output from FC6 and further processes it, potentially learning even more abstract representations of the image features.

FC6 and FC7 have the same size of 4096 units to provide sufficient capacity for complex feature representation and to allow for a deeper hierarchy of learned features, which was empirically found to be effective for image classification tasks.

- **FC8:** (Fully Connected Layer 8): This is the final fully connected layer and usually represents the output layer of the network. In the case of AlexNet, it typically has 1000 units, each corresponding to a different class in a classification task. The network will output a probability distribution across these 1000 classes, indicating how likely the input image belongs to each class. Typically connects to a softmax or logistic regression output layer


<img src="./img/comp_vision/img_classification/AlexNet_architecture_layers2.png" alt="nearby_objects" width="800"/>

In [5]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import Subset
import numpy as np
import logging

# Setup logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger()


# Define the AlexNet Model
class AlexNet(nn.Module):        
    """_summary_
    # Input size is 227x227x3 as per the architecture   
    nn.Conv2d(3, 96, kernel_size=11, stride=4, padding=0),  # First Convolutional Layer
    
    1 First parameter (in_channels=3): This specifies the number of channels in the input image. 
        For RGB images, this is 3, which means there are three channels for red, green, and blue.

    2 Second parameter (out_channels=96): This is the number of filters that will be applied to the input image. 
        It also represents the number of feature maps that will be produced by the convolution. In this case, 
        96 filters will generate 96 separate feature maps, meaning the layer will output a tensor with 96 channels.

    3 kernel_size=11: This defines the size of the filter (also known as the kernel) that will be used to perform the convolution. 
        In this case, the filter will be 11 pixels by 11 pixels. 
        The kernel size affects the area of the input image that is used to compute each element of the output feature map.

    4 stride=4: The stride is the number of pixels by which the filter moves across the input image. 
        A stride of 4 means the filter jumps 4 pixels at a time as it moves across the image. 
        This results in downsampling the output feature map by a factor of 4, making it smaller than the input image.

    5 padding=0: Padding adds zeros to the border of the input image. This is used to control the spatial size of the output feature maps.
        A padding of 0 means no padding is applied, and the spatial dimensions of the output will be reduced compared to the input dimensions. 
        Padding can also be used to ensure that the output feature map has the same spatial dimensions as the input image (called 'same' padding),
        but in this case, since the padding is 0, the output will be smaller.    
    """
    def __init__(self, num_classes=10):  # Default is 10 classes as per the architecture provided
        super(AlexNet, self).__init__()
        self.features = nn.Sequential(
            # Input size is 227x227x1 as per the architecture One channel == Fashion Dataset                                   
            nn.Conv2d(1, 96, kernel_size=11, stride=4, padding=0),  # First Convolutional Layer
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2),
            nn.Conv2d(96, 256, kernel_size=5, padding=2),  # Second Convolutional Layer
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2),
            nn.Conv2d(256, 384, kernel_size=3, padding=1),  # Third Convolutional Layer
            nn.ReLU(inplace=True),
            nn.Conv2d(384, 384, kernel_size=3, padding=1),  # Fourth Convolutional Layer
            nn.ReLU(inplace=True),
            nn.Conv2d(384, 256, kernel_size=3, padding=1),  # Fifth Convolutional Layer
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2),
        )
        self.avgpool = nn.AdaptiveAvgPool2d((6, 6))  # Adaptive pooling to make it size-independent
        self.classifier = nn.Sequential(
            nn.Dropout(),
            # CF 6 
            nn.Linear(256 * 6 * 6, 4096),  # First Fully Connected Layer
            nn.ReLU(inplace=True),
            nn.Dropout(),
            # CF 7
            nn.Linear(4096, 4096),  # Second Fully Connected Layer
            nn.ReLU(inplace=True),
            # FC 8 input features from the left hand side hiddel layer 4096, uotputs classes -> FashionDataset = num_classes 
            nn.Linear(4096, num_classes),  # 
            nn.Softmax(dim=1)  # Softmax activation as per the architecture
        )

    def forward(self, x):
        x = self.features(x)
        x = self.avgpool(x)
        x = torch.flatten(x, 1)
        x = self.classifier(x)
        return x



# Load and preprocess the dataset
def load_dataset():
    
    transform = transforms.Compose([
    #VGG16 architecture is designed for images of size 224x224
    transforms.Resize((327, 327)),  # Resize the image to 224x224
    transforms.ToTensor()
    ])
    
    
    full_trainset = torchvision.datasets.FashionMNIST(root='./data', train=True, download=True, transform=transform)
    full_testset = torchvision.datasets.FashionMNIST(root='./data', train=False, download=True, transform=transform)
    train_indices = np.arange(1000) 
    test_indices = np.arange(100)  
    trainset = Subset(full_trainset, train_indices)
    testset = Subset(full_testset, test_indices)
    trainloader = torch.utils.data.DataLoader(trainset, batch_size=4, shuffle=True, num_workers=2)
    testloader = torch.utils.data.DataLoader(testset, batch_size=4, shuffle=False, num_workers=2)
    return trainloader, testloader

trainloader, testloader = load_dataset()

# Define training process
def train_model(model, trainloader, criterion, optimizer, num_epochs=10):
    for epoch in range(num_epochs):
        running_loss = 0.0
        for i, data in enumerate(trainloader, 0):
            inputs, labels = data
            optimizer.zero_grad()
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            running_loss += loss.item()
        logger.info(f"Epoch {epoch+1}, Loss: {running_loss/len(trainloader)}")
    logger.info('Finished Training')

# Define testing process
def test_model(model, testloader):
    correct = 0
    total = 0
    with torch.no_grad():
        for data in testloader:
            images, labels = data
            outputs = model(images)
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
    logger.info('Accuracy of the network on test images: %d %%' % (100 * correct / total))

# Initialize the AlexNet model
model = AlexNet()

# Loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

# Train the model
train_model(model, trainloader, criterion, optimizer, num_epochs=10)

# Test the model
test_model(model, testloader)

2024-01-15 21:21:57,995 - INFO - Epoch 1, Loss: 2.302598403930664
2024-01-15 21:22:39,325 - INFO - Epoch 2, Loss: 2.302577648162842
2024-01-15 21:23:24,046 - INFO - Epoch 3, Loss: 2.3025367193222044
2024-01-15 21:24:11,448 - INFO - Epoch 4, Loss: 2.302519413948059
2024-01-15 21:24:56,798 - INFO - Epoch 5, Loss: 2.302498456954956
2024-01-15 21:25:47,228 - INFO - Epoch 6, Loss: 2.3024621686935425
2024-01-15 21:26:28,027 - INFO - Epoch 7, Loss: 2.3024250869750977
2024-01-15 21:27:06,881 - INFO - Epoch 8, Loss: 2.3024126749038696
2024-01-15 21:27:47,419 - INFO - Epoch 9, Loss: 2.3023795166015626
2024-01-15 21:28:28,214 - INFO - Epoch 10, Loss: 2.302334545135498
2024-01-15 21:28:28,215 - INFO - Finished Training
2024-01-15 21:28:31,273 - INFO - Accuracy of the network on test images: 11 %


# VGGNet (Visual Geometry Group (VGG))


- The VGGNet, developed by the Visual Geometry Group (VGG) at the University of Oxford, is primarily used for image classification.

- Visual Geometry Group (VGG) was created to improve the model's performance by increasing the depth of such CNNs.

<img src="./img/comp_vision/img_classification/VGG1.png" alt="nearby_objects" width="600"/>

**Input:** The input layer typically receives the raw pixel data of the image, which in this case seems to be a 224x224 pixel image with 64 channels.

**Convolutional Layers (Conv):** These layers perform convolution operations to extract features from the input image. They

use filters (or kernels) to capture local patterns such as edges, textures, and gradients within the image. Each convolutional layer applies a number of these filters to the image or to the output of previous layers, and the depth of each convolutional block (e.g., depth-64, depth-128) represents the number of unique filters used, producing a corresponding number of feature maps.

- conv1_1, conv1_2: These are the first convolutional layers with a depth of 64, using 3x3 filters to capture basic patterns.

- conv2_1, conv2_2: Increase the depth to 128, allowing the network to capture more complex patterns.

- conv3_1 to conv3_4: The depth is further increased to 256. More layers mean the network can build more complex features from simpler ones captured in earlier layers.

- conv4_1 to conv4_4: Similar to conv3, but with a depth of 512, allowing even more complex features to be captured.

- conv5_1 to conv5_4: These are the final convolutional layers before the network transitions to fully connected
layers. They also have a depth of 512 and continue to refine the high-level features extracted from the image.

**Pooling Layers (Maxpool):** These layers follow some of the convolutional blocks and perform downsampling to reduce the spatial dimensions of the feature maps. This helps to reduce the amount of computation needed and also makes the features somewhat invariant to small translations of the input image. Max pooling layers typically select the maximum value from a small neighborhood (like 2x2 pixels) to represent the region.


**Fully Connected Layers (FC):** After several layers of convolutions and pooling, the network uses fully connected layers to perform classification based on the features extracted. Each neuron in a fully connected layer has connections to all activations in the previous layer. In this network:


**FC1:** The first fully connected layer has 4096 neurons, which take the flattened output of the last max pooling layer as input.

**FC2:** The second fully connected layer also has 4096 neurons and continues the process of interpreting the extracted features.
Size=1000 (Softmax): The final layer has 1000 neurons, each corresponding to a class in the dataset (assuming this network was designed for a 1000-class classification task like ImageNet). The softmax function is applied to this layer to obtain a probability distribution over the 1000 classes.

<img src="./img/comp_vision/img_classification/VGG2.png" alt="nearby_objects" width="800"/>

<img src="./img/comp_vision/img_classification/VGG3.png" alt="nearby_objects" width="800"/>

In [29]:
import torch
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import Subset
import torch.nn as nn
import torch.optim as optim
import numpy as np
import logging
import certifi
import ssl

# Set SSL certificate
ssl._create_default_https_context = ssl._create_unverified_context
ssl._create_default_https_context = lambda: ssl.create_default_context(cafile=certifi.where())

# Setup logger
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

# Load and preprocess the dataset
def load_dataset():
    transform = transforms.Compose([
    #VGG16 architecture is designed for images of size 224x224
    transforms.Resize((224, 224)),  # Resize the image to 224x224
    transforms.ToTensor()
    ])
    
    
    full_trainset = torchvision.datasets.FashionMNIST(root='./data', train=True, download=True, transform=transform)
    full_testset = torchvision.datasets.FashionMNIST(root='./data', train=False, download=True, transform=transform)
    train_indices = np.arange(100) 
    test_indices = np.arange(10)  
    trainset = Subset(full_trainset, train_indices)
    testset = Subset(full_testset, test_indices)
    trainloader = torch.utils.data.DataLoader(trainset, batch_size=4, shuffle=True, num_workers=2)
    testloader = torch.utils.data.DataLoader(testset, batch_size=4, shuffle=False, num_workers=2)
    return trainloader, testloader

import torch.nn as nn
import torch.nn.functional as F
# Define the VGGNet class, which inherits from nn.Module, the base class for all neural network modules in PyTorch.
class VGGNet(nn.Module):
    """
    Implementation of the VGGNet architecture for image classification.

    VGGNet is characterized by its simplicity, using only 3x3 convolutional layers stacked on top of each other in increasing depth. 
    Reducing volume size is handled by max pooling. Two fully connected layers, each with 4096 nodes are then followed by a softmax 
    classifier (output layer).

    There are several versions of VGGNet. This one seems to be VGG16, which has 16 layers that have weights.
    """
    
    def __init__(self, num_classes=1000):
        """
        Initialize the VGGNet model.
        
        Parameters:
            num_classes (int): number of classes for the final softmax layer (default is 1000 for ImageNet).
        """
        super(VGGNet, self).__init__()
        
        # Define the convolutional layers in the feature extractor part of the VGGNet
        self.features = nn.Sequential(
            # conv1: two convolutional layers with 64 output channels each
            #nn.Conv2d(in_channels=3, out_channels=64, kernel_size=3, padding=1),
            # Adjust the first conv layer to take in 1-channel images.
            nn.Conv2d(in_channels=1, out_channels=64, kernel_size=3, padding=1),   
            nn.ReLU(inplace=True),
            nn.Conv2d(in_channels=64, out_channels=64, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2),
            
            # conv2: two convolutional layers with 128 output channels each
            nn.Conv2d(in_channels=64, out_channels=128, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(in_channels=128, out_channels=128, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2),
            
            # conv3: three convolutional layers with 256 output channels each
            nn.Conv2d(in_channels=128, out_channels=256, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(in_channels=256, out_channels=256, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(in_channels=256, out_channels=256, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2),
            
            # conv4: three convolutional layers with 512 output channels each
            nn.Conv2d(in_channels=256, out_channels=512, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(in_channels=512, out_channels=512, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(in_channels=512, out_channels=512, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2),
            
            # conv5: three convolutional layers with 512 output channels each
            nn.Conv2d(in_channels=512, out_channels=512, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(in_channels=512, out_channels=512, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(in_channels=512, out_channels=512, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2)
        )
        
        # Define the fully connected layers in the classifier part of the VGGNet
        self.classifier = nn.Sequential(
            # First fully connected layer
            nn.Linear(in_features=7 * 7 * 512, out_features=4096),
            nn.ReLU(inplace=True),
            nn.Dropout(),
            # Second fully connected layer
            nn.Linear(in_features=4096, out_features=4096),
            nn.ReLU(inplace=True),
            nn.Dropout(),
            # Final layer with output size = number of classes
            nn.Linear(in_features=4096, out_features=num_classes)
        )

    # Define the forward pass of the network.
    def forward(self, x):
        # Pass the input through the convolutional layers.
        x = self.features(x)
        # Flatten the output of the conv layers to fit the fully connected layers.
        x = x.view(x.size(0), -1)
        # Pass the flattened output through the fully connected layers.
        x = self.classifier(x)
        return x



def train(model, trainloader, criterion, optimizer, device):
    model.train()
    for epoch in range(num_epochs):
        running_loss = 0.0
        for i, data in enumerate(trainloader, 0):
            inputs, labels = data
            inputs, labels = inputs.to(device), labels.to(device)

            optimizer.zero_grad()
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            running_loss += loss.item()
            if i % 15 == 0:
                logging.info(f'Epoch: {epoch + 1}, Batch: {i + 1}, Avg. Loss: {running_loss / 200:.4f}')
                running_loss = 0.0

def test(model, testloader, device):
    model.eval()
    correct = 0
    total = 0
    with torch.no_grad():
        for data in testloader:
            images, labels = data
            images, labels = images.to(device), labels.to(device)
            outputs = model(images)
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
    logging.info(f'Accuracy of the network on test images: {100 * correct / total}%')

# Main execution
trainloader, testloader = load_dataset()

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# Instantiate the VGGNet class with 10 output classes (default).
# Assuming the use of Fashion-MNIST dataset where images are 28x28 in size.

#TODO implementation from scratch
model = VGGNet().to(device)

#TODO compare with VGG model from torch
#import torchvision.models as models
#num_classes = 10
#model = models.vgg16(pretrained=True)
#model.features[0] = torch.nn.Conv2d(1, 64, kernel_size=3, stride=1, padding=1) # Modify the first convolutional layer
#model.classifier[6] = torch.nn.Linear(model.classifier[6].in_features, num_classes) # Modify the classifier for 10 output classes


# Show the architecture of the model.
logging.info(model)

criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

num_epochs = 10  # Define the number of epochs
train(model, trainloader, criterion, optimizer, device)
test(model, testloader, device)


2024-01-11 13:05:52,728 - INFO - VGGNet(
  (features): Sequential(
    (0): Conv2d(1, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): ReLU(inplace=True)
    (2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (3): ReLU(inplace=True)
    (4): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (5): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (6): ReLU(inplace=True)
    (7): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (8): ReLU(inplace=True)
    (9): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (10): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (11): ReLU(inplace=True)
    (12): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (13): ReLU(inplace=True)
    (14): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (15): ReLU(inplace=True)
    (16): MaxPool2d(kernel_siz

# R-CNN (Region-based Convolutional Neural Network):

- The original R-CNN model proposes a method where selective search is used to generate region proposals. These regions are then fed into a CNN to extract features, which are subsequently classified by a set of SVMs (Support Vector Machines).

- It's computationally expensive due to the need to process multiple regions per image separately.

- **SVM Training:** You then train an SVM classifier using these feature vectors as input. The SVM will learn a hyperplane that best separates the feature vectors into different classes based on the training data you provide. Each class in your training set should have a corresponding label.


In [3]:
import torch
import torchvision
import torchvision.transforms as transforms
import torch.nn as nn
import cv2
import numpy as np
import logging
from sklearn import svm
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from torch.utils.data import Subset

# Set up logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)


def load_dataset():
    """
    Load and preprocess a subset of the Fashion-MNIST dataset.
    Returns DataLoader objects for the subsets of training (100 instances) 
    and testing datasets (10 instances).
    """
    transform = transforms.Compose([transforms.ToTensor()])

    # Load the full Fashion-MNIST datasets
    full_trainset = torchvision.datasets.FashionMNIST(root='./data', train=True, download=True, transform=transform)
    full_testset = torchvision.datasets.FashionMNIST(root='./data', train=False, download=True, transform=transform)

    # Define indices for the subsets
    train_indices = np.arange(1000) 
    test_indices = np.arange(100)  

    # Create subset datasets
    trainset = Subset(full_trainset, train_indices)
    testset = Subset(full_testset, test_indices)

    # Create data loaders
    trainloader = torch.utils.data.DataLoader(trainset, batch_size=4, shuffle=True, num_workers=2)
    testloader = torch.utils.data.DataLoader(testset, batch_size=4, shuffle=False, num_workers=2)

    return trainloader, testloader


# Define the R-CNN Model
class FashionRCNN(nn.Module):
    def __init__(self):
        """
        Initialize the FashionRCNN model.
        The model consists of a simple CNN for feature extraction.
        """
        super(FashionRCNN, self).__init__()
        self.features = nn.Sequential(
            nn.Conv2d(1, 32, kernel_size=5, stride=1, padding=2),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),
            nn.Conv2d(32, 64, kernel_size=5, stride=1, padding=2),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2)
        )

    def forward(self, x):
        """
        Forward pass of the model.
        Applies the feature extraction layers.
        """
        x = self.features(x)
        return x.view(x.size(0), -1)  # Flatten the output

    def extract_features(self, x):
        """
        Extract features from the input batch.
        This function is used for feature extraction for SVM training.
        """
        with torch.no_grad():
            x = self.features(x)
        return x.view(x.size(0), -1).cpu().numpy()

# Main function to execute the training and testing
def main():
    trainloader, testloader = load_dataset()
    net = FashionRCNN()

    # Extracting features from training data
    logger.info('Starting feature extraction for training data')
    features = []
    labels = []
    for i, data in enumerate(trainloader, 0):
        inputs, label = data
        feature = net.extract_features(inputs)
        features.extend(feature)
        labels.extend(label.numpy())

    # SVM classifier training
    logger.info('Starting SVM training')
    svm_classifier = make_pipeline(StandardScaler(), svm.SVC(gamma='auto'))
    svm_classifier.fit(features, labels)
    logger.info('SVM training completed')

    # Testing with SVM classifier
    logger.info('Starting SVM testing')
    correct = 0
    total = 0
    for data in testloader:
        images, label = data
        feature = net.extract_features(images)
        predicted = svm_classifier.predict(feature)
        total += label.size(0)
        correct += (predicted == label.numpy()).sum().item()

    logger.info(f'Accuracy of the SVM classifier on the 100 test images: {100 * correct / total}%')

if __name__ == "__main__":
    main()


INFO:__main__:Starting feature extraction for training data
INFO:__main__:Starting SVM training
INFO:__main__:SVM training completed
INFO:__main__:Starting SVM testing
INFO:__main__:Accuracy of the SVM classifier on the 100 test images: 76.0%
