In [None]:
#Q1

In [None]:
Pooling, in the context of Convolutional Neural Networks (CNN), is a technique used to downsample the spatial dimensions of feature maps. It serves the purpose of reducing the spatial size of the input representation, extracting dominant features, and facilitating translation invariance. Here are the purpose and benefits of pooling in CNN:

1. Dimension Reduction: Pooling reduces the spatial dimensions of the feature maps, which helps to reduce the computational complexity and memory requirements of the subsequent layers in the network. By downsampling the feature maps, pooling enables the network to process larger input sizes efficiently.

2. Translation Invariance: Pooling provides translation invariance, meaning that small spatial translations in the input result in the same pooled output. This property allows CNNs to recognize patterns or features regardless of their specific position in the input image. Pooling achieves this by summarizing the local information within a pooling region and considering only the most salient features.

3. Feature Extraction: Pooling helps to extract the most relevant and dominant features from the input data. By selecting the most important features within each pooling region, pooling focuses on capturing the essential characteristics of the input while discarding less informative details. This enhances the ability of the network to identify and discriminate between different patterns or objects.

4. Spatial Invariance: Pooling contributes to spatial invariance, where the network becomes less sensitive to small variations or distortions in the input. This property is desirable when dealing with inputs that may have slight spatial transformations, such as rotated or translated images. Pooling allows the network to recognize the same features regardless of their precise location or orientation.

5. Parameter Efficiency: Pooling reduces the number of parameters in the network. By downsampling the feature maps, pooling reduces the spatial resolution and consequently reduces the number of parameters required in subsequent layers. This leads to a more parameter-efficient network architecture, enabling better generalization and reducing the risk of overfitting.

6. Computational Efficiency: Pooling reduces the computational complexity of the network. By reducing the spatial size of the feature maps, pooling reduces the number of computations required for subsequent layers. This speeds up the training and inference process, making CNNs more computationally efficient.

Pooling operations, such as max pooling or average pooling, play a crucial role in CNN architectures by reducing spatial dimensions, extracting relevant features, achieving translation invariance, and improving computational and parameter efficiency. These benefits contribute to the overall performance and effectiveness of CNN models in various computer vision tasks, such as image classification, object detection, and semantic segmentation.

In [None]:
#Q2

In [None]:
Both min pooling and max pooling are common operations in convolutional neural networks (CNNs) used for dimensionality reduction and feature extraction. They are typically applied after convolutional layers to downsample feature maps.

The main difference between min pooling and max pooling lies in how they select the representative value from a pool of values.

1. Max Pooling:
   - Max pooling takes a pool of values and outputs the maximum value from that pool.
   - It is used to capture the most salient features within a region of the input.
   - By selecting the maximum value, it emphasizes the presence of a particular feature, which helps in detecting patterns, edges, or textures that are most dominant within the pool.
   - Max pooling is often used to reduce the spatial dimensions of feature maps while retaining the most important information.

2. Min Pooling:
   - Min pooling, on the other hand, takes a pool of values and outputs the minimum value from that pool.
   - It is less commonly used compared to max pooling.
   - Min pooling can help detect features that exhibit low values or represent a specific absence of a certain pattern.
   - It can be useful in certain scenarios, such as anomaly detection or identifying regions with minimal activity.

In both cases, pooling is performed by sliding a window (typically non-overlapping) over the input feature map and applying the pooling operation within each window. This process reduces the spatial dimensions while retaining the most relevant information.

In summary, max pooling focuses on capturing the most prominent features, while min pooling aims to identify low-value or absence patterns. The choice between the two depends on the specific task and the characteristics of the data being processed. In most CNN architectures, max pooling is more commonly used due to its effectiveness in capturing dominant features.

In [None]:
#Q3

In [None]:
Padding in convolutional neural networks (CNNs) refers to the process of adding extra border pixels around the input image or feature map. These additional pixels are typically filled with zeros, hence the name "zero-padding." The primary purpose of padding is to control the spatial dimensions of the output feature maps after convolutional operations.

Padding is significant for several reasons:

1. Preservation of spatial information: Convolutional layers reduce the spatial dimensions of the input due to the application of filters (kernels) that slide over the input. Without padding, the border pixels of the input receive fewer convolutions, leading to a reduction in spatial information at the edges of the feature maps. Padding ensures that all pixels in the input have the same opportunity to contribute to the output feature maps, preserving spatial information and preventing a loss of information near the borders.

2. Retaining spatial resolution: In many cases, preserving the spatial resolution is crucial. For example, in object detection tasks, it is important to precisely localize objects within an image. Padding helps maintain the original spatial resolution of the input, allowing the network to detect objects at different locations accurately.

3. Mitigating information loss: As the receptive field (the area in the input that a single filter "sees") increases deeper into the network, the spatial dimensions of the feature maps decrease. Without padding, the size of the receptive field would continue to shrink, leading to a loss of fine-grained spatial details. Padding counteracts this effect, ensuring that the receptive field remains constant across different layers and reducing the risk of information loss.

4. Border effects: When applying convolutional operations near the borders of the input, the filter might not entirely fit within the input's spatial dimensions, resulting in incomplete convolutions. Padding alleviates this issue by extending the input's borders, enabling complete convolutions across the entire input space.

The amount of padding applied is determined by the desired output spatial dimensions and the size of the convolutional filters. It can be calculated using various formulas, such as the "same" padding, which pads the input in such a way that the output feature maps have the same spatial dimensions as the input, or "valid" padding, which does not apply any padding, resulting in smaller output feature maps.

In summary, padding in CNNs plays a crucial role in maintaining spatial information, preserving resolution, mitigating information loss, and handling border effects. It ensures that the network can effectively learn and represent features across the entire input space, leading to more accurate and reliable results.

In [None]:
#Q4

In [None]:
Certainly! Let's compare and contrast the effects of using zero-padding and valid-padding on the output feature map size.

1. Zero-padding:
   - Zero-padding refers to adding extra border pixels around the input image or feature map, typically filled with zeros.
   - With zero-padding, the output feature map size can be calculated using the formula: 
     output_size = (input_size + 2 * padding - filter_size) / stride + 1
   - Zero-padding increases the spatial dimensions of the input, which results in larger output feature maps compared to the input size.
   - The added border pixels provide a buffer zone that allows the convolutional filters to capture information near the borders of the input.
   - Zero-padding helps preserve spatial information, retain spatial resolution, mitigate information loss, and handle border effects.
   - Zero-padding is commonly used in convolutional layers to maintain consistent spatial dimensions throughout the network and prevent information loss at the edges.

2. Valid-padding:
   - Valid-padding, also known as "no padding," means that no extra border pixels are added around the input.
   - With valid-padding, the output feature map size can be calculated using the formula:
     output_size = (input_size - filter_size) / stride + 1
   - Valid-padding reduces the spatial dimensions of the input, which leads to smaller output feature maps compared to the input size.
   - Without padding, the convolutional filters do not extend beyond the boundaries of the input, resulting in a smaller receptive field.
   - Valid-padding discards information near the borders of the input, potentially causing a loss of spatial details.
   - Valid-padding is often used when the preservation of spatial resolution is not a primary concern or when the input size is large enough to mitigate border effects.

In summary:
- Zero-padding increases the output feature map size, while valid-padding reduces it.
- Zero-padding helps preserve spatial information, retain resolution, and mitigate information loss, but it also increases computational complexity.
- Valid-padding reduces the spatial dimensions, discards information near the borders, and can lead to a smaller receptive field.
- The choice between zero-padding and valid-padding depends on the specific requirements of the task, such as the need for spatial accuracy, sensitivity to border effects, or computational constraints.

It's important to note that the choice of padding and its effects on the output feature map size can have implications for subsequent layers and the overall performance of the CNN.

In [None]:
#Q5

In [None]:
LeNet-5 is a pioneering convolutional neural network (CNN) architecture developed by Yann LeCun and his colleagues in the 1990s. It was primarily designed for handwritten digit recognition tasks, such as recognizing digits in postal addresses.

Here's an overview of the LeNet-5 architecture:

1. Input Layer:
   - LeNet-5 takes grayscale images of size 32x32 pixels as input.
   - The images are typically normalized to have pixel values ranging from 0 to 1.

2. Convolutional Layers:
   - LeNet-5 consists of two convolutional layers.
   - The first convolutional layer applies six filters (kernels) of size 5x5 to the input.
   - Each filter performs a convolution operation, resulting in feature maps with reduced spatial dimensions.
   - The second convolutional layer applies sixteen filters of size 5x5 to the output of the first layer.
   - Again, this produces feature maps with further reduced spatial dimensions.

3. Pooling Layers:
   - After each convolutional layer, LeNet-5 incorporates a pooling layer.
   - The pooling layers use average pooling with a 2x2 filter and a stride of 2.
   - Average pooling reduces the spatial dimensions of the feature maps while retaining important features.

4. Fully Connected Layers:
   - Following the convolutional and pooling layers, LeNet-5 includes three fully connected layers.
   - The first fully connected layer has 120 neurons, while the second has 84 neurons.
   - These layers help capture high-level features by combining information from the preceding layers.
   - The final fully connected layer has ten neurons, corresponding to the ten possible classes (digits 0-9) in digit recognition tasks.
   - The activation function used in the fully connected layers is typically a sigmoid or hyperbolic tangent function.

5. Output Layer:
   - The output layer employs a softmax activation function to produce a probability distribution over the ten possible classes.
   - The class with the highest probability is considered the predicted class.

LeNet-5's architecture introduced several key concepts that are now widely used in modern CNNs, including the alternating convolutional and pooling layers, the use of non-linear activation functions, and the employment of a softmax output layer for classification tasks.

Despite its simplicity by today's standards, LeNet-5 played a crucial role in demonstrating the effectiveness of deep learning in image recognition tasks. It served as a foundation for subsequent advancements in CNN architectures, paving the way for the deep learning revolution.

In [None]:
#Q6

In [None]:
Certainly! The key components of LeNet-5 and their purposes are as follows:

1. Convolutional Layers:
   - LeNet-5 consists of two convolutional layers.
   - The purpose of the convolutional layers is to extract meaningful local features from the input images.
   - Each convolutional layer applies a set of learnable filters (kernels) to the input feature maps.
   - The convolution operation involves sliding these filters across the input, computing element-wise multiplications, and summing the results to produce feature maps.
   - By applying multiple filters, the network learns to detect different patterns and features at various spatial locations.

2. Pooling Layers:
   - After each convolutional layer, LeNet-5 includes pooling layers.
   - The pooling layers serve to reduce the spatial dimensions of the feature maps while preserving the most salient information.
   - LeNet-5 employs average pooling, where each pooling unit computes the average value within a defined region (usually 2x2) of the input feature map.
   - Average pooling helps to downsample the feature maps, making the network more robust to small variations in the position of features and reducing computational complexity.

3. Fully Connected Layers:
   - Following the convolutional and pooling layers, LeNet-5 incorporates three fully connected layers.
   - The fully connected layers are responsible for capturing high-level abstractions and combining information from the preceding layers.
   - The first fully connected layer has 120 neurons, while the second fully connected layer has 84 neurons.
   - These layers employ activation functions (commonly sigmoid or hyperbolic tangent) to introduce non-linearity into the network.
   - The final fully connected layer consists of ten neurons, representing the ten possible classes in digit recognition.
   - The output of this layer is fed into a softmax function to obtain class probabilities.

4. Activation Functions:
   - Activation functions introduce non-linearity and help the network learn complex relationships between inputs and outputs.
   - LeNet-5 typically uses sigmoid or hyperbolic tangent activation functions in its fully connected layers.
   - The choice of activation functions enables the network to model non-linear mappings and make predictions based on the learned features.

5. Output Layer:
   - The output layer of LeNet-5 uses the softmax activation function.
   - Softmax converts the outputs of the last fully connected layer into a probability distribution over the possible classes.
   - The class with the highest probability is considered the predicted class.

In summary, the convolutional layers extract local features, pooling layers reduce spatial dimensions, fully connected layers capture high-level abstractions, activation functions introduce non-linearity, and the output layer provides class probabilities. Together, these components enable LeNet-5 to perform digit recognition tasks effectively.

In [None]:
#Q7

In [None]:
LeNet-5, being one of the early pioneering CNN architectures, has both advantages and limitations in the context of image classification tasks. Let's explore them:

Advantages of LeNet-5:

1. Efficiency with small-sized inputs: LeNet-5 was designed to process small-sized grayscale images, such as 32x32 pixels. Its architecture is optimized for such inputs, making it computationally efficient and requiring fewer parameters compared to architectures designed for larger images.

2. Hierarchical feature learning: LeNet-5's architecture, with alternating convolutional and pooling layers, allows for hierarchical feature learning. The early convolutional layers capture low-level local features, while the subsequent layers learn increasingly complex and abstract representations. This enables the network to understand hierarchical structures in images.

3. Translation invariance: The use of pooling layers in LeNet-5 helps achieve translation invariance. By downsampling the feature maps, the network becomes less sensitive to small translations of the input image, making it robust to slight variations in the position of features.

4. Simplicity and interpretability: LeNet-5 has a relatively simple architecture compared to modern CNNs. Its simplicity makes it easier to understand and interpret. This characteristic is particularly valuable in educational settings or scenarios where model interpretability is crucial.

Limitations of LeNet-5:

1. Limited capacity: LeNet-5's architecture may not have sufficient capacity to handle more complex and larger-scale datasets. It was primarily designed for digit recognition tasks, which have relatively simpler patterns. The shallow architecture and small receptive fields limit its ability to learn intricate features and handle more diverse image datasets.

2. Lack of scalability: Due to its specific design for small-sized inputs, LeNet-5 may not scale well to larger images. The receptive fields and pooling operations may not adequately capture contextual information in bigger images, affecting its performance on tasks that require a broader field of view.

3. Lack of advanced techniques: LeNet-5 predates many modern advancements in CNN architectures. It does not incorporate more recent techniques like batch normalization, residual connections, or advanced activation functions, which have been shown to improve performance in image classification tasks.

4. Not suitable for complex datasets: LeNet-5's architecture and capacity make it less suitable for complex image datasets with a high level of variation and intricacies. It may struggle to capture fine-grained details or handle complex object recognition tasks that require more sophisticated models.

In summary, while LeNet-5 was groundbreaking in its time and laid the foundation for CNNs, it has some limitations when applied to more complex image classification tasks. It excels with small-sized inputs, provides interpretability, and achieves translation invariance but may lack the capacity and scalability needed for handling larger and more complex datasets.

In [None]:
#Q8

In [1]:
!pip install tensorflow

Collecting tensorflow
  Downloading tensorflow-2.13.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (524.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m524.1/524.1 MB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting grpcio<2.0,>=1.24.3
  Downloading grpcio-1.56.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.2/5.2 MB[0m [31m84.6 MB/s[0m eta [36m0:00:00[0mta [36m0:00:01[0m
[?25hCollecting termcolor>=1.1.0
  Downloading termcolor-2.3.0-py3-none-any.whl (6.9 kB)
Collecting gast<=0.4.0,>=0.2.1
  Downloading gast-0.4.0-py3-none-any.whl (9.8 kB)
Collecting opt-einsum>=2.3.2
  Downloading opt_einsum-3.3.0-py3-none-any.whl (65 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m65.5/65.5 kB[0m [31m13.5 MB/s[0m eta [36m0:00:00[0m
Collecting tensorflow-io-gcs-filesystem>=0.23.1
  Downloading tensorflow_io_gcs_filesyst

In [2]:
import tensorflow as tf
from tensorflow.keras import layers, models
from tensorflow.keras.datasets import mnist

# Load and preprocess the MNIST dataset
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train = x_train.reshape(-1, 28, 28, 1).astype('float32') / 255.0
x_test = x_test.reshape(-1, 28, 28, 1).astype('float32') / 255.0
num_classes = 10

# Define LeNet-5 architecture
model = models.Sequential()
model.add(layers.Conv2D(6, kernel_size=(5, 5), activation='relu', input_shape=(28, 28, 1)))
model.add(layers.MaxPooling2D(pool_size=(2, 2)))
model.add(layers.Conv2D(16, kernel_size=(5, 5), activation='relu'))
model.add(layers.MaxPooling2D(pool_size=(2, 2)))
model.add(layers.Flatten())
model.add(layers.Dense(120, activation='relu'))
model.add(layers.Dense(84, activation='relu'))
model.add(layers.Dense(num_classes, activation='softmax'))

# Compile and train the model
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
model.fit(x_train, y_train, batch_size=128, epochs=10, validation_data=(x_test, y_test))

# Evaluate the model
test_loss, test_accuracy = model.evaluate(x_test, y_test)
print("Test Loss:", test_loss)
print("Test Accuracy:", test_accuracy)


2023-07-16 16:09:32.174838: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-07-16 16:09:32.238503: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-07-16 16:09:32.240530: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test Loss: 0.037744369357824326
Test Accuracy: 0.9883999824523926


In [None]:
#Q9

In [None]:
AlexNet is a popular convolutional neural network (CNN) architecture that achieved significant breakthroughs in image classification, specifically winning the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2012. It was developed by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton. Here's an overview of the AlexNet architecture:

1. Input Layer:
   - AlexNet takes RGB images of size 227x227 pixels as input.
   - The images are typically normalized to have pixel values ranging from 0 to 1.

2. Convolutional Layers:
   - AlexNet starts with five convolutional layers.
   - The first convolutional layer applies 96 filters of size 11x11 with a stride of 4.
   - The subsequent convolutional layers apply different numbers of filters: 256, 384, 384, and 256.
   - The activation function used is a rectified linear unit (ReLU), introducing non-linearity.

3. Max Pooling Layers:
   - After each of the first two convolutional layers, AlexNet includes max pooling layers.
   - The max pooling layers use a 3x3 filter with a stride of 2.
   - Max pooling reduces the spatial dimensions of the feature maps while retaining the most prominent features.

4. Local Response Normalization (LRN) Layer:
   - Between the convolutional and pooling layers, AlexNet incorporates LRN layers.
   - LRN layers aim to provide local competition between adjacent neurons and enhance generalization.
   - LRN normalizes the responses of the neurons across channels, promoting the detection of more diverse and robust features.

5. Fully Connected Layers:
   - Following the convolutional and pooling layers, AlexNet includes three fully connected layers.
   - The first fully connected layer has 4096 neurons, while the subsequent two layers have 4096 and 1000 neurons, respectively.
   - Dropout regularization is applied after each fully connected layer to prevent overfitting.
   - The activation function used in the fully connected layers is ReLU.

6. Output Layer:
   - The output layer of AlexNet consists of 1000 neurons, representing the 1000 possible classes in the ILSVRC challenge.
   - The activation function used in the output layer is softmax, which produces a probability distribution over the classes.

7. Overlapping Prediction and Training:
   - One unique aspect of AlexNet is that it splits the training of the network over two GPUs and merges the results during prediction.
   - This approach helps to mitigate memory limitations and allows for efficient training and inference on large-scale datasets.

In summary, AlexNet revolutionized the field of image classification with its deep architecture, ReLU activations, dropout regularization, and large-scale training on the ImageNet dataset. Its success paved the way for the development of deeper and more powerful CNN architectures.

In [None]:
#Q10

In [None]:
AlexNet introduced several architectural innovations that contributed to its breakthrough performance. These innovations include:

1. Deep Architecture:
   - AlexNet was one of the first CNN architectures to have a deep structure with multiple layers.
   - Prior to AlexNet, shallow networks were commonly used, but AlexNet demonstrated the power of deeper architectures in learning complex features and improving classification accuracy.

2. Convolutional Layers with Large Filter Sizes:
   - AlexNet utilized convolutional layers with large filter sizes, particularly the first layer with an 11x11 filter.
   - This choice allowed the network to capture more spatial context and capture larger-scale patterns in the input images.

3. Rectified Linear Units (ReLU):
   - AlexNet adopted the rectified linear unit (ReLU) activation function instead of traditional activation functions like sigmoid or tanh.
   - ReLU provides faster and more efficient training by mitigating the vanishing gradient problem and enabling better gradient flow during backpropagation.

4. Local Response Normalization (LRN):
   - AlexNet incorporated LRN layers between the convolutional and pooling layers.
   - LRN layers encouraged competition among adjacent neurons, promoting the detection of diverse and robust features.
   - This normalization technique enhanced the generalization ability of the network and contributed to improved accuracy.

5. Overlapping Pooling:
   - AlexNet introduced the concept of overlapping pooling by using a stride smaller than the pooling window size.
   - Overlapping pooling reduced the information loss during downscaling, preserving more spatial details and improving the ability of the network to capture fine-grained features.

6. Dropout Regularization:
   - AlexNet employed dropout regularization after each fully connected layer.
   - Dropout randomly drops out a fraction of the neurons during training, reducing overfitting and improving the network's generalization ability.

7. GPU Acceleration and Parallelization:
   - AlexNet was designed to take advantage of GPU acceleration and parallel processing.
   - The network split the training process across two GPUs and merged the results during prediction, allowing for efficient training and inference on large-scale datasets.

These architectural innovations collectively contributed to the breakthrough performance of AlexNet. They enabled the network to learn rich and discriminative features, effectively handle large-scale datasets, and achieve a significant improvement in accuracy on the challenging ImageNet dataset. AlexNet's success inspired further research in deep learning and paved the way for subsequent advancements in CNN architectures.

In [None]:
#Q11

In [None]:
In AlexNet, convolutional layers, pooling layers, and fully connected layers each play a crucial role in the architecture:

1. Convolutional Layers:
   - Convolutional layers in AlexNet are responsible for learning local patterns and extracting features from the input images.
   - The use of multiple convolutional layers allows the network to capture increasingly complex and abstract features.
   - AlexNet introduced large filter sizes, such as the 11x11 filters in the first layer, which helps capture larger-scale patterns and spatial context.
   - These layers employ the rectified linear unit (ReLU) activation function, promoting non-linearity and better gradient flow during training.

2. Pooling Layers:
   - Pooling layers in AlexNet follow the convolutional layers and serve to downsample the feature maps, reducing their spatial dimensions.
   - The pooling operation helps extract the most salient features while reducing the network's sensitivity to small spatial variations.
   - AlexNet incorporates max pooling layers with a 3x3 filter and a stride of 2, which downsamples the feature maps while retaining important information.
   - Overlapping pooling, achieved by using a smaller stride than the pooling window size, preserves more spatial details and improves the network's ability to capture fine-grained features.

3. Fully Connected Layers:
   - Fully connected layers in AlexNet are responsible for capturing high-level abstractions and making class predictions.
   - They take the learned features from the preceding layers and combine them to make predictions.
   - AlexNet includes three fully connected layers with a decreasing number of neurons (4096, 4096, and 1000).
   - ReLU activation functions are used in these layers, introducing non-linearity and allowing the network to learn complex mappings.
   - Dropout regularization is applied after each fully connected layer to prevent overfitting by randomly dropping out a fraction of the neurons during training.

The convolutional layers extract local features, capturing patterns at different spatial scales. The pooling layers reduce the spatial dimensions and downsample the feature maps, retaining the most salient information. Finally, the fully connected layers capture high-level abstractions and make predictions based on the learned features. The combination of these layers allows AlexNet to effectively learn discriminative features and achieve high accuracy in image classification tasks.

It's important to note that these components work together in a hierarchical manner, with each layer building upon the representations learned by the previous layers. This hierarchical feature learning is a key factor in the success of CNN architectures like AlexNet in handling complex visual tasks.

In [None]:
#Q12

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision.transforms as transforms
from torchvision.datasets import CIFAR10
from torch.utils.data import DataLoader

# Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Hyperparameters
num_epochs = 10
batch_size = 128
learning_rate = 0.001

# Define AlexNet architecture
class AlexNet(nn.Module):
    def __init__(self, num_classes=10):
        super(AlexNet, self).__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 96, kernel_size=11, stride=4, padding=2),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2),
            nn.Conv2d(96, 256, kernel_size=5, padding=2),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2),
            nn.Conv2d(256, 384, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(384, 384, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(384, 256, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2)
        )
        self.classifier = nn.Sequential(
            nn.Dropout(),
            nn.Linear(256 * 2 * 2, 4096),
            nn.ReLU(inplace=True),
            nn.Dropout(),
            nn.Linear(4096, 4096),
            nn.ReLU(inplace=True),
            nn.Linear(4096, num_classes)
        )

    def forward(self, x):
        x = self.features(x)
        x = torch.flatten(x, 1)
        x = self.classifier(x)
        return x

# Load and preprocess the CIFAR-10 dataset
transform = transforms.Compose([
    transforms.RandomCrop(32, padding=4),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])

train_dataset = CIFAR10(root='./data', train=True, download=True, transform=transform)
test_dataset = CIFAR10(root='./data', train=False, download=True, transform=transform)

train_loader = DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(dataset=test_dataset, batch_size=batch_size, shuffle=False)

# Initialize the model
model = AlexNet().to(device)

# Loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

# Training loop
total_step = len(train_loader)
for epoch in range(num_epochs):
    for i, (images, labels) in enumerate(train_loader):
        images = images.to(device)
        labels = labels.to(device)

        # Forward pass
        outputs = model(images)
        loss = criterion(outputs, labels)

        # Backward and optimize
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if (i+1) % 100 == 0:
            print(f'Epoch [{epoch+1}/{num_epochs}], Step [{i+1}/{total_step}], Loss: {loss.item():.4f}')

# Evaluation
model.eval()
with torch.no_grad():
    correct = 0
    total = 0
    for images, labels in test_loader:
        images = images.to(device)
        labels = labels.to(device)
        outputs = model(images)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

    accuracy = 100 * correct / total
    print(f'Test Accuracy: {accuracy:.2f}%')
