# TOPIC: Understanding Pooling and Padding in CNN

# 1. Describe the purpose and benefits of pooling in CNN.

Pooling, also known as subsampling or downsampling, is a fundamental operation in Convolutional Neural Networks (CNNs) used for feature extraction and dimensionality reduction. Its purpose is to reduce the spatial dimensions (width and height) of feature maps while retaining the most important information. Pooling is typically applied after convolutional layers and is an essential component of the feature extraction process in CNNs. The main purposes and benefits of pooling in CNNs are as follows:

1. Dimensionality Reduction: One of the primary purposes of pooling is to reduce the spatial dimensions of the feature maps. By doing this, it reduces the computational load and the number of parameters in subsequent layers, making the network more efficient and less prone to overfitting. Smaller feature maps also help in accelerating training and inference.

2. Translation Invariance: Pooling creates a degree of translation invariance in the feature representations. This means that even if the object of interest in the input image shifts slightly in position, the same features are activated, as long as they are still present in the region being pooled. This property helps the CNNs recognize patterns regardless of their exact position in the input image.

3. Feature Selection: Pooling retains the most essential features while discarding less important or redundant information. It does this by selecting the most dominant feature in each pooling region. This feature selection process can enhance the model's ability to focus on important patterns and reduce sensitivity to noise.

4. Downscaling: Pooling reduces the size of feature maps, which allows the network to process more information with fewer parameters. Smaller feature maps make it possible to capture higher-level, more abstract features in later layers, which is essential for hierarchical feature learning.

5. Computational Efficiency: Pooling reduces the amount of computation required to process feature maps, which is crucial for real-time applications and resource-constrained devices.

There are two common types of pooling used in CNNs:

1. Max Pooling: In max pooling, the maximum value in each pooling region is selected as the representative feature. This helps emphasize the most prominent features in each region.

2. Average Pooling: In average pooling, the average value in each pooling region is used as the representative feature. It provides a smoother downsampling and is less likely to lose fine-grained details compared to max pooling.

Overall, pooling plays a vital role in CNNs by reducing spatial dimensions, improving the network's invariance and efficiency, and helping to create more robust and informative feature representations for image classification, object detection, and various computer vision tasks.

# 2. Explain the difference between min pooling and max pooling.

Min pooling and max pooling are both types of pooling operations used in Convolutional Neural Networks (CNNs) for dimensionality reduction and feature extraction. The main difference between the two lies in the way they select values from the pooling regions:

1. Max Pooling:
   - In max pooling, the maximum value within each pooling region is selected as the representative feature. It emphasizes the most prominent features in the region while discarding less significant information. Max pooling is particularly useful for highlighting the most activated features in an image, helping the network focus on the most important patterns. It contributes to creating translation invariance, as the network can detect the features regardless of their exact position in the input image. Max pooling is commonly used in CNN architectures for tasks such as image classification and object detection.

2. Min Pooling:
   - In contrast, min pooling selects the minimum value within each pooling region. This operation emphasizes the least intense features, which can be useful in specific scenarios where the presence of certain minimal features is significant. However, min pooling is less common than max pooling in CNN architectures, as it is typically not as effective in capturing the most salient features in an image. It may find limited use in specific applications where the minimum values hold critical information, such as anomaly detection or certain specialized image analysis tasks.

In summary, the key distinction between min pooling and max pooling lies in the values they select from the pooling regions. Max pooling highlights the most prominent features, while min pooling emphasizes the least intense features. Max pooling is more commonly used in CNNs for its effectiveness in capturing significant patterns and creating translation invariance, while min pooling finds more limited use in specific applications that require the emphasis of minimal values.

# 3. Discuss the concept of padding in CNN and its significance.

In the context of Convolutional Neural Networks (CNNs), padding refers to the process of adding extra pixels around the boundary of an image, typically with zeros, to control the spatial dimensions of the output volume after applying convolutional operations. The use of padding is essential to manage the spatial size of the input volume as it goes through successive layers of convolutions and to preserve important information at the borders of the input data. The significance of padding in CNNs can be understood through the following points:

1. Preservation of Spatial Information: Padding helps in retaining the spatial information at the borders of the input data. Without padding, the spatial dimensions of the feature maps would reduce after each convolutional layer, leading to a gradual loss of information at the borders. Padding ensures that the spatial dimensions are maintained, allowing the CNN to capture features at all spatial locations of the input.

2. Control over Output Size: By controlling the amount of padding, one can manage the spatial size of the output volume after convolution. This control is crucial in designing CNN architectures and controlling the flow of information through the network. It enables the network to retain important spatial information while controlling the downsampling process through convolutional and pooling layers.

3. Mitigation of Boundary Effects: Padding helps in mitigating boundary effects that occur during convolution. When performing convolution operations, the pixels at the border have fewer neighboring pixels compared to the pixels in the center. This can result in a reduction of feature map size and the loss of information at the edges. By using appropriate padding techniques, the boundary effects can be minimized, ensuring that all parts of the input receive similar treatment during convolution.

4. Better Feature Extraction: Padding ensures that the convolutional kernels can process the pixels at the borders effectively, enabling the extraction of meaningful features from the entire input space. This leads to the creation of more representative feature maps, allowing the network to learn more complex and intricate patterns from the input data.

There are different types of padding methods used in CNNs, such as:

- Valid Padding: No padding is added, and the spatial dimensions reduce after each convolutional layer.
- Same Padding: Equal amounts of padding are added to the input image so that the output spatial dimensions remain the same as the input dimensions after the convolution operation.

Overall, the use of padding in CNNs is critical for maintaining spatial information, controlling the output size, and ensuring effective feature extraction, thus contributing to the robustness and accuracy of the network in various computer vision tasks.

# 4. Compare and contrast zero-padding and valid-padding in terms of their effects on the output feature map size.

In Convolutional Neural Networks (CNNs), padding is applied to the input data before performing convolutions to control the output size of the feature maps. Two commonly used padding techniques are zero-padding and valid-padding. Let's compare and contrast these padding methods based on their effects on the output feature map size:

1. Zero-padding:

   - Effect on output feature map size: Zero-padding adds zeros to the borders of the input data, effectively increasing the spatial dimensions of the input volume. When convolution is applied to the padded input, the output feature map size is preserved, and it remains the same as the input size.

   - Purpose: Zero-padding is typically used to maintain the spatial dimensions of the input throughout the convolutional layers. It helps in retaining spatial information, especially at the borders of the input data, and mitigates the boundary effects that occur during convolution.

2. Valid-padding:

   - Effect on output feature map size: Valid-padding, also known as 'no padding,' implies no additional padding is added to the input data. Consequently, the spatial dimensions of the output feature map are reduced compared to the input size. The reduction in size depends on the dimensions of the filter and the stride used in the convolution operation.

   - Purpose: Valid-padding is often used when the objective is to reduce the spatial dimensions of the feature maps, typically for downsampling or dimensionality reduction purposes. It is commonly employed in architectures where reducing the spatial dimensions is a deliberate design choice, such as in pooling layers or certain convolutional layers aimed at feature extraction.

In summary, the key difference between zero-padding and valid-padding lies in their effects on the output feature map size. Zero-padding maintains the output feature map size, ensuring that the spatial dimensions remain the same as the input, whereas valid-padding reduces the output feature map size, resulting in a downsized feature map compared to the input. The choice between these padding techniques depends on the specific requirements of the CNN architecture and the desired manipulation of spatial dimensions during the convolutional and pooling operations.

# TOPIC: Exploring LeNet

# 1. Provide a brief overview of LeNet-5 architecture.

LeNet-5 is a pioneering convolutional neural network architecture developed by Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner in the early 1990s. It was primarily designed for handwritten and machine-printed character recognition tasks. LeNet-5 played a crucial role in demonstrating the potential of convolutional neural networks in the field of computer vision. Although it is relatively simple compared to modern architectures, LeNet-5 introduced some fundamental concepts that have become integral to the design of more complex CNN architectures. Here is a brief overview of the LeNet-5 architecture:

1. Input Layer: The architecture is designed to handle 32x32-pixel grayscale images, which serve as the input to the network.

2. Convolutional Layers: LeNet-5 consists of two sets of convolutional layers, each followed by a sub-sampling layer. The first convolutional layer applies a set of learnable filters to the input image, extracting features such as edges and corners. The second convolutional layer further refines these features to create more abstract representations of the input.

3. Sub-sampling Layers: These layers perform a downsampling operation, typically using average pooling, to reduce the dimensionality of the feature maps and make the network translation-invariant.

4. Fully Connected Layers: LeNet-5 includes three fully connected layers that act as a classic multilayer perceptron. These layers are responsible for learning higher-level features and making the final classifications.

5. Output Layer: The final fully connected layer produces the output for the classification task. In the case of LeNet-5, it was primarily used for character recognition tasks, so the output layer would typically have as many units as the number of classes in the dataset.

Key concepts introduced by LeNet-5, such as convolutional layers, sub-sampling layers, and the combination of these layers with fully connected layers, laid the groundwork for more sophisticated CNN architectures that followed. Although LeNet-5 was initially designed for character recognition, its architectural principles have been adapted and expanded upon to solve a wide range of complex computer vision tasks, including image classification, object detection, and image segmentation.

# 2. Describe the key components of LeNet-5 and their respective purposes.

LeNet-5 is a pioneering convolutional neural network (CNN) architecture primarily designed for character recognition. It consists of several key components, each serving a specific purpose in the process of feature extraction and classification. The key components of LeNet-5 and their respective purposes are as follows:

1. Input Layer:
   - Purpose: The input layer accepts 32x32-pixel grayscale images. This layer serves as the initial point of data entry into the network.

2. Convolutional Layers:
   - Purpose: The convolutional layers apply a set of learnable filters to the input image. These filters detect patterns such as edges, corners, and textures, effectively extracting low-level features from the input.

3. Sub-sampling Layers:
   - Purpose: The sub-sampling layers perform down-sampling operations, typically using average pooling. This process reduces the dimensionality of the feature maps and enhances the translation invariance of the network, allowing it to identify patterns regardless of their position in the image.

4. Fully Connected Layers:
   - Purpose: LeNet-5 includes fully connected layers that function as a classic multilayer perceptron. These layers learn higher-level features by combining the information from the previous layers. They contribute to the network's ability to recognize complex patterns and make accurate classifications.

5. Output Layer:
   - Purpose: The output layer of LeNet-5 produces the final classification results. Since the architecture was primarily designed for character recognition, the output layer typically has as many units as the number of classes in the dataset, with each unit representing a specific character or class.

By combining these key components, LeNet-5 effectively demonstrated the potential of CNNs for pattern recognition tasks, especially in the domain of handwritten and machine-printed character recognition. It laid the foundation for the development of more advanced CNN architectures that have significantly expanded the capabilities of deep learning in various fields of computer vision, including image classification, object detection, and image segmentation.

# 3. Discuss the advantages and limitations of LeNet-5 in the context of image classification tasks.

LeNet-5, as one of the early convolutional neural network (CNN) architectures, has several advantages and limitations, particularly in the context of image classification tasks. Understanding these aspects provides insights into its capabilities and areas where it may be less effective:

Advantages of LeNet-5:

1. Simplicity: LeNet-5's relatively simple architecture makes it easy to understand and implement. It serves as a foundational model for more complex CNN architectures, allowing researchers and practitioners to grasp the fundamental concepts of CNNs.

2. Effective for Simple Recognition Tasks: LeNet-5 is effective for simple image recognition tasks, especially in scenarios where the images are relatively low resolution and the classes are well-defined, as demonstrated in its original purpose of character recognition.

3. Translation Invariance: The architecture's inclusion of sub-sampling layers enhances translation invariance, enabling the network to identify patterns regardless of their specific location in the input image.

Limitations of LeNet-5:

1. Limited Complexity: LeNet-5 may struggle with more complex image classification tasks involving high-resolution images or intricate patterns. Its architecture might not be deep or complex enough to capture the full range of features necessary for nuanced image recognition tasks.

2. Limited Capacity for Large-Scale Datasets: With the increase in the scale of modern image datasets, LeNet-5 may face challenges in effectively capturing the diversity and complexity of information in large-scale datasets.

3. Performance on Varied Image Types: LeNet-5 may not perform as well on datasets containing diverse or highly complex images, such as those with multiple objects or scenes, due to its limited capacity to learn intricate hierarchical representations.

4. Sensitivity to Image Variations: The architecture might struggle with variations in image quality, lighting, and orientation, as it was not specifically designed to handle complex variations in image characteristics.

Overall, while LeNet-5 laid the groundwork for contemporary CNNs, it is not as well-equipped to handle the complexities of modern image classification tasks, which often involve large and diverse datasets, high-resolution images, and complex patterns. Subsequent CNN architectures have addressed many of these limitations by incorporating deeper structures, increased capacity for feature extraction, and improved handling of various image variations, leading to significant advancements in the field of computer vision.

# 4. Implement LeNet-5 using a deep learning framework of your choice (e.g., TensorFlow, PyTorch) and train it on a publicly available dataset (e.g., MNIST). Evaluate its performance and provide insights.

In [None]:
from tensorflow import keras
from keras.datasets import mnist
from keras.layers import Conv2D, MaxPooling2D,AveragePooling2D
from keras.layers import Dense, Flatten
from keras.models import Sequential

# Load the CIFAR-10 dataset
(x_train, y_train), (x_test, y_test) = keras.datasets.cifar10.load_data()

# Normalize pixel values between 0 and 1
x_train = x_train / 255.0
x_test = x_test / 255.0

# Convert labels to one-hot encoding
y_train = keras.utils.to_categorical(y_train, 10)
y_test = keras.utils.to_categorical(y_test, 10)


# Building the Model Architecture

model = Sequential()

model.add(Conv2D(6, kernel_size = (5,5), padding = 'valid', activation='tanh', input_shape = (32,32,3)))
model.add(AveragePooling2D(pool_size= (2,2), strides = 2, padding = 'valid'))

model.add(Conv2D(16, kernel_size = (5,5), padding = 'valid', activation='tanh'))
model.add(AveragePooling2D(pool_size= (2,2), strides = 2, padding = 'valid'))

model.add(Flatten())

model.add(Dense(120, activation='tanh'))
model.add(Dense(84, activation='tanh'))
model.add(Dense(10, activation='softmax'))

model.summary()


model.compile(loss=keras.metrics.categorical_crossentropy, optimizer=keras.optimizers.Adam(), metrics=['accuracy'])
model.fit(x_train, y_train, batch_size=128, epochs=2, verbose=1, validation_data=(x_test, y_test))
score = model.evaluate(x_test, y_test)

print('Test Loss:', score[0])
print('Test accuracy:', score[1])

2023-10-13 15:49:31.580591: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-10-13 15:49:32.095740: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-10-13 15:49:32.095808: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-10-13 15:49:32.098730: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-10-13 15:49:32.379417: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-10-13 15:49:32.382359: I tensorflow/core/platform/cpu_feature_guard.cc:182] This Tens

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 conv2d (Conv2D)             (None, 28, 28, 6)         456       
                                                                 
 average_pooling2d (Average  (None, 14, 14, 6)         0         
 Pooling2D)                                                      
                                                                 
 conv2d_1 (Conv2D)           (None, 10, 10, 16)        2416      
                                                                 
 average_pooling2d_1 (Avera  (None, 5, 5, 16)          0         
 gePooling2D)                                                    
                                                                 
 flatten (Flatten)           (None, 400)               0         
                                                                 
 dense (Dense)               (None, 120)               4

# TOPIC: Analyzing AlexNet

# 1. Present an overview of the AlexNet architecture.

AlexNet is a deep convolutional neural network architecture that gained significant attention after winning the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2012. It was developed by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton. This groundbreaking architecture played a crucial role in popularizing deep learning and convolutional neural networks. Here's an overview of the key components of the AlexNet architecture:

1. Input Layer: The architecture takes an input image of size 227x227x3, representing the RGB color channels.

2. Convolutional Layers: AlexNet consists of five convolutional layers, where the first two are followed by max-pooling layers. These layers are responsible for feature extraction, capturing different levels of image features through the use of various convolutional filters.

3. ReLU Activation: Rectified Linear Unit (ReLU) activation functions are applied after each convolutional layer. ReLU introduces non-linearity, allowing the network to learn complex patterns in the data.

4. Local Response Normalization (LRN): LRN layers were included in the original AlexNet architecture. They aimed to provide local competition between neurons, enhancing the generalization ability of the network. However, the use of LRN has become less common in modern architectures.

5. Fully Connected Layers: The architecture includes three fully connected layers. These layers serve as a classic artificial neural network, learning high-level features and making the final predictions.

6. Dropout: Dropout, a regularization technique, is applied to the fully connected layers to prevent overfitting. It randomly drops a certain percentage of connections during each training iteration, forcing the network to learn more robust and generalizable features.

7. Softmax Layer: The final layer uses the softmax activation function to produce the probability distribution over the classes, enabling the network to make predictions.

Key contributions of AlexNet include the use of deep convolutional neural networks for large-scale image recognition tasks, the demonstration of the effectiveness of ReLU activation functions, the use of dropout for regularization, and the adoption of the GPU acceleration to significantly speed up the training process. AlexNet paved the way for the development of more sophisticated deep learning architectures, leading to substantial advancements in computer vision and various other domains.

# 2. Explain the architectural innovations introduced in AlexNet that contributed to its breakthrough performance.

AlexNet introduced several architectural innovations that significantly contributed to its breakthrough performance and played a crucial role in popularizing deep learning and convolutional neural networks. These innovations include:

1. Deep Architecture: AlexNet was one of the first deep convolutional neural network architectures, consisting of eight layers, including five convolutional layers and three fully connected layers. Its depth allowed it to learn complex hierarchical representations of the input data, enabling more abstract and high-level feature learning.

2. ReLU Activation Function: AlexNet incorporated the Rectified Linear Unit (ReLU) activation function after each convolutional layer. ReLU helped address the vanishing gradient problem, allowing the network to train faster and avoid the saturation issues of earlier activation functions such as the sigmoid and tanh. The use of ReLU significantly accelerated the training process and improved the convergence of the network.

3. Local Response Normalization (LRN): Although less commonly used in modern architectures, AlexNet introduced the concept of Local Response Normalization (LRN). LRN aimed to provide local competition between adjacent neurons, enhancing the network's ability to generalize and respond to variations in the input data. While not as widely adopted in subsequent architectures, the inclusion of LRN showcased the potential benefits of incorporating normalization techniques in CNNs.

4. Overlapping Max Pooling: AlexNet utilized overlapping max pooling layers, which allowed for more translation invariance and helped capture spatial hierarchies of features effectively. The overlap in pooling regions provided more comprehensive coverage of the input data, contributing to the network's robustness in detecting and learning features across different regions of the input images.

5. Data Parallelism with GPUs: One of the most critical innovations was the implementation of data parallelism using Graphics Processing Units (GPUs). AlexNet efficiently leveraged the parallel processing capabilities of GPUs to accelerate the training process, significantly reducing the training time compared to traditional CPU-based implementations. This breakthrough allowed the network to handle large-scale datasets and complex computations, marking a significant step forward in the field of deep learning.

Collectively, these architectural innovations in AlexNet set the stage for the development of more advanced deep learning models and played a pivotal role in demonstrating the potential of deep convolutional neural networks for solving complex real-world tasks, particularly in the domain of computer vision and image recognition.

# 3. Discuss the role of convolutional layers, pooling layers, and fully connected layers in AlexNet.

In AlexNet, each type of layer serves a specific purpose in the process of feature extraction, dimensionality reduction, and classification. The architecture of AlexNet utilizes convolutional layers, pooling layers, and fully connected layers in a sequential manner to extract hierarchical representations from input data and make accurate predictions. Here's a breakdown of the roles of each type of layer in AlexNet:

1. Convolutional Layers:
   - Role: The convolutional layers in AlexNet are responsible for feature extraction from the input images. These layers apply convolution operations using learnable filters to extract various low to high-level features such as edges, textures, and patterns. By convolving the input images with these filters, the network can capture meaningful spatial hierarchies of features, allowing it to learn complex patterns and representations.

2. Pooling Layers:
   - Role: The pooling layers in AlexNet, specifically the overlapping max pooling layers, play a crucial role in reducing the spatial dimensions of the feature maps and introducing translation invariance. These layers downsample the feature maps by taking the maximum value within each pooling region, thereby preserving the most salient features and reducing the sensitivity to small variations in the spatial positions of the features. Overlapping pooling regions enable the network to capture more comprehensive spatial hierarchies, making the learned features more robust and translationally invariant.

3. Fully Connected Layers:
   - Role: The fully connected layers in AlexNet serve as a classic artificial neural network, responsible for learning high-level features from the extracted representations and making the final predictions. These layers take the flattened feature maps from the previous layers and learn complex relationships between different features. By applying non-linear transformations and learning the feature combinations, the fully connected layers enable the network to classify the input data into different classes. In the case of AlexNet, the fully connected layers produce the final classification results based on the learned representations.

By combining these different types of layers in a deep architecture, AlexNet effectively learns hierarchical representations of the input data, making it capable of handling complex real-world tasks such as large-scale image classification. The coordinated function of convolutional layers, pooling layers, and fully connected layers allows AlexNet to capture intricate patterns and relationships within the data, contributing to its breakthrough performance in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2012.

# 4. Implement AlexNet using a deep learning framework of your choice and evaluate its performance on a dataset of your choice.

In [16]:
import tensorflow as tf
from tensorflow.keras import datasets, layers, models

# Load the CIFAR-10 dataset
(train_images, train_labels), (test_images, test_labels) = datasets.cifar10.load_data()
train_images, test_images = train_images / 255.0, test_images / 255.0

# Define the AlexNet architecture
model = models.Sequential([
    layers.Conv2D(64, (11, 11), strides=(4, 4), activation='relu', input_shape=(32, 32, 3)),
    layers.MaxPooling2D(pool_size=(3, 3), strides=(2, 2)),
    layers.Conv2D(192, (5, 5), padding='same', activation='relu'),
    layers.MaxPooling2D(pool_size=(3, 3), strides=(2, 2)),
    layers.Conv2D(384, (3, 3), padding='same', activation='relu'),
    layers.Conv2D(256, (3, 3), padding='same', activation='relu'),
    layers.Conv2D(256, (3, 3), padding='same', activation='relu'),
    layers.MaxPooling2D(pool_size=(3, 3), strides=(2, 2)),
    layers.Flatten(),
    layers.Dense(4096, activation='relu'),
    layers.Dense(4096, activation='relu'),
    layers.Dense(10, activation='softmax')
])

# Compile the model
model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

# Train the model
model.fit(train_images, train_labels, epochs=5, validation_data=(test_images, test_labels))

# Evaluate the model
test_loss, test_acc = model.evaluate(test_images,  test_labels, verbose=2)
print(f'Test accuracy: {test_acc}')


ValueError: Negative dimension size caused by subtracting 3 from 2 for '{{node max_pooling2d_8/MaxPool}} = MaxPool[T=DT_FLOAT, data_format="NHWC", explicit_paddings=[], ksize=[1, 3, 3, 1], padding="VALID", strides=[1, 2, 2, 1]](conv2d_13/Relu)' with input shapes: [?,2,2,192].

In [17]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, models
from tensorflow.keras.datasets import cifar10
from tensorflow.keras.utils import to_categorical

# Load and preprocess the CIFAR-10 dataset
(x_train, y_train), (x_test, y_test) = cifar10.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0  # Normalize pixel values to the range [0, 1]
y_train, y_test = to_categorical(y_train), to_categorical(y_test)  # One-hot encode labels

# Define the AlexNet architecture
model = models.Sequential()
model.add(layers.Conv2D(96, (11, 11), strides=(4, 4), activation='relu', input_shape=(32, 32, 3)))
model.add(layers.MaxPooling2D((3, 3), strides=(2, 2)))
model.add(layers.Conv2D(256, (5, 5), activation='relu'))
model.add(layers.MaxPooling2D((3, 3), strides=(2, 2)))
model.add(layers.Conv2D(384, (3, 3), activation='relu'))
model.add(layers.Conv2D(384, (3, 3), activation='relu'))
model.add(layers.Conv2D(256, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((3, 3), strides=(2, 2)))
model.add(layers.Flatten())
model.add(layers.Dense(4096, activation='relu'))
model.add(layers.Dropout(0.5))
model.add(layers.Dense(4096, activation='relu'))
model.add(layers.Dropout(0.5))
model.add(layers.Dense(10, activation='softmax'))

# Compile the model
model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

# Train the model
model.fit(x_train, y_train, epochs=10, batch_size=128, validation_data=(x_test, y_test))

# Evaluate the model
test_loss, test_acc = model.evaluate(x_test, y_test, verbose=2)
print("\nTest accuracy:", test_acc)


ValueError: One of the dimensions in the output is <= 0 due to downsampling in conv2d_18. Consider increasing the input size. Received input shape [None, 2, 2, 96] which would produce output shape with a zero or negative value in a dimension.

In [18]:
import tensorflow as tf
from tensorflow.keras import datasets, layers, models

# Load and preprocess the CIFAR-10 data
(train_images, train_labels), (test_images, test_labels) = datasets.cifar10.load_data()
train_images, test_images = train_images / 255.0, test_images / 255.0

# Build the AlexNet model
model = models.Sequential()
model.add(layers.Conv2D(96, (11, 11), strides=(4, 4), activation='relu', input_shape=(32, 32, 3)))
model.add(layers.MaxPooling2D(pool_size=(3, 3), strides=(2, 2)))
model.add(layers.Conv2D(256, (5, 5), padding='same', activation='relu'))
model.add(layers.MaxPooling2D(pool_size=(3, 3), strides=(2, 2)))
model.add(layers.Conv2D(384, (3, 3), padding='same', activation='relu'))
model.add(layers.Conv2D(384, (3, 3), padding='same', activation='relu'))
model.add(layers.Conv2D(256, (3, 3), padding='same', activation='relu'))
model.add(layers.MaxPooling2D(pool_size=(3, 3), strides=(2, 2)))
model.add(layers.Flatten())
model.add(layers.Dense(4096, activation='relu'))
model.add(layers.Dense(4096, activation='relu'))
model.add(layers.Dense(10, activation='softmax'))

# Compile the model
model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

# Train the model
model.fit(train_images, train_labels, epochs=5, validation_data=(test_images, test_labels))

# Evaluate the model
test_loss, test_acc = model.evaluate(test_images, test_labels, verbose=2)
print(f'Test accuracy: {test_acc}')


ValueError: Negative dimension size caused by subtracting 3 from 2 for '{{node max_pooling2d_12/MaxPool}} = MaxPool[T=DT_FLOAT, data_format="NHWC", explicit_paddings=[], ksize=[1, 3, 3, 1], padding="VALID", strides=[1, 2, 2, 1]](conv2d_20/Relu)' with input shapes: [?,2,2,256].