# Assignment 6 Solutions

#### 1.	What are the advantages of a CNN over a fully connected DNN for image classification?

**Answwer**The advantages of a Convolutional Neural Network (CNN) over a fully connected Deep Neural Network (DNN) for image classification are:

Parameter Efficiency: CNNs use convolutional layers to share weights across local regions of an image, reducing the number of learnable parameters compared to fully connected DNNs. This makes CNNs more efficient in handling large image inputs.

Spatial Hierarchies: CNNs capture spatial hierarchies in images through convolutional layers, learning local patterns and gradually aggregating them to represent higher-level features. This enables CNNs to better capture spatial relationships and local patterns in images.

Translation Invariance: CNNs are translationally invariant, meaning they can recognize patterns regardless of their position in the image. This property is beneficial for tasks like image classification, where the position of objects may vary within an image.

Feature Reuse: CNNs reuse learned features across different spatial locations in an image, allowing them to detect similar patterns in different regions efficiently. This leads to better generalization and robustness to variations in object positions.

Reduced Overfitting: The weight sharing and pooling operations in CNNs contribute to regularization, helping to reduce overfitting and improve generalization on image classification tasks.

Local Connectivity: CNNs exploit local connectivity by focusing on a small receptive field, allowing them to efficiently process spatially localized features.

#### 2.	Consider a CNN composed of three convolutional layers, each with 3 × 3 kernels, a stride of 2, and "same" padding. The lowest layer outputs 100 feature maps, the middle one outputs 200, and the top one outputs 400. The input images are RGB images of 200 × 300 pixels.
What is the total number of parameters in the CNN? If we are using 32-bit floats, at least how much RAM will this network require when making a prediction for a single instance? What about when training on a mini-batch of 50 images?

**Answer** To calculate the total number of parameters in the CNN, we need to consider the number of weights and biases in each layer.

Given:

Three convolutional layers with 3 × 3 kernels each.
A stride of 2.
"Same" padding, which means the output feature maps will have the same spatial dimensions as the input.
The number of parameters in each convolutional layer can be calculated as follows:

First Convolutional Layer:

Input channels (RGB): 3
Output channels: 100
Number of weights: 3 × 3 × 3 × 100 = 2700
Number of biases: 100
Total parameters in the first layer: 2700 + 100 = 2800
Second Convolutional Layer:

Input channels: 100
Output channels: 200
Number of weights: 3 × 3 × 100 × 200 = 180,000
Number of biases: 200
Total parameters in the second layer: 180,000 + 200 = 180,200
Third Convolutional Layer:

Input channels: 200
Output channels: 400
Number of weights: 3 × 3 × 200 × 400 = 720,000
Number of biases: 400
Total parameters in the third layer: 720,000 + 400 = 720,400
Total number of parameters in the CNN:
2800 + 180,200 + 720,400 = 903,400 parameters.

To calculate the RAM required for a single prediction instance:
Assuming 32-bit floats for each parameter (4 bytes per parameter):
903,400 parameters × 4 bytes/parameter ≈ 3.4 MB

When training on a mini-batch of 50 images:
Assuming each image is loaded into memory separately for training (50 instances):
3.4 MB × 50 ≈ 170 MB

#### 3.	If your GPU runs out of memory while training a CNN, what are five things you could try to solve the problem?

**Answer**If the GPU runs out of memory while training a CNN, here are five things you could try to solve the problem:

Reduce Batch Size: Use a smaller batch size during training to reduce the memory required for storing intermediate activations and gradients.
Reduce Model Size: Reduce the number of layers, feature maps, or neurons in the CNN to decrease the number of parameters and memory usage.
Use Mixed Precision: Utilize mixed precision training, where weights are stored in lower precision (e.g., float16) to reduce memory consumption while maintaining accuracy.
Gradient Accumulation: Accumulate gradients over multiple mini-batches before applying weight updates. This allows using a smaller batch size per iteration while still updating the model effectively.
Memory Optimization Techniques: Utilize memory-efficient implementations or optimize memory usage in your code and model architecture.

#### 4.	Why would you want to add a max pooling layer rather than a convolutional layer with the same stride?

**Answer**A max pooling layer is preferred over a convolutional layer with the same stride for downsampling because max pooling reduces spatial dimensions without adding learnable parameters. It introduces no additional weights or biases, making it computationally efficient. Convolutional layers, on the other hand, introduce more parameters and may increase the computational load when downsampling.

#### 5.	When would you want to add a local response normalization layer?

**Answer**Local Response Normalization (LRN) layers were commonly used in older CNN architectures like AlexNet to promote competition between neighboring feature maps. However, the use of LRN has decreased in modern architectures because batch normalization and other normalization techniques have proven more effective in stabilizing training and improving generalization. LRN may still be useful in specific cases where you want to encourage competition between nearby feature maps and have a specific architectural reason to use it, but it is no longer a standard layer in modern CNN architectures.







#### 6.	Can you name the main innovations in AlexNet, compared to LeNet-5? What about the main innovations in GoogLeNet, ResNet, SENet, and Xception?

**Answer**Main innovations in AlexNet compared to LeNet-5:

Larger Architecture: AlexNet has a deeper architecture with more layers, making it more capable of capturing complex patterns in images.
ReLU Activation: AlexNet introduced the rectified linear unit (ReLU) activation function, which helps mitigate the vanishing gradient problem and accelerates convergence.
Dropout Regularization: AlexNet used dropout regularization during training to reduce overfitting and improve generalization.
Data Augmentation: AlexNet employed extensive data augmentation techniques, including random cropping, flipping, and color jittering, to increase the effective size of the training dataset and improve robustness.
GPU Acceleration: AlexNet was one of the first CNN architectures to make use of GPUs for efficient training, enabling faster computations and larger model capacity.
Main innovations in GoogLeNet (Inception), ResNet, SENet, and Xception:

GoogLeNet: Introduced the Inception module, which uses multiple filter sizes (1x1, 3x3, 5x5) to capture features at different scales efficiently. It enabled training of deeper networks and was one of the first to explore the idea of "network in network" architectures.
ResNet: Introduced residual connections (skip connections) to address the vanishing gradient problem in very deep networks. ResNet allowed for training of extremely deep networks (over a hundred layers) by facilitating the flow of gradients and easing the training process.
SENet (Squeeze-and-Excitation Network): Introduced the concept of channel-wise attention, allowing the network to adaptively recalibrate the importance of each feature map channel. This attention mechanism improved feature selection and emphasized more relevant information.
Xception: Introduced depthwise separable convolutions, which factorize a standard convolution into separate depthwise and pointwise convolutions. This reduced the number of parameters and computational cost while maintaining or even improving performance.

#### 7.	What is a fully convolutional network? How can you convert a dense layer into a convolutional layer?

**Answer**A fully convolutional network (FCN) is a neural network architecture designed for semantic segmentation tasks, where the goal is to assign a class label to each pixel in an input image. FCNs replace traditional fully connected layers with convolutional layers, allowing the network to take input of arbitrary sizes and produce output maps with spatial dimensions corresponding to the input.To convert a dense layer into a convolutional layer, you can use a 1x1 convolution with the same number of output channels as the number of neurons in the dense layer. This convolutional layer effectively acts as a global average pooling layer, aggregating information across the spatial dimensions and producing a 1x1 feature map for each channel.

#### 8.	What is the main technical difficulty of semantic segmentation?

**Answer**The main technical difficulty of semantic segmentation is preserving spatial information while handling variable-sized input images. Unlike image classification, where the input size is fixed, semantic segmentation requires the network to produce output maps with the same spatial dimensions as the input. This challenge arises due to the need to maintain spatial resolution throughout the network and to handle objects of various sizes and aspect ratios in the input images. FCNs and other modern segmentation architectures address this difficulty by using convolutional layers and upsampling techniques to produce dense predictions at the original image resolution.

#### 9.	Build your own CNN from scratch and try to achieve the highest possible accuracy on MNIST.

In [2]:
pip install tensorflow


Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.0.1 -> 23.2.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [3]:
import tensorflow as tf
from tensorflow.keras import layers, models
from tensorflow.keras.datasets import mnist
from tensorflow.keras.utils import to_categorical

# Load MNIST dataset
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

# Normalize and reshape data
train_images = train_images.reshape(-1, 28, 28, 1).astype('float32') / 255.0
test_images = test_images.reshape(-1, 28, 28, 1).astype('float32') / 255.0

# One-hot encode labels
train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)

# Build CNN model
model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.Flatten())
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(10, activation='softmax'))

# Compile the model
model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

# Train the model
model.fit(train_images, train_labels, epochs=10, batch_size=128, validation_data=(test_images, test_labels))

# Evaluate the model
test_loss, test_accuracy = model.evaluate(test_images, test_labels)
print("Test accuracy:", test_accuracy)


Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test accuracy: 0.991100013256073


#### 10.	Use transfer learning for large image classification, going through these steps:
- a.	Create a training set containing at least 100 images per class. For example, you could classify your own pictures based on the location (beach, mountain, city, etc.), or alternatively you can use an existing dataset (e.g., from TensorFlow Datasets).
- b.	Split it into a training set, a validation set, and a test set.
- c.	Build the input pipeline, including the appropriate preprocessing operations, and optionally add data augmentation.
- d.	Fine-tune a pretrained model on this dataset.